Hightouch vs. Segment: How we do reverse ETL at Reforge

How we use Reverse ETL

We use reverse ETL here at Reforge for a bunch of different use cases. They are:

  • Sending data about our customers to our email automation system, Iterable
  • Sending data that is aggregated in our data warehouse (snowflake) to a Google sheet
  • Sending data to our behvaioral analytics system, Amplitude
  • Sending data to our CRM (HubSpot) from our data warehouse
  • Sending data to Slack from our data warehouse

Examples

We use Segment and Personas features to send data to our email system, Iterable.

There are cool features in Personas that allow you to use a GUI to define things like:

  • Counters (example: how many page views that meet a certain criteria in the last X days)
  • Aggregation (example: how much revenue have they generated in the past quarter)
  • Most frequent (example: what’s the most frequent blog post they’ve visited)
  • First (example: what’s the first marketing campaign that got a user to visit your site)
  • Last (example: what’s the last marketing campaign that got a user to visit your site)

They also have the ability to write arbitrary SQL to pull a list of people and any columns in the SQL can be traits associated with a user. So you can use GUI to define basic things or use complicated SQL to pull sophisticated stats.

An example of a user trait we have in Segment:

A SQL user trait in Segment that pulls a list of applicants to Reforge and values associated with their application

Important caveat: it’s on our roadmap to pull all of this logic out of SQL and put it into a DBT model. The benefits there are better testing, understanding of data lineage in a DAG, and we have all of it under source control. In the future the SQL for this trait could be something like applications.user_application_values.

In our data dictionary we define these properties:

Screenshot of our data dictionary covering the same fields

Then we can use the traits to send personalized campaigns in Iterable for many different scenarios:

  • Application acceptance or denial
  • Payment nudges to remind someone they’ve been accepted
  • Stats about how many people from their company have participated

Here’s an example of a campaign in Iterable:

Sample campaign in Iterable that uses the fields from the SQL trait / data dictionary

You can see how we have a dynamic template in the email above that personalizes the email subject to have the company name they select as well as stats about the company with things like:

  • How many people from their company have participated in Reforge in the past
  • How many people from their company are current members
  • How many people from their company have applied recently

I think it’s pretty great that Segment has this functionality especially when you combine it with their other platform tools like Protocols. That said, I have some gripes about the tool as of November 2022:

  • On our current tier the most frequent traits can run is hourly. Maybe I can pay for a higher frequency, but that drives me nuts and I’d have to negotiate with a sales rep for that.
  • It’s really hard to test a trait is working the way you want it, you can’t trigger a sync. This drives me nuts because I always make a simple mistake and then either try to get it to trigger or have to wait an hour to see if an update fixed it.
  • The destinations you can sync to are the typical ones for a CDP. I wish I could sync a result set with a spreadsheet, but that’s not possible.
  • I can’t sync non-contact objects to our CRM. For instance, I can sync contacts and their properties to HubSpot but I can’t sync subscriptions to a custom object or companies and their details to a company object in HubSpot.
  • It feels like Segment has abandoned working on this feature. They’re busy integrating with Twilio and building things like Twilio Engage rather than investing in making Personas great. I might have missed it, but it feels like they haven’t made changes to the product in years.

How we use Hightouch

We use a second ETL provider at Reforge. It sucks that we have two, but my hope is that it’s more of a transitionary period until we ultimately simplify with one. I started using Hightouch for free to sync data to a google sheet / to slack, and it became clear it was the vastly superior product to Segment Personas.

The world runs on spreadsheets, and often times it’s really helpful to give teams a constantly updated resource that has data synced from your data warehouse. It’s a great free path to get using their tools because once you have it connected, it’s so easy to add additional sources.

Here’s an example of how we send paid data into a spreadsheet for the marketing team (we pull paid numbers across all channels into our data warehouse and this helps them understand all of it and play with it in a spreadsheet environment rather than pulling it from a BI tool constantly):

Sending aggregate paid spend numbers into a spreadsheet for the marketing team

The model can be custom SQL, a table in a DB, or a model in dbt or Looker. It makes it really easy.

We also use Hightouch to sync data to our CRM, HubSpot. It’s just so easy.

We also use Hightouch to send customized messages to Slack. I’ve done this using Zapier in the past but it’s nice to be able to have all of these syncs in one place. Here’s an example of a daily ARR report that the team built:

Our daily update that sends data from Snowflake to Slack

Here’s what it looks like in Slack:

An example of a daily summary the team built via Hightouch

I spoke with the co-founders of Hightouch when it was early days and that’s when I found out they were part of the early team that built Personas at Segment. It feels like they were able to address a lot of their frustrations about how the product was built and make big improvements on it. Things that are just so much better than Segment:

  • Their support is so much better than Segment. This should be every startup’s advantage against the company acquired by a big behemoth, and they do a good job. I have been so disappointed with Segment’s support over the years and Hightouch makes the overall experience so much better because you can get an answer to why something is broken quickly. With Segment support, I often times have to wait days which is so painful when you’re working on a deeply technical integration like this.
  • They have a “run now” button where you can run a sync anytime you want. This sounds so simple, but I’m sure it makes the technical implementation much harder. I can’t tell you how many times I’ve messed up a piece of a sync in Segment personas and then I just have to wait until the sync runs again. It baffles me they haven’t changed this.
  • It feels like it’s being worked on all the time. Lots of changes and improvements, new features, and integrations with the latest and greatest tools.
  • The logging and insight into what happened is so much better with Hightouch. This is a real interaction I had with Segment’s support team (no judgement on the support rep, this is a reflection of the state of their product). This type of response is infuriating when you are working to integrate two systems together and you don’t know what’s wrong or when it’ll retry some logic.
A screenshot of my support ticket with Segment. I was very frustrated!

Conclusion

I look forward to moving all of our ETL infrastructure over to Hightouch from Segment at annual contract renewal time. While there are benefits to having the reverse ETL logic and elements all in Segment (we pay for their protocols feature and there are benefits in blocking bad data / being alerted when your data doesn’t adhere to the spec), I would rather use the better tool.

How do others do this type of work at their companies? I’d love to know if others solve their challenges similarly and if there are even better ways.

Alerting on Snowflake Cost Spikes with Metaplane

We switched from a postgres data warehouse earlier this year to a Snowflake data warehouse. As part of that transition we spoke with sales and learned about what drives your bill on snowflake. It turned out that 15-20% of our bill with snowflake was because of our BI tool, Metabase.

That led me to wonder if I could understand better what was driving our costs on snowflake and ensure that we didn’t get an unexpected bill. Our storage costs are tiny (we’re lucky at Reforge that we have enough data to make it useful, but not so much that the costs are prohibitive / make us move slowly).

I did two things as a result of this:

  1. Visualized the costs over time.
  2. We setup Metaplane so that it would monitor our daily usage and alert if it deviated from the historical norm. This was really easy.

Visualizing Costs over Time

I got some great feedback from the DBT community about the types of queries I should be running to run down costs. They asked good questions like:

  • How are your warehouses configured?
  • What warehouses are creating the most spend?
  • Are the warehouses set to auto-suspend?
  • What services are contributing to your bill?
  • Is your spend mostly on storage or compute?
  • What queries cost the most?

Ultimately, one of the things that came out of the analysis was that all of our bill was being used for compute. The sales team at snowflake said that most customers don’t pay for “cloud services” and that we were an outlier. That led us to look into what was driving the overage on our cloud services. I visualized the cost over time with Deepnote for both compute and cloud services. You can see that “overage-cloud services” represent a big part of our bill, and we attributed most of that cost to the metabase queries.

Our montly snowflake bill broken out by usage type

We discovered the issue in April and took some action to try to reduce it. If you want to do the same analysis, I posted the notebook here. I’d love to build a flowchart of the types of things you should be looking at when analyzing your snowflake bill and adding them to the jupyter notebook.

Being Alerted to When Costs Go Haywire

Metaplane is a tool that looks at your data systems and lets you know when something is broken. It could be that tables have stopped being updated, it’s not being updated at the same right, or someone is spamming events to your system. It has many different kinds of tests, but ultimately it helps to ensure that you know when things are going haywire so you can fix it before it causes problems or causes others to lose confidence in the reliability of your data and systems.

We configured Metaplane to run this query every day for both cloud services and compute spend, which are the vast majority of our spend. If it deviates from the historical norm it alerts the team so we can dig into what is increasing our usage and our bill. A big thanks to Ian Whitestone in the DBT community who helped with suggestions on this query as well as other ideas.

SELECT
  SUM(usage_in_currency) as total_cost
from SNOWFLAKE.ORGANIZATION_USAGE.USAGE_IN_CURRENCY_DAILY
where usage_type = 'overage-cloud services'
    and usage_date >= current_date - interval '1 day' 
    and usage_date < current_date
order by 1
Metaplane monitoring compute credits over time, our usage spiked in the last data point to 2X the normal range
The slack notification when spend deviates from what we expect

Example Use

In the above image you can see how there was a big spike in usage that was much higher than the previous sixty days. Our team was notified, looked into the issue, and this was the result:

found two queries made in the Snowflake UI that had been running for 20 and 40 hours respectively.  I force canceled the queries and let the user know what happened.  We also decided to update the query timeout account wide from it’s default value of 48 hours down to 1 hour.

One of the data engineers on the team

This is a great example of how Metaplane ensures that we’re on top of anything that is unexpected and can affect our team. It may be data quality issues, reporting issues, delays, logic issues, or something like costs. Examples like this confirm that data observability is a must-have that all companies will have in the next couple of years.

Next Steps

Ultimately, we didn’t think it was worth our team’s time to try to optimize our spend on Snowflake due to our metabase issues. I’d rather the team spend time on creating new value for our company and our customers rather than spending a lot of time to save $1,000 / month. There will come a point where it’s worth it for someone on our team to look at this and a myriad of other areas for improvements, but until then we’ll be monitoring our spend closely to ensure it doesn’t change materially from where it is today. If anyone has any tips or insight into how to address the metabase problems, let me know!

Setting up a simple on-call schedule in Slack and Zapier

The Reforge data team doesn’t require anyone to “carry a pager” and be available during certain periods to respond to issues at any time of day or night. That said, we do have a rotating responsibility for responding to slack forums and looking at alerts from our data observability tool, Metaplane.

One of the issues is that we ask people to be available for a week at a time to be the point person on responding to questions or issues that come up. We don’t expect everyone to be the one to fix / respond to an issue, but we ask that they be the point person to help move it forward.

I built a really simple reminder workflow in Slack to remind people of what they should be doing when they are in this role because it’s easy to forget. I thought I’d share how easy it was to setup and see if anyone had any other tools / processes they use.

We already had a google doc that listed the dates that each person was responsible for, and I built a quick mapping table that takes a person’s name and puts in their Slack id. It’s easy to get someone’s slack id.

The google doc has a row for each day, lists the person responsible, and their slack ID

I then built a workflow in Zapier that:

  • Runs every day at a standard time (12pm ET)
  • It checks to make sure that it’s a weekday (not a weekend)
  • It looks up the row in the spreadsheet for the current day
  • It only continues if there’s a slack ID for that person that day (if someone doesn’t want the reminders they can delete their slack id from the row)
  • It sends them a direct message on slack from a reminder bot

This is the DM that people get in Slack:

How do others in the data space ensure that the team rotates through different responsibilities, get reminded of what they’re supposed to do, and ensure they’re aware of how to do it? I don’t want to go full-blown pager service and this felt like the right level of reminder for our stage of company (the team has 5 analysts, a manager, a data engineer, and myself).

Analyzing Metabase Usage

I am the head of data at Reforge (we’re hiring!) and data is critical to our success. We use it to track our business, individual teams leverage it in their work, and we believe in empowering people in the company with data to help them do great work. We use all sorts of great tools, but as the head of data I felt it was important to understand whether people were actually using our BI (business intelligence) tool.

I built a weekly report that lays out how my company is using our BI tool, Metabase. It has the following reports:

  • Dashboards created in the past 7 days
  • Questions created in the past 7 days
  • Individual Dashboard views by our exec team in the past 7 days
  • Dashboard view counts by user in the past 7 days
  • Activity by my boss (our CTO)
  • Metabase MAUs
  • Metabase WAUs

It helps me to understand at a high level what is happening inside the company from a data perspective:

  • Who is building reports (questions / dashboards)
  • What questions people are asking of our data
  • What is most important to our exec team
  • Who is looking at information on a regular basis
  • What is top of mind for my boss
  • How engaged our company is with our data & analytics

It was surprisingly easy to set this up. All we had to do was add the Metabase internal database as a source to Metabase (sounds confusing, but actually quite simple). It wouldn’t shock me if this became a paid feature long-term for Metabase on their enterprise plan, but it took less than 30 minutes to figure out their data model and build some quick and dirty SQL questions that produced answers to our most important questions. Metabase is already storing all of the data we needed to visualize to get answers to our questions.

I set this up as a dashboard subscription so it is delivered to my email every week and I can quickly scan it to understand what’s top of mind for the org.

Reforge is Data Driven

Since the beginning of the company we’ve looked at data to understand how well we’re tracking against key goals for the company (revenue, applications, content completion, weekly activity, etc). These stats span across the entire company in terms of usefulness (support, sales, marketing, product, engineering, design, finance, etc) and we strive to make as much information accessible to everyone so that they can be data informed.

If you’re considering joining a company – do you want to be at one that uses data to help inform decisions? You can see below that Reforge employees are using data on a regular basis and you can see how it scales as we’ve grown the team and hit key milestones. We’ve typically seen 75-100% of employees leveraging Metabase on a monthly basis.

Our Metabase monthly active users along with company milestones

The queries that we use to power the charts:

Dashboards created in the past week

select created_at, name, u.email, left(description, 40) as description, 'https://BI_HOSTNAME/dashboard/' || d.id as url
from report_dashboard d
    inner join core_user u 
        on d.creator_id = u.id
where created_at > current_date - interval '7 days'
order by 1 desc

Questions created in the past week

select created_at, name, u.email, 'https://BI_HOSTNAME/question/' || d.id as url
from report_card d
    inner join core_user u 
        on d.creator_id = u.id
where created_at > current_date - interval '7 days'
order by 1 desc

Exec team dashboard views in the past week

For this report, I wanted to know what the exec team was looking at in the past week. I want to know how much they’re looking at the metrics and whether they’re looking at the dashboards I think they should be looking at.

select u.email, d.name, count(v.id)
from view_log v
    inner join core_user u
        on v.user_id = u.id
    inner join report_dashboard d
        on v.model_id = d.id
where model = 'dashboard'
    and u.email in 
        (LIST_OF_EMAILS)
    and v.timestamp > (current_date - interval '7 days')
group by 1, 2
order by 3 desc
limit 200

Dashboard views leaderboard (views in the past week)

This shows the number of views by user in the past 7 days. It’s always a good sign when your head of data isn’t the most prolific viewer of dashboards.

List of users and their views
select u.email, count(v.id)
from view_log v
    inner join core_user u
        on v.user_id = u.id
    inner join report_dashboard d
        on v.model_id = d.id
where model = 'dashboard'
    and v.timestamp > (current_date - interval '7 days')
group by 1
order by 2 desc
limit 200

Most popular dashboards by views

select d.name, count(v.id)
from view_log v
    inner join core_user u
        on v.user_id = u.id
    inner join report_dashboard d
        on v.model_id = d.id
where model = 'dashboard'
    and v.timestamp > (current_date - interval '7 days')
group by 1
order by 2 desc
limit 200

Questions created by a single person

select created_at, name, u.email, 'https://BI_URL/question/' || d.id as url
from report_card d
    inner join core_user u 
        on d.creator_id = u.id
where created_at > current_date - interval '28 days'
    and u.email = EMAIL_OF_YOUR_BOSS
order by 1 desc

Metabase MAUs

select date_trunc('month', timestamp), count(distinct user_id) 
from view_log 
group by 1
order by 1

Metabase WAUs

select date_trunc('week', timestamp), count(distinct user_id) 
from view_log 
group by 1
order by 1

Hiring a Senior Data Engineer at Reforge

It is an exciting time to be joining the data team at Reforge. In the past 16 months, we created the data team and scaled it to 8 people. On the team today, there are five analysts, a data engineer, an analytics manager, and me (I’m the Head of Data at Reforge).

We’ve been busy in this time. Some of the work that we’ve done:

  • Migrated our data warehouse from postgres to snowflake
  • Rebuilt our entire data warehouse schema organization
  • Built countless models to model our programs
  • Enabled the switch to a subscription model (reporting on subscriptions, change over time, renewals, etc)
  • Worked with our R&D team to adopt and use a behavioral analytics system
  • Defined our data dictionary
  • Hired and onboarded 8 people
  • Built data pipelines to enrich data in our system via Airflow
  • Migrated our email automation system

Reforge as a company has also been doing very well during this time. Our business has grown tremendously, we raised our $60M series B round, and we have more than doubled the team in that time.

We had the good fortune of having raised our Series B earlier this spring before the recent economic turmoil and have the the money, time, and resources now to invest heavily in creating new experiences and products. It’s an exciting time to be investing and building when it feels like many companies are cutting back or deprioritizing investments.

Data engineering will be a critical component to our upcoming growth and I’m particularly excited about the projects we are undertaking:

  • We will build data pipelines to enrich data about our prospects and customers. This will be the foundation for personalization and recommendations in the future. Having a reliable and performant system that can help us understand who someone is, where they work, and build classification models for their attributes will be critical for future product experiences as well as analyzing how our business is performing.
  • We will build more sophisticated foundational reporting for our growing analyst team to work with. Our analysts work with stakeholders in every department to help find insight in data, build visualizations, and impact our strategy and future roadmap. You will have a ton of leverage in creating impact – the data sources you create will be used by every department.
  • We will invest in additional data syncs between our systems to ensure that every customer interaction is excellent. Reforge is teaching how the best technology companies execute, and we aspire to be the best example of our own frameworks. Our customers are the leading practitioners in the tech industry and love to hold us accountable.
  • Revamp the infrastructure to enable reporting for our marketing organization. When the data team was spun up we didn’t have anyone in the marketing org, yet were growing 100% year over year due to word of mouth. The team has grown a lot since then, and we are becoming much more sophisticated in how we approach our marketing efforts. We need to empower our analysts and the marketing team with the data they need to understand how well we’re attracting people to Reforge, where our best return on investments are, and what best explains the value of Reforge.

We use best of breed tools in our data stack. Here’s a video of our current data stack. Most of these tools have been implemented in the past year and we will continue to invest heavily as our business grows and we become even more sophisticated.

If you know of anyone who wants more responsibility and has a strong vision for how data engineering should work today, please send them this post or let me know if I should reach out to them. This is a critical role for us to fill before the end of the year.

Apply here for the role.

© 2024 Dan Wolchonok

Theme by Anders NorénUp ↑