CategoryUncategorized

Hightouch vs. Segment: How we do reverse ETL at Reforge

How we use Reverse ETL

We use reverse ETL here at Reforge for a bunch of different use cases. They are:

  • Sending data about our customers to our email automation system, Iterable
  • Sending data that is aggregated in our data warehouse (snowflake) to a Google sheet
  • Sending data to our behvaioral analytics system, Amplitude
  • Sending data to our CRM (HubSpot) from our data warehouse
  • Sending data to Slack from our data warehouse

Examples

We use Segment and Personas features to send data to our email system, Iterable.

There are cool features in Personas that allow you to use a GUI to define things like:

  • Counters (example: how many page views that meet a certain criteria in the last X days)
  • Aggregation (example: how much revenue have they generated in the past quarter)
  • Most frequent (example: what’s the most frequent blog post they’ve visited)
  • First (example: what’s the first marketing campaign that got a user to visit your site)
  • Last (example: what’s the last marketing campaign that got a user to visit your site)

They also have the ability to write arbitrary SQL to pull a list of people and any columns in the SQL can be traits associated with a user. So you can use GUI to define basic things or use complicated SQL to pull sophisticated stats.

An example of a user trait we have in Segment:

A SQL user trait in Segment that pulls a list of applicants to Reforge and values associated with their application

Important caveat: it’s on our roadmap to pull all of this logic out of SQL and put it into a DBT model. The benefits there are better testing, understanding of data lineage in a DAG, and we have all of it under source control. In the future the SQL for this trait could be something like applications.user_application_values.

In our data dictionary we define these properties:

Screenshot of our data dictionary covering the same fields

Then we can use the traits to send personalized campaigns in Iterable for many different scenarios:

  • Application acceptance or denial
  • Payment nudges to remind someone they’ve been accepted
  • Stats about how many people from their company have participated

Here’s an example of a campaign in Iterable:

Sample campaign in Iterable that uses the fields from the SQL trait / data dictionary

You can see how we have a dynamic template in the email above that personalizes the email subject to have the company name they select as well as stats about the company with things like:

  • How many people from their company have participated in Reforge in the past
  • How many people from their company are current members
  • How many people from their company have applied recently

I think it’s pretty great that Segment has this functionality especially when you combine it with their other platform tools like Protocols. That said, I have some gripes about the tool as of November 2022:

  • On our current tier the most frequent traits can run is hourly. Maybe I can pay for a higher frequency, but that drives me nuts and I’d have to negotiate with a sales rep for that.
  • It’s really hard to test a trait is working the way you want it, you can’t trigger a sync. This drives me nuts because I always make a simple mistake and then either try to get it to trigger or have to wait an hour to see if an update fixed it.
  • The destinations you can sync to are the typical ones for a CDP. I wish I could sync a result set with a spreadsheet, but that’s not possible.
  • I can’t sync non-contact objects to our CRM. For instance, I can sync contacts and their properties to HubSpot but I can’t sync subscriptions to a custom object or companies and their details to a company object in HubSpot.
  • It feels like Segment has abandoned working on this feature. They’re busy integrating with Twilio and building things like Twilio Engage rather than investing in making Personas great. I might have missed it, but it feels like they haven’t made changes to the product in years.

How we use Hightouch

We use a second ETL provider at Reforge. It sucks that we have two, but my hope is that it’s more of a transitionary period until we ultimately simplify with one. I started using Hightouch for free to sync data to a google sheet / to slack, and it became clear it was the vastly superior product to Segment Personas.

The world runs on spreadsheets, and often times it’s really helpful to give teams a constantly updated resource that has data synced from your data warehouse. It’s a great free path to get using their tools because once you have it connected, it’s so easy to add additional sources.

Here’s an example of how we send paid data into a spreadsheet for the marketing team (we pull paid numbers across all channels into our data warehouse and this helps them understand all of it and play with it in a spreadsheet environment rather than pulling it from a BI tool constantly):

Sending aggregate paid spend numbers into a spreadsheet for the marketing team

The model can be custom SQL, a table in a DB, or a model in dbt or Looker. It makes it really easy.

We also use Hightouch to sync data to our CRM, HubSpot. It’s just so easy.

We also use Hightouch to send customized messages to Slack. I’ve done this using Zapier in the past but it’s nice to be able to have all of these syncs in one place. Here’s an example of a daily ARR report that the team built:

Our daily update that sends data from Snowflake to Slack

Here’s what it looks like in Slack:

An example of a daily summary the team built via Hightouch

I spoke with the co-founders of Hightouch when it was early days and that’s when I found out they were part of the early team that built Personas at Segment. It feels like they were able to address a lot of their frustrations about how the product was built and make big improvements on it. Things that are just so much better than Segment:

  • Their support is so much better than Segment. This should be every startup’s advantage against the company acquired by a big behemoth, and they do a good job. I have been so disappointed with Segment’s support over the years and Hightouch makes the overall experience so much better because you can get an answer to why something is broken quickly. With Segment support, I often times have to wait days which is so painful when you’re working on a deeply technical integration like this.
  • They have a “run now” button where you can run a sync anytime you want. This sounds so simple, but I’m sure it makes the technical implementation much harder. I can’t tell you how many times I’ve messed up a piece of a sync in Segment personas and then I just have to wait until the sync runs again. It baffles me they haven’t changed this.
  • It feels like it’s being worked on all the time. Lots of changes and improvements, new features, and integrations with the latest and greatest tools.
  • The logging and insight into what happened is so much better with Hightouch. This is a real interaction I had with Segment’s support team (no judgement on the support rep, this is a reflection of the state of their product). This type of response is infuriating when you are working to integrate two systems together and you don’t know what’s wrong or when it’ll retry some logic.
A screenshot of my support ticket with Segment. I was very frustrated!

Conclusion

I look forward to moving all of our ETL infrastructure over to Hightouch from Segment at annual contract renewal time. While there are benefits to having the reverse ETL logic and elements all in Segment (we pay for their protocols feature and there are benefits in blocking bad data / being alerted when your data doesn’t adhere to the spec), I would rather use the better tool.

How do others do this type of work at their companies? I’d love to know if others solve their challenges similarly and if there are even better ways.

Alerting on Snowflake Cost Spikes with Metaplane

We switched from a postgres data warehouse earlier this year to a Snowflake data warehouse. As part of that transition we spoke with sales and learned about what drives your bill on snowflake. It turned out that 15-20% of our bill with snowflake was because of our BI tool, Metabase.

That led me to wonder if I could understand better what was driving our costs on snowflake and ensure that we didn’t get an unexpected bill. Our storage costs are tiny (we’re lucky at Reforge that we have enough data to make it useful, but not so much that the costs are prohibitive / make us move slowly).

I did two things as a result of this:

  1. Visualized the costs over time.
  2. We setup Metaplane so that it would monitor our daily usage and alert if it deviated from the historical norm. This was really easy.

Visualizing Costs over Time

I got some great feedback from the DBT community about the types of queries I should be running to run down costs. They asked good questions like:

  • How are your warehouses configured?
  • What warehouses are creating the most spend?
  • Are the warehouses set to auto-suspend?
  • What services are contributing to your bill?
  • Is your spend mostly on storage or compute?
  • What queries cost the most?

Ultimately, one of the things that came out of the analysis was that all of our bill was being used for compute. The sales team at snowflake said that most customers don’t pay for “cloud services” and that we were an outlier. That led us to look into what was driving the overage on our cloud services. I visualized the cost over time with Deepnote for both compute and cloud services. You can see that “overage-cloud services” represent a big part of our bill, and we attributed most of that cost to the metabase queries.

Our montly snowflake bill broken out by usage type

We discovered the issue in April and took some action to try to reduce it. If you want to do the same analysis, I posted the notebook here. I’d love to build a flowchart of the types of things you should be looking at when analyzing your snowflake bill and adding them to the jupyter notebook.

Being Alerted to When Costs Go Haywire

Metaplane is a tool that looks at your data systems and lets you know when something is broken. It could be that tables have stopped being updated, it’s not being updated at the same right, or someone is spamming events to your system. It has many different kinds of tests, but ultimately it helps to ensure that you know when things are going haywire so you can fix it before it causes problems or causes others to lose confidence in the reliability of your data and systems.

We configured Metaplane to run this query every day for both cloud services and compute spend, which are the vast majority of our spend. If it deviates from the historical norm it alerts the team so we can dig into what is increasing our usage and our bill. A big thanks to Ian Whitestone in the DBT community who helped with suggestions on this query as well as other ideas.

SELECT
  SUM(usage_in_currency) as total_cost
from SNOWFLAKE.ORGANIZATION_USAGE.USAGE_IN_CURRENCY_DAILY
where usage_type = 'overage-cloud services'
    and usage_date >= current_date - interval '1 day' 
    and usage_date < current_date
order by 1
Metaplane monitoring compute credits over time, our usage spiked in the last data point to 2X the normal range
The slack notification when spend deviates from what we expect

Example Use

In the above image you can see how there was a big spike in usage that was much higher than the previous sixty days. Our team was notified, looked into the issue, and this was the result:

found two queries made in the Snowflake UI that had been running for 20 and 40 hours respectively.  I force canceled the queries and let the user know what happened.  We also decided to update the query timeout account wide from it’s default value of 48 hours down to 1 hour.

One of the data engineers on the team

This is a great example of how Metaplane ensures that we’re on top of anything that is unexpected and can affect our team. It may be data quality issues, reporting issues, delays, logic issues, or something like costs. Examples like this confirm that data observability is a must-have that all companies will have in the next couple of years.

Next Steps

Ultimately, we didn’t think it was worth our team’s time to try to optimize our spend on Snowflake due to our metabase issues. I’d rather the team spend time on creating new value for our company and our customers rather than spending a lot of time to save $1,000 / month. There will come a point where it’s worth it for someone on our team to look at this and a myriad of other areas for improvements, but until then we’ll be monitoring our spend closely to ensure it doesn’t change materially from where it is today. If anyone has any tips or insight into how to address the metabase problems, let me know!

Setting up a simple on-call schedule in Slack and Zapier

The Reforge data team doesn’t require anyone to “carry a pager” and be available during certain periods to respond to issues at any time of day or night. That said, we do have a rotating responsibility for responding to slack forums and looking at alerts from our data observability tool, Metaplane.

One of the issues is that we ask people to be available for a week at a time to be the point person on responding to questions or issues that come up. We don’t expect everyone to be the one to fix / respond to an issue, but we ask that they be the point person to help move it forward.

I built a really simple reminder workflow in Slack to remind people of what they should be doing when they are in this role because it’s easy to forget. I thought I’d share how easy it was to setup and see if anyone had any other tools / processes they use.

We already had a google doc that listed the dates that each person was responsible for, and I built a quick mapping table that takes a person’s name and puts in their Slack id. It’s easy to get someone’s slack id.

The google doc has a row for each day, lists the person responsible, and their slack ID

I then built a workflow in Zapier that:

  • Runs every day at a standard time (12pm ET)
  • It checks to make sure that it’s a weekday (not a weekend)
  • It looks up the row in the spreadsheet for the current day
  • It only continues if there’s a slack ID for that person that day (if someone doesn’t want the reminders they can delete their slack id from the row)
  • It sends them a direct message on slack from a reminder bot

This is the DM that people get in Slack:

How do others in the data space ensure that the team rotates through different responsibilities, get reminded of what they’re supposed to do, and ensure they’re aware of how to do it? I don’t want to go full-blown pager service and this felt like the right level of reminder for our stage of company (the team has 5 analysts, a manager, a data engineer, and myself).

Hiring a Senior Data Engineer at Reforge

It is an exciting time to be joining the data team at Reforge. In the past 16 months, we created the data team and scaled it to 8 people. On the team today, there are five analysts, a data engineer, an analytics manager, and me (I’m the Head of Data at Reforge).

We’ve been busy in this time. Some of the work that we’ve done:

  • Migrated our data warehouse from postgres to snowflake
  • Rebuilt our entire data warehouse schema organization
  • Built countless models to model our programs
  • Enabled the switch to a subscription model (reporting on subscriptions, change over time, renewals, etc)
  • Worked with our R&D team to adopt and use a behavioral analytics system
  • Defined our data dictionary
  • Hired and onboarded 8 people
  • Built data pipelines to enrich data in our system via Airflow
  • Migrated our email automation system

Reforge as a company has also been doing very well during this time. Our business has grown tremendously, we raised our $60M series B round, and we have more than doubled the team in that time.

We had the good fortune of having raised our Series B earlier this spring before the recent economic turmoil and have the the money, time, and resources now to invest heavily in creating new experiences and products. It’s an exciting time to be investing and building when it feels like many companies are cutting back or deprioritizing investments.

Data engineering will be a critical component to our upcoming growth and I’m particularly excited about the projects we are undertaking:

  • We will build data pipelines to enrich data about our prospects and customers. This will be the foundation for personalization and recommendations in the future. Having a reliable and performant system that can help us understand who someone is, where they work, and build classification models for their attributes will be critical for future product experiences as well as analyzing how our business is performing.
  • We will build more sophisticated foundational reporting for our growing analyst team to work with. Our analysts work with stakeholders in every department to help find insight in data, build visualizations, and impact our strategy and future roadmap. You will have a ton of leverage in creating impact – the data sources you create will be used by every department.
  • We will invest in additional data syncs between our systems to ensure that every customer interaction is excellent. Reforge is teaching how the best technology companies execute, and we aspire to be the best example of our own frameworks. Our customers are the leading practitioners in the tech industry and love to hold us accountable.
  • Revamp the infrastructure to enable reporting for our marketing organization. When the data team was spun up we didn’t have anyone in the marketing org, yet were growing 100% year over year due to word of mouth. The team has grown a lot since then, and we are becoming much more sophisticated in how we approach our marketing efforts. We need to empower our analysts and the marketing team with the data they need to understand how well we’re attracting people to Reforge, where our best return on investments are, and what best explains the value of Reforge.

We use best of breed tools in our data stack. Here’s a video of our current data stack. Most of these tools have been implemented in the past year and we will continue to invest heavily as our business grows and we become even more sophisticated.

If you know of anyone who wants more responsibility and has a strong vision for how data engineering should work today, please send them this post or let me know if I should reach out to them. This is a critical role for us to fill before the end of the year.

Apply here for the role.

The Reforge Data Team is hiring

We are hiring for three analysts on the Data Team at Reforge. We’re looking for a Product Analyst, a Senior Product Analyst, and a Senior Marketing Analyst.

It’s a very exciting time for Reforge and for the Data Team.

Helping People do the Best Work of Their Careers

Reforge is an amazing opportunity to work on a product that has a significant impact on people’s lives. We are taking the untapped knowledge, frameworks, and practices from industry leaders and making it accessible to our customers. Our goal is to help our members do the best work of their career by unlocking insights and then helping them apply it back to their jobs right away.

We routinely here from our members that it was the best educational experience they’ve had (better than their MBA if they got one), and that it has helped them be more confident in their role and drive a big impact for their business.

$21 Million Series A Investment

We raised our Series A investment from A16Z in February. At the time, our CEO Brian Balfour wrote about:

  • The history of Reforge
  • Why there’s a real need for our offering
  • The solution we are building

I was lucky to be one of the earliest employees that took this from a nascent concept through today. It was an incredible time to be at the company, but today’s inflection point is even more exciting. While we bootstrapped the company to eight figures of revenue, our recent fundraising round gives us the capital to invest even more aggressively.

We are the rare startup that has real revenue with generous margins and profits, yet also has raised venture capital and the ambition to continue growing 100% each year.

The Data Team at Reforge

The data team is just getting started here at Reforge. We’re looking for people who want to be a part of a small team that is growing fast and are comfortable with ambiguity and fast paced change. It’s a great opportunity for people who want to be a part of building a strong data practice at a company that sees data as invaluable in operations as well as strategic decisions.

The nature of our product is that it’s cross functional. We create content, host events, and build relationships between people through marketing, community, operations, product, design, and engineering.

We have a central data team so that every part of the org has access to all of the data to deliver an exceptional experience. Every part of the experience should be tailored to your role, your company’s business model, where you’re located, how senior you are, what you’ve done in the platform, and what your goals are today. The data team is here to make sure teams have the information they need to deliver and iterate on an exceptional experience.

The data org owns the growth model for the business. It’s our job to help bring together the metrics for the whole business to understand how we’re trending towards our goals, what is performing well, where there is opportunity to improve, and what the greatest points of leverage are. This is especially important for a cross functional product like ours where opportunities require multiple groups within the company to collaborate.

What we’re looking for

We’re looking for someone who can take ambiguous questions, run independent analyses, and then clearly communicate their findings.

  • Analysts will be required to field questions from many potential sources: the leadership team, PMs, Marketers, Engineering, Design, Operations, or support. Being able to listen to the questions of teams, clarify their goals, and then come up with the right way to analyze and summarize findings is critical.
  • Often times the initial question asked is not the right one or needs to be clarified and adjusted. It is not an analysts job to mindlessly give teams the answers to their questions, but to push back when necessary and be a collaborator with their teams to gain insights that will push the product experience forward and improve the business.

An analyst should be deeply quantitative but naturally curious. This person should have opinions about software experiences and be passionate about finding ways to quantify value to end users, impact to the business, areas for improvement, and insights into behavior.

  • Teams may not always know the right question to ask and the best analysts have their own opinions that they pursue.
  • Analysts should understand how the product works, the user psychology of end users, and how the product is different from its alternatives. The quantitative metrics that are used to measure its performance are a direct result of finding ways to express a deep understanding of the product.
  • The best analysts are looking around corners to help uncover pockets of strength, quantify areas that are under-performing, and are thinking about the best ways to influence people about what’s important and should be focused on.

An analyst should be an excellent communicator, story teller, and consultant.

  • We are not looking for analysts that are simply responsible for producing charts. They will need to be able to understand the motivation behind a request, suggest alternatives, and have the conviction to push back when they disagree to foster debate.
  • We want analysts who are able to tell a story with the data, explain the technical elements of an analysis but communicate why it’s important.
  • We want analysts to be consultative with their peers – they need to understand what will be impactful and resonate with an audience and tailor the summary accordingly.

The Tools we Use

As much as possible, we use the best of breed technology stack available today. We have roots as a profitable, bootstrapped company so we haven’t upgraded all of our tools yet but we are constantly evaluating each of the tools to ensure we’re using the best tool for the job.

Our core sources of data:

  • We have a read-replica of our production database so that we can query the latest and greatest production data in real time.
  • We have a data warehouse that is populated hourly by our customer data platform, Segment.
  • We are able to seamlessly run queries that mix and match data between the two systems so that we can use the latest production data with an analysis that uses raw event based data or models computed in the data warehouse.

Some of the tools we use:

  • We use DBT for modeling and data transformation in our data warehouse
  • We use github for our DBT models and all of our code
  • We use metaplane to ensure that we are the first to know of any structural changes to our schemas as well as unexpected changes or trends within our databases.
  • We use iterable as our email marketing tool
  • We use Segment for our event pipeline, reverse ETL tool, and to populate our data warehouse with event level information from sources of data.
  • We use Airflow for any jobs that connect to external services and to create more sophisticated models and jobs that can’t be created in pure SQL.
  • We use Amplitude as our behavioral analytics tool to get insights about how people are using our product, key conversion paths and funnels, and retention behavior.
  • We use Metbase as our BI tool. I’ve written about examples of how we use it here and here.
A simplified version of our data systems

How Teams are Structured

We are rapidly building our teams and how they are structured, but our strategy is a product led experience. That means that our product team is responsible for the core experience of our members. We are structuring teams in pods that own discrete elements of the experience. While every pod may vary, they will typically have:

  • A product manager
  • A product designer
  • A Tech Lead
  • Multiple software engineers
  • A marketer
  • A product analyst

It will be the product analysts responsibilities to:

  • Be the expert on the team about data
  • Empower the team to do their own reporting and answer the vast majority of their own questions
  • Do deeper analyses than any other team member can do

If this is interesting to you, please apply here.

© 2022 Dan Wolchonok

Theme by Anders NorénUp ↑