Gleb Mezhanskiy on Creating a Smoother Path For Data Systems
The CEO and Cofounder of Datafold set out to empower data teams to build reliable data products faster—a need he saw in previous roles at Lyft and Autodesk.
As data pipelines become more advanced, they become more complex—and introduce new opportunities for error. Gleb Mezhanskiy saw this firsthand when pushing an approved fix to production took down the Lyft’s data platform for the better part of a day, completely stumping the senior data engineers trying to put the pieces back together again.
Now, he’s building Datafold, a two-year-old platform helping prevent these types of issues from ever making their way to production.
I spoke with Gleb to dive into what Datafold is, how it differs from its competitors, and how one catastrophic mistake led to its creation.
Let's dive in with you sharing a bit about what Datafold does.
Datafold automates the testing of data pipelines and data applications for analytics and data engineers.
Today companies are applying many steps of transformations to the data to make it more useful. The typical pattern is data gets into a warehouse from events, third-party sources, and OLTP data sources. We have transformations using tools like dbt Docs to Airflow. Then we feed this data into BI applications, machine learning applications, and l-diversity data activation.
That's a multi-step process that gets pretty complex, and the more data-driven the businesses, the more complicated it gets. So the problem that Datafold solves is essentially testing that entire pipeline. We help data-driven companies minimize the number of errors that they're getting in their applications of data, as well as dramatically improve the productivity of people working with data.
How does Datafold differ from Monte Carlo?
I would say Monte Carlo is a leader in what's called data monitoring or data observability. I think where Monte Carlo is most effective is in catching issues that occur in production. For example we have a warehouse and we have jobs that are computed daily or hourly that then feed into different applications. When there are issues with quality, we have fewer rows than necessary, or there's a certain part of data that is missing, Monte Carlo can detect that and alert the team. So it's like a Datadog, but for data applications.
Datafold is focused on catching data quality issues before they get into production. Datafold integrates deeply with GitHub or GitLab CI/CD tools and data transformation tools, and we help catch issues when the developer is working on the code or stages to promote them to production. In a way, you can think of the tools as complementary. They focus on different parts of the development workflow.
Who are your biggest competitors?
In terms of the actual capabilities of the product, we don't have a direct competitor. I'd say business-wise, we are still, in a way, fighting for attention. This is just because in the modern data stack world, there are way more vendors and way more solutions than what teams have attention for. We definitely have overlap with a number of players in the space and some of the features of the product could be more similar to what others are offering.
For example, Datafold also has capabilities for column-level lineage, which helps teams trace dependencies in their data platform, understand where data comes from, and learn how any given column is computed. So, if we're looking anywhere in the data pipeline, we can know exactly who is using it and whether a certain column in a given table goes into the executive dashboard as a machine-learning model. These capabilities exist in a number of products. The question is how they are applied. We apply that to do preventative data quality testing versus PII on governance compliance.
What is your background? Why was Datafold started?
I spent about seven years in data engineering. I was the first hire in Data Platform at Autodesk Consumer Group, which was a division at Autodesk that was focused on consumer apps. That was in 2014, and then I joined Lyft when they were at a hypergrowth stage as one of the founding members of the Data team.
At Lyft, I saw firsthand the challenges of what it takes to actually ship reliable data products. I was responsible for making sure data was delivered on time. One day, I was woken up by PagerDuty because of an incident. The pipeline was sold because of data that entered the pipeline. So I made some hot fixes to filter out that data and everything looked good. I got a plus one from my buddy on the team and merged everything to production. Everything was green. Or so it seemed.
The crazy part is not that I blew up the entire data platform and corrupted hundreds of essential analytical tables, as we found out the next day. The crazy part is that it took us six hours in the war room full of senior data engineers to actually be able to pinpoint the anomalies we were seeing to the tiny code change that I made the previous night. Luckily for me, Lyft didn't fire me, but actually put me in charge of building tools to prevent the very same issues.
Datafold is pretty much a natural continuation of that. I decided that I wanted to build a toolkit that would be useful for even a very successful company. We started Datafold in 2020, and the reason why we focused on the problem of testing data before production is because I've had this painful experience myself. It's also a huge productivity suck.
How did you start growing? What do you do to try to continue this growth?
We started two years ago with the assumption that we were going to work with large companies and large tech companies because they have big data platforms. Our first check was from Thumbtack. That was when we were in YCombinator’s Summer 2020 batch,. We then discovered that while the problem we are solving is definitely more valuable to solve at larger companies with larger scale, it is also very acute even for smaller teams. We started seeing companies that just hired their first data person or a brick-and-mortar business that wanted to really get smarter about their business coming our way and adopting the platform. So we’ve been steadily letting more and more people start using the product.
How big is the company now?
We have 32 people as of today and we are all remote. We are globally distributed across 11 countries. We’ve embraced remote from the very beginning, and it's been working really well for us.
What do you do to stay close to your user and their needs? How does that help you stay on top of the evolving product roadmap?
I was part of the sales calls for the majority of customers. We also have shared Slack channels with most of them. Slack channels are actually a really cool tool because that's where teams now spend most of their time. We are pretty much one click away if they want to ask us a question or they want to provide feedback.
The other part, which has been helpful on the more strategic level for both current customers and prospective customers, is we are running a community called Data Quality Meetup. This is a virtual event that happens quarterly. It brings in data practitioners and sometimes open-source vendors. We have,four or five lightning talks on different topics and then a panel discussion. We've actually gotten a lot of insight from that and it's turned out to be quite effective. It’s much less of a lift than a full-blown conference, and we can host it quarterly. It’s been a great way to engage with people, get feedback, and get new ideas.
What's the pitch to someone who wants to come work with you?
We are solving one of the most painful bottlenecks in the data engineering workflow and the reason why it matters is because we believe data is going to eat the world. That's not news to anyone, but I think an obvious part is that while there are so many fancy applications of data, it can work in a machine learning company, advancing algorithms, or BI tools.
Ultimately, the biggest bottleneck out there is in being able to serve high-quality, reliable data, which is ultimately what data engineers and analytics engineers do. So by empowering them, improving their productivity, and helping them ship high-quality data to the downstream use cases, we are enabling amazing applications of data for companies, non-profits, and anyone in the world. It's a great way to make an impact.