Harnessing the Power of Unstructured Data
From the CIA to machine learning, the Founder and CEO of Unstructured walks us through his projections for the future of unstructured data modeling
Data isn’t just information across spreadsheets, evaluations, and metrics. Information, attitudes, and findings lurk in emails, drives, and Powerpoints, and Brian Raymond wants companies to better see, read, and learn from these unstructured sources of knowledge. With Unstructured, he’s compressing the amount of time to go from raw data to machine learning-ready by about 99%.
I sat down with Brian to dig deeper into his past across policy work, the Middle East, San Francisco, and more to understand his perspective on the power of mining, curating, and learning from companies’ unstructured data.
Let’s start with how you got to Unstructured.
About 15 years ago, while I was working on a PhD in Northern California, I was recruited by the CIA. I worked for them in the Middle East in the Directorate of Intelligence—now it’s called the Directorate of Analysis. I ended up moving from intelligence down into a policymaking role on the National Security Council. I then worked for President Obama for a couple of years through some of the turbulence we had in the Middle East with ISIS.
What was the PhD about? And how does one get recruited by the CIA?
Well, it’s a funny story. It was researching comparative politics at UC Davis—in particular, constitutional design and electoral rules. I was trying to understand how to architect the rules of democracy to maximize the likelihood of success.
There are a lot of lessons learned from East Asia as well as Eastern Europe after the fall of the Soviets. They were really the only recent lessons available for places like Afghanistan and Iraq as they rebuilt their governments. But at the time, some of the Iraqi parties were manipulating esoteric rules, which would have had profound implications for the long-term U.S. relationship with the Iraqi government. I was brought in not as a Middle East expert, but as a mechanics of democracy expert. My role grew from there.
How do you move from thinking about the structure of a constitution to processes like extraction, transformation, and loading LLMs for language models?
My PhD was a quantitative-heavy program. In fact, the first year and a half was almost entirely quantitative research methods. It was all econometrics and research design. I took that “toolkit” with me to the agency. I saw really interesting advanced analytics initiatives, including big data, AI, and machine learning in the early days, and found a home working on projects and initiatives that were capitalizing on some of these emerging technologies.
By 2015, I was starting to feel the pull to find a new mission and intellectual stimulation that would advance my career in other areas. I ultimately did an MBA at Dartmouth, then landed at Primer.ai, a small little Series A company focused on natural language processing right as transformer-based models began to emerge. I was the second non-technical hire there, and I ended up spending the next four and a half years there building a fantastic business and learning everything there was to learn about operationalizing NLP for enterprise applications.
That’s really what led me to Unstructured. It was the lived experience of taking these models that we're fine-tuning and pipe-lining and building applications on top of them. We would get so close to the problem that we’d see the wrinkles. We would spend years just doing data preparation for some customers.
What was involved in that data preparation?
Sometimes we’d show a demo on LexisNexis or a Twitter—natural language data—and then build pre-processing pipelines. That way you can do all sorts of fabulous machine learning capabilities on that. They'd say, "We love it. We want it on our data." We’d apply those pipelines to over 20 years of data on SharePoint, so PowerPoints, scraped HTML, and emails. There's a lot of knowledge contained in these files that's relevant to organizations that might want to use for fine-tuning or inference, but the pipelines are super fragile and very expensive to get up and running in the first place. You have to strip out advertisements, headers, footers, and all sorts of stuff that would pollute the knowledge graph you’re generating from the model. Building it is also completely manual: zero tooling, all regex, all Python scripts, all off-the-shelf OCR, all duct taped together and hard coded to specific document templates. If a space emerges or things shift a little to the left, a little to the right, everything breaks.
And is that what Unstructured is trying to solve?
Yes. That’s the problem we're tackling. We're compressing the amount of time to go from raw data to machine learning-ready by about 99%.
How does it work?
We have almost 400 Python libraries, or open source repos integrated together, alongside the stuff that we've developed internally. Conceptually, it's pretty straightforward. Let’s say you have an S3 bucket of a hundred thousand files from file extensions to document layouts. It's the whole knowledge of an organization for the last five years. When you feed that in, an auto strategy will recognize file extensions and route them to the file transformation pipeline. Doing that optimizes speed and accuracy because we're not jamming everything through expensive models.
We can then sort some things to extremely cheap models, and others we escalate to more expensive models. Everything comes out the other end in clean JSON without artifacts. We’ll then classify titles, body text, list items, headers and footers, and so on. It then becomes effortless to ask it for only body text, and rip out everything else. Irrespective of what came in, you've now got these metadata tags that you can organize, study, group, and extract. But we raised our seed round from scratch in July of 2022. By November, ChatGPT dropped.
Maybe you could have been six months earlier and it probably would've been okay.
Yeah, there's a couple things there. One, our initial approach assumed we were going to need thousands of layout-specific pre-processing pipelines, which we would empower data scientists to build and then share with one another through composable bricks we developed. The introduction of LLMs meant that you could have a little bit more noise in the data, so you didn't need as sterile of data.
Second, data scientists hate doing this data engineering crap. They don't want to spend any time on it. They don't derive social capital from it or professional capital from sharing a cool pre-processing pipeline in the way that they do with models they may share on Hugging Face. What they want to do is share a cool RAG architecture or a fine-tuned model they made. And so, by December, we said, "Look, we're going to need a different approach here. We need to flip this around. So, instead of the bricks-based approach, we need to distill this down to a single API endpoint so that anyone can just throw anything at it and end with a populated, crystal clean vector database."
Nobody had done that before because it was a nightmare. But we had the team to do it. We had the experience. Everyone else is going after sexier problems. We're going to go after the ugly stuff that everyone hates, but that will give 80% of data scientists' time back."
It reminds me of Paul Graham’s essay “The Schlep Blindness” that points out the benefit of startups who tackle problems others have ignored because of how much of a schlep it would be to fix it.
Totally. I'm blown away all the time by our engineers. We have computer vision folks, natural language processing folks, and depth on the backend infrastructure side. Many have done customer-facing work that uses regex and Python. That way, we can pragmatically integrate it so that we can deliver a fantastic user experience. We have around 30 engineers banging away on this problem area.
Where are you along the journey right now?
We're a little over 14 months in. We released a Python library last February then upgraded it in March. We pushed out containers as quickly as we could and a free API in July. It’s great, but the number of dependencies is through the roof. The feedback continues to be, "We'd like it simpler, simpler, simpler."
That said, we just crossed two million downloads since April.
Two million downloads—how many users is that?
Well, that’s the challenge of being an open source company. We have GitHub stars, but not a ton. We got a big consumer community—but not a big contributor community, so that’s a little under 3,000 stars. Slack has about a thousand, and there’s a lot more hitting our API. I think we're somewhere between 7,000 and 7,500 repos that we're integrated into now. We also look at PIP installs and rate of PIP installs. In total, though, we've processed probably tens of millions of pages since it went live in July.
What will the business look like over time, and how much time do you spend thinking about that today?
I spend a lot of time deliberately thinking about it, even if it makes me anxious. What do we care about at the end of the day? To me, I care about people successfully using RAG applications and fine-tuning or pre-training models. Adoption is the number one thing that matters—not just for prototypes but for production. We want people to move prototypes into production. We find this almost everywhere in the industry: the 80 to 85% solution is not hard to get to, but the 99% solution that's going to delight Howard in HR working at some company in Illinois that's tough.
The objective of the commercial offerings that we'll be rolling out starting this next month and then continuing through Q4 is to have a production-grade product ready once those prototypes are ready to graduate.
There's a little bit of cognitive dissonance in the context of ChatGPT, Midjourney, and a handful of other application hits. The reality is the vast majority of work that has been done is still stuck in prototype.
The backdrop of this is that over the last decade, 80 to 90% of AI and machine learning initiatives have failed. How do we change that?
Well, another way to think about that is that a 10% hit rate is enough to change the world radically over time. Facebook and Uber’s machine learning programs did wonders for them.
Yeah. It's a synergistic and complimentary ecosystem right now. I'm planning with my peers at places like Weaviate, LangChain, LlamaIndex, and Arize AI.
How do you think about privacy and security?
There's a number of different approaches, or frameworks, being tested out right now. It's been instructive to watch the declining usage of ChatGPT or organizations banning it entirely. I think that's a testament to the value and sensitivity of the natural language data that those companies are producing, even if it's just props.
I'm probably closer in my views to Clem at Hugging Face than I am to Sam Altman on this. The only way to go over the medium term for most of the economy will be small models that can actually perform and own narrow tasks that deploy in their regions. Microsoft's taking a hybrid approach, but I just don't know how many instances of GPT-4 will work from a security standpoint.
And, as far as I’m aware, running larger models is just way too expensive right now.
Yes. For massive models, you can’t afford to host multiple parallel instances of them in different cloud regions. It's just cost prohibitive. We're not yet fully cognizant of the total cost of ownership of this. If you want the security of running them in your region, you're going to have to redeploy it, and you're going to have to host it. The GPU math on how to scale with users is scary. It's almost like a GPU per user, or else they'll back up significantly in terms of queuing jobs. It'll decline in the ability to supplement workflows in a seamless fashion.
What is the first principle around your passion for unstructured data?
I come from a national security background. I've spent a lot of time in that domain, and I take a lot of lessons from Palantir and what they did to unlock the power of companies' structured data right as we started going to cloud in the mid to late 2000s. They’ve done incredible things for the customers they serve and their peers in the data integration space. But there’s been no corollary yet for the 80% of data that organizations generate that's unstructured. We’re only relying on the structured data, the 20%—one fifth of organizations' information. It’s going to be transformative, like what Snowflake has done with data warehousing, Fivetran, and others.