Everyone knows the big Web 2.0 companies use hundreds of data points to determine which ad we might prefer. And yet — in the deathmatch against disease, we reduce human health to single variables.
Granted, this has partially been due to immature technology and infrastructure; after all, an assembly line of PhDs can only annotate the genome so quickly. There is also a hard limit on a human’s ability to find patterns within the noise.
In the last couple of years however, a few trends have reshaped the landscape for startups working at the intersection of computer science and biology:
1) the hardware layer of the genomics stack has been commoditized,
2) the cost of genomic sequencing has fallen below the threshold required for routine reads,
3) data storage is effectively free, and
4) sophisticated computational tools, including deep learning, have matured, allowing us to apply strategies that were not possible before
Once in a while, there is an inflection point that completely changes the rules of the game. We saw this in the early 2000s, for example, when suddenly you didn’t need a big check to build your own servers and infrastructure, just to get a website up and running.
What this shift enables, is a new generation of biotechnology companies very distinct from its predecessors, with characteristics not unlike the software and machine learning companies we are familiar with.
The characteristics that make software startups so appealing — that you can test your idea cheaply, that you can de-risk early, that you can scale quickly, etc — will be found in this new generation of biology companies also. In fact, many of these startups should really be thought of as machine learning/software companies with domain knowledge in biology. Just as we saw an explosion of web startups running many experiments at a low cost in the mid 2000s, we expect to see a similar phenomenon in the biology space.
And clearly genomics is a big-data problem — arguably the biggest today. The thing is, most people think of the genome as a static tell-all dataset. In reality, even your somatic dna changes at an astonishing rate; in fact, we can predict your age, within around a 5 year confidence interval, from your genome. That would not be possible if your genome was static. So we need to reframe the genome as a dynamic real-time data stream of what is happening in the body. Then of course, we also need to couple longitudinal genomic datasets with time series biomarker data before we can use our new tools to understand human health a little better.
We have been excited to meet teams that are fully leveraging the promise of this new era. A couple weeks ago, for example, we met cancer diagnostic startup Freenome, which uses cell-free dna from liquid biopsies to detect cancer at an early stage. If that sounds scary, at a very high level, it is just a machine learning categorization algorithm. What is exciting is that they have essentially taken an agnostic approach to the problem. Healthcare is a notoriously slow-moving industry, but imagine that in the future, new findings will simply be incorporated through a software update.
Beyond disease diagnosis, we have seen startups working in agriculture genomics, drug response, and even designing a new genomic programming language, that have all captivated our imagination.
It will be tempting at times to dismiss these startups as naive; after all, many of them are tackling highly complex problems that generations of scientists have given blood sweat and tears to, only to make tiny contributions. And indeed, there are many technical and commercial bottlenecks we have yet to overcome (next post). However, we have seen impressive real-world results, and we are excited about what is to come.