Since Truveta’s inception 5 years ago, we have been tirelessly working to build a complete, clean, and timely dataset reflective of US health. Truveta Data now includes 120M de-identified patients’ medical records updated daily. It’s been incredible to work with the CDC, Bayer, Boston Scientific, Boehringer Ingelheim, Stanford, Johns Hopkins, and 100+ other innovative partners who use Truveta to understand the safety and effectiveness of all kinds of drugs and devices used every day and improve insight into immediate public health concerns.
But there’s been something missing from this large “phenotypic database.” Health outcomes today are a product of our biology, our environment, and our medicine. At Truveta, we integrate socioeconomic data to understand the environment, while the EHR data captures our medicine. But how do we get to the biology — aka our “genotype” — at the core of each of us? This data is not collected during the standard course of our care — it just doesn’t exist at any scale, and not for me or anyone in my family. We haven’t been collecting this data during care today, not only because it’s expensive to do so, but because we don’t know enough yet about the data for it to be useful.
The US NIH has invested over $3 billion dollars to create a dataset of ~300,000 patients. The UK government and philanthropy funded the UK Biobank to the scale of ~500,000 British people, yet still, over 80% of them are of European ancestry which limits large-scale research in all populations. While both of these small, great datasets enable high quality research today, they are insufficient to feed today’s AI and deliver the breakthroughs in medicine that all of our diverse communities deserve. And that is why we are introducing the Truveta Genome Project. Our plan is to build a database of 10s of millions of consented de-identified Americans. At over ten times the scale of previous endeavors, this groundbreaking project ensures comprehensive representation across ancestries, ethnicities, genders, and other social drivers of health.
It’s hard to not over dramatize the potential of understanding the genetics of disease. We don’t know why some people get lung cancer without ever smoking. We don’t know why some people drink like a fish and never have any liver issues — meanwhile others drink socially with deleterious effect. Some folks must manage every calorie while others struggle to lose weight. We spend billions on annual colonoscopies for people who will never get colon cancer. When we have this genetic data linked with longitudinal health information, we will be able to better understand and hopefully prevent disease, not just treat it. When we have this data, we will finally begin to address the cost of healthcare through early detection and prevention rather than chasing the never-ending challenges of chronic disease management. The impact will be profound.
One example really hits home for me. We learn the connection between one gene and a rare form of hearing loss. A therapy is developed, and this little girl can now hear.
Introducing the Truveta Genome Project
Our goal with the Truveta Genome Project is to sequence the exomes of tens of millions of consented and de-identified volunteers, creating the largest and most diverse database of genotypic and phenotypic information ever assembled. We will bank biospecimens to do further multi-omic sequencing of patients when research requires it.
In the Truveta Genome Project, our health systems members work with their patients to collect biospecimens for clinical lab tests. With their consent to participate in the project, the leftover biospecimens from those lab tests and associated de-identified clinical data is shared with the Regeneron Genetics Center (RGC). RGC performs the genetic sequencing and shares the data with Truveta. Truveta then makes the de-identified data available for research for academic, public health, and life science researchers.
Unleashing the power of AI
AI is the backbone of Truveta Data. Today, the Truveta Language Model transforms millions of unstructured clinical events in the US into de-identified, clean real-world data for research. Genetic data opens an exciting new frontier, but it brings challenges in terms of scale and semantics. Truveta AI is here to tackle these challenges and make genomic data, and all omics data, useful for research.
Firstly, consider the scale. A whole exome sequence consists of 25 GB of genetic data (“AGCTGTCAGT…”). Whole genome sequences are 50 times larger, and other multi-omics sequences are even more complex and not human-readable. AI is essential for processing genetic data to interpret and integrate knowledge from various formats with different levels of detail.
The second challenging dimension of genetic data is its semantics. The same genes can produce a wide range of phenotypes in different environments. Currently, genome-wide association studies (GWAS) are performed to identify genetic variants linked to diseases or traits by finding specific DNA locations that occur more frequently in individuals with a particular disease than in those without it. While the manually engineered algorithms used in GWAS have advanced our understanding of genetic mutations linked to both common and rare diseases, there is a significant portion of heritability that remains unexplained. Effective data integration is crucial to unraveling the complex interplay within the genome, as well as interactions with environmental factors, all influencing phenotype expression.
By using AI to build the world’s largest linked genotypic-phenotypic database, we will help researchers pose and test causal hypotheses using real-world clinical data. We’re thrilled to collaborate with partners in public health, life sciences, and academia, deploying modern AI to unlock the biology behind diseases and improve health outcomes. Clean, complete data is essential, and we’re excited to provide it.
How did this come together?
At the core, the Truveta Genome Project is made possible through the mission and economic alignment of our 30 health system members and our work since 2020 to build Truveta together.
We need this data to enable precision medicine, improving the health of our communities. And just like Truveta Data today unlocks value in the data health systems accumulate, the Truveta Genome Project unlocks value in the remnant biospecimens health systems currently pay to dispose of.
In February 2021, Len Schleifer, CEO of Regeneron, personally reached out to learn more about Truveta. Looking back, this was an amazing act of foresight.
Starting in January 2023, Regeneron became a customer of Truveta to study rare diseases. Through that work, we learned about RGC and its mission to sequence as many people as possible so they could understand the underlying science behind health outcomes. We began collaborating on a fundamental question– what is a long-term sustainable business model that would allow this work to scale to all Americans? The key became reimbursing health systems and RGC for their efforts whenever de-identified data is used by others.
Looking back, a breakthrough meeting took place December 2023 with Michael Dowling and Larry Kraemer of Northwell Health, Rod Hochman of Providence, Len Schleifer, George Yancopoulos, Aris Baris, and Andy Deubler from Regeneron, Ryan Ahern and myself from Truveta – where we decided to really dig in. Later, a key dinner included Dan Roth of Trinity Health, Larry, George, Aris, Andy, Ryan, and me in September 2024 as we were getting clarity on our respective commitments to make this happen. My family recalls passionate discussions from the Ngorongoro crater in Tanzania last summer, and more on Christmas Eve, Christmas Day, New Years Eve, and New Years Day 2025 to finish things up.
Along the way, we had many discussions with Illumina and Microsoft about how best to deliver these capabilities. None of this would be possible without the incredible Microsoft Azure and Illumina NovoSeq technologies.
To support Truveta’s growth into genomics, Regeneron and Illumina both invested in Truveta. Their investments come with no board seats, no governance over Truveta, and no access to any customer confidential information.
And now here we are in January 2025 after incredible work over two years by so many people across 30 health systems, Illumina, Microsoft, Regeneron, and Truveta. It’s incredibly gratifying to launch the Truveta Genome Project — to build the most ambitious biobank in the world and the largest database of clinical and genetic information to unlock the science of humanity.
Thank you!
Participation in the Truveta Genome Project requires each volunteer to receive care at one of our member health systems and consent to their leftover de-identified biospecimens and data to be used for medical research. No incremental needle prick or draw of more blood is needed. We’re simply asking that you let us use any leftovers after the test your doctor has ordered is completed. I think of this like the organ donation decision each of us face as we get our driver’s license – a simple consent, which makes the world a better place for all of us and saved more than 46,000 lives in the US in 2023.
To those of you who participate, thank you for contributing to the mission of Saving Lives with Data. Imagine the impact we could have together.