Big Data and Personalization
Throughout this series, we have looked at some of the biological reasons for using RNA as a biomarker/diagnostic tool, the relevance of RNA in diseases such as cancer, and how RNA-seq and molecular diagnostics in general are changing patient care. As we close, we will review where the field of RNA-seq has come from and where it may be going.
The overarching question in the field of molecular diagnostics is how to incorporate individual experiments or test results into the larger whole, and then how to use that larger whole to inform intervention at the individual level. It is an issue of “big data” – that annoying and often uninformative cliche – vs personalization. Next generation sequencing is spinning off enormous volumes of information. Billions of bases per individual genome, all read for $1000 in a few days, is an extraordinary accomplishment not just for science but for humanity. That leaves us with 700 megabytes per genome just to store the string of bases. (Reid Robinson does the math in a great post on Medium). And that doesn’t begin to actually represent the problem, as each NGS run results in massive amounts of additional data including all the reads required for appropriate depth and associate phred scores. The reality is that each full genome takes up slightly less than 200 gigabytes. A commentary published in 2013 estimated that the annual global sequencing capacity could produce “15 petabytes of compressed genetic data.” So there’s the issue of storage – which is probably the most insignificant of the issues facing data scientists.
Moore’s law should mean that warehousing information isn’t that big of a deal. But finding useful information has nothing to do with semiconductors and ever-smaller chips. Where “big data” has become a buzzword in many industries, it is only useful insofar as the data is, well, useful. This takes place on multiple levels. Broadly speaking, compiling data from across many samples allows researchers to discover novel, clinically relevant genetic signatures in various diseases. From here, disease states can be diagnosed and even re-defined. What may have looked like two cases of the same cancer based on traditional methods could, in fact turn out to be two unique subtypes (or even sub-subtypes) based on highly detailed genetic signatures. Thus, the large scale data provides powerful statistical patterns that can be applied to the other end of the spectrum, at the individual level.
Ultimately, the individual levels is the only thing that matters. Intervening in ways that change the course of a patient’s disease is the only reason why having racks of digital storage space is a thing in the first place. Cofactor Founder and CEO Jarret Glasscock points to this relationship between data in the aggregate working with individual data as a focal point for the industry in general, and Cofactor in particular. Deriving actionable insights from an individual patient’s RNA signature is impossible by just looking at that signature. Instead, the data must be put into a larger context of health and disease. Comparing a single sample profile to thousands of others provides enough statistical power to offer prognoses and, potentially, therapeutic direction. Cofactor is taking this approach. With its new services, the team can compare RNA profiles of clinical tumor samples to a repository of over 10,000 others. The software and informatics developed by Cofactor scientists is robust enough to perform these comparative searches in seconds, providing an example of big data made accessible.
While the informatics tools provide a mechanism to interact with the data, RNA itself provides the ability to make that data useful in clinical settings. Since the early 2000s, clinicians have ordered DNA testing on patients to look for mutations. Many of these – take BRCA, MTHFR, or Cytochrome p450 as easy examples – are well established and provide valuable predictive information for patients. Unfortunately, predictions aren’t guarantees, as every financial advertisement is legally required to remind us. Fortunately, RNA takes advances in genome science and makes them more actionable than the risk-assessing DNA can do on its own. RNA-seq reveals genes that are actually turned on (or off) in a diseased tissue. That, as Glasscock says, “might have much more weight to it than the other mutations that are actually not in expressed genes.” RNA-seq, therefore, helps its users avoid being distracted by shiny objects in the form of interesting mutations that end up not having clinical relevance in that specific patient. As such, it is a complementary tool to DNA-based tests, both validating and adding to information derived from other diagnostics.
Coming back around, though, finding these nuggets of actionable intelligence require ever-evolving tools to sift through those annual petabytes. There are exciting projects towards this effort. For example, in the previous post about the human side of RNA-seq we mentioned the case of Lukas Wartman at Washington University. In addition to being a cancer survivor aided by RNA-seq and personalized medicine, Wartman is involved in a program using artificial intelligence technology from IBM to apply big data to individual cases. The goal is to let Watson, IBM’s famous AI system, do the work of sifting through databases of genomic data to find clinically relevant insights and identify potential therapeutic interventions. (For more on this project, see posts from Wartman and Forbes.)
IBM is only one blue chip company applying big data analytics to genomic science. Apple has opened collaborations that could add some level of genetic sequencing to its iPhone HealthKit suite. Google now offers Google Genomics, designed to “organize the world’s genomic information and make it accessible and useful.” And there are many more smaller companies working to apply artificial intelligence to healthcare. Though not all of these efforts are related to genomics, advances in the field in general could easily be broadly applicable. For example, on a recent StartupHealthNow Podcast, Pinaki Dasgupta discussed his company’s mission to apply AI and predictive analytics to healthcare. Hindsait is focused on using AI to help providers and payers cut down on unnecessary services, minimize medical errors and reduce fraud. Their work is outside the scope of RNA-seq, but the tools – AI and machine learning – are relevant for mining the vast reservoirs of genomic data inundating the scientific community.
As Glasscock pointed out, the struggle over the past decade has been to “get our head wrapped around how to use that data…what were the hurdles involved?” Now, both genomic science and data science are developing to a point where we can predict real outcomes and create novel treatments that will dramatically change the course of cancer and other disease. Here at Cofactor, we feel uniquely positioned to contribute in this area. Visit our Clinical Assay page to learn more about the products we’re developing, and connect with us to learn more.