Why medical research still seems to be miles away from using genomics for large-scale breakthroughs, and why open source might be a solution to accelerate things.
It’s often that you hear someone at a conference chatter on about the great possibilities for genomics in advancing medical research. At least I do. But even with the great promises of precision medicine and the use of genomics, are we really as close to large-scale groundbreaking research as some claim?
To find out what’s really going on we had to dig deeper, look at the flow of genomics data for medical research and speak to someone who might know a bit more. Spoiler alert: a potential solution for speeding up medical research with genomics might be an open source one.
Costs of Sequencing
Graphic tweet: Tweet to @nuviun
Data source: Genome.gov/sequencingcosts. Graphics from nuviun’s investigative data journalism unit.
Description: Via the Human Genome Project, which was declared complete in 2003, the first whole human genome was sequenced more than a decade ago. Since then the costs decreased much more rapidly compared to other technology innovations that follow Moore's Law.
Who says we are close to a genomic revolution?
At the end of last year, billionaire Dr. Soon-Shiong, according to Forbes the world's richest doctor and owner of bioinformatics company NantHealth, claimed that he would create a solution that allows analyzing a genome in 47 seconds. A new app, he said at the Forbes Healthcare Summit, should allow doctors to review a cancer patient’s genes on a Blackberry phone.
Yet there were doubts. One of the doubters included Forbes blogger Matthew Herper who published on the topic and wrote that the analysis of a genome in 47 seconds would strike him as profoundly misleading.
Looking at research from last year from the University of Chicago, there is progress in terms of sequencing speed. Whole genome analysis requires the alignment and comparison of raw sequence data, and the motivation for the researchers was to fight a computational bottleneck that stems from a limited ability to analyze multiple genomes simultaneously.
The researchers deployed a supercomputer to analyze 240 genomes in approximately 50 hours. The trick the researchers used to speed up the whole genome sequencing process was to apply a parallelization approach, required for concurrent multiple genome analysis to analyze the full genome, which allowed them to sequence 240 entire genomes in 50 hours. Broken down into minutes, this would mean that it would take the Cray XE6 supercomputer around 12.5 minutes to analyze a full genome.
According to the most recent article from January by Genome Biology, researchers at Nationwide Children's Hospital developed an even faster approach. The new system, called Churchill, promises to analyze a whole genome sequence in as little as 90 minutes (from a raw FASTQ text-based format through to identifying variant cells at high confidence by spreading each analysis step across multiple computing instances).
There are different types of genome sequencing: There is full genome sequencing and the Exome Sequencing alternative. The latter is used for protein-coding genes in a genome. The exome consists of all of the genome’s exons, which are the coding portions of genes.
Graphic tweet: Tweet to @nuviun
Mark Bartlett, founder and Managing Director of Geneix, a precision medicine startup in the UK, says that whole genome [sequencing] is the gold standard in the academic community, and that there is a lot of information missed from just Exome Sequencing.
Mark Bartlett, founder and Managing Director of Geneix, a precision medicine startup in the UK says that whole genome [sequencing] is fast becoming the gold standard in the academic community, through for example, government projects, and that there is a lot of information missed from just Exome sequencing. In the UK, Genomics England, for example, sequenced 100,000 genomes via the 100,000 Genomes Project.
With 180,000 exons, which only makes up around 1.5% of the human genome, or around 30 million base pairs, Exome sequencing is fast (compared to around 3.2 billion base pairs of a full human genome sequence). Therefore, there is accelerated speed and less cost than full genome sequencing. This is important, especially because mutations are much more likely to have severe consequences than in the rest.
Companies such as Illumina sell their technology under the condition that the exome contains around 85% of known disease-causing variants, making whole-exome sequencing a cost-effective alternative to whole-genome sequencing.
The Genomic Timeline: How Did We Get To Now?
Data Viz Tweet: Tweet to @nuviun
The Human Genome Project gave the space a massive push with the main goal of providing a complete and accurate sequence of the three billion plus DNA base pairs of the human genome. The project was started in 1990 and completed in 2003, ahead of schedule. Since then, much has happened.
Interactive Graphic: The Genomic Timeline (click on the buttons to navigate)
According to humangenes.org, full or whole genome sequencing deals have been advanced to handle almost 95% of the DNA, but should not be confused with DNA profiling (helping to determine where genetic material came from, and excludes genetic relationships) or SNP genotyping. The sequencing technology has been around since 2000 (while researchers could already sequence small genome pairs back in 1979).
Our Genome is Sequenced, Now What?
We managed to reduce costs and accelerate the speed of sequencing. Now we have the data. What’s next? With the Angelina Jolie-Pitt cancer prevention case, which we reported on in an earlier post, people might have started to understand the potential effects of genome sequencing on patients' choices for their own sake.
But what does this mean for medical research? Whole genome analysis had considerable consequences for cancer research. As Tuna and Amos write in their scientific article: “Whole genome analysis has impacted research of complex diseases including cancer by allowing the systematic analysis of entire genomes in a single experiment…”.
But there are challenges. Jill Adams writes in her research that pyrosequencing methods have drastically cut the cost of sequencing and may eventually allow every person the possibility of personalized genome information.
The next stage will require considerable work to generate, understand, organize, and apply this massive amount of data to human disease. Might the real challenge be that researchers are not well equipped to deal with the huge amounts of data from the sequencing?
The challenge to find what's in the data, before knowing what to look for...
To find out whether that’s the case, nuviun interviewed genomics expert David Mittelman, who joined the genomics analysis startup Tute Genomics as CSO early this year. In an email to nuviun, Mittleman writes that he thinks that a big bottleneck is in trying to figure out what to do with the DNA differences:
“How can we quickly annotate the variants (differences) and then move toward interpretation and classification of the variants? This is what Tute is trying to help scale. Large sequencing groups and test providers that I have worked with all point to reporting as the bottleneck these days”. David Mittelman
The company has a cloud-based clinical genome interpretation platform. The founders collected $2.3 Million in Series A funding last December and it was recently mentioned in the news with the announcement of the Startup Health - AngelList syndicate.
One of the main challenges Tute Genomics wants to help with is the unbearable size of the data offered to researchers, who sometimes lack the skills to produce results and handle the data in a meaningful way. To effectively use the analyzed sequenced genome data, a researcher needs knowledge and education to query, store and analyze the information.
According to a paper by Hsi-Yang Fritz et al., data storage costs have become an appreciable proportion of total cost in the creation and analysis of DNA sequence data. A full sequenced genome is around 100GB heavy.
To offer some context, this is approximately 190 hours of uncompressed CD-quality audio. In an interview with Gigaom, Andreas Sundquist, founder of DNAnexus, which provides a global network for sharing and management of genomic data and tools towards medicine, said that the data size increases to about 1Terrabyte after it has been analyzed - about 1,500 CD-roms. DNAnexus just announced that it will provide its technology to the Garvan Institute of Medical Research in Sydney, Australia to support its genomics-based research initiatives, an example that confirms that there is demand for such systems across the research landscape.
With the progress we make in sequencing genomes, the cost might also significantly increase if we don’t find solutions to smartly distribute the data in the right size. One of those solutions might come from companies such as Tute Genomics with a good understanding of the value of open source and open access.
Video: Tute Genomics
As Mittelman explains, Tute Genomics offers a solution to this big data mess. The platform takes the data that comes fresh from the sequencer. It can be the raw data or even the variant calls that have information on what factors makes a human unique to someone else. That data can be brought to their platform and it then basically works like a search engine, allowing users to cross-reference anything that is known in the public literature or other data sources. Aggregating information like this enables optimal usable information about differences within a human DNA.
For better decision-making, the platform aims at medical research and healthcare professionals. Mittelman says that the tool is designed to help researchers to become smarter about how to crunch data and to make relationships and observations between mutations and phenotypes. The software platform can also help clinicians in particular situations. For example, if they are building clinical reports, it can help clinicians guide clinical decision-making and support in the generation of diagnostic reports, Mittelman explains in an interview with nuviun.
Regarding how researchers can access the data, and how Tute aims at getting the right information to the people who need it, Mittelman says that the company recently teamed up with Google Genomics to make all of this information available in Google’s environment. Google Genomics wants to support the life sciences community to organize genomic information. The partnership certainly makes sense. Tute Genomics does the analysis of the data, and uses Google’s environment to give people access.
Mittelman says that the cool thing about Google’s environment is that their platform allows “big querying”. What this means is that the user is able to do fast searches. The ability to intersect with specific information quickens. Mittelman explains it from the perspective of a researcher: We looked at all the data we had. Then what we did with Google Genomics was to have this data structured in a way that a researcher could easily access that data. This data is structured in Google Genomics in a way that it can easily be queried.
So if for example a researcher wants to look at specific genomes and want to find out as well how many people have these specific mutations, or how many people had this mutation with this specific phenotype or this consequence, this is now possible. The Google Genomics partnership allows researchers to query and essentially asks these complex questions.
But that’s not all. Another way Tute Genomics makes it easier for medical researchers and others to access genomics data is via the release of a new feature called “Expert review”. It was launched at the recent American College of Medical (ACMG) Genetics conference to help researchers generate a report on Tute’s platform, which can provide other researchers’ findings that also includes public data sources on specific variants.
The next step for a researcher who generates a clinical report is to classify what he finds. Tute Genomics now offers an expert review process to help with the classification within the application. This helps researchers to find data that might be clinically relevant for their research, and it establishes a standard for categorizing the mass amount of information.
In short, what Mittelman described is that there are cloud-based solutions that can solve some significant problems in speed and costs when it comes to genomics. A clever value chain of analytics and data storage can increase the output for the medical research community and support continuous medical research successes. Other companies have made similar progress in offering smart access to genomics data, but not in an open data fashion like his company, Mittelman says.
Is Open Source The Answer for The Future of Medical Research?
Open Source and open data could heavily influence how genomic medical research will be handled in the future. Open source is also enhanced by crossing different sorts of data streams and data sources, giving researchers more opportunity to locate patterns and breakthrough findings with such collaboration.
In the interview with Mittelman, he also mentioned that in the future, interfacing with medical records (EMRs) might be an option. This is already a primary goal of the Electronic Medical Records and Genomics (eMERGE) Network, which started in 2007. eMERGE aims at developing, disseminating, and applying approaches to research that combine DNA biorepositories with electronic medical record systems to enable large-scale, high-throughput genetic research.
There are more old school solutions out there, says Bartlett. But even Illumina has an open source software approach for its community of researchers and scientists. Even Thermo Fischer has connections to the open source approach via Thermo Fisher Scientific Open Biosystems (which has a business model that is inspired by the success of the open source movement in software).
Bartlett’s startup, Geneix; Tute Genomics; and a company called Seven Bridges Genomics stand out with an innovative approach of analyzing whole genome data, says Bartlett. Seven Bridges Genomics, for example, provides researchers with 400 open source applications for analyzing DNA and RNA data. Some initiatives are entirely open source based. The Open Genomics Engine, for example, is an open source project for analyzing and interpreting high-throughput sequencing data. According to Open Health News, the framework is free, easy to install and enables comparative analysis of cancer genomes.
Another example is the Personal Genome Project in the UK. The initiative aims at creating public UK genome, health, and trait data and making it accessible for research. The approach is to invite participants to share their personal data for the greater good and scientific discovery. The initiative also argues that the genome is just one part of the puzzle and that genes interact with the environment to form traits. So other public data from participants' donations to build public records of their health and traits is appreciated by the organization to drive scientific discovery.
3 Takeaways from this Article:
- Whether open source wins, is not yet fully apparent to me. I highlighted a few examples that suggest that it might go that way for genomics-related medical research.
- With the rapid development in the field of human genomics for medical research over the last decade, one of the key challenges is now the interpretation of the sequenced data, not the sequencing itself anymore.
- From what I have seen so far, there are good arguments that it is very likely that we will see more innovative companies in the human genomics field to concentrate on interpretation, analysis and visualization of the data, rather on the sequencing part.