On DNA Day, an ode to genomics
The 25th of April was first designated as DNA Day in 2003, the year that marked two significant milestones in the genomics revolution — the 50th anniversary of the discovery of the double helix structure of the DNA and the official completion of the Human Genome Project.
The discovery of DNA structure revolutionized medicine and genetic research and formed the basis for modern biotechnology. The Human Genome Project, the first successful example of “big science” in biomedical research. paved the way to a new era of high-throughput computational biology and opened the doors to the next generation of bioinformaticians and computational biologists. However, the technological limitations of that time meant that the project was only able to generate an essentially complete human genome, which covered 92% of the human genome and included 400 gaps of unknown sequences. It would take another 19 years, until 2022, before computational methods and tools were able to finally sequence a truly complete, ‘Gapless’ Human Genome. This next-generation human genome sequence is expected to have a significant impact on personalized medicine, population genome analysis, and genome editing.
So as we approach the 20th anniversary of the advent of big science genomics, here’s a quick look at the current state of computational genomics.
Data Analytics in Genomics
If the Human Genome Project laid the big science foundation for genomics research, the field has since expanded significantly into the multimodal interdisciplinary discipline of functional multiomics. Today a range of omics technologies, which includes transcriptomics, proteomics, epigenomics, metabolomics, etc., generate a variety of data types pertaining to distinct levels of complex biological systems. This space continues to expand with emerging technologies, such as single-cell sequencing and spatial analysis, affording new insights into hitherto unexplored layers of biology.
The integrated analysis of these multiomics data types is now the norm in biomedical research and has been critical to understanding the relationship between different omics levels and in generating a holistic, system-level perspective of biological networks. Modern biomedical research has evolved into truly big data science with advanced data analytics as the key driver for numerous research areas.
Today genomics research data is being generated at an exponential pace. An estimated 2 to 40 billion gigabytes of data is generated every year from genomics. By 2025, healthcare data is expected to double every 73 days, growing at a compounded annual rate that eclipses other data-intensive sectors like media & entertainment, financial services, and manufacturing.
With data growing at a pace exponentially faster than traditional data analytics methods can cope with, biomedical research has transitioned to new computational frameworks that can turn raw sequence data into actionable intelligence.
ML/AI algorithms in genomics
The phenomenal increase in the volume and variety of complex, high-dimensional omics data has necessitated the use of AI technologies like machine learning and deep learning in genomics research. In fact, the evolving capabilities of these technologies have helped extend the data landscape of biomedical research beyond just omics datasets to include a diverse range of structured and unstructured data sources including bioimaging/medical imaging, clinical records, and biomedical text. In fact, it is this ability to integratively analyze high-throughput sequence data with relevant non-omics structured and unstructured data that has delivered a new impetus to the practice of precision medicine and precision drug development.
The field of intelligent bioinformatics is now entering the deep genomics era with advances in deep learning showing the potential for enhanced performance in specific genomics tasks than even state-of-the-art methodologies. This approach is also powering new applications in genome functional annotations, the prediction of 3D genome features, and the generation of novel artificial genomes.
Moving genomics forward
The discovery of the DNA double helix and the human reference genome made possible by the Human Genome Project has undoubtedly opened up a range of new opportunities to explore and understand biological processes. However, the genomics revolution is only just beginning.
Over the past two years, the human reference genome has been the open-access keystone to genomics research. However, a more diverse set of genomic samples will be critical to realizing the potential of precision drug discovery and personalized medicine. As a result, the reference genome is currently in the middle of a major update. The Human Pangenome Reference Consortium has set out “to sequence and assemble genomes from individuals from diverse populations in order to better represent the genomic landscape of diverse human populations.” Their goal is to construct the highest-possible quality human pangenome reference that will serve as a more accurate and representative genetic resource for biomedical research and precision medicine and lay the foundation for more equitable genomics research.
The second critical factor for a more productive and inclusive approach to genomics research is the open sharing of scientific data, a key principle espoused by the Human Genome Project. To this end, the Global Biodata Coalition (GBC) has created a forum to stabilize and sustain crucial biodata resources worldwide. The coalition currently constitutes a collection of 37 open-access global core biodata resources of biological, life science, and biomedical databases that are crucial for sustaining the broader biodata infrastructure.
And finally, even as AI/ML technologies become pervasive in predictive genomics data analysis, there is a vital need for explainable AI (xAI) / interpretable machine learning (iML) so that researchers can make sense of the underlying reasons that validate the predictions.
Happy DNA Day!