Prior to 2020, the fastest vaccine development program had taken four years from viral sampling to approval – for mumps in the 1960s. This record changed dramatically in 2020 when the first vaccine candidates against COVID-19 entered clinical trials a mere six months after the causative coronavirus had first been identified.
This unprecedented achievement was the result of a sustained, orchestrated, and multifaceted effort from a global community of researchers from a wide range of practices –including medicine, biology, epidemiology, immunology, public health, bioinformatics, biostatistics, computer science researchers, and more.
This successful global collaboration has also triggered a phase shift in research practices that has transformed multiple aspects of computational biology, data science and public health.
There are three key levels that perfectly showcase how the coordinated global response to the pandemic transformed several conventional research processes and practices.
At the biological level, high-throughput NGS technologies provided the technical infrastructure to trace the origins and the evolution of SARS-CoV-2, reconstruct the genomic sequence and create voluminous databases related to SARS-CoV-2 genomes and variants. In terms of bioinformatics tools and pipelines (the key to translating NGS data into knowledge and actionable insights), existing solutions were either adapted for this new paradigm, or new tools were developed specifically for accelerating the detection of SARS-CoV-2, the analysis of sequencing data, and the discovery of potential drug targets.
There also emerged a choice of comprehensive pathway enrichment analysis workflows designed to find possible targets in biological pathways. At the same time, virus database repositories launched a variety of SARS-CoV-2-specific data services that enhanced accessibility of viral sequence data and improved integration with clinical and genotype data.
Around this core effort, an entire online ecosystem emerged offering various types of resources that could accelerate research including a curated list of Virus Bioinformatics Tools targeted at coronaviruses, high-quality SARS-CoV-2 genome sequence data and metadata and other COVID-related services.
A similar transformation was also unfolding at the medical level with dedicated tools and resources to, for instance, identify biomarkers and drug targets in COVID-19 or repurpose existing drugs. The focus on repurposing existing FDA-approved drugs was especially critical given that conventional drug development processes are not designed to account for the time sensitivity of a global pandemic.
Researchers leveraged the full potential of technologies like CADD (computer-aided drug design) to accelerate the identification of potential drug candidates. Novel AI-based bioinformatics systems have also been proposed for their ability to interrogate existing drugs and identify potential therapeutic agents that could be repurposed for COVID-19.
And finally, at the public-health level, the combined fields of epidemiology, biostatistics, survey science, and clinical trials research in collaboration with virology, immunology, and infectiology have made significant contributions to understanding virus dynamics and immunological response, the development of diagnostic tools, and gauging the epidemiological effect and societal side effects.
All this scientific productivity has understandably generated a trove of COVID-19 literature with an estimated 50,000 papers having been published since the pandemic began – and hundreds more are being added each day. In order to streamline the search, discovery, visualization and summarization, several automated text mining applications are now available to enable the global research community to stay abreast of developments in the fight against the pandemic.
The global response to COVID-19 has been mounted on a strong foundation of international, cross-functional collaboration, improved and innovative research tools and workflows, and shared access to valuable data. This is a remarkable achievement that is still in progress and has the potential to positively influence the future trajectory of medical research and computational biology. However, the crisis has also exposed a few challenges that still need to be addressed as we move forward as a global community.
The rapid development and broadening applications of NGS technologies have resulted in the exponential growth of publicly available datasets. However, even a simple task such as using reference sequence databases to classify sample data and identify possible origins often triggers a restrictive choice of reference datasets that are either perfectly curated but limited, or messy but inclusive.
This is not an optimal choice for efficient and productive research. The ideal approach, therefore, would be to create a centralised data repository that provides curated, comprehensive, and regularly updated resources of all publicly available data to enable quick, efficient, accurate and productive analysis.
Read more: Making All Omics Data Computable
The utility of the exponential volumes of data generated by NGS technologies is limited only by the capabilities of the downstream bioinformatics solutions used to convert data into knowledge. In the context of virus discovery, the ideal bioinformatics pipeline would automatically analyse NGS data and identify reads related to viruses.
Unfortunately, as an article in the March 2021 issue of Briefings in Bioinformatics phrased it, there is no fully integrated bioinformatics pipeline available today. Instead, virus discovery workflows consist of a series of discrete steps to ensure data quality, remove host/rRNA data, assemble reads, classify taxonomically, and verify the virus genome verification.
A fully automated and integrated ingestion-to-insight pipeline would be key to ironing out the inefficiencies of the current step-by-step approach to virus discovery.
And finally, most conventional solutions are not designed with the scalability to analyse and visualise the volume of data embedded in full-length genome sequences of SARS-CoV-2. This means that the dataset has to be divided into multiple sub-datasets, independently aligned and then combined again. So, the need of the hour is for modern cloud- and container-based bioinformatics solutions that can scale across multiple dimensions, including data volume, and accelerate COVID-19 research.
The global research community – and specifically the medical, biological, and public health research sciences – has been able to successfully design and deliver a response to the overwhelming challenge presented by the coronavirus pandemic.
The challenge is ongoing, but the lessons learned from this experience can be used to create the collaborative, data-driven and computationally sophisticated systems and infrastructure required not only to tackle the next challenge but also to augment scientific research in general.