The availability of high-quality genomes, transcriptome and protein assemblies are crucial for numerous studies including human health, disease diagnosis & prognosis, crop improvement etc.
Genomes come in widely different sizes, ranging from viruses such as HIV with a genome of 9.7 kbp to humans with 3 Gbp, all the way up to Paris japonica, a plant with 150 Gbp of genome. Moreover, the difference in the ploidy and sex chromosome organization makes it intricate to study them in detail and compare them.
Where would you go to find your nucleotide and amino acid sequence of interest?
There are thousands of multi-omic databases, tools, and other resources freely accessible on the Internet and over the last decade, with the advancement of sequencing technologies, there is a phenomenal growth of sequence data. NCBI's GenBank in the U.S., EMBL in Europe, and the DDBJ in Japan are the standard databases that accept the direct submission of biological sequences from individual researchers, sequencing projects and patent applications across the globe.
The exponential increase in submissions to these repositories raises some challenges with regard to maintaining accuracy and accessibility. Furthermore, there is an urgent need for methods capable of visualising and searching these big data in a feasible way.
Various efforts have been undertaken to built archives, database and analysis tools for better understanding the genomes of different species. Although these resources have been useful and have solved many issues, the ever-growing amount of data and the different types of data will exacerbate the challenges the research community is already facing.
For example, genome-wide association studies generate an enormous amount of data that could potentially provide insights into the multifactorial genetic origins of disease, evolution, and more but at the same time archiving this data and make it available to other researchers is still a challenge. Failing to address these issues, makes the data non-usable, which might delay current biological knowledge-creation and breakthrough rates in biology.
Lastly, the vast amount of multi-omics databases, analysis tools and other resources available on the web has made it daunting for the researchers to use these resources effectively and efficiently for hypothesis building and testing. Despite the availability of documentation, and advanced trainings, researchers still do not manage to locate these resources or use all these analysis tools without the help of trained specialists and bioinformaticians.
In spite of these challenges arising from growth and subsequently maintenance of data and databases, we have a wealth of data available - unimaginable just a couple of decades ago - enabling new discoveries every day.
As a direct result of advancing technology over the past decade, sequencing costs have plummeted which in-turn resulted in a consequential avalanche of data. As technologies are constantly evolving and are being optimised, drug discovery and targeted therapies for crop improvement have accelerated.
In 2018 alone, over 40% of FDA-approved drugs had the capacity for being personalised solely based on the available genomics data. This trend would improve further and is very unlikely to slow down anytime soon.
For data storage and integration of large omic datasets, cloud computing seems to be a viable solution that has emerged with flexible access to the users. With the inherent elasticity that cloud resources can offer the research institutes, companies can scale their computational resources in relation to the amount of data that is being generated. Moreover, hardware and software is pre-configured according to the user’s needs ensuring the reproducibility.
Yet, the price of cloud computing can be significant. Furthermore, even though cloud computing and hardware optimisations are widely used, omics data analysis still cannot follow the pace of sequence data generation.
The multi-omics revolution is here! And we need more robust tools and applications to harness the sequencing information that is generated to uncover and boost the innovation in genetic research.
Biostrand came up with an innovative solution and a research companion capable of fast and in-depth comparisons through all the publicly available omics databases with a possibility of combining the user data. The basis of this solution is HYFTsTM, patented concepts representing anchorpoints in genetic code that have unique capabilities. HYFTsTM allow for example to combine textual data with biological sequence data.
The BioStrand technology is similar to face recognition technology in which multiple features are identified and translated into a unique code representing an individual signature. By introducing an FRT-like approach to omics data analysis nucleotide or amino-acid sequences can be identified in an instance, translated into patterns, which then can be compared to other omics data to uncover similarities and differences between subjects and species, in mere seconds and with unprecedented accuracy.
Using the HYFTsTM, BioStrand transformed the current methodology in multi-omics analysis into an indexing problem, and as such, a Google-like approach ensures identification of exact-match sequences across vast omics data sets. This next generation omics analysis technology provides real scalability. It is innovative yet more interactive with powerful downstream applications. By using indexing and exact matching results can be delivered faster, with high accuracy and at a lower cost.
BioStrand’s Retrieve and Relate application provides you with the most comprehensive report of your searches by combining different visual elements. The operations could be further extended to genome mapping processes and comparisons. Multiple sequences matched could be aligned together to detect the conserved regions which are extrapolated to protein structural annotations. The platform enables you to tweak every step of your analysis making it simpler for the user, yet powerful in understanding the biological significance.
The upcoming advanced version of ‘Retrieve and Relate’ is upgraded with a more robust recommendation engine enabling sequence annotation in conjunction with the search principle of multi-domain comparisons. The researcher will be able to identify the domains in the sequence and simultaneously correlate these domains to other known databases.
The knowledge acquired from this kind of omics research can be applied in different settings including medical, biotechnology and social sciences. For example, in vaccines and drug discovery, diagnosis & prognosis of rare diseases and in the field of synthetic biology and bioengineering the creation of partially synthetic species of bacteria.
Conservationists have made use of genomics data to evaluate key factors that might be involved in the conservation of species. Insights devised from these patterns could potentially aid the human species and enable us to thrive in the future.
Next-generation omics data analysis technology such as Biostrand enables researchers to harness and speed up the next breakthroughs in biology without irrelevant delays and with high precision and accuracy.
Barrett, T., et al. NCBI GEO: Mining tens of millions of expression profiles - Database and tools update. Nucleic Acids Research 35, D760-D765 (2006).
Howe, D., et al. Big data: The future of biocuration. Nature 455, 47-50 (2008).
Siva, N. 1000 Genomes Project. Nature Biotechnology 26, 256 (2008). Xiangqun Zheng-Bradley, Paul Flicek. Applications of the 1000 Genomes Project resources. Briefings in Functional Genomics 16, 3 (2017).
Holinski, A., et al. Biocuration - mapping resources and needs. F1000Res. 9, ELIXIR-1094 (2020).
Image source: AdobeStock © IgorZh 227676185