Today, the integrative computational analysis of multi-omics data has become a central tenet of the big data-driven approach to biological research. And yet, there is still a lack of gold standards when it comes to evaluating and classifying integration methodologies that can be broadly applied across multi-omics analysis.
More importantly, the lack of a cohesive or universal approach to big data integration is also creating new challenges in the development of novel computational approaches for multi-omics analysis.
One aspect of sequence search and comparison, however, has not changed much at all – a biological sequence in a predefined and acceptable data format is still the primary input in most research. This approach is probably and arguably valid in many if not most real-world research scenarios.
Take machine learning (ML) models, for instance, which are increasingly playing a central role in the analysis of genomic big data. Biological data presents several unique challenges, such as missing values and precision variations across omics modalities, that simply expand the gamut of integration strategies required to address each specific challenge.
For example, omics datasets often contain missing values, which can hamper downstream integrative bioinformatics analyses. This requires an additional imputation process to infer the missing values in these incomplete datasets before statistical analyses can be applied. Then there is the high-dimension low sample size (HDLSS) problem, where the variables significantly outnumber samples, leading ML algorithms to overfit these datasets, thereby decreasing their generalisability on new data.
In addition, there are multiple challenges inherent to all biological data irrespective of analytical methodology or framework. To start with there is the sheer heterogeneity of omics data comprising a variety of datasets originating from a range of data modalities and comprising completely different data distributions and types that have to be handled appropriately.
Integrating heterogeneous multi-omics data presents a cascade of challenges involving the unique data scaling, normalisation, and transformation requirements of each individual dataset. Any effective integration strategy will also have to account for the regulatory relationships between datasets from different omics layers in order to accurately and holistically reflect the nature of this multidimensional data.
Furthermore, there is the issue of integrating omics and non-omics (OnO) data, like clinical, epidemiological or imaging data, for example, in order to enhance analytical productivity and to access richer insights into biological events and processes. Currently, the large-scale integration of non-omics data with high-throughput omics data is extremely limited due to a range of factors, including heterogeneity and the presence of subphenotypes, for instance.
The crux of the matter is that without effective and efficient data integration, multi-omics analysis will only tend to become more complex and resource-intensive without any proportional or even significant augmentation in productivity, performance, or insight generation.
Early approaches to multi-omics analysis involved the independent analysis of different data modalities and combining results for a quasi-integrated view of molecular interactions. But the field has evolved significantly since then into a broad range of novel, predominantly algorithmic meta-analysis frameworks and methodologies for the integrated analysis of multi-dimensional multi-omics data.
It is therefore essential to understand the fundamental conceptual principles, rather than any specific method or framework, that define multi-omics data integration.
Multi-omics datasets are broadly organized as horizontal or vertical, corresponding to the complexity and heterogeneity of multi-omics data.
Horizontal datasets are typically generated from one or two technologies, for a specific research question and from a diverse population, and represent a high degree of real-world biological and technical heterogeneity. Horizontal or homogeneous data integration, therefore, involves combining data from across different studies, cohorts or labs that measure the same omics entities.
Vertical data refers to data generated using multiple technologies, probing different aspects of the research question, and traversing the possible range of omics variables including the genome, metabolome, transcriptome, epigenome, proteome, microbiome, etc. Vertical, or heterogeneous, data integration involves multi-cohort datasets from different omics levels, measured using different technologies and platforms.
The fact that vertical integration techniques cannot be applied for horizontal integrative analysis and vice-versa opens up an opportunity for conceptual innovation in multi-omics for data integration techniques that can enable an integrative analysis of both horizontal and vertical multi-omics datasets.
Of course, each of these broad data heads can further be broken down into a range of approaches based on utility and efficiency.
A 2021 mini-review of general approaches to vertical data integration for ML analysis defined five distinct integration strategies – Early, Mixed, Intermediate, Late and Hierarchical – based not just on the underlying mathematics but on a variety of factors including how they were applied.
Here’s a quick rundown of each approach.
Early integration is a simple and easy-to-implement approach that concatenates all omics datasets into a single large matrix. This increases the number of variables, without altering the number of observations, which results in a complex, noisy and high dimensional matrix that discounts dataset size difference and data distribution.
Mixed integration addresses the limitations of the early model by separately transforming each omics dataset into a new representation and then combining them for analysis. This approach reduces noise, dimensionality, and dataset heterogeneities.
Intermediate integration simultaneously integrates multi-omics datasets to output multiple representations, one common and some omics-specific. However, this approach often requires robust pre-processing due to potential problems arising from data heterogeneity.
Late integration circumvents the challenges of assembling different types of omics datasets by analysing each omics separately and combining the final predictions. This multiple single-omics approach does not capture inter-omics interactions.
Hierarchical integration focuses on the inclusion of prior regulatory relationships between different omics layers so that analysis can reveal the interactions across layers. Though this strategy truly embodies the intent of trans-omics analysis, this is still a nascent field with many hierarchical methods focusing on specific omics types, thereby making them less generalisable.
The availability of an unenviable choice of conceptual approaches – each with its own scope and limitations in terms of throughput, performance, and accuracy – to multi-omics data integration represents one of the biggest bottlenecks to downstream analysis and biological innovation.
Researchers often spend more time mired in the tedium of data munging and wrangling than they do extracting knowledge and novel insights. Most conventional approaches to data integration, moreover, seem to involve some form of compromise involving either the integrity of high-throughput multi-omics data or achieving true trans-omics analysis.
There has to be a new approach to multi-omics data integration that can 1), enable the one-click integration of all omics and non-omics data, and 2), preserve the biological consistency, in terms of correlations and associations across different regulatory datasets, for integrative multi-omics analysis in the process.
At BioStrand we took a lateral approach to the challenge of biological data integration. Rather than start with a technological framework that could be customised for the complexity and heterogeneity of multi-omics data, we set out to decode the atomic units of all biological information that we call HYFTs™.
HYFTs™ are essentially the building blocks of biological information, which means that they enable the tokenisation of all biological data, irrespective of species, structure, or function, to a common omics data language.
We then built the framework to identify, collate, and index HYFTs™ from sequence data. This enabled us to create a proprietary pangenomic knowledge database of over 660 million HYFTs™, each containing comprehensive information about variation, mutation, structure, etc., from over 450 million sequences available across 12 popular public databases.
With the BioStrand platform, researchers and bioinformaticians have instant access to all the data from some of the most widely used omics data sources. Plus, our unique HYFTs™ framework allows researchers the convenience of one-click normalization and integration of all their proprietary omics data and metadata.
Based on our biological discovery, we were able to normalise and integrate all publicly available omics data, including patent data, at scale, and render them multi-omics analysis-ready. The same HYFT™ IP can also be applied to normalise and integrate proprietary omics data.
The transversal language of HYFTs™ enables the instant normalisation and integration of multi-omics research-relevant data and metadata into one single source of truth. With the BioStrand approach to multi-omic data integration, it is no longer about whether research data is horizontal or vertical, homogeneous or heterogeneous, text or sequence, omics or non-omics.
If it is data that is relevant to your research, BioStrand enables you to integrate it with just one click.