The metadata challenge in biological research
One of the key highlights of the global response to COVID-19 has been the importance of effective, ethical, and equitable data sharing and how it can exponentially accelerate outbreak research. However, data sharing cannot just be an incident response strategy.
With the volume of biological data increasing exponentially year on year, the public sharing of experimental multi-omics data needs to become part of the culture of biological and life sciences research. The ability to assemble data from across domains and disciplines will pave the way for more integrated multi-omics and cross-disciplinary research and expand the potential for more sophisticated insights into biological systems.
There are already concerted efforts within the industry to evolve towards an open data paradigm in biological research. However, the public availability of data will not automatically translate into enhanced value in terms of insights and knowledge. It has to be accompanied by a radical rethink of how data is generated, stored, shared and accessed. And, equally importantly, data management frameworks will have to account for the value of metadata.
The challenges of metadata
Metadata – essentially data that describes data – provide context and provenance to raw data and is crucial to both data discovery and validation. For instance, metadata can describe a sample – in terms of the biomaterial it was derived from, how the sample was handled, the details of processes used for sample purification, profiling, and quantification, and provide detailed information into the experimental set-up and procedure.
Integrating data from multiple analyses and experiments enables high-level research that can address more complex questions in the life sciences.
However, several case studies indicate that even current efforts to prioritise open data have not been able to catalyse open analysis at a proportionate scale. There are several reasons for this lag. A key challenge in this context is the fact that many research groups have adopted community-specific conventions that are not easily scalable for multidisciplinary research.
Many researchers still use no formal reporting conventions or completely exclude the metadata critical to the interpretation and reuse of data. Often, the metadata included with open datasets is incomplete and/or poorly annotated.
When it comes to data integration, conventional methods fall into one of two categories – multi-staged or meta-dimensional analysis. Though meta-dimensional analysis is capable of incorporating all data into a comprehensive metadata matrix, combining data from different datasets still remains a significant challenge.
Data integration is further complicated by the lack of user-friendly tools for researchers with limited bioinformatics, biostatistics, and programming expertise.
Open data without metadata adds no discernible new value to the research process. Though the biomedical community is making a concerted effort to share omics data, there is still a lack of consistency among researchers in ensuring that this data is backed by complete, annotated and usable metadata. So, if open analysis has to catch up with open data, there needs to be a universal standard for the reporting and sharing of valuable data.
Standardising biological metadata
Though metadata has been acknowledged as a key component of research infrastructure design, there is still no universal standard for reporting and sharing metadata. Instead, there have been numerous initiatives for the development of hundreds of metadata standards with diverse characteristics.
However, there has been a conceptual consensus on the three types of metadata standards – descriptive, administrative, and structural. There is also general agreement that metadata is key to supporting FAIR principles in order to overcome obstacles to data discovery and reuse for both humans and machines.
As a result, there has been some change in the profile of public data with many repositories showing some degrees of “FAIRness” and several new projects emerging with FAIR as a central objective. In addition, many scientific journals are now urging researchers to make data shareable and public even as they endorse public data repositories that implement FAIR principles.
Notwithstanding that, much of the data in public repositories is far from being perfectly FAIR. One limited study of engineered nanomaterial databases found that even though a majority met FAIR criteria, one of the potential areas of improvement was the use of standard schema for metadata. Another study to evaluate the completeness of metadata, referenced to nine clinical phenotypes, in public omics data reported a large variability in both the number and consistency of reported clinical phenotypes.
Even coordinated efforts, like MIAME, designed to encourage metadata sharing have had a limited impact given the fact that they define the content but not the format for this information.
The creation of a unified framework for metadata continues to be a significant challenge with the public data landscape still characterized by diverse databases and standards that still require users to devise and manage compatibility. It is therefore going to require a monumental and orchestrated effort to ensure data and metadata quality adherence across the universe of public data repositories.
The primary reason for this is that genomic data organization is essentially a fraught endeavour. For instance, files come in multiple formats with widely different semantics to fit neatly into a predefined universal framework. More importantly, there is no commonly accepted standard for a general yet basic data unit that can represent the heterogeneous and multi-dimensional data assets that are central to biological research.
HYFTs™ – the atomic units of biological data
At BioStrand, we applied advanced NLP techniques to protein and DNA sequences to transcribe the universal language of all omics data. In doing so, we were able to decode the atomic units of information, called HYFTs™, that are the building blocks of biological information.
With HYFTs™, all biological data, irrespective of species, structure, or function, can be tokenised to a common omics data language. In addition, these atomic data units are also extremely efficient carriers of biological information. Each HYFT™ pattern represents a unique signature sequence in DNA, RNA, and AA and integrates data and metadata across all omics layers.
The transversal language of HYFTs™ enables the unification, standardisation, and normalisation of all data, across species and domains, to create a single source of truth. The default integration of omics layers and associated metadata combined with the BioStrand platform’s hyper-scalable technology and unified analytical framework enables truly integrated multi-omics research.
Standardised metadata is the key to the usability and reproducibility of public data. With BioStrand, all public data is usable, and all biological research is reproducible.