The advantage of agnostic technology in bioinformatics: general solutions for specific problems
We live in an era of abundant data and abundant tools. New sequencing and mass spectrometry techniques are giving researchers more and more ways to generate large amounts of data, and online repositories such as GitHub make sharing of analysis methods easier than ever.
However, data and tools are not always in sync, and a discrepancy is growing between data complexity and highly specific tools. Meanwhile, general purpose analysis software also suffers from drawbacks in terms of speed and accuracy. The need for bioinformatics analysis to become “agnostic” presents a clear challenge to life sciences research all over the world.
Agnostic technology presents a new paradigm for data-analysis, without restrictions in formatting our data source. Tools able to process anything, without bias, without error propagation and without cutting corners are immensely helpful to researchers in well-established fields, but also open up entirely new areas of investigation.
Specialised tools can limit the scope of research
Specialised bioinformatics tools are developed every day, often by research groups that need to see questions addressed for which no software is available yet or by companies that seek to develop proprietary analysis methods for their own products. These tools often only read one type of input data (e.g. a custom-formatted column-based text file), generated from a specific technique (e.g. differential proteomics), sometimes using hardware from a single manufacturer. On top of that, they may be species- or disease-specific.
While tools that answer a specific question can be extremely useful, their user base is necessarily limited, sometimes to the developers themselves. When faced with a research question for which none of these specialised tools is available, researchers face the challenge of using sub-optimal methods or developing analysis methods themselves. This requires trained personnel and more time and budget than is usually available.
Alternatively, researchers can use general-purpose technology, but even such packages are rarely “agnostic”. For software to be truly agnostic, it needs to be able to read any type of data, generated on any kind of hardware by processing biological materials from any kind of organism, tissue, or cell-line. On top of that, it needs to match the resulting data against all available databases. This is a challenging task, especially when we consider that despite its broad scope, agnostic technology needs to be just as fast and reliable as specialised bioinformatics tools.
Species-agnostic tools open up research into non-model organisms
The potential for agnostic technology to open up new research domains however, is great. Research into non-model organisms is notoriously difficult, not only when it comes to culturing the organisms, but also when it comes to genomic, transcriptomic, or proteomic analysis. If no genome is available for your species of interest, any next-generation sequencing (NGS)-based technique becomes challenging. But even when a genome is available, not all methods may be feasible.
For an overview of which sequencing technologies and data analysis methods are best for non-model organisms, see Fonseca et al. (2016) below. In this review, the authors present sequencing technologies and corresponding high-throughput data analysis pipelines as applicable in non-model organism biology. These approaches include RAD-seq, targeted sequencing, and standard RNA-seq applied to non-model species.
Increasingly, researchers are turning towards non-model organisms to investigate many of the remaining unanswered questions in life sciences. These organisms bring with them several challenges, chief among which is the absence of tools, both wet-lab and in silico. The development of specific bioinformatics tools for each of these non-model organisms is not feasible.
In essence, the absence of bioinformatics tools gives researchers only two options: focus their research elsewhere or develop the tools themselves. There is however a third option: a truly agnostic piece of technology is able to analyse data regardless of its source and would be a great help for research into non-model organism biology.
An extra advantage of such a piece of technology would be that it makes comparisons across multiple species extremely easy, as they can all be analysed in the same way. As non-model organism researchers often rely on comparative biology to draw meaningful conclusions, agnostic technology should become the gold standard when making comparisons between different species.
Heuristic algorithms overspecify and oversimplify
Problems with specialised bioinformatics tools are not limited to species-specificity however, as many heuristic methods such as BLAST, Clustal, T-Coffee, and Bowtie all have their own shortcomings. First of all, they look at subsequences instead of complete sequence libraries. While this greatly speeds up computation time, it also has its downsides. Heuristic algorithms that look at subsequences instead of total libraries may simply miss some sequences and return a sub-optimal result. Every assumption made by a heuristic leads to data reduction, which leads to faster analysis, but could also result in sub-optimal results.
There is no guarantee, for example, that an alignment returned by the BLAST algorithm between a large number of sequences, is the best possible alignment. While chances of sub-optimal alignment are small or insignificant for single searches, they multiply with increased complexity of the query. As data becomes easier to generate, the complexity of the analysis demanded becomes greater, and errors in the heuristics stack.
When great accuracy is necessary, researchers are currently limited to dynamic programming methods, which consider full sequences and thus do not “cut corners”. These methods however also require large amounts of processing power and time, which can make execution of the task impractical or downright impossible.
Again, the solution is a truly agnostic technology that does not make assumptions about the data to cut corners, but still maintains the speed and efficiency of a thoroughly optimized heuristic algorithm. In fact, today even heuristic algorithms are reaching their limit in terms of speed. As data becomes easier to generate and higher in complexity, once speedy heuristics become slow. New methods inspired by modern computer science using indexing and machine learning show great promise but are still in development.
The pipeline problem: chaining tools propagates errors
Perhaps one of the biggest problems of using specialised bioinformatics tools is that rarely a single one will suffice. A single tool may be used for filtering and quality control, followed by another one for data normalisation, another one for genome mapping, another one for statistical analysis, and a final one for gene ontology analysis. Chaining all of these tools together may seem smart, but it poses significant problems.
Each tool assumes that the analysis in the previous tool was done optimally or even perfectly. Errors made during an earlier analysis step are disregarded, and thus they propagate. This leads to false positives or negatives near the end of the analysis pipeline. This is because the specialised tools do not “talk” to one another about their assumptions. For example, if a gene ontology program does not have knowledge of the heuristics used to map reads onto the genome, it cannot take them into account.
The problem of error propagation in multi-tool pipelines is solved by using a single all-in-one software package. While some of these are available, they are often proprietary and only work with hardware manufactured by a single company. Another downside of many of these all-in-one packages is their tendency to function as a “black box”. This means that users simply input their data and receive the end result, with hardly any customisation options.
While software packages such as these are definitely easy to use and may look deceptively simple, users often do not have accurate knowledge of what is happening “under the hood” of the program, and it is, therefore, easy to misjudge the significance or relevance of the result. Additionally, all-in-one programs tend to be difficult to troubleshoot. Error messages may be cryptic and hard to decipher, leaving researchers with no idea what exactly went wrong in the extensive pipeline that is running under the hood. Proprietary customer support is often the only option left in such cases.
The solution: a species-agnostic non-preselecting all-in-one software package
Today, few all-in-one software packages for the analysis of biological data come close to being agnostic. They rely extensively on heuristics and only query part of the available datasets. There are however notable exceptions that use modern indexing methods and patterns embedded in DNA to enable fast and scalable searching of multiple databases simultaneously.
Such methods have the capacity to lay the groundwork for the first truly agnostic software packages in bioinformatics. The advantages of technology that is broadly applicable yet extremely precise are obvious for researchers working with non-model organisms, but even researchers working with human samples or mouse models could benefit from a standardised all-in-one piece of agnostic technology.
As data becomes easier to generate, sample size and setup complexity skyrocket, but data analysis often lags behind, relying on overly specific tools meant for a very specialised audience, or all-in-one software packages that are operated like a black box. Agnostic technology attempts to find a middle ground between the two, allowing for great flexibility in data source and format, while offering maximum transparency and control to end-users.
da Fonseca RR, Albrechtsen A, Themudo GE, Ramos-Madrigal J, Sibbesen JA, Maretty L, et al. Next-generation biology: Sequencing and data analysis approaches for non-model organisms. Mar Genomics. 2016;30: 3–13. doi:10.1016/j.margen.2016.04.012
Picture source: AdobeStock © metamorworks 220753947