I recently came across a 2009 article titled A Quick Guide for Developing Effective Bioinformatics Programming Skills at PLOS, a non-profit, Open Access journal for researchers in science and medicine. The gist of the article was that formal bioinformatics training programs focus too strongly on bioinformatics methodology to the detriment of “self-educational and experiential” aspects of the science.
With that as context, the article then proceeded to catalogue several key principles of bioinformatics programming that could facilitate the career development of bioinformaticians. These included a comprehensive understanding of available computational tools, mastery in at least one programming language, a grasp on pertinent software tools, libraries, and applications, proficiency in the use and maintenance of Web server systems, a working knowledge of UNIX systems, the ability to structure data, and the aptitude to embrace complex paradigms like parallel computing.
Not exactly a narrow skillset – but ten years ago (or two generations ago in technology time), bioinformaticians had no option but to be their own software engineers, programmers, hardware experts, and data specialists if they were to make anything of their biological degrees and interests. And since bioinformatics is, by definition, the application of computational tools to large volumes of biological data, maybe the intensive, exhaustive, and applied knowledge of both biology and computer science is implicit, even if only arguably so.
However, today, computation is an integral aspect of every vocation – and this is despite the inability of most users to build their own computing environments, applications, and workflows. For example, the AI revolution in medicine is not predicated on doctors’ abilities to master deep learning, build data pipelines, and train algorithms.
And yet, bioinformaticians are still expected to possess specialist-grade capabilities – not only in their core mission of biological research, but also in the design and deployment of the systems, processes, and applications required to execute their mission productively.
First of all, how does this DIY approach to bioinformatics scale?
And second, why, in the XaaS age of “Everything as a Service,” does bioinformatics lack even a rudimentary form of organized, mainstream technological development and support?
Let’s address these issues, in that order.
To be clear, there is no dearth of computational solutions in bioinformatics. The reality of every user designing their own solutions to drive their research has left us with a tyranny of choice.
As a 2020 article titled Community Curation Of Bioinformatics Software And Data Resources describes it, the rapidly expanding corpus of bioinformatics tools and databases is making it a challenge for scientists to select tools that fit the desired purpose. In response, the European Infrastructure for Biological Information has created an intergovernmental organisation to unify life science resources across Europe, as well as a portal for tools and databases pertaining to bioinformatics. Even so, helping researchers find the right software is just one challenge addressed – researchers still need specialist training to use these tools.
All of which brings us to the numerous drawbacks of this discrete and fragmented approach to technological development in bioinformatics.
The need for specialisation: In recent years, citizen data scientists have emerged as the new fulcrum for the development of advanced analytics. Bioinformatics, however, continues to require deep specialisation across two inherently complex disciplines, even as biological data builds up exponentially in the background.
Technology eclipses biological research: Even assuming we manage to find a large enough pool of human capital with this dual specialization, most of the tools and pipelines available have been designed and built for specific research projects. Once these are completed and the resources have been submitted to a central community registry, it is often the case that codes are no longer updated and/or there is not sufficient support documentation available. The inevitable result is that researchers end up spending more time and energy on troubleshooting technology than on actual biological research.
Scalability/portability challenges: Many of these legacy tools and solutions are not designed to scale – at least not to modern big data specifications. Neither are many of them easy to port across different computational environments. Engineering these capabilities and ensuring that different components communicate across the pipeline is a significant challenge.
Cumulation of errors: Chaining together a set of specialised bioinformatics point solutions to build a pipeline introduces the real risk of error propagation. The lack of adequate communication between each of these components means that errors made early in the process can get compounded, cumulatively resulting in false positives or negatives at the end of the pipeline.
The bioinformatics industry has built up an extensive collection of software and tools over the years. However, there are significant challenges and shortcomings involved in adapting these solutions to bespoke bioinformatics projects. There is currently a limited supply of all-in-one software solutions – and most of those that are available require proprietary hardware, incorporate opaque workflows that are hard to understand or troubleshoot, and are severely limited in terms of customization.
It’s been a decade since a VC delivered the software eating the world axiom along with the observation that every major industry – from movies to agriculture to national defence – is now run by software delivered as a service. Somehow, that mainstream trend seems to have bypassed a critical scientific research discipline – bioinformatics.
As early as 1999, a presidential advisory committee in the U.S warned about software engineering practices not being applied effectively to scientific computing. Since then, several authors have periodically reminded us that the ‘chasm’ between software engineering and scientific programming posed a serious risk to scientific progress. A 2017 review of the issue concluded that the situation persists with no sign of any straightforward solutions.
And yet, we have to find a way to change the computational status quo in bioinformatics. To paraphrase from a 2020 article called The Democratization Of Bioinformatics: A Software Engineering Perspective, bioinformatic researchers face daily impediments to productivity because they have to improvise solutions and deal with implementation details that distract from their primary focus.
The authors propose a widely available fix: leverage the power of cloud computing to democratise access to scalable and reproducible bioinformatics engineering for generalist bioinformaticians and biologists.
It’s time for Bioinformatics-as-a-Service to bridge the chasm and bring the discipline to the digital age.