Democratising Omics Data Analysis with Citizen Data Scientists

 

 

“Which would you rather have: a newly-minted Ph.D. data scientist or 20 people who can conduct basic analyses in their current jobs?”.

Almost all opt for the latter, say the authors of a recent article in the Harvard Business Review about how companies typically respond when faced with that hypothetical choice. And if you did too, then there’s a good chance that you are advocating for the democratisation of data science.


The citizen data scientist imperative

Democratising data science has become a cross-industry rallying cry in recent times, even making it to the top three in Gartner’s 10 strategic technology trends for 2020. Broadly, it refers to the crying need to enable business users, domain specialists, and other non-data scientists within the enterprise to leverage the potential of data science in their day-to-day routines.

The global shortage of data scientists is a significant but not the sole premise for the democratisation argument. As the HBR article tells it, truly transformational data science will require enterprise-wide teams of citizen data scientists – people who use advanced analytics without the requisite exposure to statistics and analytics – with access to self-service tools that empower them to identify and address business challenges on a daily basis


A snapshot of multi-omics research

Genomics analysis and research is currently almost a pure data scientist play and these specialised skills are indispensable across the value chain, from data preparation, integration and normalization to insight extraction.

Most data relevant to multi-omics analysis is characteristically diverse and distributed across domain-specific silos. This makes data ingestion a task reserved for skilled data scientists and bioinformaticians.

The tool environment is often unique to each data type and domain, which means that the choice of the right toolset for analysis requires years of experience and in-depth expertise.

And finally, even foundational processes, such as mapping new genome sequences and subsequences against available databases, are powered by decades-old methodologies like dynamic programming algorithms and heuristic algorithms. These techniques are neither suited to meet the Big Data era demands of speed and scalability, nor do they reflect the universal usability principles required to promote citizen data scientist engagement.

This need for specialization across the genomics research value chain not only increases the time-to-value of research projects, it also contributes to the already substantial gap between genetic data generation and data analysis.

Now, this is a situation that can be adequately addressed by reimagining the entire omics research process with citizen data scientists as the focal point of the transformation.

And if that sounds like academic wheel reinvention, at BioStrand we have actually demonstrated the technological feasibility and practical viability of an end-to-end self-service genomics analysis and research solution that plays well with both seasoned and citizen data scientists

Genomics analysis and research is currently almost a pure data scientist play and these specialised skills are indispensable across the value chain, from data preparation, integration and normalization to insight extraction.

Most data relevant to multi-omics analysis is characteristically diverse and distributed across domain-specific silos. This makes data ingestion a task reserved for skilled data scientists and bioinformaticians.

The tool environment is often unique to each data type and domain, which means that the choice of the right toolset for analysis requires years of experience and in-depth expertise.

And finally, even foundational processes, such as mapping new genome sequences and subsequences against available databases, are powered by decades-old methodologies like dynamic programming algorithms and heuristic algorithms. These techniques are neither suited to meet the Big Data era demands of speed and scalability, nor do they reflect the universal usability principles required to promote citizen data scientist engagement.

How BioStrand democratises data science

One HYFT™ to bind them all

The first step towards democratising omics workflows is to address the issue of diversity, of sources, formats, and types, in omic data.

As such, the strategic priority was to develop a universal organising principle to parse, store and provide access to data across multiple platforms, techniques, and tools. So, we developed an innovative technique to index cellular blueprints and building blocks into a proprietary pattern called HYFT™.

HYFT™ patterns serve as biological fingerprints for signature sequences in DNA, RNA, and AA. The HYFT™ for any sequence contains multiple layers of information, relating to function, structure, and position, that interlink DNA, RNA, and proteins. Using the BioStrand parser, we retrieved HYFTs™ from over 350 million sequences available across 11 public databases to create a proprietary pan genomic knowledge database of over 660 million HYFT™ patterns comprising information about variation, mutation, structure, etc.

This is an ongoing process that will update and expand our knowledge base in line with that of the public databases. This process is also what allows any citizen data scientist to easily normalise and integrate their own datasets with a single click

Just Google it

Sequence comparisons are often the simple foundations for even complex omics research, and we wanted to make this fundamental process as simple as Googling it.

Now that we had a precomputed knowledge base of the world’s genomic data, we went ahead and augmented it with Google-like indexing and exact matching functionality. Citizen data scientists can use plain text – for example, “similar coronavirus” where similar is the unique identifier – to search through 660 million HYFT™ patterns and retrieve all information about aligned, similar, and variant matches sorted by DNA, RNA, and AA in less than 3 seconds. Users also have the option of launching the sequence comparison process of searching by sequence using a FASTA file.

This simple, one-click search interface is the anchor point for BioStrand’s Retrieve & Relate (R&R) platform. It is intuitive enough to get any research off the blocks in as little as 3 seconds, and the results are comprehensive enough to enable a more detailed and accurate understanding of sequence similarities, alignments and annotations. Users get out-of-the-box access to all public databases, the option to bring their own datasets or even integrate R&R functionality into their own workflows using a simple API interface.

Your research, your pathway

Research pathways in omics analysis have for long been a function of users’ ability to choose appropriate algorithms and build complex pipelines. We believe that’s a rather warped approach to research that deprioritizes domain knowledge for technological aptitude.

Technology has to level the playing field – not skew it. To that end, the BioStrand solution features an extensive array of easy-to-use, versatile and powerful tools that give users complete control over the scope, focus, and pace of their research. Our technology enables any research pathway you think has the maximum potential for a breakthrough, and allows you to fine-tune and continually focus your progress based on any combination of variables and parameters that are most pertinent to your research objectives. With our built-in powerful, state-of-the-art AI tools, you can intuitively synthesise knowledge from a multitude of data sources.


The platform for Citizen Omics Data Scientists

The BioStrand SaaS platform represents a revolutionary new approach to omics analysis and research. It also underlines our commitment to make advanced omics data analytics accessible to everyone – data science skills are optional. We believe that limiting omics research and analysis to only the data experts is not a productive or sustainable strategy, especially in genomics where incoming data outstrips insight generation by an increasingly widening margin. In this context, citizen omics data scientists represent the best approach to manage the data-value gap in genomics.

Ebook: A better way to analyse multi omics data

 


Register for future blogs

 


 

Leave a Comment