Advancing our understanding of biology is the only way we will boost developments in biotechnology and discover the drivers for diseases we aim to treat, strive to prevent, and hope, one day, to cure.
Here, at BioStrand, we are on a mission to create a truly effective, powerful, and convenient omics data analysis solution that will empower life sciences researchers to revolutionise genetic research, ramp up the speed and effectiveness of R&D lifecycles, and bring personalised treatments and precision medicine to the next level.
We are blessed with a truly talented and innovative team of researchers, developers, engineers, and data scientists that have not only made this solution a reality, but work tirelessly every day to continuously enhance our technology, improve its speed, accuracy, and ease of use, and develop valuable new functionalities for you, our users.
In this new series, we’re excited to introduce you to each member of our esteemed team so you can learn more about what projects they’re working on, and what tools, technologies, and techniques they use to bring the BioStrand platform to life.
We start with Sébastien Lemal, PhD. Sébastien has been with us since June 2020, continuing his career as a Data Scientist from his previous research positions at the University of Liège. He is responsible for the whole development process – from whiteboard to production-ready models.
I am currently involved with the integration of BioStrand’s HYFTTM technology to our state-of-the-art workflow for omics analysis. The HYFTTM technology is currently implemented within the BioStrand Omics Parser, which extracts HYFTTM patterns from biological sequences. For people familiar with Natural Language Processing (NLP), this can be seen as an elaborate way to tokenize biological sequences.
In the same way raw text can be tokenized as information-encompassing entities, the HYFTTM patterns are the tokens, which are small snippets of sequences which carry information. These HYFTTM patterns can be the basis for multiple down-stream task, such as indexing, clustering, mapping, etc.
I am also working on the integration of textual data to the BioStrand platform. More specifically, I am developing a tool – NLP Link – to perform enrichment of differential gene expression data derived from scientific literature. The objective is to significantly speed up research on biological pathways involved with certain diseases and treatments, and help researchers formulate their hypotheses.
NLP Link started as a simple use case to exploit, conjointly, BioStrand’s text analytics platform and omics platform. In the primitive step of the project, it was difficult to assess how this would fit in the market.
But as I discussed the matter with clients and collaborators, it turns out that there’s not only a need for having an integrative environment for textual data and omics data of any kind, but many potential applications, such as enrichment, automatic annotations, cross-mapping through different datasets, and more.
The scope of this project alone makes it very exciting, as well as the opportunity to make an impact on the market.
Given the nature of BioStrand’s technology, I am quite attentive to the progress done in the field of bioinformatics and NLP. Recently, NLP has seen a breakthrough with the advent of deep learning transformer architectures1, such as Google’s BERT2, which significantly improved NLP tasks, including Named Entity Recognition, Relation Extraction, Question Answering, etc.
More spectacularly in my opinion, transformers and their Attention Mechanisms are at the heart of a significant breakthrough in molecular biology, with their implementation in DeepMind’s AlphaFold2 model, which is able to predict the 3D structures of proteins just from their amino-acid sequence3 . Finally, considering my work at BioStrand, I had great interest with this review4 of the intersection of NLP, deep learning-based methods, and protein structure.
It’s a great read for anybody who wants to have some high-level understanding of the progress made and the challenges encountered in the field over recent years.
As a Data Scientist, my go-to tool to process data is the Python programming language, and the variety of libraries available (pandas, spark, etc.). For visualisation, I particularly enjoy Plotly and sub-libraries for interactive plots, as well as Seaborn for static ones.
To help in dealing with different projects, I like to code within notebooks in the Jupyter Lab environment. It is rudimentary and lacks the sophistication of PyCharm, but I prefer its sobriety. I also use Gephi or Cytoscape to visualise network-based data, typically after some processing, as data tends to be stored within relational tables (as in Microsoft Excel).
In my opinion, patience is a soft skill that is hard to acquire yet has significant purpose – both in work and in life. When I start working on a project, I never know if it is going to lead to something good or not, and how much time it’s going to take before coming to fruition. For some people, this can be frustrating, and they would prefer working on shorter-term projects.
However, if I want to work toward a greater objective, it is necessary to take both the risks and the time associated with the work necessary to achieve it. Hence, lack of patience so often leads to disappointment.
👏 Photo credit: Georgios Triantopoulos
1 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
2 Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
3 Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596 (7873), 583-589.
4 Ofer, D., Brandes, N., & Linial, M. (2021). The language of proteins: NLP, machine learning & protein sequences. Computational and Structural Biotechnology Journal.