Bioengineer, philosopher, programmer, bioinformatician and teacher, Christophe Van Neste is a truly multidisciplinary and multi-skilled researcher and data scientist at BioStrand.
Christophe’s journey towards becoming a fully-fledged bioinformatician and data scientist started with a five-year master's course in bioengineering.
After successfully completing this, and not feeling quite ready yet for the job market, Christophe embarked on another five-year master’s degree, this time in philosophy. Here he discovered that the pervasive use of logic in philosophy had a lot in common with his personal passion for (bio)programming.
With two master's degrees, self-taught proficiency in programming and set for a career in bioinformatics, Christophe realized that many of the specific bio IT opportunities he was interested in required a PhD. So, back to academia he went for a bioinformatics PhD, in “Porting forensic DNA analysis to deep sequencing,” where his exposure to multiple disciplines further honed his skills in biological data processing.
Christophe has been a valued member of the BioStrand data science team since January 2021. He brings his extensive experience in bioengineering and bioinformatics, and his abiding interest in programming, to his main responsibilities of preparing data for processing on the BioStrand Genetic Research Information Platform and helping customers with data ingestion, grouping and analysis.
Christophe’s synthesist expertise across biology, technology and programming enables him to apply systems thinking approaches to expanding and growing our platform capabilities and helping our customers realize novel insights and quick wins in their research.
I used Perl quite extensively when I was doing my bioengineering and doctorate. But it’s kind of a messy language with multiple approaches to the same thing. I then started and settled on Python because of its simplicity, clean code and logical framework for doing things. This was 10 years ago, and I have been using Python as a general framework ever since.
Of course, within any language, there are a lot of tools to do data analysis. For visualization, I use the Matplotlib library, pandas for preprocessing data, SciPy for scientific analysis, and Sklearn for ML in Python. For neural network analysis, I use TensorFlow.
In bioinformatics, it is almost impossible for anyone, even those with vast experience and expertise, to understand every algorithm and every single step quickly or easily in a process. We typically use a combination of a lot of tools or libraries created by others to perform multiple tasks and it can be quite challenging to decode the functional nuances or internal logic of every software tool that we use.
However, it is extremely important not to take for granted that the tools we deploy are a complete black box. We need to understand how these systems work because every task has significant implications for downstream data analysis outcomes. It is important to look at this responsibility as a continuous learning process to figure out how each of these applications performs on an individual basis and how they work together as a system.
As a bioinformatician, if you are not producing your own data then you just need information on the cutting-edge platforms and applications in bioinformatics and biological data analysis. However, if you are a data producer yourself, then you have specific biological questions. The area of interest then morphs into the state-of-the-art for the specific type of data that you are using.
In both cases, the common theme is technology. And so probably the most significant area of interest in bioinformatics and genomics research currently is on how technology can help any researcher with any type of data analytics requirement without having to take the trouble of having to build their own platform from an overwhelming range of tools, libraries etc.
I am currently working on a fun personal project of honing my home barista skills by applying data science to the art of brewing coffee. I guess it was my way of coping with being stuck at home during the pandemic.
I am roasting my own beans and tracking the temperature and capturing other key data points to see how the process evolves. I am also looking forward to doing some ML on the process. But it’s not like my work interests takes up all of my free time. I also like to go for a run, go skiing, and read philosophy. So, it's not like I’m thinking about data all the time.
As a bioinformatician who was not generating my own research data but only processing data, I initially used to feel kind of stuck in the middle. When I was doing my PhD, there were other doctoral students who were producing their own data and feeling more ownership of that data.
But on the other hand, I also realized that I had a strategic involvement in that I had to work with different kinds of researchers in defining and achieving the objectives of a particular project. So, that helped me develop the soft skills required to deal with the dynamics of working with different people to become a team player. And that is a skill that is as important at work as it is elsewhere in life.
I'm most proud of the project that I'm currently working on, which is to develop the Biostrand engine to launch bioinformatics pipelines and ease the development of the pipelines themselves. I get to work closely with the other Biostrand data scientists to make their work easier.
At BioStrand, we are currently working on variant calling and on optimizing the report formats that will make it super easy for a genetic consultant to quickly solve the puzzle without having to wrestle with the data. As a result, I have been going through variant calling resources to help me stay current on the process.
I came across this very interesting paper about a customizable analysis pipeline that helps identify clinically relevant genetic variants in NGS data. Clinical laboratories typically have big backlogs of patient data to be analyzed and, therefore, there is a lot of effort currently going into optimizing the analysis process.
This paper talks in detail about the challenges facing these laboratories in terms of digesting the information generated by the analysis of patient data. And they also described an interesting framework that classifies genes into tiers of importance.
It is an interesting approach because it does not reduce the data that is presented in the report but rather presents in a format that significantly optimizes the amount of time required to extract insights from a patient's report.
👏 Photo credit: Georgios Triantopoulos