How to identify species related to a sequence

The taxonomic identification of species is a fundamental process in biological research that allows us to map how different species relate to one another, identify new genetic material and understand evolution. As a result, species identification is a foundational requirement for many real-world applications including forensics, conservation, marketplace regulation, food production, disease control, ecosystem monitoring and research.

However, species identification is often hindered by the lack of taxonomically informative structures and the reality that 90% of all multicellular species still await description.

This issue is now being addressed by international initiatives, such as the Consortium for the Barcode of Life (CBOL) and the Global Taxonomy Initiative (GTI), to develop DNA barcoding as a global standard for the identification of biological species. With this approach, taxonomic identification can be accomplished by analysing short segments of the genome against reference sequence libraries linked to specimens identified to a species level.

But even in cases where reference sequence libraries are already available, sequence-based species identification is still hampered by several procedural and computational challenges.

In this walkthrough, we will demonstrate how the BioStrand Retrieve & Relate (R&R) platform simplifies and accelerates species identification using just a sequence fragment to retrieve accurate matches from multiple databases in parallel.


Techniques for species identification from a random sequence

Currently, BLAST (Basic Local Alignment Search Tool) is the go-to platform for species identification using either a proprietary reference database or a public genetic repository like GenBank. Executing a BLAST search involves selecting the appropriate nucleotide blast program, pasting the query sequence into the search box, selecting a database and clicking the BLAST button. This returns a list of matches, with the top hit signifying the best match, supplemented by several statistical indicators of the quality of each match.

Each result, therefore, is further qualified by values such as Max/Total score, Query coverage, E value, and Max Ident that represent alignment scores, the proportion of query sequence aligned with reference datasets, the proportion of identical nucleotides in aligned sequences, and probable quality of each result. The best estimate of species identity, i.e. the top hit, is the result with near-complete query coverage, very high sequence identity, and a very low E value.

The primary limitation of this approach is that only one reference database can be integrated into the analysis. More importantly, there needs to be a more deterministic model for sequence-based taxonomic identification in the age of DNA barcoding.

With the BioStrand R&R platform, you can integrate all relevant databases at one shot to aggregate all pertinent information and generate a universal list of all relevant matching sequences. In addition, the platform also gives you the flexibility to explore the results across multiple parameters including alignment score, shared sequences and other associated metadata.


Research Objective

In this workflow, we demonstrate how you can start with any random sequence fragment to:

1. Simultaneously retrieve all matching sequences from multiple databases, and

2. Use a range of built-in filters and features to identify the best taxonomic match for your query sequence.

More broadly, we will demonstrate how you do not have to compromise between search effectiveness and computational efficiency in the extraction of sequences with shared motifs.


The BioStrand sequence-based species identification workflow

We start with a randomly selected sequence fragment:

TCTCGACTCCGTCAGATTTGCGGCACCTCTTTACAAAAATGGTCCTGTATCTATACGACTATACATTGGCGGTCGTGTCTAAGGACAACACAATGCACACAGCAGATGATATAAGTTTCGACTTACGACCGCACTTGCCTTCTGACGCCTTGTTAGCGACCGTGATCAACCACGACATGGTGATCGATGGGGATCTCTTCACGCCTGAGCAGATATCACTGCTTTGTCTTGCAGGGCAGCAGTACCCGAGCGTGTGGTATGCCGGCGAAGGCAACATATACAATTCATGCAACATGGTCGCCGATGATCTGGTGGTGGTAAGCTCCGGGAGGCTCACCACAGACAGTGCTTTCACATGGGGGTCACCTGACAAGCTCTACAACATGATGTGGACAATAGCCCAGAAGCTGAATGGCGTGAGCTGTCTGATGTATGCCCTCGAGAGTATGCGTGGAAAGTGCAAGATGATGAGCGATATCGTTGCCAAGACTGACTGCAGAGAGGTCAATGCTATGATACCTAGGAGTTACTGCATGAGCACAGCCTTTGGACAAATTAGGGAGAAGCAGATTGTGGTCAAAATGCCAGGTTACTTCTCAACGAGCATCGGAATGTTATCCGACCTGATGTATGGTATGACATTCAAGGCCGTCGCTTCGTGCGTGGCCGAGACTCTGGGCGCTATGGGCACGATTGTCTCATCGTCGACGCCACGCACGAATCCTACCATCAATGGGCTTATGAGAGATTACGGTTTGCAGCACACGAACGCCTGGGACAATTTCATGCTTAGAAACTTTGAAATGGTCACACGGCGACCAACCCAGTGGGACATAGGACAGCACATGAAAGAGTATGCCTTGGCGCTTGCCGAGCATGTCATGCTAGGGTATGATATTGAGATGCCATCCATCTTACTCACCATACCCGCTCTGACCGCAGTCAATACTGCCTACGGACTCACGAGGGGGTGGTACGGAGGTGGGAGCACCTTGGATATGGACAAAAAGCAGCGGAAAGAATCTACAGATGCCTTATGCGCTGTGGGGTGGATGTGCGGGCTGCGGCAATGCCGACCCCAAGTATTCCGGAACCGAGCAGGCAAAAAGCAAGTGATGGTTAATGCAGCAGAGCGTAAGCTAAGAGCCGAGGCGGGGGATGACTGCAGGATAAGGGATGTGGAGTTCTGGCTAGAGGACACGCCCGGCGGCAGGGTCGACGAGAACGAGGAATCTGCCCCTAACTTGTATAAGACGGAGTTCAGCGGGACCAAATGTGCTATGGTGTTCAATTACGAGATGGGTATGTGGATTGAGGCCCGTCAGATGGACTACGACCGACTTAAGAGAGAAACTTTCTCAGGAGACCTGACTAAAAAAGAGCGGTATACAATGAGTAAAGTGTCTGCAATGCCTATACATTGGGGCCCACCCCCAAACCATAAGGCTAAGCTGGAGGCCAGCCTGGAACACATGAAAAGTATAAGCCGCGGGAACGCGATTGTACCCACCCGAGAGCCGAAGCATGTGCGCATTAACTCGCAGAGTATGGCCGTTGTGCCCAAATATGTCAAGGACGGTGTCGAAGAGGAAAAGTATGTCCACTACGAACGGCCAGCGATCGAGGAGGGAGATACGATCCGCTTCAGCGAGATAGACGTTCCAGGGGATGGTTCATGCGGGATACACGCTATGGTGAAGGACCTGACAGTGCATGGTAGGTTATCGCCGCACGAGGCTGCCAAGGCTACCGAGCTGTTTAGCACGGATACAGCCTCAAAAAAGTTCCATGACGCTGCTGAACTTGCAGCACAGTGCCAGCTGTGGGGCATGGGAATGGACCTGATCGACAAGGGGAGCAACCGTGTGACCCGGTATGGGCCTGAAGACAGTGAGTACAGCATCACTATCATCCGTGATGGGGGTCACTTCAGAGCCGGGTTGATA


STEP 1: Define the right translation frame

As we are using a sequence fragment, we first need to define the right translation frame.

In order to do so, we paste our query sequence into the Search bar and hit enter. This launches a search with default settings across 12 public databases and across all omics layers, translations and relevant coding sequences. This means that you get a more complete overview of preliminary results that are relevant to the queried sequence.

Next, we click on the info button on the right of the grey bar. The information at the bottom of the screen allows us to assess the sequence codes. As you can see, there are quite a few Asterix symbols, or gaps, which means that this Translation frame does not code very well.

Define the right translation frame, image 1

To adapt the reading frame, we go back to the search screen and relaunch the query from history and select Translation frame 3.

Define the right translation frame, image 2

When we click on the info button on the right of the grey bar, we are returned with a result that in this case is a fully coded protein sequence with no gaps whatsoever.

Define the right translation frame, image 3


STEP 2: Assessing the relationship between the (fully coded) sequence and the related meta data

When we shift to the Quick Filter view, we are presented with a comprehensive overview of all the possibilities:

Assessing the relationship between the (fully coded) sequence and the related meta data, image 1

We can now assess the search results on the basis of their alignment scores by switching to the List View.

By default, the BioStrand R&R platform ranks the most specific sequences on the basis of the proportion of HYFT patterns the sequences share in common.

Assessing the relationship between the (fully coded) sequence and the related meta data, image 2

As we can see from the Description field, the top matches for our query sequence are pointing to chrysovirus. We can now select all the top sequences associated with chrysovirus to view all the metadata associated with these selected sequences.

Assessing the relationship between the (fully coded) sequence and the related meta data, image 3

In the Quick Filter view, we get an exhaustive overview of all relevant information including the Top Concepts derived from all the returned sequence descriptions, the Taxonomy and the associated Gene Ontology terms.

Assessing the relationship between the (fully coded) sequence and the related meta data, image 4

We can then switch to the List View and the Alignment View to confirm that the best species match for our query sequence is the Penicillium chrysogenum virus.


Conclusions based on the BioStrand motif retrieval workflow

Using discretionary combinations of translation frames, ranking parameters and associated filters, we have been able to establish that our random sequence fragment is best aligned with the Penicillium chrysogenum virus.


The BioStrand R&R taxonomic identification advantage

With the BioStrand R&R platform, you are no longer forced to comb through one database at a time to aggregate all the results that are potentially aligned with your queried sequence. With a single click, you retrieve all the preliminary results relevant to your search from 440 million sequences across 12 databases. Moreover, the search also includes all omics layers, translations and relevant coding sequences. Furthermore, for each query a “word cloud” or “Top Concept” list is created from all the returned sequence descriptions.

With a universal shortlist of potential matches in place, you can then use a variety of features and filters built into the R&R platform and features to explore, validate and quickly extract knowledge from the preliminary results. For instance, even the taxonomy overview can be used as a filter at all times. You can use the taxonomy filter to quickly evaluate matches with multiple family/genus associations to accurately quickly identify if they relate best to a single species or to a whole family.

See for yourself how effective and efficient the BioStrand Retrieve & Relate platform is –
start your Free Trial here.

 

New call-to-action

 


Register for future blogs

 


 

Leave a Comment