How to retrieve shared motifs across homologous sequences

Motifs refer to short, typically fixed length patterns that are repeated in exact or approximate form across biological sequences and correspond to functionally or structurally important elements. The discovery of new motifs, therefore, and determining their functional/structural correlations is an enduring challenge in bioinformatics.

This walkthrough, however, is not about discovering novel motifs and decoding their underlying correlations. Here we focus on the search and retrieval of specific query motifs that are shared across a group of homologous sequences.

The ability to quickly and easily retrieve shared motifs across a homologous family of sequences allows researchers to group/filter sequences on the basis of their shared characteristics. For instance, identifying common structural motifs in a protein dataset can provide valuable structural and functional insights about uncharacterised proteins, reveal similarities between proteins, and serve as fingerprints for spatial configurations of amino acids.

As a result, the real-time retrieval of motifs of interest from vast biological sequence datasets is a critical capability in modern bioinformatics.


Techniques for motif retrieval

Determining whether a group of homologous sequences share a specific motif continues to be an extremely challenging problem for a number of reasons. Take the case of proteins, for example. Conventional methods based on clustering sequences by similarity to identify shared motifs can be computationally slow given the sheer size of the protein database.

In fact, current algorithms to identify and retrieve specific motifs from biological sequences fall into one of two broad types: exact algorithms, which can identify all sequences with the shared motif but are computationally intensive, and approximate algorithms, which are less effective at retrieving shared motif sequences but are computationally efficient.

The BioStrand approach to motif retrieval eliminates the need for a trade-off between effectiveness and efficiency. With the BioStrand Retrieve & Relate (R&R) platform, you can search for complex patterns across extremely large datasets with the added functionality to even exclude specific motifs.


Research objective

In this workflow, we will demonstrate how you can retrieve, in real-time, all biological sequences in a dataset that share a specific motif.

We detail the workflows for two distinct scenarios:

1. Identifying all sequences that include the exact pattern as that of the queried motif of interest

2. Identifying all sequences with known complex motifs that include gaps and variables

More broadly, we will demonstrate how you do not have to compromise between search effectiveness and computational efficiency in the extraction of sequences with shared motifs.


The BioStrand shared motif retrieval workflow

SCENARIO ONE

The objective of this workflow is to identify all sequences in a dataset that include the exact pattern as that of the queried motif of interest.

We therefore start with:

1. A search sequence: In this case, we have a myoglobin sequence:

MVLSEGEWQLVLHVWAKVEADVAGHGQDIEIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG

2. A motif of interest: In this case, we are looking for a well-conserved helix pattern associated with oxygen binding:

EAELKPLAQSHAT

In this workflow, we will identify all immunoglobulin sequences that share the exact pattern as our oxygen-binding motif of interest.


STEP 1: Myoglobin Sequence Search

We copy and paste the myoglobin sequence into the BioStrand Search Bar and hit return.

Myoglobin Sequence Search Bar

The Quick Filter view provides us with a comprehensive overview of all relevant results, including similar protein sequences and corresponding coding nucleotide sequences, retrieved from 14 different databases.

Myoglobin Sequence Search Quick Filter

We then shift to the Alignment view for a detailed list of the 1,256 sequences that match our search query.

Myoglobin Sequence Search Alignment


STEP 2: Motif of interest Search

We then paste our pattern of interest into the ‘Highlight pattern’ motif search field just above the Alignment view results.

(Quick Tip: Apply the “exclude pattern” filter if the objective is to retrieve all sequences that DO NOT SHARE the queried motif)

Motif of interest Search

Finally, we click on the “Use as filter” icon for a complete set of sequences that include the exact pattern of our queried motif highlighted in red.

Motif of interest Search Filter

In just two easy steps, we retrieved 1,256 sequences homologous to our myoglobin sequence and then we further narrowed this down to 529 matches that contain the exact biological pattern as that of our oxygen-binding motif of interest.

The versatile motif search functionality of the BioStrand R&R platform can be used to search and sort results based on specific variations that enable you to quickly and efficiently isolate sequences that are most relevant to your research.


SCENARIO TWO

In this scenario, we want to identify all sequences that include known complex motifs. The key difference, in this case, is that the queried motif of interest is not an exact pattern and contains gaps and variables.

We therefore start with:

1. A search sequence: For this workflow, we will start with a high-level plain-text search, rather than a specific sequence, to retrieve all homologous sequences. In this case, we will search for sequences that match the string “Similar CaM” where ‘similar’ is the unique identifier for text searches and CaM is Calmodulin, a very conserved, ubiquitous, eukaryotic protein that binds Ca 2+ ions with high affinity.

2. A motif of interest: We are looking for the IQ calmodulin-binding motif, an amino acid sequence motif

[FILV]Qxxx[RK]Gxxx[RK]xx[FILVWY]


STEP 1: High-level plain text search for “Similar CaM”

First, we simply type “Similar CaM” (without quotes) into the BioStrand Search Bar and hit return to retrieve all sequences with CaM in their domain.

High level plain text search for Similar CaM

The ability to use high-level text to simultaneously search through multiple databases means that you can collate all associated sequences – nearly 73,000 in this instance – with just a single click. Aggregating only research-relevant data at one shot and in one place allows you to focus on the filters and parameters that are most pertinent to the objectives of your study.


STEP 2: Search for the IQ calmodulin-binding motif

The ‘x’s in the IQ calmodulin-binding motif [FILV]Qxxx[RK]Gxxx[RK]xx[FILVWY] represent positions that could contain any variable.

On the BioStrand platform, ‘?’ is the character used to denote a fixed gap. So, in order to launch a search for IQxxxKGxxxRxxY, we simply replace the ‘x’s with ‘?’s and input IQ???KG???R??Y into the motif search bar.

Our platform also allows you to define more fuzzy boundaries by using an asterisk (*) to indicate that the length may vary. The flexibility to choose between fixed gaps or variable lengths makes it much easier to spot even small variations in motifs.

We launch our motif search by shifting to the Alignment view and entering our query IQ???KG???K??L into the motif search bar.

As you can see, we have instantly narrowed down the sequence results from nearly 73,000 to just one sequence containing our exact motif of interest.

Search for the IQ calmodulin-binding motif

Orchestrated multi-database analytics is one of the key differentiators of the BioStrand platform. However, we also understand that multi-database research is only as good as the data filtering and visualization tools that enhance the intelligibility and comprehensibility of the results.

By simply clicking on the sequence, we can then generate a concise report with all salient details including the original data source of the sequence and all associated metadata such as high-level ontology, taxonomy, cath topology, etc. By clicking ‘open externally’, we go directly to the original database of this sequence.

Search for the IQ calmodulin-binding motif database


The BioStrand R&R Motif Retrieval Advantage

With the BioStrand R&R platform, the effectiveness and efficiency of your research are no longer mutually exclusive. A simple two-step process will, in milliseconds, get you from your reference search sequence to a comprehensive list of accurate results of homologous sequences that include the queried motif of interest.

In addition, you have the option of searching for exact patterns, for motifs with fixed gaps that could contain a range of variables, or for motifs with more fuzzy boundaries. You can also choose to launch with a high-level text search for homologous sequences and then use motifs as an additional filter to qualitatively narrow down the results.

And finally, our simultaneous multi-database search functionality enables you to retrieve all relevant results at once, thereby enhancing productivity and accelerating time-to-insight.

See for yourself how effective and efficient the BioStrand Retrieve & Relate platform is –
start your Free Trial here.

 

New call-to-action

 


Register for future blogs

 


 

Leave a Comment