YEASTRACT logo
 
Tutorial

Web Services new web services

DISCOVERER

Regulatory Associations:
- Search for TFs
- Search for Genes
- Search for Associations

Group genes:
- Group by TF
- Group by GO

Pattern Matching:
- Search by DNA Motif
- Find TF Binding Site(s)
- Search Motifs on Motifs

Utilities:
- ORF List ⇔ Gene List
- IUPAC Code Generation
Generate Regulation Matrix

Retrieve:
- TF-Consensus List
- Evidence Code List
- Upstream Sequence
- Flat files

About Yeastract:
- Contact Us
- Cite YEASTRACT
- Acknowledgments
- Credits


Support & suggestions:

KDBIO/INESC-ID IST

MUSA Help

BSRG/IST

index musa tutorial credits publications cite discoverer menu


Index

  1. Summary
  2. The Algorithm
  3. Input
  4. Parameters
  5. Output
  6. References



1. Summary


MUSA is an algorithm that searches for simple and structured motifs in a given set of non-coding DNA sequences. MUSA differs from other motif finders in the sense that it does not require details about the structure of the complex motif to be searched for. Due to this important feature, MUSA is useful not only in the search for the motifs themselves, but also in the estimation of parameters to be used in other motif finders.

When the algorithm finishes, an e-mail containing a link to a web page is sent to the user. The complete list of motifs found can be downloaded from this page. The motifs identified are also clustered in families, each one represented by a Position Weight Matrix (PWM). Each of these PWMs can be selected to be compared with the Transcription Factor Binding Sites (TFBS) that are described in the YEASTRACT database, which allows for the identification of existing similar motifs. Figure 1 illustrates the complete process.


Figure 1: Summary
Back to top

2. The Algorithm


The method used by MUSA to extract motifs is depicted in Figure 2. In order to search for motifs, MUSA builds what is known as a matrix of co-occurrences. Given a length λ , all possible sequences of this length (λ-mers) are put together using the alphabet {A, T, G, C}. The matrix of co-occurrences is then constructed to depict the ε-tolerant score of the most common configuration of each pair of λ-mers in the input sequences. The algorithm then proceeds to extract motifs using a biclustering approach applied to the matrix of co-occurrences. Further details can be found in [1].


Figure 2: How MUSA finds motifs
Back to top


3. Input


Figure 3 shows the user interface for the MUSA algorithm. An email address, to where the results will be sent, a small description of the job and a list of ORFs/Gene Names are mandatory. The ORFs/Gene Names must be according to the SGD nomenclature, as their promoter sequences will be obtained automatically from the YEASTRACT database.


Figure 3: Input
Back to top


4. Parameters


Although MUSA does not require the user to specify the characteristics of the motifs, a few parameters are required for its function. The default values can be changed by clicking the box entitled Specify Parameters in the user interface, as shown in Figure 4.


Figure 4: Default parameter values

The following table provides a brief description of these parameters.

Search both strands Should be selected if the user wishes to perform the search for motifs in both the forward and reverse strands of the promoters of the input genes.
Minimum number of sequences (sieve rate) Minimum percentage of the input genes that must contain the motifs in their promoters.
Lambda parameter (λ, size of λ-mers)Size of the small sequences used to build the matrix of co-occurrences.
Epsilon parameter (ε, distance tolerance)Distance tolerance in the configuration of a pair of λ-mers. The use of ε greater than zero allows the configurations of λ-mers to have slight variations from sequence to sequence.
Maximum p-value for assessment of relevant motifs for family generation Specifies the maximum p-value of the subset of motifs that will be considered to generate the families
Back to top


5. Output


An email will be sent to the given address once the algorithm finishes. This email contains information about the input parameters and a link to a web page, as shown in Figure 5.


Figure 5: Example of the email content

A sample output page is shown in Figure 6, below. This page contains a link to the motif finder's output and a table listing the families of motifs. Each entry contains a logo depicting the PWM (Position Weight Matrix) of the family, the p-value of the root and a link to a file containing the list of motifs in the family and the PWM itself.


Figure 6: Sample output page

The families obtained are displayed in a number of pages, and these can be viewed by using the links shown in Figure 7, below.


Figure 7: Jumping between pages of results

Each of the PWMs in this output file can be compared to the TFBS contained in the YEASTRACT database, by following the steps depicted in Figure 8:

  1. Select a family's PWM in the column labelled "Select";
  2. Select the parameters to be used in the alignment of the PWMs;
  3. Press the button labelled "Match!", to compare the selected PWM with the database's TFBS.

The default metric is "Sum of the Squared Distances", and the input PWM can also be trimmed. Trimming removes the columns at the edges of the PWM that have an information content below the selected threshold.


Figure 8: Comparing the output families with YEASTRACT TFBS

The comparison of the input PWM with the TFBS of the YEASTRACT database is done using a procedure described in [2]. First of all, the TFBS of the YEASTRACT database are converted to PWMs, using the IUPAC rules and assuming equiprobability between the nucleotides. A few examples are shown in Figure 9. The input PWM is then locally aligned (using the Smith-Waterman local alignment algorithm) with each of the TFBS PWMs, with the selected column distance metric to perform the alignment. Four distance metrics were implemented:

  • Sum of the squared distances;
  • Average Kullback-Leibler divergence;
  • Pearson Correlation coefficient;
  • Average log-likelihood ratio.

For the average log-likelihood ratio distance metric, the nucleotide background frequencies were corrected for the GC-content (38%) of S. cerevisiae promoter DNA.


Figure 9: Transforming a IUPAC TFBS into a PWM

Each of these metrics compares two columns, evaluating their similarity numerically and either favouring or penalizing their alignment. After the alignments are performed between the input PWM and all the TFBS, they are ordered by score, and the twenty top scoring alignments are displayed in the results table.

An example of the output obtained when a PWM of the family of motifs is compared with the YEASTRACT TFBS is shown in Figure 10. The results table contains TFBS that were found to be similar to the input PWM. Each row contains the TFBS, the TF it belongs to, whether the input PWM aligns on the forward or the reverse strand of the TFBS and the alignment of the input PWM with the PWM of the TFBS. An example of one local alignment is shown in Figure 11.


Figure 10: Top-scoring TFBS when aligned with the input PWM




Figure 11: Example of the alignment of an input PWM with a TFBS
Back to top


6. References


[1] Mendes N.D., Casimiro A.C., Santos P.M., Sá-Correia I., Oliveira A.L., Freitas A.T., MUSA: A parameter free algorithm for the identification of biologically significant motifs, Bioinformatics, 22, 2996-3002, 2006

[2] Mahony S, Auron PE, Benos PV (2007) DNA familial binding profiles made easy: Comparison of various motif alignment and clustering strategies., PLoS Comput Biol, 3(3): e61. doi:10.1371/journal.pcbi.0030061

Back to top


Back to top top w3c xhtml validator w3c css validator w3c xhtml+rdfa validator