YEASTRACT logo
 
Tutorial

Web Services new web services

DISCOVERER

Regulatory Associations:
- Search for TFs
- Search for Genes
- Search for Associations

Group genes:
- Group by TF
- Group by GO

Pattern Matching:
- Search by DNA Motif
- Find TF Binding Site(s)
- Search Motifs on Motifs

Utilities:
- ORF List ⇔ Gene List
- IUPAC Code Generation
Generate Regulation Matrix

Retrieve:
- TF-Consensus List
- Evidence Code List
- Upstream Sequence
- Flat files

About Yeastract:
- Contact Us
- Cite YEASTRACT
- Acknowledgments
- Credits


Support & suggestions:

KDBIO/INESC-ID IST

RISO Help

BSRG/IST

index musa tutorial credits publications cite discoverer menu


Index

  1. Summary
  2. The Algorithm
  3. Input
  4. Parameters
  5. Output
  6. References



1. Summary


RISO is an algorithm that searches for simple and structured motifs in a given set of DNA sequences. The motifs that are found by RISO comply with certain characteristics which are specified by the user, such as the number and size of the boxes that form the structured motif, the distances between them and the minimum quorum expected. It is also possible to specify a number of substitutions for each box.

When the algorithm finishes, an e-mail containing a link to a web page is sent to the user. The complete list of motifs found can be downloaded from this page. The motifs identified are also clustered in families, each one represented by a Position Weight Matrix (PWM) description. Each PWM can be selected to be compared with the TF Binding Sites that are described in the YEASTRACT database, which allows for the identification of existing similar motifs. Figure 1 illustrates the complete process.


Figure 1: Summary
Back to top

2. The Algorithm


RISO searches for motifs in the input sequences by using all the input sequences to construct what is known as a factor tree, a kind of suffix tree pruned at a given depth (defined by the length of the boxes of the complex motif). RISO uses a new data structure, the box-link, in order to efficiently extract structured motifs from the set of promoters of the input genes. The box-link stores the information needed to jump from box to box of the structured motif. Further details can be found in [1].


Figure 2: How RISO finds motifs
Back to top


3. Input


Figure 3 shows the user interface for the RISO algorithm. An email address, to where the results will be sent, a small description of the job and a list of ORFs/Gene Names are mandatory. The ORFs/Gene Names must be according to the SGD nomenclature, as their promoter sequences will be obtained automatically from the database.


Figure 3: Input
Back to top


4. Parameters


The parameters required for RISO's function are described in the following table:

QuorumMinimum percentage of genes that must contain the motifs in their promoters.
Family significance level cutoffSpecifies the maximum p-value of the set of motifs that are to be used in the generation of the families of motifs.
SubstitutionsMaximum number of substitutions allowed for the corresponding box of the structured motif.
Minimum/maximum lengthMinimum/maximum length of the corresponding box of the structured motif.
Add one box/remove last boxShould be clicked when one wishes to add/remove one box to/from the structured motif, respectively.
Minimum/maximum spacer lengthMinimum/maximum length of the space between the boxes of the structured motif.

Figure 4 shows how to add boxes to (and remove boxes from) the structured motif.


Figure 4: Adding boxes to and removing boxes from the structured motif
Back to top


5. Output


An email will be sent to the given address once the algorithm finishes. This email contains information about the input parameters and a link to a web page, as shown in Figure 5.


Figure 5: Example of the email content

A sample output page is shown in Figure 6, below. This page contains a link to the motif finder's output and a table listing the families of motifs. Each entry contains a logo depicting the PWM (Position Weight Matrix) of the family, the p-value of the root and a link to a file containing the list of motifs in the family and the PWM itself.


Figure 6: Sample output page

The families obtained are displayed in a number of pages, and these can be viewed by using the links shown in Figure 7, below.


Figure 7: Jumping between pages of results

Each of the PWMs in this output file can be compared to the TFBS contained in the YEASTRACT database, by following the steps depicted in Figure 8:

  1. Select a family's PWM in the column labelled "Select";
  2. Select the parameters to be used in the alignment of the PWMs;
  3. Press the button labelled "Match!", to compare the selected PWM with the database's TFBS.

The default metric is "Sum of the Squared Distances", and the input PWM can also be trimmed. Trimming removes the columns at the edges of the PWM that have an information content below the selected threshold.


Figure 8: Comparing the output families with YEASTRACT TFBS

The comparison of the input PWM with the TFBS of the YEASTRACT database is done using a procedure described in [2]. First of all, the TFBS of the YEASTRACT database are converted to PWMs, using the IUPAC rules and assuming equiprobability between the nucleotides. A few examples are shown in Figure 9. The input PWM is then locally aligned (using the Smith-Waterman local alignment algorithm) with each of the TFBS PWMs, with the selected column distance metric to perform the alignment. Four distance metrics were implemented:

  • Sum of the squared distances;
  • Average Kullback-Leibler divergence;
  • Pearson Correlation coefficient;
  • Average log-likelihood ratio.

For the average log-likelihood ratio distance metric, the nucleotide background frequencies were corrected for the GC-content (38%) of S. cerevisiae promoter DNA.


Figure 9: Transforming a IUPAC TFBS into a PWM

Each of these metrics compares two columns, evaluating their similarity numerically and either favouring or penalizing their alignment. After the alignments are performed between the input PWM and all the TFBS, they are ordered by score, and the twenty top scoring alignments are displayed in the results table.

An example of the output obtained when a PWM of the family of motifs is compared with the YEASTRACT TFBS is shown in Figure 10. The results table contains TFBS that were found to be similar to the input PWM. Each row contains the TFBS, the TF it belongs to, whether the input PWM aligns on the forward or the reverse strand of the TFBS and the alignment of the input PWM with the PWM of the TFBS. An example of one local alignment is shown in Figure 11.


Figure 10: Top-scoring TFBS when aligned with the input PWM




Figure 11: Example of the alignment of an input PWM with a TFBS
Back to top


6. References


[1] Mendes N.D., Casimiro A.C., Santos P.M., Sá-Correia I., Oliveira A.L., Freitas A.T., MUSA: A parameter free algorithm for the identification of biologically significant motifs, Bioinformatics, 22, 2996-3002, 2006

[2] Mahony S, Auron PE, Benos PV (2007) DNA familial binding profiles made easy: Comparison of various motif alignment and clustering strategies., PLoS Comput Biol, 3(3): e61. doi:10.1371/journal.pcbi.0030061

Back to top


Back to top top w3c xhtml validator w3c css validator w3c xhtml+rdfa validator