This exercise will be used to illustrate how to generate a pattern that can describe a protein family

Let's assume that we are interested in characterizing the following protein sequence from Saccharomyces cerevisiae: RPE_YEAST.
  • For FASTA format sequence retrieval click on "P46969 in FASTA format" at the end of the page.

  • First, you will perform a protein BLAST search to identify related proteins. Paste the protein sequence and perform a search against SwissProt. Expand the Algorithm parameters window and set the Max target sequences parameter to 50.
    Click here to view pre-computed results.

    The results indicate that there are a considerable number of proteins in SwissProt that are similar in sequence and that have been annotated as Ribulose-phosphate 3-epimerase. They can be considered a family.

  • Then retrieve the sequences of all proteins identified in the BLAST search. First select the first 30 sequence and than click on get selected sequences button. You will redirect to an Entrez NCBI page.

  • Select FASTA in the 'Display' pull down menu, and than click the send button with "all to file" in the next pull down menu to save your sequences on your PC.
    Click here for FASTA formatted sequences retrieval.

  • Then, align them with ClustalW. Part of the resulting multiple alignment should look as appears below. You should be able to recognize some conserved regions.




    ' * ' signs indicate positions with a high percentage of identity.
    ' : ' signs indicate positions with high percentage of similarity (amino acids with similar properties)
    ' . ' signs indicate positions with a lower degree of similarity.


    The block marked in the red rectangle is clearly more conserved than the rest of aligned segments, and therefore,correspond to PROTEIN MOTIF that could be characteristic of the RPE protein family.
    As you know, protein Motifs can be efficiently represented as simple patterns or regular expressions, if they correspond to sequences that are not too long. Therefore, we will now produce regular expressions that represent the Motif we found, and we will try to find other sequences that also contain it.


  • You have now 2 options: you can generate your own pattern MANUALLY, or use some AUTOMATIC algorithm to do it.

  • First, we will try MANUALLY:
    Choose some of the conserved segments and deduce patterns or regular expressions that could represent them. To do it, follow these standard as well as simple rules:
    Basic syntax for regular expression formulation
    [AS] = A and S allowed
    D = D allowed
    x = Any symbol
    x4 = Four arbitrary symbols
    {PG} = Any symbol except P and G
    [FY]2 = Two positions where F and Y allowed
    x(3,7) = Minimum 3 and maximum 7 residues

  • Then try to find other sequences that contain the same Motif, using ScanProsite
    You can compare your results with those obtained with the following pattern:
    [FVLI]-H-x-D-[IVM]-M-D-x2()-[FY]-x(2)-N


  • Now, we will try to automatically generate the pattern, using Pratt.
    We will use the same segment of the alignment that we chose before to define patterns manually.

    To trim the alignment to keep only a given segment, you definitively need a MSA viewer and editor, such as Belvu, Pfaat or JalView.

    If you can not succeed in obtaining a portion of the alignment, continue with the sequences in the file: segment.fasta

    The server Pratt uses sets of unaligned sequences as input. You just have to copy the different unaligned sequences you have in the file:segment.fasta, and paste them in the Pratt input box.

    The results should look like those reproduced below:



  • Now, we will try to identify Prosite patterns in your target sequence.
    So far, you have been able to identify protein Motifs that are characteristic of a group of aligned sequences, and to identify new sequences in SwissProt that also contain the Motif. According to their annotations, most of those sequences seem have a similar function, and, therefore, you could consider that you have discovered a Motif that is characteristic of the RPE family of proteins. That Motif is represented with the patterns that you have either manually constructed or automatically generated with Pratt.

    It seems interesting, therefore, to check whether a similar pattern (or Motif) has been deposited in Prosite and, if that is the case, whether the pattern identifies a family of proteins that includes that we have identified.

    You can follow several approaches:
    -- Use ScanProsite to scan the amino acid sequence of RPE_YEAST against the database of patterns and profiles of Prosite.
    -- Search Prosite with the expression "Ribulose-phosphate 3-epimerase".
    -- Find the RPE_YEAST entry in SwissProt and find out whether there are cross references to the Prosite database.

  • You should be able to find that the Ribulose-phosphate 3-epimerase family has been documented in Prosite, and has accession number PS01085
    Also, Prosite contains information about patterns that are characteristic of this family.

    Following with the same idea, we will try now with Pfam
  • We want to know whether there is some profile (HMM profile) in Pfam, which identifies the RPE family.
    As before, you can:
    -- Use the RPE_YEAST sequence to perform a Protein Search against Pfam.
    -- Find the RPE_YEAST entry in SwissProt and find out whether there are cross references to the Pfam database.

    You should be able to find that there is a Pfam profile that describes the family Ribul_P_3_epim.
    As expected, the Pfam profile describes a protein segment that is longer than that described by Prosite patterns. It is considered a conserved Domain, more than a conserved Motif.




Return to CAB Home