The goal of the exercise is to show you the principles used in the cleaning, clustering and assembling of EST sequences

This is a small example with 4 sequences, but in real life all the steps have been automated in pipelines.
Here there are 4 mouse sequences that you will analyze and clean.
Copy them in a local file (you will need to manually edit them).

Vector contamination checking and cleaning
  • Analyze each of your sequences for vector contamination using VecScreen at NCBI.
  • Remove the vector contaminations from the sequences using your text editor or word processor.
Interspersed and simple repeat masking
  • Open new browser window with
  • Click Services/Repeatmasking
  • Copy-and-paste sequences cleaned from vector contaminaitons
  • Select html for return format
  • Click Submit sequence and wait for process to finish
  • RepeatMasker summary (pre-computed)
  • Copy the masked sequences in a local file (pre-computed)
Look at summary page
  • What is the total length of your input sequence?
  • What is the GC level (percentage of G+C nucleotides)?
  • How many bases were masked? How many percent?
  • How many LINE elements were found and how many bases do they cover?
Check if the 4 sequences could be in the same cluster

i.e. 2 sequences are clustered if they are greater than 96% identical over a window of 150 bases.
  • Using a pairwise comparison program as bl2seq.
  • Can you already tell something about the result of the clustering?
Use sequence assembly software to assemble sequence reads into contiguous sequences.
  • Open new browser window to access the cap3 program
  • Cut and paste the 4 sequences previous masked (the corresponding file is on your Desktop) in the text area
  • Click SUBMIT button
  • Click here to view cap3 standard output

