The goal of the exercise is to show you the principles used in the cleaning,
clustering and assembling of EST sequences
This is a small example with 4 sequences, but in real life all the steps have been automated in pipelines.
Here there are 4 mouse sequences
that you will analyze and clean.
Copy them in a local file (you will need to manually edit them).
Vector contamination checking and cleaning
Interspersed and simple repeat masking
- Analyze each of your sequences for vector contamination using VecScreen at
- Remove the vector contaminations from the sequences using your text editor
or word processor.
- Open new browser window with http://www.repeatmasker.org/
- Click Services/Repeatmasking
- Copy-and-paste sequences cleaned from vector contaminaitons
- Select html for return format
- Click Submit sequence and wait for process to finish
- RepeatMasker summary (pre-computed)
- Copy the masked sequences in a local file (pre-computed)
Look at summary page
Check if the 4 sequences could be in the same cluster
i.e. 2 sequences are clustered if they are greater than 96% identical over a window of 150 bases.
- What is the total length of your input sequence?
- What is the GC level (percentage of G+C nucleotides)?
- How many bases were masked? How many percent?
- How many LINE elements were found and how many bases do they cover?
Use sequence assembly software to assemble sequence reads into contiguous sequences.
- Using a pairwise comparison program as bl2seq.
- Can you already tell something about the result of the clustering?
- Open new browser window to access the cap3 program
- Cut and paste the 4 sequences previous masked (the corresponding file is on your Desktop) in the text area
- Click SUBMIT button
- Click here to view cap3 standard output
Return to CAB Home