I'm dealing with a fungal genome with at least 40% of repeats, so I'm trying to follow the advanced repeat construction protocol.
So far, so good, but I have doubts about how to build the protein database as explained at the end of the page
1. get SwissProt and RefSeq fungal proteins
2. tblastn (from 1) against EST-NCBI database and keep the matches
3. blastp the output from 2 against the transposase protein db. Remove matches
but from here on I'm a bit lost...
"Finally, the rice protein sequences were compared with verified
transposons (such as Pack-MULEs) in the rice genome. If the protein
sequence matched a transposon perfectly and was the only perfect match
in the genome, the relevant protein sequence was excluded. Although
elements such as Pack-MULEs contain true gene sequences, the annotation
(the protein sequence in the database) often extends to non-gene
sequences such as terminal inverted repeat or sub-terminal repeat, which
are not true plant proteins and would cause great complications. As a
result, it is essential to exclude them."
Are the proteins kept at the end of the step 3 the 'protein database'?
Could you provide a bit more detail on how to tackle this?
Thank you in advance,
Xabier Vázquez-Campos, PhD Research Associate NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences
The University of New South Wales Sydney NSW 2052 AUSTRALIA