question about regular expression for parent feature

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

question about regular expression for parent feature

Monica Poelchau
Hi Tripal people,

I have a set of protein sequences that I would like to associate with the transcript sequences that they are derived from. The IDs (unique names) follow this format:

transcript: SGGG000001-RA
protein: SGGG000001-PA

Has anyone dealt with a similar situation before (the above format doesn't seem to be that uncommon), and what regex did you use in the 'Import a multi-FASTA file' page under 'Regular expression for the parent' that would allow the protein ID to be associated with the transcript ID? 

Thanks!

Monica
--
Monica Poelchau
Postdoctoral Fellow, University of Maryland
Research Affiliate, Georgetown University
[hidden email]
[hidden email]

------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: question about regular expression for parent feature

Stephen Ficklin-2
Hi Monica,

The FASTA importer regular expressions can only match on information that is already in the definition lines of the FASTA file.  So, what I would suggest, is to edit your FASTA file and add the transcript ID to the definition line in your protein file.  Then you can use a regular expression in the loader to pull out the parent.

If you feel comfortable with running one-liner UNIX command-lines, then a little perl command-line may be able to do this for you.  For example, if your definition lines only has the protein name (like this):

>SGGG000001-PA

Then the following command can transform all of the definition lines in your FASTA file to include the parent ID.   This command assumes all proteins are suffixed with '-PA' and all transcripts are suffixed with '-RA' and that the protein ID is the only text in the definition line:

cat [filename] | perl -pi -e 's/>(.*?)-PA/>\1-PA mRNA:\1-RA/g' > [new filename]

You could then use this regular expression in the FASTA loader to make the association:

^>.*?mRNA:(.*?)$


Hope that helps,
Stephen

On 1/24/2014 1:43 PM, Monica Poelchau wrote:
Hi Tripal people,

I have a set of protein sequences that I would like to associate with the transcript sequences that they are derived from. The IDs (unique names) follow this format:

transcript: SGGG000001-RA
protein: SGGG000001-PA

Has anyone dealt with a similar situation before (the above format doesn't seem to be that uncommon), and what regex did you use in the 'Import a multi-FASTA file' page under 'Regular expression for the parent' that would allow the protein ID to be associated with the transcript ID? 

Thanks!

Monica
--
Monica Poelchau
Postdoctoral Fellow, University of Maryland
Research Affiliate, Georgetown University
[hidden email]
[hidden email]


------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk


_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal


------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal