Analyzing Targeted Resequencing data with Galaxy

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Analyzing Targeted Resequencing data with Galaxy

Lali
Hi!
I am having problems with my sequencing results, but I am a newbie at this; so I am thinking there is something wrong with my analysis. So far, I've tried Galaxy and CLC Workbench, but with CLC I could not align to the whole genome, only to individual chromosomes (maybe there is a way, but by the time the trial ended I had not found it).

I used SureSelect capture kit and did single end sequencing on an Illumina. The files the lab sent me are FastQ Illumina 1.5 files, my samples were indexed, and I got a series of files each representing an Index.

What would be the standard workflow for this kind of data?
Which tools/settings?

Does anyone have an example Galaxy workflow for preparing (clipping adapters, quality trimming) and mapping Targeted Resequencing Data?

Is there a way to obtain a coverage report through Galaxy?

Is it possible to ignore/discard the reads mapped when the coverage is below a certain threshold?

I know, I know, a lot of things, but I am very lost.
Any help is appreciated.

L
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

MAF

Laura Iacolina
Dear all,
I’m analysing SNPs data for the first time. I tried with the few software I found in litterature but they can only manage small datasets. I am currently trying with “genetics” package in R but the Geno function takes into account a marker at a time. Considering I have to analyse 200 samples with 50K markers is there any way to tell R to analyse each SNP one after the other?
 
Thank you very much for the help.
 
Laura
 
 

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: MAF

fubar
Laura,

What kind of data you have and you would like to achieve?

There are some Galaxy wrappers for plink
(http://pngu.mgh.harvard.edu/~purcell/plink/) that may be useful for
some kinds of analysis available in the rgenetics tools if you have
linkage pedigree genotype and map files.

On Tue, Apr 5, 2011 at 5:19 AM, Laura Iacolina <[hidden email]> wrote:

> Dear all,
> I’m analysing SNPs data for the first time. I tried with the few software I
> found in litterature but they can only manage small datasets. I am currently
> trying with “genetics” package in R but the Geno function takes into account
> a marker at a time. Considering I have to analyse 200 samples with 50K
> markers is there any way to tell R to analyse each SNP one after the other?
>
> Thank you very much for the help.
>
> Laura
>
>
> ___________________________________________________________
> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
>
>  http://lists.bx.psu.edu/listinfo/galaxy-dev
>
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
>
>  http://lists.bx.psu.edu/
>



--
Ross Lazarus MBBS MPH
Associate Professor, HMS; Director of Bioinformatics, Channing Laboratory;
181 Longwood Ave., Boston MA 02115, USA. Tel: +1 617 505 4850
Head, Medical Bioinformatics, BakerIDI;
PO Box 6492, St Kilda Rd Central; Melbourne, VIC 8008, Australia; Tel:
+61 385321444

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: MAF

Eccles, David
On Tue, Apr 5, 2011 at 5:19 AM, Laura Iacolina <[hidden email]> wrote:
> Considering I have to analyse 200 samples with 50K markers is there any way
to tell R to analyse each SNP one after the other?

From: Ross [mailto:[hidden email]]
> There are some Galaxy wrappers for plink
> (http://pngu.mgh.harvard.edu/~purcell/plink/) that may be useful for
> some kinds of analysis available in the rgenetics tools if you have
> linkage pedigree genotype and map files.

I would also advise using plink for this. Calculating SNP marker statistics
[1] is the one of the things that it has been designed to do. The main
problem is getting data into a format supported by plink, either linkage (one
line per individual), or transposed pedigree (one line per marker). There are
details on these formats in the plink documentation [2].

[1] http://pngu.mgh.harvard.edu/~purcell/plink/summary.shtml#freq
[2] http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#tr

--
David Eccles (gringer)

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Targeted Resequencing data with Galaxy

Anton Nekrutenko
In reply to this post by Lali
Lali:

In your case the workflow for capture re-sequencing should look like this:

1. QC data (groom fastq files and plot quality distribution)
2. Map the reads (use bwa)
3. Generate and filter pileup
4. Intersect pileup with coordinates of sure select bates.

However, before you dive in please understand basic Galaxy functionality by taking a look at http://usegalaxy.org/galaxy101 and watching *all* Illumina-related Galaxy quickies (black boxes on the front page on Galaxy). Next, take a look at http://usegalaxy.org/heteroplasmy.

Note, that we are working on bringing "industrial-strength" diploid genotyping functionality in Galaxy in the next two-three months that will include more sophisticated genotypers, recalibration and realignment tools, and novel visualization approaches.

Thank for using Galaxy.

anton
galaxy team  



On Apr 5, 2011, at 2:44 AM, Lali wrote:

> Hi!
> I am having problems with my sequencing results, but I am a newbie at this; so I am thinking there is something wrong with my analysis. So far, I've tried Galaxy and CLC Workbench, but with CLC I could not align to the whole genome, only to individual chromosomes (maybe there is a way, but by the time the trial ended I had not found it).
>
> I used SureSelect capture kit and did single end sequencing on an Illumina. The files the lab sent me are FastQ Illumina 1.5 files, my samples were indexed, and I got a series of files each representing an Index.
>
> What would be the standard workflow for this kind of data?
> Which tools/settings?
>
> Does anyone have an example Galaxy workflow for preparing (clipping adapters, quality trimming) and mapping Targeted Resequencing Data?
>
> Is there a way to obtain a coverage report through Galaxy?
>
> Is it possible to ignore/discard the reads mapped when the coverage is below a certain threshold?
>
> I know, I know, a lot of things, but I am very lost.
> Any help is appreciated.
>
> L ___________________________________________________________
> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
>
>  http://lists.bx.psu.edu/listinfo/galaxy-dev
>
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
>
>  http://lists.bx.psu.edu/

Anton Nekrutenko
http://nekrut.bx.psu.edu
http://usegalaxy.org




___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: MAF

Anton Nekrutenko
In reply to this post by Laura Iacolina
Laura:

SNP identification and analysis is a very complex subject and without knowing what you are trying to do it is very difficult to point you to the right direction. Perhaps a good place to start would be a supplement for the last year's report from 1000 Genomes Consortium (Nature. 467(7319): p. 1061-1073). Some of the steps you can perform through Galaxy, yet some are in development.

Thanks!

anton
galaxy team


On Apr 5, 2011, at 5:19 AM, Laura Iacolina wrote:

> Dear all,
> I’m analysing SNPs data for the first time. I tried with the few software I found in litterature but they can only manage small datasets. I am currently trying with “genetics” package in R but the Geno function takes into account a marker at a time. Considering I have to analyse 200 samples with 50K markers is there any way to tell R to analyse each SNP one after the other?
>  
> Thank you very much for the help.
>  
> Laura
>  
>  
> ___________________________________________________________
> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
>
>  http://lists.bx.psu.edu/listinfo/galaxy-dev
>
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
>
>  http://lists.bx.psu.edu/

Anton Nekrutenko
http://nekrut.bx.psu.edu
http://usegalaxy.org




___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Targeted Resequencing data with Galaxy

Anton Nekrutenko
In reply to this post by Anton Nekrutenko
Lali:

Please, always CC mailing list when you reply. 

My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?

Which browser/OS are your using?

Thanks,

anton
galaxy team

On Apr 5, 2011, at 11:25 AM, Lali wrote:

Thanks so much for the tips Anton!
I am very excited about the newer developments.
I did watch the quickies and they were very useful for a beginner like me, I actually did my first try at the alignment by following the Illumina single-end tutorial video step by step, but you need to watch the paired-end too, for some of the first steps, which are explained better on that one.
I have been playing around a lot with Galaxy, and I have several workflows, my department just started doing sequencing, so we don't have standard procedures set in place. I was assigned to evaluate Galaxy and CLC, and so far CLC has not impressed me, except for the fact that it can generate reports easily.
I think Galaxy is the way to go for me (us, if I can convince them to run a local server), since I am not a bioinformatician, and just the fact that you can queue up actions and just walk away is fantastic (amongst other things).
But because I am a beginner, I am not 100% of the settings I have chosen and my data is not looking too good so far, but I am having a bioinformatician come over and help me on Thursday and I think your tips will be of help.
My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?

Best regards,

L

On Tue, Apr 5, 2011 at 3:56 PM, Anton Nekrutenko <[hidden email]> wrote:
Lali:

In your case the workflow for capture re-sequencing should look like this:

1. QC data (groom fastq files and plot quality distribution)
2. Map the reads (use bwa)
3. Generate and filter pileup
4. Intersect pileup with coordinates of sure select bates.

However, before you dive in please understand basic Galaxy functionality by taking a look at http://usegalaxy.org/galaxy101 and watching *all* Illumina-related Galaxy quickies (black boxes on the front page on Galaxy). Next, take a look at http://usegalaxy.org/heteroplasmy.

Note, that we are working on bringing "industrial-strength" diploid genotyping functionality in Galaxy in the next two-three months that will include more sophisticated genotypers, recalibration and realignment tools, and novel visualization approaches.

Thank for using Galaxy.

anton
galaxy team



On Apr 5, 2011, at 2:44 AM, Lali wrote:

> Hi!
> I am having problems with my sequencing results, but I am a newbie at this; so I am thinking there is something wrong with my analysis. So far, I've tried Galaxy and CLC Workbench, but with CLC I could not align to the whole genome, only to individual chromosomes (maybe there is a way, but by the time the trial ended I had not found it).
>
> I used SureSelect capture kit and did single end sequencing on an Illumina. The files the lab sent me are FastQ Illumina 1.5 files, my samples were indexed, and I got a series of files each representing an Index.
>
> What would be the standard workflow for this kind of data?
> Which tools/settings?
>
> Does anyone have an example Galaxy workflow for preparing (clipping adapters, quality trimming) and mapping Targeted Resequencing Data?
>
> Is there a way to obtain a coverage report through Galaxy?
>
> Is it possible to ignore/discard the reads mapped when the coverage is below a certain threshold?
>
> I know, I know, a lot of things, but I am very lost.
> Any help is appreciated.
>
> L ___________________________________________________________
> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
>
>  http://lists.bx.psu.edu/listinfo/galaxy-dev
>
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
>
>  http://lists.bx.psu.edu/

Anton Nekrutenko
http://nekrut.bx.psu.edu
http://usegalaxy.org







___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Targeted Resequencing data with Galaxy

Lali
Ohh sorry about that!
I am using both Windows XP and Ubuntu and I usually use Google Chrome.


On Tue, Apr 5, 2011 at 5:33 PM, Anton Nekrutenko <[hidden email]> wrote:
Lali:

Please, always CC mailing list when you reply. 

My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?

Which browser/OS are your using?

Thanks,

anton
galaxy team

On Apr 5, 2011, at 11:25 AM, Lali wrote:

Thanks so much for the tips Anton!
I am very excited about the newer developments.
I did watch the quickies and they were very useful for a beginner like me, I actually did my first try at the alignment by following the Illumina single-end tutorial video step by step, but you need to watch the paired-end too, for some of the first steps, which are explained better on that one.
I have been playing around a lot with Galaxy, and I have several workflows, my department just started doing sequencing, so we don't have standard procedures set in place. I was assigned to evaluate Galaxy and CLC, and so far CLC has not impressed me, except for the fact that it can generate reports easily.
I think Galaxy is the way to go for me (us, if I can convince them to run a local server), since I am not a bioinformatician, and just the fact that you can queue up actions and just walk away is fantastic (amongst other things).
But because I am a beginner, I am not 100% of the settings I have chosen and my data is not looking too good so far, but I am having a bioinformatician come over and help me on Thursday and I think your tips will be of help.
My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?

Best regards,

L

On Tue, Apr 5, 2011 at 3:56 PM, Anton Nekrutenko <[hidden email]> wrote:
Lali:

In your case the workflow for capture re-sequencing should look like this:

1. QC data (groom fastq files and plot quality distribution)
2. Map the reads (use bwa)
3. Generate and filter pileup
4. Intersect pileup with coordinates of sure select bates.

However, before you dive in please understand basic Galaxy functionality by taking a look at http://usegalaxy.org/galaxy101 and watching *all* Illumina-related Galaxy quickies (black boxes on the front page on Galaxy). Next, take a look at http://usegalaxy.org/heteroplasmy.

Note, that we are working on bringing "industrial-strength" diploid genotyping functionality in Galaxy in the next two-three months that will include more sophisticated genotypers, recalibration and realignment tools, and novel visualization approaches.

Thank for using Galaxy.

anton
galaxy team



On Apr 5, 2011, at 2:44 AM, Lali wrote:

> Hi!
> I am having problems with my sequencing results, but I am a newbie at this; so I am thinking there is something wrong with my analysis. So far, I've tried Galaxy and CLC Workbench, but with CLC I could not align to the whole genome, only to individual chromosomes (maybe there is a way, but by the time the trial ended I had not found it).
>
> I used SureSelect capture kit and did single end sequencing on an Illumina. The files the lab sent me are FastQ Illumina 1.5 files, my samples were indexed, and I got a series of files each representing an Index.
>
> What would be the standard workflow for this kind of data?
> Which tools/settings?
>
> Does anyone have an example Galaxy workflow for preparing (clipping adapters, quality trimming) and mapping Targeted Resequencing Data?
>
> Is there a way to obtain a coverage report through Galaxy?
>
> Is it possible to ignore/discard the reads mapped when the coverage is below a certain threshold?
>
> I know, I know, a lot of things, but I am very lost.
> Any help is appreciated.
>
> L ___________________________________________________________
> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
>
>  http://lists.bx.psu.edu/listinfo/galaxy-dev
>
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
>
>  http://lists.bx.psu.edu/

Anton Nekrutenko
http://nekrut.bx.psu.edu
http://usegalaxy.org








___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Targeted Resequencing data with Galaxy

Mike Dufault

Hi all,

 

Like many people on this e-mail chain, I have been looking for advice on how to process Exome data. Below, I have described in detail what I have done with the hope of getting some clarification. Hopefully it will be helpful to many of us!

 

I have SureSelect Exome captured data. The data was delivered to me as two separate files (/1) & (/2). Each file has ~33 million reads; 7.2 GB each. I am looking for SNPs from a family with cancer. Eventually I plan to compare the date from multiple members of the same family to find a related disease SNP.

 

Below is the workflow that I used to process my data. I adapted it from the Screencast titles: "Mapping Illumina Reads: Paired Ends Example." I used all of the same default parameters as in the screencast.

 

At the end of step 13, I had ~4,700,000 SNPs. This seemed like a lot so in step 14, I filtered on column 7 (c7) which I believe is the Quality SNP value. I set the filter as C7>=1 to remove all of the 0 (zero) values for Quality SNP. I figured that if they have a value of zero, they must not be real SNPs. This left me with ~180,000 SNPs.

 

1: Get Data: Illumina 1.3+ file (/1)

2: Get Data: Illumina 1.3+ file (/2)

3: FASTQ Groomer on data 1

4: FASTQ Groomer on data 2

5: FASTQ Summary Statistics on data 3

6: FASTQ Summary Statistics on data 4

7: Box plot on data 5

8: Box plot on data 6

9: Map with Bowtie for Illumina on data 4 and data 3: mapped reads

10: Filter Sam on data 9

11: SAM-to-BAM on data 10: converted to BAM

12: Generate pileup on data 11: converted pileup

13: Filter pileup on data 12

14: Filter data on 13 (c7>=1)

15: Sort on data 15 (C7; descending order)

 

First, if anyone has ideas on how to improve the workflow, I would be open to suggestions; especially from people experienced with Galaxy.

 

Second, I am concerned that many/most of the SNPs are known. Should I filter my data against the known SNPdb? If so, how can I do this in Galaxy (in Bowtie?)

 

Third, as suggested in the screencast, I did not trim or filter my FASTQ Groomed data because I was interested in SNPs and I could filter on Quality later in the workflow. Would implementing a filtering step on phred quality (~20) at this step save me the step of filtering later on. Currently it takes multiple hours (~16) to process the data from start to finish, would filtering at this step reduce the amount of time that it takes to process my data? Presumably, there would be less data to process. I do this on the AWS Cloud and time is money!

 

Fifth, when using Galaxy on the AWS cloud, does adding additional cores or adding High CPU ( or both) shorten the time to process the data? When I set up extra cores, it appeared that some of them are idle and I don't want to pay for idle cores. If anyone could share information on how best to manage the cloud, it would be appreciated.

 

Finally, what is the difference between “stopping” an instance and “terminating” an instance on the cloud? Would I still get charged by AWS if I just stop an instance? Any clarification in this area would also be much appreciated. Again, time is money!

I hope this helps many of us!

 

Unfortunatly, I will not be in Pitt to ask these questions in person.

 

Thanks in advance!!!

 

Mike

--- On Tue, 4/5/11, Lali <[hidden email]> wrote:


From: Lali <[hidden email]>
Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
To: "Anton Nekrutenko" <[hidden email]>
Cc: "galaxy-user" <[hidden email]>
Date: Tuesday, April 5, 2011, 11:50 AM

Ohh sorry about that!
I am using both Windows XP and Ubuntu and I usually use Google Chrome.


On Tue, Apr 5, 2011 at 5:33 PM, Anton Nekrutenko <anton@...> wrote:
Lali:

Please, always CC mailing list when you reply. 

My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?

Which browser/OS are your using?

Thanks,

anton
galaxy team

On Apr 5, 2011, at 11:25 AM, Lali wrote:

Thanks so much for the tips Anton!
I am very excited about the newer developments.
I did watch the quickies and they were very useful for a beginner like me, I actually did my first try at the alignment by following the Illumina single-end tutorial video step by step, but you need to watch the paired-end too, for some of the first steps, which are explained better on that one.
I have been playing around a lot with Galaxy, and I have several workflows, my department just started doing sequencing, so we don't have standard procedures set in place. I was assigned to evaluate Galaxy and CLC, and so far CLC has not impressed me, except for the fact that it can generate reports easily.
I think Galaxy is the way to go for me (us, if I can convince them to run a local server), since I am not a bioinformatician, and just the fact that you can queue up actions and just walk away is fantastic (amongst other things).
But because I am a beginner, I am not 100% of the settings I have chosen and my data is not looking too good so far, but I am having a bioinformatician come over and help me on Thursday and I think your tips will be of help.
My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?

Best regards,

L

On Tue, Apr 5, 2011 at 3:56 PM, Anton Nekrutenko <anton@...> wrote:
Lali:

In your case the workflow for capture re-sequencing should look like this:

1. QC data (groom fastq files and plot quality distribution)
2. Map the reads (use bwa)
3. Generate and filter pileup
4. Intersect pileup with coordinates of sure select bates.

However, before you dive in please understand basic Galaxy functionality by taking a look at http://usegalaxy.org/galaxy101 and watching *all* Illumina-related Galaxy quickies (black boxes on the front page on Galaxy). Next, take a look at http://usegalaxy.org/heteroplasmy.

Note, that we are working on bringing "industrial-strength" diploid genotyping functionality in Galaxy in the next two-three months that will include more sophisticated genotypers, recalibration and realignment tools, and novel visualization approaches.

Thank for using Galaxy.

anton
galaxy team



On Apr 5, 2011, at 2:44 AM, Lali wrote:

> Hi!
> I am having problems with my sequencing results, but I am a newbie at this; so I am thinking there is something wrong with my analysis. So far, I've tried Galaxy and CLC Workbench, but with CLC I could not align to the whole genome, only to individual chromosomes (maybe there is a way, but by the time the trial ended I had not found it).
>
> I used SureSelect capture kit and did single end sequencing on an Illumina. The files the lab sent me are FastQ Illumina 1.5 files, my samples were indexed, and I got a series of files each representing an Index.
>
> What would be the standard workflow for this kind of data?
> Which tools/settings?
>
> Does anyone have an example Galaxy workflow for preparing (clipping adapters, quality trimming) and mapping Targeted Resequencing Data?
>
> Is there a way to obtain a coverage report through Galaxy?
>
> Is it possible to ignore/discard the reads mapped when the coverage is below a certain threshold?
>
> I know, I know, a lot of things, but I am very lost.
> Any help is appreciated.
>
> L ___________________________________________________________

> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
>
>  http://lists.bx.psu.edu/listinfo/galaxy-dev
>
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
>
>  http://lists.bx.psu.edu/

Anton Nekrutenko
http://nekrut.bx.psu.edu
http://usegalaxy.org








-----Inline Attachment Follows-----

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Targeted Resequencing data with Galaxy

Anton Nekrutenko
Mike:

Which parameters did you use at step 13 (if you used main site to perform these analyses you can share your history with me).

Thanks,

anton


On Apr 5, 2011, at 2:22 PM, Mike Dufault wrote:

Hi all,

 

Like many people on this e-mail chain, I have been looking for advice on how to process Exome data. Below, I have described in detail what I have done with the hope of getting some clarification. Hopefully it will be helpful to many of us!
 
I have SureSelect Exome captured data. The data was delivered to me as two separate files (/1) & (/2). Each file has ~33 million reads; 7.2 GB each. I am looking for SNPs from a family with cancer. Eventually I plan to compare the date from multiple members of the same family to find a related disease SNP.
 
Below is the workflow that I used to process my data. I adapted it from the Screencast titles: "Mapping Illumina Reads: Paired Ends Example." I used all of the same default parameters as in the screencast.

 

At the end of step 13, I had ~4,700,000 SNPs. This seemed like a lot so in step 14, I filtered on column 7 (c7) which I believe is the Quality SNP value. I set the filter as C7>=1 to remove all of the 0 (zero) values for Quality SNP. I figured that if they have a value of zero, they must not be real SNPs. This left me with ~180,000 SNPs.
 
1: Get Data: Illumina 1.3+ file (/1)
2: Get Data: Illumina 1.3+ file (/2)
3: FASTQ Groomer on data 1
4: FASTQ Groomer on data 2
5: FASTQ Summary Statistics on data 3
6: FASTQ Summary Statistics on data 4
7: Box plot on data 5
8: Box plot on data 6
9: Map with Bowtie for Illumina on data 4 and data 3: mapped reads
10: Filter Sam on data 9
11: SAM-to-BAM on data 10: converted to BAM
12: Generate pileup on data 11: converted pileup
13: Filter pileup on data 12
14: Filter data on 13 (c7>=1)
15: Sort on data 15 (C7; descending order)
 
First, if anyone has ideas on how to improve the workflow, I would be open to suggestions; especially from people experienced with Galaxy.
 
Second, I am concerned that many/most of the SNPs are known. Should I filter my data against the known SNPdb? If so, how can I do this in Galaxy (in Bowtie?)
 
Third, as suggested in the screencast, I did not trim or filter my FASTQ Groomed data because I was interested in SNPs and I could filter on Quality later in the workflow. Would implementing a filtering step on phred quality (~20) at this step save me the step of filtering later on. Currently it takes multiple hours (~16) to process the data from start to finish, would filtering at this step reduce the amount of time that it takes to process my data? Presumably, there would be less data to process. I do this on the AWS Cloud and time is money!
 
Fifth, when using Galaxy on the AWS cloud, does adding additional cores or adding High CPU ( or both) shorten the time to process the data? When I set up extra cores, it appeared that some of them are idle and I don't want to pay for idle cores. If anyone could share information on how best to manage the cloud, it would be appreciated.
 
Finally, what is the difference between “stopping” an instance and “terminating” an instance on the cloud? Would I still get charged by AWS if I just stop an instance? Any clarification in this area would also be much appreciated. Again, time is money!
I hope this helps many of us!
 
Unfortunatly, I will not be in Pitt to ask these questions in person.

 

Thanks in advance!!!
 
Mike

--- On Tue, 4/5/11, Lali <[hidden email]> wrote:

From: Lali <[hidden email]>
Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
To: "Anton Nekrutenko" <[hidden email]>
Cc: "galaxy-user" <[hidden email]>
Date: Tuesday, April 5, 2011, 11:50 AM

Ohh sorry about that!
I am using both Windows XP and Ubuntu and I usually use Google Chrome.


On Tue, Apr 5, 2011 at 5:33 PM, Anton Nekrutenko <anton@...> wrote:
Lali:

Please, always CC mailing list when you reply. 

My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?

Which browser/OS are your using?

Thanks,

anton
galaxy team

On Apr 5, 2011, at 11:25 AM, Lali wrote:

Thanks so much for the tips Anton!
I am very excited about the newer developments.
I did watch the quickies and they were very useful for a beginner like me, I actually did my first try at the alignment by following the Illumina single-end tutorial video step by step, but you need to watch the paired-end too, for some of the first steps, which are explained better on that one.
I have been playing around a lot with Galaxy, and I have several workflows, my department just started doing sequencing, so we don't have standard procedures set in place. I was assigned to evaluate Galaxy and CLC, and so far CLC has not impressed me, except for the fact that it can generate reports easily.
I think Galaxy is the way to go for me (us, if I can convince them to run a local server), since I am not a bioinformatician, and just the fact that you can queue up actions and just walk away is fantastic (amongst other things).
But because I am a beginner, I am not 100% of the settings I have chosen and my data is not looking too good so far, but I am having a bioinformatician come over and help me on Thursday and I think your tips will be of help.
My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?

Best regards,

L

On Tue, Apr 5, 2011 at 3:56 PM, Anton Nekrutenko <anton@...> wrote:
Lali:

In your case the workflow for capture re-sequencing should look like this:

1. QC data (groom fastq files and plot quality distribution)
2. Map the reads (use bwa)
3. Generate and filter pileup
4. Intersect pileup with coordinates of sure select bates.

However, before you dive in please understand basic Galaxy functionality by taking a look at http://usegalaxy.org/galaxy101 and watching *all* Illumina-related Galaxy quickies (black boxes on the front page on Galaxy). Next, take a look at http://usegalaxy.org/heteroplasmy.

Note, that we are working on bringing "industrial-strength" diploid genotyping functionality in Galaxy in the next two-three months that will include more sophisticated genotypers, recalibration and realignment tools, and novel visualization approaches.

Thank for using Galaxy.

anton
galaxy team



On Apr 5, 2011, at 2:44 AM, Lali wrote:

> Hi!
> I am having problems with my sequencing results, but I am a newbie at this; so I am thinking there is something wrong with my analysis. So far, I've tried Galaxy and CLC Workbench, but with CLC I could not align to the whole genome, only to individual chromosomes (maybe there is a way, but by the time the trial ended I had not found it).
>
> I used SureSelect capture kit and did single end sequencing on an Illumina. The files the lab sent me are FastQ Illumina 1.5 files, my samples were indexed, and I got a series of files each representing an Index.
>
> What would be the standard workflow for this kind of data?
> Which tools/settings?
>
> Does anyone have an example Galaxy workflow for preparing (clipping adapters, quality trimming) and mapping Targeted Resequencing Data?
>
> Is there a way to obtain a coverage report through Galaxy?
>
> Is it possible to ignore/discard the reads mapped when the coverage is below a certain threshold?
>
> I know, I know, a lot of things, but I am very lost.
> Any help is appreciated.
>
> L ___________________________________________________________

> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
>
>  http://lists.bx.psu.edu/listinfo/galaxy-dev
>
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
>
>  http://lists.bx.psu.edu/

Anton Nekrutenko
http://nekrut.bx.psu.edu
http://usegalaxy.org








-----Inline Attachment Follows-----

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/



___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Targeted Resequencing data with Galaxy

Mike Dufault
Hi Anton,
 
The conditions are give below. Currently, I don't have access to the AWS cloud so I can not share my history at the moment.
 

Select dataset:

 

 

which contains:

Pileup with ten columns (with consensus)

 

See "Types of pileup datasets" below for examples

Do not consider read bases with quality lower than:

20

 

No variants with quality below this value will be reported

Do not report positions with coverage lower than:

3

 

Pileup lines with coverage lower than this value will be skipped

 

Only report variants?:

Yes

See "Examples 1 and 2" below for explanation

 

Convert coordinates to intervals?:

Yes

See "Output format" below for explanation

 

Print total number of differences?:

No

See "Example 3" below for explanation

 

Print quality and base string?:

No

See "Example 4" below for explanation

 

 

I did save the output from step 15 to my USB stick and I have provided a bit of it below for what it is worth.

 
chr1 100316588 100316589 A G 255 255 60 141 0 0 137 0 137
chr1 100575932 100575933 G A 255 255 60 89 89 0 0 0 89
chr1 100617886 100617887 C T 255 255 60 113 0 0 0 111 111
chr1 100672059 100672060 T C 255 255 60 225 1 220 0 0 221
chr1 101203826 101203827 G A 255 255 60 106 105 0 0 0 105
chr1 103461507 103461508 T A 255 255 60 87 82 0 0 0 82
chr1 104166495 104166496 T C 255 255 60 168 0 157 0 5 162
chr1 104256477 104256478 T A 255 255 60 84 82 0 0 0 82
 
Thanks for your help!
Mike

--- On Tue, 4/5/11, Anton Nekrutenko <[hidden email]> wrote:

From: Anton Nekrutenko <[hidden email]>
Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
To: "Mike Dufault" <[hidden email]>
Cc: "Lali" <[hidden email]>, "galaxy-user" <[hidden email]>
Date: Tuesday, April 5, 2011, 2:33 PM

Mike:

Which parameters did you use at step 13 (if you used main site to perform these analyses you can share your history with me).

Thanks,

anton


On Apr 5, 2011, at 2:22 PM, Mike Dufault wrote:

Hi all,

 

Like many people on this e-mail chain, I have been looking for advice on how to process Exome data. Below, I have described in detail what I have done with the hope of getting some clarification. Hopefully it will be helpful to many of us!
 
I have SureSelect Exome captured data. The data was delivered to me as two separate files (/1) & (/2). Each file has ~33 million reads; 7.2 GB each. I am looking for SNPs from a family with cancer. Eventually I plan to compare the date from multiple members of the same family to find a related disease SNP.
 
Below is the workflow that I used to process my data. I adapted it from the Screencast titles: "Mapping Illumina Reads: Paired Ends Example." I used all of the same default parameters as in the screencast.

 

At the end of step 13, I had ~4,700,000 SNPs. This seemed like a lot so in step 14, I filtered on column 7 (c7) which I believe is the Quality SNP value. I set the filter as C7>=1 to remove all of the 0 (zero) values for Quality SNP. I figured that if they have a value of zero, they must not be real SNPs. This left me with ~180,000 SNPs.
 
1: Get Data: Illumina 1.3+ file (/1)
2: Get Data: Illumina 1.3+ file (/2)
3: FASTQ Groomer on data 1
4: FASTQ Groomer on data 2
5: FASTQ Summary Statistics on data 3
6: FASTQ Summary Statistics on data 4
7: Box plot on data 5
8: Box plot on data 6
9: Map with Bowtie for Illumina on data 4 and data 3: mapped reads
10: Filter Sam on data 9
11: SAM-to-BAM on data 10: converted to BAM
12: Generate pileup on data 11: converted pileup
13: Filter pileup on data 12
14: Filter data on 13 (c7>=1)
15: Sort on data 15 (C7; descending order)
 
First, if anyone has ideas on how to improve the workflow, I would be open to suggestions; especially from people experienced with Galaxy.
 
Second, I am concerned that many/most of the SNPs are known. Should I filter my data against the known SNPdb? If so, how can I do this in Galaxy (in Bowtie?)
 
Third, as suggested in the screencast, I did not trim or filter my FASTQ Groomed data because I was interested in SNPs and I could filter on Quality later in the workflow. Would implementing a filtering step on phred quality (~20) at this step save me the step of filtering later on. Currently it takes multiple hours (~16) to process the data from start to finish, would filtering at this step reduce the amount of time that it takes to process my data? Presumably, there would be less data to process. I do this on the AWS Cloud and time is money!
 
Fifth, when using Galaxy on the AWS cloud, does adding additional cores or adding High CPU ( or both) shorten the time to process the data? When I set up extra cores, it appeared that some of them are idle and I don't want to pay for idle cores. If anyone could share information on how best to manage the cloud, it would be appreciated.
 
Finally, what is the difference between “stopping” an instance and “terminating” an instance on the cloud? Would I still get charged by AWS if I just stop an instance? Any clarification in this area would also be much appreciated. Again, time is money!
I hope this helps many of us!
 
Unfortunatly, I will not be in Pitt to ask these questions in person.

 

Thanks in advance!!!
 
Mike

--- On Tue, 4/5/11, Lali <laurafe@...> wrote:

From: Lali <laurafe@...>
Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
To: "Anton Nekrutenko" <anton@...>
Cc: "galaxy-user" <galaxy-user@...>
Date: Tuesday, April 5, 2011, 11:50 AM

Ohh sorry about that!
I am using both Windows XP and Ubuntu and I usually use Google Chrome.


On Tue, Apr 5, 2011 at 5:33 PM, Anton Nekrutenko <anton@...> wrote:
Lali:

Please, always CC mailing list when you reply. 

My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?

Which browser/OS are your using?

Thanks,

anton
galaxy team

On Apr 5, 2011, at 11:25 AM, Lali wrote:

Thanks so much for the tips Anton!
I am very excited about the newer developments.
I did watch the quickies and they were very useful for a beginner like me, I actually did my first try at the alignment by following the Illumina single-end tutorial video step by step, but you need to watch the paired-end too, for some of the first steps, which are explained better on that one.
I have been playing around a lot with Galaxy, and I have several workflows, my department just started doing sequencing, so we don't have standard procedures set in place. I was assigned to evaluate Galaxy and CLC, and so far CLC has not impressed me, except for the fact that it can generate reports easily.
I think Galaxy is the way to go for me (us, if I can convince them to run a local server), since I am not a bioinformatician, and just the fact that you can queue up actions and just walk away is fantastic (amongst other things).
But because I am a beginner, I am not 100% of the settings I have chosen and my data is not looking too good so far, but I am having a bioinformatician come over and help me on Thursday and I think your tips will be of help.
My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?

Best regards,

L

On Tue, Apr 5, 2011 at 3:56 PM, Anton Nekrutenko <anton@...> wrote:
Lali:

In your case the workflow for capture re-sequencing should look like this:

1. QC data (groom fastq files and plot quality distribution)
2. Map the reads (use bwa)
3. Generate and filter pileup
4. Intersect pileup with coordinates of sure select bates.

However, before you dive in please understand basic Galaxy functionality by taking a look at http://usegalaxy.org/galaxy101 and watching *all* Illumina-related Galaxy quickies (black boxes on the front page on Galaxy). Next, take a look at http://usegalaxy.org/heteroplasmy.

Note, that we are working on bringing "industrial-strength" diploid genotyping functionality in Galaxy in the next two-three months that will include more sophisticated genotypers, recalibration and realignment tools, and novel visualization approaches.

Thank for using Galaxy.

anton
galaxy team



On Apr 5, 2011, at 2:44 AM, Lali wrote:

> Hi!
> I am having problems with my sequencing results, but I am a newbie at this; so I am thinking there is something wrong with my analysis. So far, I've tried Galaxy and CLC Workbench, but with CLC I could not align to the whole genome, only to individual chromosomes (maybe there is a way, but by the time the trial ended I had not found it).
>
> I used SureSelect capture kit and did single end sequencing on an Illumina. The files the lab sent me are FastQ Illumina 1.5 files, my samples were indexed, and I got a series of files each representing an Index.
>
> What would be the standard workflow for this kind of data?
> Which tools/settings?
>
> Does anyone have an example Galaxy workflow for preparing (clipping adapters, quality trimming) and mapping Targeted Resequencing Data?
>
> Is there a way to obtain a coverage report through Galaxy?
>
> Is it possible to ignore/discard the reads mapped when the coverage is below a certain threshold?
>
> I know, I know, a lot of things, but I am very lost.
> Any help is appreciated.
>
> L ___________________________________________________________

> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
>
>  http://lists.bx.psu.edu/listinfo/galaxy-dev
>
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
>
>  http://lists.bx.psu.edu/

Anton Nekrutenko
http://nekrut.bx.psu.edu
http://usegalaxy.org








-----Inline Attachment Follows-----

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/



___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Targeted Resequencing data with Galaxy

Sean Davis
In reply to this post by Mike Dufault
Hi, Mike.  See my couple of comments below....

Sean

On Tue, Apr 5, 2011 at 2:22 PM, Mike Dufault <[hidden email]> wrote:

Hi all,

 

Like many people on this e-mail chain, I have been looking for advice on how to process Exome data. Below, I have described in detail what I have done with the hope of getting some clarification. Hopefully it will be helpful to many of us!

 

I have SureSelect Exome captured data. The data was delivered to me as two separate files (/1) & (/2). Each file has ~33 million reads; 7.2 GB each. I am looking for SNPs from a family with cancer. Eventually I plan to compare the date from multiple members of the same family to find a related disease SNP.

 

Below is the workflow that I used to process my data. I adapted it from the Screencast titles: "Mapping Illumina Reads: Paired Ends Example." I used all of the same default parameters as in the screencast.

 

At the end of step 13, I had ~4,700,000 SNPs. This seemed like a lot so in step 14, I filtered on column 7 (c7) which I believe is the Quality SNP value. I set the filter as C7>=1 to remove all of the 0 (zero) values for Quality SNP. I figured that if they have a value of zero, they must not be real SNPs. This left me with ~180,000 SNPs.

 

1: Get Data: Illumina 1.3+ file (/1)

2: Get Data: Illumina 1.3+ file (/2)

3: FASTQ Groomer on data 1

4: FASTQ Groomer on data 2

5: FASTQ Summary Statistics on data 3

6: FASTQ Summary Statistics on data 4

7: Box plot on data 5

8: Box plot on data 6

9: Map with Bowtie for Illumina on data 4 and data 3: mapped reads


This might not be the best choice, as bowtie does not allow gapped alignment.  See here for a discussion of indels and SNV calling:


You will probably also want to consider local realignment around indels and potentially quality score recalibration.  
 

10: Filter Sam on data 9

11: SAM-to-BAM on data 10: converted to BAM

12: Generate pileup on data 11: converted pileup

13: Filter pileup on data 12

14: Filter data on 13 (c7>=1)

15: Sort on data 15 (C7; descending order)

 

First, if anyone has ideas on how to improve the workflow, I would be open to suggestions; especially from people experienced with Galaxy.

 

Second, I am concerned that many/most of the SNPs are known. Should I filter my data against the known SNPdb? If so, how can I do this in Galaxy (in Bowtie?)


Keep in mind that, depending on the version of dbSNP, there are many cancer-associated SNPs contaminating the database.

 

Third, as suggested in the screencast, I did not trim or filter my FASTQ Groomed data because I was interested in SNPs and I could filter on Quality later in the workflow. Would implementing a filtering step on phred quality (~20) at this step save me the step of filtering later on. Currently it takes multiple hours (~16) to process the data from start to finish, would filtering at this step reduce the amount of time that it takes to process my data? Presumably, there would be less data to process. I do this on the AWS Cloud and time is money!

 


Adding a gapped alignment algorithm, indel realignment, and quality recalibration can easily increase this time to a couple of days per sample.
 

Fifth, when using Galaxy on the AWS cloud, does adding additional cores or adding High CPU ( or both) shorten the time to process the data? When I set up extra cores, it appeared that some of them are idle and I don't want to pay for idle cores. If anyone could share information on how best to manage the cloud, it would be appreciated.

 

Finally, what is the difference between “stopping” an instance and “terminating” an instance on the cloud? Would I still get charged by AWS if I just stop an instance? Any clarification in this area would also be much appreciated. Again, time is money!

I hope this helps many of us!

 

Unfortunatly, I will not be in Pitt to ask these questions in person.

 

Thanks in advance!!!

 

Mike


___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Targeted Resequencing data with Galaxy

Anton Nekrutenko
In reply to this post by Mike Dufault
Mike:

You have a fairly deep coverage, so increasing quality cutoff to 25 - 30 and coverage to at least 20, will dramatically decrease the number of SNPs. To see which SNPs are from dbSNP simple obtains dbSNP data from UCSC (Get Data -> UCSC main) and join with the pileup you've generated (Operate on Genomic Intervals -> Join).


To add to the excellent comments by Sean -> realignment and recalibration tools are coming by this Summer together with more sophisticated genotypers.

Tx,

anton
galaxy team


On Apr 5, 2011, at 3:11 PM, Mike Dufault wrote:

> Hi Anton,
>  
> The conditions are give below. Currently, I don't have access to the AWS cloud so I can not share my history at the moment.
>  
> Select dataset:
>  
>  
> which contains:
> Pileup with ten columns (with consensus)
>  
> See "Types of pileup datasets" below for examples
> Do not consider read bases with quality lower than:
> 20
>  
> No variants with quality below this value will be reported
> Do not report positions with coverage lower than:
> 3
>  
> Pileup lines with coverage lower than this value will be skipped
>  
> Only report variants?:
> Yes
> See "Examples 1 and 2" below for explanation
>  
> Convert coordinates to intervals?:
> Yes
> See "Output format" below for explanation
>  
> Print total number of differences?:
> No
> See "Example 3" below for explanation
>  
> Print quality and base string?:
> No
> See "Example 4" below for explanation
>  
>  
> I did save the output from step 15 to my USB stick and I have provided a bit of it below for what it is worth.
>  
> chr1 100316588 100316589 A G 255 255 60 141 0 0 137 0 137
> chr1 100575932 100575933 G A 255 255 60 89 89 0 0 0 89
> chr1 100617886 100617887 C T 255 255 60 113 0 0 0 111 111
> chr1 100672059 100672060 T C 255 255 60 225 1 220 0 0 221
> chr1 101203826 101203827 G A 255 255 60 106 105 0 0 0 105
> chr1 103461507 103461508 T A 255 255 60 87 82 0 0 0 82
> chr1 104166495 104166496 T C 255 255 60 168 0 157 0 5 162
> chr1 104256477 104256478 T A 255 255 60 84 82 0 0 0 82
>  
> Thanks for your help!
> Mike
>
> --- On Tue, 4/5/11, Anton Nekrutenko <[hidden email]> wrote:
>
> From: Anton Nekrutenko <[hidden email]>
> Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
> To: "Mike Dufault" <[hidden email]>
> Cc: "Lali" <[hidden email]>, "galaxy-user" <[hidden email]>
> Date: Tuesday, April 5, 2011, 2:33 PM
>
> Mike:
>
> Which parameters did you use at step 13 (if you used main site to perform these analyses you can share your history with me).
>
> Thanks,
>
> anton
>
>
> On Apr 5, 2011, at 2:22 PM, Mike Dufault wrote:
>
>> Hi all,
>>  
>> Like many people on this e-mail chain, I have been looking for advice on how to process Exome data. Below, I have described in detail what I have done with the hope of getting some clarification. Hopefully it will be helpful to many of us!
>>  
>> I have SureSelect Exome captured data. The data was delivered to me as two separate files (/1) & (/2). Each file has ~33 million reads; 7.2 GB each. I am looking for SNPs from a family with cancer. Eventually I plan to compare the date from multiple members of the same family to find a related disease SNP.
>>  
>> Below is the workflow that I used to process my data. I adapted it from the Screencast titles: "Mapping Illumina Reads: Paired Ends Example." I used all of the same default parameters as in the screencast.
>>  
>> At the end of step 13, I had ~4,700,000 SNPs. This seemed like a lot so in step 14, I filtered on column 7 (c7) which I believe is the Quality SNP value. I set the filter as C7>=1 to remove all of the 0 (zero) values for Quality SNP. I figured that if they have a value of zero, they must not be real SNPs. This left me with ~180,000 SNPs.
>>  
>> 1: Get Data: Illumina 1.3+ file (/1)
>> 2: Get Data: Illumina 1.3+ file (/2)
>> 3: FASTQ Groomer on data 1
>> 4: FASTQ Groomer on data 2
>> 5: FASTQ Summary Statistics on data 3
>> 6: FASTQ Summary Statistics on data 4
>> 7: Box plot on data 5
>> 8: Box plot on data 6
>> 9: Map with Bowtie for Illumina on data 4 and data 3: mapped reads
>> 10: Filter Sam on data 9
>> 11: SAM-to-BAM on data 10: converted to BAM
>> 12: Generate pileup on data 11: converted pileup
>> 13: Filter pileup on data 12
>> 14: Filter data on 13 (c7>=1)
>> 15: Sort on data 15 (C7; descending order)
>>  
>> First, if anyone has ideas on how to improve the workflow, I would be open to suggestions; especially from people experienced with Galaxy.
>>  
>> Second, I am concerned that many/most of the SNPs are known. Should I filter my data against the known SNPdb? If so, how can I do this in Galaxy (in Bowtie?)
>>  
>> Third, as suggested in the screencast, I did not trim or filter my FASTQ Groomed data because I was interested in SNPs and I could filter on Quality later in the workflow. Would implementing a filtering step on phred quality (~20) at this step save me the step of filtering later on. Currently it takes multiple hours (~16) to process the data from start to finish, would filtering at this step reduce the amount of time that it takes to process my data? Presumably, there would be less data to process. I do this on the AWS Cloud and time is money!
>>  
>> Fifth, when using Galaxy on the AWS cloud, does adding additional cores or adding High CPU ( or both) shorten the time to process the data? When I set up extra cores, it appeared that some of them are idle and I don't want to pay for idle cores. If anyone could share information on how best to manage the cloud, it would be appreciated.
>>  
>> Finally, what is the difference between “stopping” an instance and “terminating” an instance on the cloud? Would I still get charged by AWS if I just stop an instance? Any clarification in this area would also be much appreciated. Again, time is money!
>> I hope this helps many of us!
>>  
>> Unfortunatly, I will not be in Pitt to ask these questions in person.
>>  
>> Thanks in advance!!!
>>  
>> Mike
>>
>> --- On Tue, 4/5/11, Lali <[hidden email]> wrote:
>>
>> From: Lali <[hidden email]>
>> Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
>> To: "Anton Nekrutenko" <[hidden email]>
>> Cc: "galaxy-user" <[hidden email]>
>> Date: Tuesday, April 5, 2011, 11:50 AM
>>
>> Ohh sorry about that!
>> I am using both Windows XP and Ubuntu and I usually use Google Chrome.
>>
>>
>> On Tue, Apr 5, 2011 at 5:33 PM, Anton Nekrutenko <[hidden email]> wrote:
>> Lali:
>>
>> Please, always CC mailing list when you reply.
>>
>>> My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?
>>
>> Which browser/OS are your using?
>>
>> Thanks,
>>
>> anton
>> galaxy team
>>
>> On Apr 5, 2011, at 11:25 AM, Lali wrote:
>>
>>> Thanks so much for the tips Anton!
>>> I am very excited about the newer developments.
>>> I did watch the quickies and they were very useful for a beginner like me, I actually did my first try at the alignment by following the Illumina single-end tutorial video step by step, but you need to watch the paired-end too, for some of the first steps, which are explained better on that one.
>>> I have been playing around a lot with Galaxy, and I have several workflows, my department just started doing sequencing, so we don't have standard procedures set in place. I was assigned to evaluate Galaxy and CLC, and so far CLC has not impressed me, except for the fact that it can generate reports easily.
>>> I think Galaxy is the way to go for me (us, if I can convince them to run a local server), since I am not a bioinformatician, and just the fact that you can queue up actions and just walk away is fantastic (amongst other things).
>>> But because I am a beginner, I am not 100% of the settings I have chosen and my data is not looking too good so far, but I am having a bioinformatician come over and help me on Thursday and I think your tips will be of help.
>>> My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?
>>>
>>> Best regards,
>>>
>>> L
>>>
>>> On Tue, Apr 5, 2011 at 3:56 PM, Anton Nekrutenko <[hidden email]> wrote:
>>> Lali:
>>>
>>> In your case the workflow for capture re-sequencing should look like this:
>>>
>>> 1. QC data (groom fastq files and plot quality distribution)
>>> 2. Map the reads (use bwa)
>>> 3. Generate and filter pileup
>>> 4. Intersect pileup with coordinates of sure select bates.
>>>
>>> However, before you dive in please understand basic Galaxy functionality by taking a look at http://usegalaxy.org/galaxy101 and watching *all* Illumina-related Galaxy quickies (black boxes on the front page on Galaxy). Next, take a look at http://usegalaxy.org/heteroplasmy.
>>>
>>> Note, that we are working on bringing "industrial-strength" diploid genotyping functionality in Galaxy in the next two-three months that will include more sophisticated genotypers, recalibration and realignment tools, and novel visualization approaches.
>>>
>>> Thank for using Galaxy.
>>>
>>> anton
>>> galaxy team
>>>
>>>
>>>
>>> On Apr 5, 2011, at 2:44 AM, Lali wrote:
>>>
>>> > Hi!
>>> > I am having problems with my sequencing results, but I am a newbie at this; so I am thinking there is something wrong with my analysis. So far, I've tried Galaxy and CLC Workbench, but with CLC I could not align to the whole genome, only to individual chromosomes (maybe there is a way, but by the time the trial ended I had not found it).
>>> >
>>> > I used SureSelect capture kit and did single end sequencing on an Illumina. The files the lab sent me are FastQ Illumina 1.5 files, my samples were indexed, and I got a series of files each representing an Index.
>>> >
>>> > What would be the standard workflow for this kind of data?
>>> > Which tools/settings?
>>> >
>>> > Does anyone have an example Galaxy workflow for preparing (clipping adapters, quality trimming) and mapping Targeted Resequencing Data?
>>> >
>>> > Is there a way to obtain a coverage report through Galaxy?
>>> >
>>> > Is it possible to ignore/discard the reads mapped when the coverage is below a certain threshold?
>>> >
>>> > I know, I know, a lot of things, but I am very lost.
>>> > Any help is appreciated.
>>> >
>>> > L ___________________________________________________________
>>> > The Galaxy User list should be used for the discussion of
>>> > Galaxy analysis and other features on the public server
>>> > at usegalaxy.org.  Please keep all replies on the list by
>>> > using "reply all" in your mail client.  For discussion of
>>> > local Galaxy instances and the Galaxy source code, please
>>> > use the Galaxy Development list:
>>> >
>>> >  http://lists.bx.psu.edu/listinfo/galaxy-dev
>>> >
>>> > To manage your subscriptions to this and other Galaxy lists,
>>> > please use the interface at:
>>> >
>>> >  http://lists.bx.psu.edu/
>>>
>>> Anton Nekrutenko
>>> http://nekrut.bx.psu.edu
>>> http://usegalaxy.org
>>>
>>>
>>>
>>>
>>
>> Anton Nekrutenko
>> http://nekrut.bx.psu.edu
>> http://usegalaxy.org
>>
>>
>>
>>
>>
>> -----Inline Attachment Follows-----
>>
>> ___________________________________________________________
>> The Galaxy User list should be used for the discussion of
>> Galaxy analysis and other features on the public server
>> at usegalaxy.org.  Please keep all replies on the list by
>> using "reply all" in your mail client.  For discussion of
>> local Galaxy instances and the Galaxy source code, please
>> use the Galaxy Development list:
>>
>>   http://lists.bx.psu.edu/listinfo/galaxy-dev
>>
>> To manage your subscriptions to this and other Galaxy lists,
>> please use the interface at:
>>
>>   http://lists.bx.psu.edu/
>
> Anton Nekrutenko
> http://nekrut.bx.psu.edu
> http://usegalaxy.org
>
>
>

Anton Nekrutenko
http://nekrut.bx.psu.edu
http://usegalaxy.org




___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Targeted Resequencing data with Galaxy

Lali
Thanks for all the tips and advice, I will get back to this thread after I have tried it out :)


On Tue, Apr 5, 2011 at 10:52 PM, Anton Nekrutenko <[hidden email]> wrote:
Mike:

You have a fairly deep coverage, so increasing quality cutoff to 25 - 30 and coverage to at least 20, will dramatically decrease the number of SNPs. To see which SNPs are from dbSNP simple obtains dbSNP data from UCSC (Get Data -> UCSC main) and join with the pileup you've generated (Operate on Genomic Intervals -> Join).


To add to the excellent comments by Sean -> realignment and recalibration tools are coming by this Summer together with more sophisticated genotypers.

Tx,

anton
galaxy team


On Apr 5, 2011, at 3:11 PM, Mike Dufault wrote:

> Hi Anton,
>
> The conditions are give below. Currently, I don't have access to the AWS cloud so I can not share my history at the moment.
>
> Select dataset:
>
>
> which contains:
> Pileup with ten columns (with consensus)
>
> See "Types of pileup datasets" below for examples
> Do not consider read bases with quality lower than:
> 20
>
> No variants with quality below this value will be reported
> Do not report positions with coverage lower than:
> 3
>
> Pileup lines with coverage lower than this value will be skipped
>
> Only report variants?:
> Yes
> See "Examples 1 and 2" below for explanation
>
> Convert coordinates to intervals?:
> Yes
> See "Output format" below for explanation
>
> Print total number of differences?:
> No
> See "Example 3" below for explanation
>
> Print quality and base string?:
> No
> See "Example 4" below for explanation
>
>
> I did save the output from step 15 to my USB stick and I have provided a bit of it below for what it is worth.
>
> chr1  100316588       100316589       A       G       255     255     60      141     0       0       137     0       137
> chr1  100575932       100575933       G       A       255     255     60      89      89      0       0       0       89
> chr1  100617886       100617887       C       T       255     255     60      113     0       0       0       111     111
> chr1  100672059       100672060       T       C       255     255     60      225     1       220     0       0       221
> chr1  101203826       101203827       G       A       255     255     60      106     105     0       0       0       105
> chr1  103461507       103461508       T       A       255     255     60      87      82      0       0       0       82
> chr1  104166495       104166496       T       C       255     255     60      168     0       157     0       5       162
> chr1  104256477       104256478       T       A       255     255     60      84      82      0       0       0       82
>
> Thanks for your help!
> Mike
>
> --- On Tue, 4/5/11, Anton Nekrutenko <[hidden email]> wrote:
>
> From: Anton Nekrutenko <[hidden email]>
> Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
> To: "Mike Dufault" <[hidden email]>
> Cc: "Lali" <[hidden email]>, "galaxy-user" <[hidden email]>
> Date: Tuesday, April 5, 2011, 2:33 PM
>
> Mike:
>
> Which parameters did you use at step 13 (if you used main site to perform these analyses you can share your history with me).
>
> Thanks,
>
> anton
>
>
> On Apr 5, 2011, at 2:22 PM, Mike Dufault wrote:
>
>> Hi all,
>>
>> Like many people on this e-mail chain, I have been looking for advice on how to process Exome data. Below, I have described in detail what I have done with the hope of getting some clarification. Hopefully it will be helpful to many of us!
>>
>> I have SureSelect Exome captured data. The data was delivered to me as two separate files (/1) & (/2). Each file has ~33 million reads; 7.2 GB each. I am looking for SNPs from a family with cancer. Eventually I plan to compare the date from multiple members of the same family to find a related disease SNP.
>>
>> Below is the workflow that I used to process my data. I adapted it from the Screencast titles: "Mapping Illumina Reads: Paired Ends Example." I used all of the same default parameters as in the screencast.
>>
>> At the end of step 13, I had ~4,700,000 SNPs. This seemed like a lot so in step 14, I filtered on column 7 (c7) which I believe is the Quality SNP value. I set the filter as C7>=1 to remove all of the 0 (zero) values for Quality SNP. I figured that if they have a value of zero, they must not be real SNPs. This left me with ~180,000 SNPs.
>>
>> 1: Get Data: Illumina 1.3+ file (/1)
>> 2: Get Data: Illumina 1.3+ file (/2)
>> 3: FASTQ Groomer on data 1
>> 4: FASTQ Groomer on data 2
>> 5: FASTQ Summary Statistics on data 3
>> 6: FASTQ Summary Statistics on data 4
>> 7: Box plot on data 5
>> 8: Box plot on data 6
>> 9: Map with Bowtie for Illumina on data 4 and data 3: mapped reads
>> 10: Filter Sam on data 9
>> 11: SAM-to-BAM on data 10: converted to BAM
>> 12: Generate pileup on data 11: converted pileup
>> 13: Filter pileup on data 12
>> 14: Filter data on 13 (c7>=1)
>> 15: Sort on data 15 (C7; descending order)
>>
>> First, if anyone has ideas on how to improve the workflow, I would be open to suggestions; especially from people experienced with Galaxy.
>>
>> Second, I am concerned that many/most of the SNPs are known. Should I filter my data against the known SNPdb? If so, how can I do this in Galaxy (in Bowtie?)
>>
>> Third, as suggested in the screencast, I did not trim or filter my FASTQ Groomed data because I was interested in SNPs and I could filter on Quality later in the workflow. Would implementing a filtering step on phred quality (~20) at this step save me the step of filtering later on. Currently it takes multiple hours (~16) to process the data from start to finish, would filtering at this step reduce the amount of time that it takes to process my data? Presumably, there would be less data to process. I do this on the AWS Cloud and time is money!
>>
>> Fifth, when using Galaxy on the AWS cloud, does adding additional cores or adding High CPU ( or both) shorten the time to process the data? When I set up extra cores, it appeared that some of them are idle and I don't want to pay for idle cores. If anyone could share information on how best to manage the cloud, it would be appreciated.
>>
>> Finally, what is the difference between “stopping” an instance and “terminating” an instance on the cloud? Would I still get charged by AWS if I just stop an instance? Any clarification in this area would also be much appreciated. Again, time is money!
>> I hope this helps many of us!
>>
>> Unfortunatly, I will not be in Pitt to ask these questions in person.
>>
>> Thanks in advance!!!
>>
>> Mike
>>
>> --- On Tue, 4/5/11, Lali <[hidden email]> wrote:
>>
>> From: Lali <[hidden email]>
>> Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
>> To: "Anton Nekrutenko" <[hidden email]>
>> Cc: "galaxy-user" <[hidden email]>
>> Date: Tuesday, April 5, 2011, 11:50 AM
>>
>> Ohh sorry about that!
>> I am using both Windows XP and Ubuntu and I usually use Google Chrome.
>>
>>
>> On Tue, Apr 5, 2011 at 5:33 PM, Anton Nekrutenko <[hidden email]> wrote:
>> Lali:
>>
>> Please, always CC mailing list when you reply.
>>
>>> My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?
>>
>> Which browser/OS are your using?
>>
>> Thanks,
>>
>> anton
>> galaxy team
>>
>> On Apr 5, 2011, at 11:25 AM, Lali wrote:
>>
>>> Thanks so much for the tips Anton!
>>> I am very excited about the newer developments.
>>> I did watch the quickies and they were very useful for a beginner like me, I actually did my first try at the alignment by following the Illumina single-end tutorial video step by step, but you need to watch the paired-end too, for some of the first steps, which are explained better on that one.
>>> I have been playing around a lot with Galaxy, and I have several workflows, my department just started doing sequencing, so we don't have standard procedures set in place. I was assigned to evaluate Galaxy and CLC, and so far CLC has not impressed me, except for the fact that it can generate reports easily.
>>> I think Galaxy is the way to go for me (us, if I can convince them to run a local server), since I am not a bioinformatician, and just the fact that you can queue up actions and just walk away is fantastic (amongst other things).
>>> But because I am a beginner, I am not 100% of the settings I have chosen and my data is not looking too good so far, but I am having a bioinformatician come over and help me on Thursday and I think your tips will be of help.
>>> My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?
>>>
>>> Best regards,
>>>
>>> L
>>>
>>> On Tue, Apr 5, 2011 at 3:56 PM, Anton Nekrutenko <[hidden email]> wrote:
>>> Lali:
>>>
>>> In your case the workflow for capture re-sequencing should look like this:
>>>
>>> 1. QC data (groom fastq files and plot quality distribution)
>>> 2. Map the reads (use bwa)
>>> 3. Generate and filter pileup
>>> 4. Intersect pileup with coordinates of sure select bates.
>>>
>>> However, before you dive in please understand basic Galaxy functionality by taking a look at http://usegalaxy.org/galaxy101 and watching *all* Illumina-related Galaxy quickies (black boxes on the front page on Galaxy). Next, take a look at http://usegalaxy.org/heteroplasmy.
>>>
>>> Note, that we are working on bringing "industrial-strength" diploid genotyping functionality in Galaxy in the next two-three months that will include more sophisticated genotypers, recalibration and realignment tools, and novel visualization approaches.
>>>
>>> Thank for using Galaxy.
>>>
>>> anton
>>> galaxy team
>>>
>>>
>>>
>>> On Apr 5, 2011, at 2:44 AM, Lali wrote:
>>>
>>> > Hi!
>>> > I am having problems with my sequencing results, but I am a newbie at this; so I am thinking there is something wrong with my analysis. So far, I've tried Galaxy and CLC Workbench, but with CLC I could not align to the whole genome, only to individual chromosomes (maybe there is a way, but by the time the trial ended I had not found it).
>>> >
>>> > I used SureSelect capture kit and did single end sequencing on an Illumina. The files the lab sent me are FastQ Illumina 1.5 files, my samples were indexed, and I got a series of files each representing an Index.
>>> >
>>> > What would be the standard workflow for this kind of data?
>>> > Which tools/settings?
>>> >
>>> > Does anyone have an example Galaxy workflow for preparing (clipping adapters, quality trimming) and mapping Targeted Resequencing Data?
>>> >
>>> > Is there a way to obtain a coverage report through Galaxy?
>>> >
>>> > Is it possible to ignore/discard the reads mapped when the coverage is below a certain threshold?
>>> >
>>> > I know, I know, a lot of things, but I am very lost.
>>> > Any help is appreciated.
>>> >
>>> > L ___________________________________________________________
>>> > The Galaxy User list should be used for the discussion of
>>> > Galaxy analysis and other features on the public server
>>> > at usegalaxy.org.  Please keep all replies on the list by
>>> > using "reply all" in your mail client.  For discussion of
>>> > local Galaxy instances and the Galaxy source code, please
>>> > use the Galaxy Development list:
>>> >
>>> >  http://lists.bx.psu.edu/listinfo/galaxy-dev
>>> >
>>> > To manage your subscriptions to this and other Galaxy lists,
>>> > please use the interface at:
>>> >
>>> >  http://lists.bx.psu.edu/
>>>
>>> Anton Nekrutenko
>>> http://nekrut.bx.psu.edu
>>> http://usegalaxy.org
>>>
>>>
>>>
>>>
>>
>> Anton Nekrutenko
>> http://nekrut.bx.psu.edu
>> http://usegalaxy.org
>>
>>
>>
>>
>>
>> -----Inline Attachment Follows-----
>>
>> ___________________________________________________________
>> The Galaxy User list should be used for the discussion of
>> Galaxy analysis and other features on the public server
>> at usegalaxy.org.  Please keep all replies on the list by
>> using "reply all" in your mail client.  For discussion of
>> local Galaxy instances and the Galaxy source code, please
>> use the Galaxy Development list:
>>
>>   http://lists.bx.psu.edu/listinfo/galaxy-dev
>>
>> To manage your subscriptions to this and other Galaxy lists,
>> please use the interface at:
>>
>>   http://lists.bx.psu.edu/
>
> Anton Nekrutenko
> http://nekrut.bx.psu.edu
> http://usegalaxy.org
>
>
>

Anton Nekrutenko
http://nekrut.bx.psu.edu
http://usegalaxy.org





___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Targeted Resequencing data with Galaxy

Lali
Btw Anton, you never answered anything about the bug with the history not loading properly until I clear my cache.
I use Windows XP - Firefox
Ubuntu - Google Chrome

Any ideas?



On Wed, Apr 6, 2011 at 2:15 PM, Lali <[hidden email]> wrote:
Thanks for all the tips and advice, I will get back to this thread after I have tried it out :)



On Tue, Apr 5, 2011 at 10:52 PM, Anton Nekrutenko <[hidden email]> wrote:
Mike:

You have a fairly deep coverage, so increasing quality cutoff to 25 - 30 and coverage to at least 20, will dramatically decrease the number of SNPs. To see which SNPs are from dbSNP simple obtains dbSNP data from UCSC (Get Data -> UCSC main) and join with the pileup you've generated (Operate on Genomic Intervals -> Join).


To add to the excellent comments by Sean -> realignment and recalibration tools are coming by this Summer together with more sophisticated genotypers.

Tx,

anton
galaxy team


On Apr 5, 2011, at 3:11 PM, Mike Dufault wrote:

> Hi Anton,
>
> The conditions are give below. Currently, I don't have access to the AWS cloud so I can not share my history at the moment.
>
> Select dataset:
>
>
> which contains:
> Pileup with ten columns (with consensus)
>
> See "Types of pileup datasets" below for examples
> Do not consider read bases with quality lower than:
> 20
>
> No variants with quality below this value will be reported
> Do not report positions with coverage lower than:
> 3
>
> Pileup lines with coverage lower than this value will be skipped
>
> Only report variants?:
> Yes
> See "Examples 1 and 2" below for explanation
>
> Convert coordinates to intervals?:
> Yes
> See "Output format" below for explanation
>
> Print total number of differences?:
> No
> See "Example 3" below for explanation
>
> Print quality and base string?:
> No
> See "Example 4" below for explanation
>
>
> I did save the output from step 15 to my USB stick and I have provided a bit of it below for what it is worth.
>
> chr1  100316588       100316589       A       G       255     255     60      141     0       0       137     0       137
> chr1  100575932       100575933       G       A       255     255     60      89      89      0       0       0       89
> chr1  100617886       100617887       C       T       255     255     60      113     0       0       0       111     111
> chr1  100672059       100672060       T       C       255     255     60      225     1       220     0       0       221
> chr1  101203826       101203827       G       A       255     255     60      106     105     0       0       0       105
> chr1  103461507       103461508       T       A       255     255     60      87      82      0       0       0       82
> chr1  104166495       104166496       T       C       255     255     60      168     0       157     0       5       162
> chr1  104256477       104256478       T       A       255     255     60      84      82      0       0       0       82
>
> Thanks for your help!
> Mike
>
> --- On Tue, 4/5/11, Anton Nekrutenko <[hidden email]> wrote:
>
> From: Anton Nekrutenko <[hidden email]>
> Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
> To: "Mike Dufault" <[hidden email]>
> Cc: "Lali" <[hidden email]>, "galaxy-user" <[hidden email]>
> Date: Tuesday, April 5, 2011, 2:33 PM
>
> Mike:
>
> Which parameters did you use at step 13 (if you used main site to perform these analyses you can share your history with me).
>
> Thanks,
>
> anton
>
>
> On Apr 5, 2011, at 2:22 PM, Mike Dufault wrote:
>
>> Hi all,
>>
>> Like many people on this e-mail chain, I have been looking for advice on how to process Exome data. Below, I have described in detail what I have done with the hope of getting some clarification. Hopefully it will be helpful to many of us!
>>
>> I have SureSelect Exome captured data. The data was delivered to me as two separate files (/1) & (/2). Each file has ~33 million reads; 7.2 GB each. I am looking for SNPs from a family with cancer. Eventually I plan to compare the date from multiple members of the same family to find a related disease SNP.
>>
>> Below is the workflow that I used to process my data. I adapted it from the Screencast titles: "Mapping Illumina Reads: Paired Ends Example." I used all of the same default parameters as in the screencast.
>>
>> At the end of step 13, I had ~4,700,000 SNPs. This seemed like a lot so in step 14, I filtered on column 7 (c7) which I believe is the Quality SNP value. I set the filter as C7>=1 to remove all of the 0 (zero) values for Quality SNP. I figured that if they have a value of zero, they must not be real SNPs. This left me with ~180,000 SNPs.
>>
>> 1: Get Data: Illumina 1.3+ file (/1)
>> 2: Get Data: Illumina 1.3+ file (/2)
>> 3: FASTQ Groomer on data 1
>> 4: FASTQ Groomer on data 2
>> 5: FASTQ Summary Statistics on data 3
>> 6: FASTQ Summary Statistics on data 4
>> 7: Box plot on data 5
>> 8: Box plot on data 6
>> 9: Map with Bowtie for Illumina on data 4 and data 3: mapped reads
>> 10: Filter Sam on data 9
>> 11: SAM-to-BAM on data 10: converted to BAM
>> 12: Generate pileup on data 11: converted pileup
>> 13: Filter pileup on data 12
>> 14: Filter data on 13 (c7>=1)
>> 15: Sort on data 15 (C7; descending order)
>>
>> First, if anyone has ideas on how to improve the workflow, I would be open to suggestions; especially from people experienced with Galaxy.
>>
>> Second, I am concerned that many/most of the SNPs are known. Should I filter my data against the known SNPdb? If so, how can I do this in Galaxy (in Bowtie?)
>>
>> Third, as suggested in the screencast, I did not trim or filter my FASTQ Groomed data because I was interested in SNPs and I could filter on Quality later in the workflow. Would implementing a filtering step on phred quality (~20) at this step save me the step of filtering later on. Currently it takes multiple hours (~16) to process the data from start to finish, would filtering at this step reduce the amount of time that it takes to process my data? Presumably, there would be less data to process. I do this on the AWS Cloud and time is money!
>>
>> Fifth, when using Galaxy on the AWS cloud, does adding additional cores or adding High CPU ( or both) shorten the time to process the data? When I set up extra cores, it appeared that some of them are idle and I don't want to pay for idle cores. If anyone could share information on how best to manage the cloud, it would be appreciated.
>>
>> Finally, what is the difference between “stopping” an instance and “terminating” an instance on the cloud? Would I still get charged by AWS if I just stop an instance? Any clarification in this area would also be much appreciated. Again, time is money!
>> I hope this helps many of us!
>>
>> Unfortunatly, I will not be in Pitt to ask these questions in person.
>>
>> Thanks in advance!!!
>>
>> Mike
>>
>> --- On Tue, 4/5/11, Lali <[hidden email]> wrote:
>>
>> From: Lali <[hidden email]>
>> Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
>> To: "Anton Nekrutenko" <[hidden email]>
>> Cc: "galaxy-user" <[hidden email]>
>> Date: Tuesday, April 5, 2011, 11:50 AM
>>
>> Ohh sorry about that!
>> I am using both Windows XP and Ubuntu and I usually use Google Chrome.
>>
>>
>> On Tue, Apr 5, 2011 at 5:33 PM, Anton Nekrutenko <[hidden email]> wrote:
>> Lali:
>>
>> Please, always CC mailing list when you reply.
>>
>>> My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?
>>
>> Which browser/OS are your using?
>>
>> Thanks,
>>
>> anton
>> galaxy team
>>
>> On Apr 5, 2011, at 11:25 AM, Lali wrote:
>>
>>> Thanks so much for the tips Anton!
>>> I am very excited about the newer developments.
>>> I did watch the quickies and they were very useful for a beginner like me, I actually did my first try at the alignment by following the Illumina single-end tutorial video step by step, but you need to watch the paired-end too, for some of the first steps, which are explained better on that one.
>>> I have been playing around a lot with Galaxy, and I have several workflows, my department just started doing sequencing, so we don't have standard procedures set in place. I was assigned to evaluate Galaxy and CLC, and so far CLC has not impressed me, except for the fact that it can generate reports easily.
>>> I think Galaxy is the way to go for me (us, if I can convince them to run a local server), since I am not a bioinformatician, and just the fact that you can queue up actions and just walk away is fantastic (amongst other things).
>>> But because I am a beginner, I am not 100% of the settings I have chosen and my data is not looking too good so far, but I am having a bioinformatician come over and help me on Thursday and I think your tips will be of help.
>>> My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?
>>>
>>> Best regards,
>>>
>>> L
>>>
>>> On Tue, Apr 5, 2011 at 3:56 PM, Anton Nekrutenko <[hidden email]> wrote:
>>> Lali:
>>>
>>> In your case the workflow for capture re-sequencing should look like this:
>>>
>>> 1. QC data (groom fastq files and plot quality distribution)
>>> 2. Map the reads (use bwa)
>>> 3. Generate and filter pileup
>>> 4. Intersect pileup with coordinates of sure select bates.
>>>
>>> However, before you dive in please understand basic Galaxy functionality by taking a look at http://usegalaxy.org/galaxy101 and watching *all* Illumina-related Galaxy quickies (black boxes on the front page on Galaxy). Next, take a look at http://usegalaxy.org/heteroplasmy.
>>>
>>> Note, that we are working on bringing "industrial-strength" diploid genotyping functionality in Galaxy in the next two-three months that will include more sophisticated genotypers, recalibration and realignment tools, and novel visualization approaches.
>>>
>>> Thank for using Galaxy.
>>>
>>> anton
>>> galaxy team
>>>
>>>
>>>
>>> On Apr 5, 2011, at 2:44 AM, Lali wrote:
>>>
>>> > Hi!
>>> > I am having problems with my sequencing results, but I am a newbie at this; so I am thinking there is something wrong with my analysis. So far, I've tried Galaxy and CLC Workbench, but with CLC I could not align to the whole genome, only to individual chromosomes (maybe there is a way, but by the time the trial ended I had not found it).
>>> >
>>> > I used SureSelect capture kit and did single end sequencing on an Illumina. The files the lab sent me are FastQ Illumina 1.5 files, my samples were indexed, and I got a series of files each representing an Index.
>>> >
>>> > What would be the standard workflow for this kind of data?
>>> > Which tools/settings?
>>> >
>>> > Does anyone have an example Galaxy workflow for preparing (clipping adapters, quality trimming) and mapping Targeted Resequencing Data?
>>> >
>>> > Is there a way to obtain a coverage report through Galaxy?
>>> >
>>> > Is it possible to ignore/discard the reads mapped when the coverage is below a certain threshold?
>>> >
>>> > I know, I know, a lot of things, but I am very lost.
>>> > Any help is appreciated.
>>> >
>>> > L ___________________________________________________________
>>> > The Galaxy User list should be used for the discussion of
>>> > Galaxy analysis and other features on the public server
>>> > at usegalaxy.org.  Please keep all replies on the list by
>>> > using "reply all" in your mail client.  For discussion of
>>> > local Galaxy instances and the Galaxy source code, please
>>> > use the Galaxy Development list:
>>> >
>>> >  http://lists.bx.psu.edu/listinfo/galaxy-dev
>>> >
>>> > To manage your subscriptions to this and other Galaxy lists,
>>> > please use the interface at:
>>> >
>>> >  http://lists.bx.psu.edu/
>>>
>>> Anton Nekrutenko
>>> http://nekrut.bx.psu.edu
>>> http://usegalaxy.org
>>>
>>>
>>>
>>>
>>
>> Anton Nekrutenko
>> http://nekrut.bx.psu.edu
>> http://usegalaxy.org
>>
>>
>>
>>
>>
>> -----Inline Attachment Follows-----
>>
>> ___________________________________________________________
>> The Galaxy User list should be used for the discussion of
>> Galaxy analysis and other features on the public server
>> at usegalaxy.org.  Please keep all replies on the list by
>> using "reply all" in your mail client.  For discussion of
>> local Galaxy instances and the Galaxy source code, please
>> use the Galaxy Development list:
>>
>>   http://lists.bx.psu.edu/listinfo/galaxy-dev
>>
>> To manage your subscriptions to this and other Galaxy lists,
>> please use the interface at:
>>
>>   http://lists.bx.psu.edu/
>
> Anton Nekrutenko
> http://nekrut.bx.psu.edu
> http://usegalaxy.org
>
>
>

Anton Nekrutenko
http://nekrut.bx.psu.edu
http://usegalaxy.org






___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Targeted Resequencing data with Galaxy

James Taylor-2
Lali, we don't have an answer yet because we have never seen this and can't reproduce. Are you using a proxy server or anything unusual?

-- jt 

(composed on my phone)

On Apr 6, 2011, at 9:49 AM, Lali <[hidden email]> wrote:

Btw Anton, you never answered anything about the bug with the history not loading properly until I clear my cache.
I use Windows XP - Firefox
Ubuntu - Google Chrome

Any ideas?



On Wed, Apr 6, 2011 at 2:15 PM, Lali <[hidden email]> wrote:
Thanks for all the tips and advice, I will get back to this thread after I have tried it out :)



On Tue, Apr 5, 2011 at 10:52 PM, Anton Nekrutenko <[hidden email]> wrote:
Mike:

You have a fairly deep coverage, so increasing quality cutoff to 25 - 30 and coverage to at least 20, will dramatically decrease the number of SNPs. To see which SNPs are from dbSNP simple obtains dbSNP data from UCSC (Get Data -> UCSC main) and join with the pileup you've generated (Operate on Genomic Intervals -> Join).


To add to the excellent comments by Sean -> realignment and recalibration tools are coming by this Summer together with more sophisticated genotypers.

Tx,

anton
galaxy team


On Apr 5, 2011, at 3:11 PM, Mike Dufault wrote:

> Hi Anton,
>
> The conditions are give below. Currently, I don't have access to the AWS cloud so I can not share my history at the moment.
>
> Select dataset:
>
>
> which contains:
> Pileup with ten columns (with consensus)
>
> See "Types of pileup datasets" below for examples
> Do not consider read bases with quality lower than:
> 20
>
> No variants with quality below this value will be reported
> Do not report positions with coverage lower than:
> 3
>
> Pileup lines with coverage lower than this value will be skipped
>
> Only report variants?:
> Yes
> See "Examples 1 and 2" below for explanation
>
> Convert coordinates to intervals?:
> Yes
> See "Output format" below for explanation
>
> Print total number of differences?:
> No
> See "Example 3" below for explanation
>
> Print quality and base string?:
> No
> See "Example 4" below for explanation
>
>
> I did save the output from step 15 to my USB stick and I have provided a bit of it below for what it is worth.
>
> chr1  100316588       100316589       A       G       255     255     60      141     0       0       137     0       137
> chr1  100575932       100575933       G       A       255     255     60      89      89      0       0       0       89
> chr1  100617886       100617887       C       T       255     255     60      113     0       0       0       111     111
> chr1  100672059       100672060       T       C       255     255     60      225     1       220     0       0       221
> chr1  101203826       101203827       G       A       255     255     60      106     105     0       0       0       105
> chr1  103461507       103461508       T       A       255     255     60      87      82      0       0       0       82
> chr1  104166495       104166496       T       C       255     255     60      168     0       157     0       5       162
> chr1  104256477       104256478       T       A       255     255     60      84      82      0       0       0       82
>
> Thanks for your help!
> Mike
>
> --- On Tue, 4/5/11, Anton Nekrutenko <[hidden email]> wrote:
>
> From: Anton Nekrutenko <[hidden email]>
> Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
> To: "Mike Dufault" <[hidden email]>
> Cc: "Lali" <[hidden email]>, "galaxy-user" <[hidden email]>
> Date: Tuesday, April 5, 2011, 2:33 PM
>
> Mike:
>
> Which parameters did you use at step 13 (if you used main site to perform these analyses you can share your history with me).
>
> Thanks,
>
> anton
>
>
> On Apr 5, 2011, at 2:22 PM, Mike Dufault wrote:
>
>> Hi all,
>>
>> Like many people on this e-mail chain, I have been looking for advice on how to process Exome data. Below, I have described in detail what I have done with the hope of getting some clarification. Hopefully it will be helpful to many of us!
>>
>> I have SureSelect Exome captured data. The data was delivered to me as two separate files (/1) & (/2). Each file has ~33 million reads; 7.2 GB each. I am looking for SNPs from a family with cancer. Eventually I plan to compare the date from multiple members of the same family to find a related disease SNP.
>>
>> Below is the workflow that I used to process my data. I adapted it from the Screencast titles: "Mapping Illumina Reads: Paired Ends Example." I used all of the same default parameters as in the screencast.
>>
>> At the end of step 13, I had ~4,700,000 SNPs. This seemed like a lot so in step 14, I filtered on column 7 (c7) which I believe is the Quality SNP value. I set the filter as C7>=1 to remove all of the 0 (zero) values for Quality SNP. I figured that if they have a value of zero, they must not be real SNPs. This left me with ~180,000 SNPs.
>>
>> 1: Get Data: Illumina 1.3+ file (/1)
>> 2: Get Data: Illumina 1.3+ file (/2)
>> 3: FASTQ Groomer on data 1
>> 4: FASTQ Groomer on data 2
>> 5: FASTQ Summary Statistics on data 3
>> 6: FASTQ Summary Statistics on data 4
>> 7: Box plot on data 5
>> 8: Box plot on data 6
>> 9: Map with Bowtie for Illumina on data 4 and data 3: mapped reads
>> 10: Filter Sam on data 9
>> 11: SAM-to-BAM on data 10: converted to BAM
>> 12: Generate pileup on data 11: converted pileup
>> 13: Filter pileup on data 12
>> 14: Filter data on 13 (c7>=1)
>> 15: Sort on data 15 (C7; descending order)
>>
>> First, if anyone has ideas on how to improve the workflow, I would be open to suggestions; especially from people experienced with Galaxy.
>>
>> Second, I am concerned that many/most of the SNPs are known. Should I filter my data against the known SNPdb? If so, how can I do this in Galaxy (in Bowtie?)
>>
>> Third, as suggested in the screencast, I did not trim or filter my FASTQ Groomed data because I was interested in SNPs and I could filter on Quality later in the workflow. Would implementing a filtering step on phred quality (~20) at this step save me the step of filtering later on. Currently it takes multiple hours (~16) to process the data from start to finish, would filtering at this step reduce the amount of time that it takes to process my data? Presumably, there would be less data to process. I do this on the AWS Cloud and time is money!
>>
>> Fifth, when using Galaxy on the AWS cloud, does adding additional cores or adding High CPU ( or both) shorten the time to process the data? When I set up extra cores, it appeared that some of them are idle and I don't want to pay for idle cores. If anyone could share information on how best to manage the cloud, it would be appreciated.
>>
>> Finally, what is the difference between “stopping” an instance and “terminating” an instance on the cloud? Would I still get charged by AWS if I just stop an instance? Any clarification in this area would also be much appreciated. Again, time is money!
>> I hope this helps many of us!
>>
>> Unfortunatly, I will not be in Pitt to ask these questions in person.
>>
>> Thanks in advance!!!
>>
>> Mike
>>
>> --- On Tue, 4/5/11, Lali <[hidden email]> wrote:
>>
>> From: Lali <[hidden email]>
>> Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
>> To: "Anton Nekrutenko" <[hidden email]>
>> Cc: "galaxy-user" <[hidden email]>
>> Date: Tuesday, April 5, 2011, 11:50 AM
>>
>> Ohh sorry about that!
>> I am using both Windows XP and Ubuntu and I usually use Google Chrome.
>>
>>
>> On Tue, Apr 5, 2011 at 5:33 PM, Anton Nekrutenko <[hidden email]> wrote:
>> Lali:
>>
>> Please, always CC mailing list when you reply.
>>
>>> My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?
>>
>> Which browser/OS are your using?
>>
>> Thanks,
>>
>> anton
>> galaxy team
>>
>> On Apr 5, 2011, at 11:25 AM, Lali wrote:
>>
>>> Thanks so much for the tips Anton!
>>> I am very excited about the newer developments.
>>> I did watch the quickies and they were very useful for a beginner like me, I actually did my first try at the alignment by following the Illumina single-end tutorial video step by step, but you need to watch the paired-end too, for some of the first steps, which are explained better on that one.
>>> I have been playing around a lot with Galaxy, and I have several workflows, my department just started doing sequencing, so we don't have standard procedures set in place. I was assigned to evaluate Galaxy and CLC, and so far CLC has not impressed me, except for the fact that it can generate reports easily.
>>> I think Galaxy is the way to go for me (us, if I can convince them to run a local server), since I am not a bioinformatician, and just the fact that you can queue up actions and just walk away is fantastic (amongst other things).
>>> But because I am a beginner, I am not 100% of the settings I have chosen and my data is not looking too good so far, but I am having a bioinformatician come over and help me on Thursday and I think your tips will be of help.
>>> My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?
>>>
>>> Best regards,
>>>
>>> L
>>>
>>> On Tue, Apr 5, 2011 at 3:56 PM, Anton Nekrutenko <[hidden email]> wrote:
>>> Lali:
>>>
>>> In your case the workflow for capture re-sequencing should look like this:
>>>
>>> 1. QC data (groom fastq files and plot quality distribution)
>>> 2. Map the reads (use bwa)
>>> 3. Generate and filter pileup
>>> 4. Intersect pileup with coordinates of sure select bates.
>>>
>>> However, before you dive in please understand basic Galaxy functionality by taking a look at http://usegalaxy.org/galaxy101 and watching *all* Illumina-related Galaxy quickies (black boxes on the front page on Galaxy). Next, take a look at http://usegalaxy.org/heteroplasmy.
>>>
>>> Note, that we are working on bringing "industrial-strength" diploid genotyping functionality in Galaxy in the next two-three months that will include more sophisticated genotypers, recalibration and realignment tools, and novel visualization approaches.
>>>
>>> Thank for using Galaxy.
>>>
>>> anton
>>> galaxy team
>>>
>>>
>>>
>>> On Apr 5, 2011, at 2:44 AM, Lali wrote:
>>>
>>> > Hi!
>>> > I am having problems with my sequencing results, but I am a newbie at this; so I am thinking there is something wrong with my analysis. So far, I've tried Galaxy and CLC Workbench, but with CLC I could not align to the whole genome, only to individual chromosomes (maybe there is a way, but by the time the trial ended I had not found it).
>>> >
>>> > I used SureSelect capture kit and did single end sequencing on an Illumina. The files the lab sent me are FastQ Illumina 1.5 files, my samples were indexed, and I got a series of files each representing an Index.
>>> >
>>> > What would be the standard workflow for this kind of data?
>>> > Which tools/settings?
>>> >
>>> > Does anyone have an example Galaxy workflow for preparing (clipping adapters, quality trimming) and mapping Targeted Resequencing Data?
>>> >
>>> > Is there a way to obtain a coverage report through Galaxy?
>>> >
>>> > Is it possible to ignore/discard the reads mapped when the coverage is below a certain threshold?
>>> >
>>> > I know, I know, a lot of things, but I am very lost.
>>> > Any help is appreciated.
>>> >
>>> > L ___________________________________________________________
>>> > The Galaxy User list should be used for the discussion of
>>> > Galaxy analysis and other features on the public server
>>> > at usegalaxy.org.  Please keep all replies on the list by
>>> > using "reply all" in your mail client.  For discussion of
>>> > local Galaxy instances and the Galaxy source code, please
>>> > use the Galaxy Development list:
>>> >
>>> >  http://lists.bx.psu.edu/listinfo/galaxy-dev
>>> >
>>> > To manage your subscriptions to this and other Galaxy lists,
>>> > please use the interface at:
>>> >
>>> >  http://lists.bx.psu.edu/
>>>
>>> Anton Nekrutenko
>>> http://nekrut.bx.psu.edu
>>> http://usegalaxy.org
>>>
>>>
>>>
>>>
>>
>> Anton Nekrutenko
>> http://nekrut.bx.psu.edu
>> http://usegalaxy.org
>>
>>
>>
>>
>>
>> -----Inline Attachment Follows-----
>>
>> ___________________________________________________________
>> The Galaxy User list should be used for the discussion of
>> Galaxy analysis and other features on the public server
>> at usegalaxy.org.  Please keep all replies on the list by
>> using "reply all" in your mail client.  For discussion of
>> local Galaxy instances and the Galaxy source code, please
>> use the Galaxy Development list:
>>
>>   http://lists.bx.psu.edu/listinfo/galaxy-dev
>>
>> To manage your subscriptions to this and other Galaxy lists,
>> please use the interface at:
>>
>>   http://lists.bx.psu.edu/
>
> Anton Nekrutenko
> http://nekrut.bx.psu.edu
> http://usegalaxy.org
>
>
>

Anton Nekrutenko
http://nekrut.bx.psu.edu
http://usegalaxy.org





___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Targeted Resequencing data with Galaxy

Lali
No proxy, I have accessed Galaxy both from home and the office and it is the same thing. I clear the cache, relogin and things are fine for maybe 1 or 2 workflows and then it starts messing up again.
It is not a problem that is there all the time, it happens sometimes only, and the solution I've found is clearing the cache.

This is what I do:
1-Make a new history - Galaxy shows a new blank history
2-Select a saved data set and send to history
3-Old history loads (last history I worked on, maybe days ago)
4-Hit refresh
5-New history with newly loaded dataset appears
6- Select a workflow, set my loaded dataset as input
7-Click on send to new history and set a name for it
8-Click ok, history remains the same (step 5)
9-Hit refresh
10-Nothing happens, same history from step 5
11-Close browser
12-Wait a few hours
13-Open Galaxy again
14-Same history from step 5
15-Click on saved histories and click on the history made with the workflow
16-Loads ok

or

Clear cache and do 1 to 6, no problem history loads as it should.

Also, when I save workflows, some steps get jumbled like:

1- Groom
2-filter artifacts
3-clip
4-another clip
5-quality trim
6-map

saved workflow:
1-groom
2-filter artifacts
3-clip
4-quality trim
5-another clip
6-map

-L


On Wed, Apr 6, 2011 at 4:08 PM, James Taylor <[hidden email]> wrote:
Lali, we don't have an answer yet because we have never seen this and can't reproduce. Are you using a proxy server or anything unusual?

-- jt 

(composed on my phone)

On Apr 6, 2011, at 9:49 AM, Lali <[hidden email]> wrote:

Btw Anton, you never answered anything about the bug with the history not loading properly until I clear my cache.
I use Windows XP - Firefox
Ubuntu - Google Chrome

Any ideas?



On Wed, Apr 6, 2011 at 2:15 PM, Lali <[hidden email][hidden email]> wrote:
Thanks for all the tips and advice, I will get back to this thread after I have tried it out :)



On Tue, Apr 5, 2011 at 10:52 PM, Anton Nekrutenko <[hidden email][hidden email]> wrote:
Mike:

You have a fairly deep coverage, so increasing quality cutoff to 25 - 30 and coverage to at least 20, will dramatically decrease the number of SNPs. To see which SNPs are from dbSNP simple obtains dbSNP data from UCSC (Get Data -> UCSC main) and join with the pileup you've generated (Operate on Genomic Intervals -> Join).


To add to the excellent comments by Sean -> realignment and recalibration tools are coming by this Summer together with more sophisticated genotypers.

Tx,

anton
galaxy team


On Apr 5, 2011, at 3:11 PM, Mike Dufault wrote:

> Hi Anton,
>
> The conditions are give below. Currently, I don't have access to the AWS cloud so I can not share my history at the moment.
>
> Select dataset:
>
>
> which contains:
> Pileup with ten columns (with consensus)
>
> See "Types of pileup datasets" below for examples
> Do not consider read bases with quality lower than:
> 20
>
> No variants with quality below this value will be reported
> Do not report positions with coverage lower than:
> 3
>
> Pileup lines with coverage lower than this value will be skipped
>
> Only report variants?:
> Yes
> See "Examples 1 and 2" below for explanation
>
> Convert coordinates to intervals?:
> Yes
> See "Output format" below for explanation
>
> Print total number of differences?:
> No
> See "Example 3" below for explanation
>
> Print quality and base string?:
> No
> See "Example 4" below for explanation
>
>
> I did save the output from step 15 to my USB stick and I have provided a bit of it below for what it is worth.
>
> chr1  100316588       100316589       A       G       255     255     60      141     0       0       137     0       137
> chr1  100575932       100575933       G       A       255     255     60      89      89      0       0       0       89
> chr1  100617886       100617887       C       T       255     255     60      113     0       0       0       111     111
> chr1  100672059       100672060       T       C       255     255     60      225     1       220     0       0       221
> chr1  101203826       101203827       G       A       255     255     60      106     105     0       0       0       105
> chr1  103461507       103461508       T       A       255     255     60      87      82      0       0       0       82
> chr1  104166495       104166496       T       C       255     255     60      168     0       157     0       5       162
> chr1  104256477       104256478       T       A       255     255     60      84      82      0       0       0       82
>
> Thanks for your help!
> Mike
>
> --- On Tue, 4/5/11, Anton Nekrutenko <[hidden email][hidden email]> wrote:
>
> From: Anton Nekrutenko <[hidden email][hidden email]>
> Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
> To: "Mike Dufault" <[hidden email][hidden email]>
> Cc: "Lali" <[hidden email][hidden email]>, "galaxy-user" <[hidden email][hidden email]>
> Date: Tuesday, April 5, 2011, 2:33 PM
>
> Mike:
>
> Which parameters did you use at step 13 (if you used main site to perform these analyses you can share your history with me).
>
> Thanks,
>
> anton
>
>
> On Apr 5, 2011, at 2:22 PM, Mike Dufault wrote:
>
>> Hi all,
>>
>> Like many people on this e-mail chain, I have been looking for advice on how to process Exome data. Below, I have described in detail what I have done with the hope of getting some clarification. Hopefully it will be helpful to many of us!
>>
>> I have SureSelect Exome captured data. The data was delivered to me as two separate files (/1) & (/2). Each file has ~33 million reads; 7.2 GB each. I am looking for SNPs from a family with cancer. Eventually I plan to compare the date from multiple members of the same family to find a related disease SNP.
>>
>> Below is the workflow that I used to process my data. I adapted it from the Screencast titles: "Mapping Illumina Reads: Paired Ends Example." I used all of the same default parameters as in the screencast.
>>
>> At the end of step 13, I had ~4,700,000 SNPs. This seemed like a lot so in step 14, I filtered on column 7 (c7) which I believe is the Quality SNP value. I set the filter as C7>=1 to remove all of the 0 (zero) values for Quality SNP. I figured that if they have a value of zero, they must not be real SNPs. This left me with ~180,000 SNPs.
>>
>> 1: Get Data: Illumina 1.3+ file (/1)
>> 2: Get Data: Illumina 1.3+ file (/2)
>> 3: FASTQ Groomer on data 1
>> 4: FASTQ Groomer on data 2
>> 5: FASTQ Summary Statistics on data 3
>> 6: FASTQ Summary Statistics on data 4
>> 7: Box plot on data 5
>> 8: Box plot on data 6
>> 9: Map with Bowtie for Illumina on data 4 and data 3: mapped reads
>> 10: Filter Sam on data 9
>> 11: SAM-to-BAM on data 10: converted to BAM
>> 12: Generate pileup on data 11: converted pileup
>> 13: Filter pileup on data 12
>> 14: Filter data on 13 (c7>=1)
>> 15: Sort on data 15 (C7; descending order)
>>
>> First, if anyone has ideas on how to improve the workflow, I would be open to suggestions; especially from people experienced with Galaxy.
>>
>> Second, I am concerned that many/most of the SNPs are known. Should I filter my data against the known SNPdb? If so, how can I do this in Galaxy (in Bowtie?)
>>
>> Third, as suggested in the screencast, I did not trim or filter my FASTQ Groomed data because I was interested in SNPs and I could filter on Quality later in the workflow. Would implementing a filtering step on phred quality (~20) at this step save me the step of filtering later on. Currently it takes multiple hours (~16) to process the data from start to finish, would filtering at this step reduce the amount of time that it takes to process my data? Presumably, there would be less data to process. I do this on the AWS Cloud and time is money!
>>
>> Fifth, when using Galaxy on the AWS cloud, does adding additional cores or adding High CPU ( or both) shorten the time to process the data? When I set up extra cores, it appeared that some of them are idle and I don't want to pay for idle cores. If anyone could share information on how best to manage the cloud, it would be appreciated.
>>
>> Finally, what is the difference between “stopping” an instance and “terminating” an instance on the cloud? Would I still get charged by AWS if I just stop an instance? Any clarification in this area would also be much appreciated. Again, time is money!
>> I hope this helps many of us!
>>
>> Unfortunatly, I will not be in Pitt to ask these questions in person.
>>
>> Thanks in advance!!!
>>
>> Mike
>>
>> --- On Tue, 4/5/11, Lali <[hidden email][hidden email]> wrote:
>>
>> From: Lali <[hidden email][hidden email]>
>> Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
>> To: "Anton Nekrutenko" <[hidden email][hidden email]>
>> Cc: "galaxy-user" <[hidden email][hidden email]>
>> Date: Tuesday, April 5, 2011, 11:50 AM
>>
>> Ohh sorry about that!
>> I am using both Windows XP and Ubuntu and I usually use Google Chrome.
>>
>>
>> On Tue, Apr 5, 2011 at 5:33 PM, Anton Nekrutenko <[hidden email][hidden email]> wrote:
>> Lali:
>>
>> Please, always CC mailing list when you reply.
>>
>>> My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?
>>
>> Which browser/OS are your using?
>>
>> Thanks,
>>
>> anton
>> galaxy team
>>
>> On Apr 5, 2011, at 11:25 AM, Lali wrote:
>>
>>> Thanks so much for the tips Anton!
>>> I am very excited about the newer developments.
>>> I did watch the quickies and they were very useful for a beginner like me, I actually did my first try at the alignment by following the Illumina single-end tutorial video step by step, but you need to watch the paired-end too, for some of the first steps, which are explained better on that one.
>>> I have been playing around a lot with Galaxy, and I have several workflows, my department just started doing sequencing, so we don't have standard procedures set in place. I was assigned to evaluate Galaxy and CLC, and so far CLC has not impressed me, except for the fact that it can generate reports easily.
>>> I think Galaxy is the way to go for me (us, if I can convince them to run a local server), since I am not a bioinformatician, and just the fact that you can queue up actions and just walk away is fantastic (amongst other things).
>>> But because I am a beginner, I am not 100% of the settings I have chosen and my data is not looking too good so far, but I am having a bioinformatician come over and help me on Thursday and I think your tips will be of help.
>>> My only problem with Galaxy is that I have to keep on clearing my cache in order to get the history to display correctly, is there another way of solving this issue?
>>>
>>> Best regards,
>>>
>>> L
>>>
>>> On Tue, Apr 5, 2011 at 3:56 PM, Anton Nekrutenko <[hidden email][hidden email]> wrote:
>>> Lali:
>>>
>>> In your case the workflow for capture re-sequencing should look like this:
>>>
>>> 1. QC data (groom fastq files and plot quality distribution)
>>> 2. Map the reads (use bwa)
>>> 3. Generate and filter pileup
>>> 4. Intersect pileup with coordinates of sure select bates.
>>>
>>> However, before you dive in please understand basic Galaxy functionality by taking a look at http://usegalaxy.org/galaxy101 and watching *all* Illumina-related Galaxy quickies (black boxes on the front page on Galaxy). Next, take a look at http://usegalaxy.org/heteroplasmy.
>>>
>>> Note, that we are working on bringing "industrial-strength" diploid genotyping functionality in Galaxy in the next two-three months that will include more sophisticated genotypers, recalibration and realignment tools, and novel visualization approaches.
>>>
>>> Thank for using Galaxy.
>>>
>>> anton
>>> galaxy team
>>>
>>>
>>>
>>> On Apr 5, 2011, at 2:44 AM, Lali wrote:
>>>
>>> > Hi!
>>> > I am having problems with my sequencing results, but I am a newbie at this; so I am thinking there is something wrong with my analysis. So far, I've tried Galaxy and CLC Workbench, but with CLC I could not align to the whole genome, only to individual chromosomes (maybe there is a way, but by the time the trial ended I had not found it).
>>> >
>>> > I used SureSelect capture kit and did single end sequencing on an Illumina. The files the lab sent me are FastQ Illumina 1.5 files, my samples were indexed, and I got a series of files each representing an Index.
>>> >
>>> > What would be the standard workflow for this kind of data?
>>> > Which tools/settings?
>>> >
>>> > Does anyone have an example Galaxy workflow for preparing (clipping adapters, quality trimming) and mapping Targeted Resequencing Data?
>>> >
>>> > Is there a way to obtain a coverage report through Galaxy?
>>> >
>>> > Is it possible to ignore/discard the reads mapped when the coverage is below a certain threshold?
>>> >
>>> > I know, I know, a lot of things, but I am very lost.
>>> > Any help is appreciated.
>>> >
>>> > L ___________________________________________________________
>>> > The Galaxy User list should be used for the discussion of
>>> > Galaxy analysis and other features on the public server
>>> > at usegalaxy.org.  Please keep all replies on the list by
>>> > using "reply all" in your mail client.  For discussion of
>>> > local Galaxy instances and the Galaxy source code, please
>>> > use the Galaxy Development list:
>>> >
>>> >  http://lists.bx.psu.edu/listinfo/galaxy-dev
>>> >
>>> > To manage your subscriptions to this and other Galaxy lists,
>>> > please use the interface at:
>>> >
>>> >  http://lists.bx.psu.edu/
>>>
>>> Anton Nekrutenko
>>> http://nekrut.bx.psu.edu
>>> http://usegalaxy.org
>>>
>>>
>>>
>>>
>>
>> Anton Nekrutenko
>> http://nekrut.bx.psu.edu
>> http://usegalaxy.org
>>
>>
>>
>>
>>
>> -----Inline Attachment Follows-----
>>
>> ___________________________________________________________
>> The Galaxy User list should be used for the discussion of
>> Galaxy analysis and other features on the public server
>> at usegalaxy.org.  Please keep all replies on the list by
>> using "reply all" in your mail client.  For discussion of
>> local Galaxy instances and the Galaxy source code, please
>> use the Galaxy Development list:
>>
>>   http://lists.bx.psu.edu/listinfo/galaxy-dev
>>
>> To manage your subscriptions to this and other Galaxy lists,
>> please use the interface at:
>>
>>   http://lists.bx.psu.edu/
>
> Anton Nekrutenko
> http://nekrut.bx.psu.edu
> http://usegalaxy.org
>
>
>

Anton Nekrutenko
http://nekrut.bx.psu.edu
http://usegalaxy.org





___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Targeted Resequencing data with Galaxy

Dannon Baker
> Also, when I save workflows, some steps get jumbled like:
>
> 1- Groom
> 2-filter artifacts
> 3-clip
> 4-another clip
> 5-quality trim
> 6-map
>
> saved workflow:
> 1-groom
> 2-filter artifacts
> 3-clip
> 4-quality trim
> 5-another clip
> 6-map

Regarding the workflows, step ordering should be consistent when re-saving unless you're moving things around on the screen.  If you're finding this not to be the case, please share the workflow with me and I'll look into it.

-Dannon


___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Targeted Resequencing data with Galaxy

Mike Dufault
In reply to this post by Sean Davis
Sean, Anton and Jen,
 
Thanks for all of the suggestions (in separate replies) on how to better analyze my SelectSure captured Exome data. My original work-flow is below in the e-mail string.
 
Based on the suggestions, I plan to change my work-flow by increasing my quality filter from 20 to 25-30 and increasing my minimum coverage from 3x to ~20x. I will use the Join function to compare the SNPs that are in common with the samples from two family members to filter (narrow down) what they have in common, since I am looking for a hereditary disease. Then i will use the Join function again with the SNPs from build (131) to characterize the SNPs.
 
Sean suggested realignment around indels and potentially quality score recalibration. Is that even possible with Galaxy at the moment?
 
Where in the flow can I perform Indel analysis? Will I need to process my data separately for SNPs and Indel analysis, or can they be done sequentially in the same linear work-flow? I am still a little unsure of the best way to hand this.
 
Please let me know if you have any more suggestions or comments before I re-launch the analysis later this evening. Once I get a flow that works, I hope to be able to publish it for everyone to benefit from.
 
Thanks to the Galaxy team for an outstanding platform and support!
 
Mike
--- On Tue, 4/5/11, Sean Davis <[hidden email]> wrote:

From: Sean Davis <[hidden email]>
Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
To: "Mike Dufault" <[hidden email]>
Cc: "galaxy-user" <[hidden email]>
Date: Tuesday, April 5, 2011, 4:39 PM

Hi, Mike.  See my couple of comments below....

Sean

On Tue, Apr 5, 2011 at 2:22 PM, Mike Dufault <dufaultm@...> wrote:

Hi all,

 

Like many people on this e-mail chain, I have been looking for advice on how to process Exome data. Below, I have described in detail what I have done with the hope of getting some clarification. Hopefully it will be helpful to many of us!

 

I have SureSelect Exome captured data. The data was delivered to me as two separate files (/1) & (/2). Each file has ~33 million reads; 7.2 GB each. I am looking for SNPs from a family with cancer. Eventually I plan to compare the date from multiple members of the same family to find a related disease SNP.

 

Below is the workflow that I used to process my data. I adapted it from the Screencast titles: "Mapping Illumina Reads: Paired Ends Example." I used all of the same default parameters as in the screencast.

 

At the end of step 13, I had ~4,700,000 SNPs. This seemed like a lot so in step 14, I filtered on column 7 (c7) which I believe is the Quality SNP value. I set the filter as C7>=1 to remove all of the 0 (zero) values for Quality SNP. I figured that if they have a value of zero, they must not be real SNPs. This left me with ~180,000 SNPs.

 

1: Get Data: Illumina 1.3+ file (/1)

2: Get Data: Illumina 1.3+ file (/2)

3: FASTQ Groomer on data 1

4: FASTQ Groomer on data 2

5: FASTQ Summary Statistics on data 3

6: FASTQ Summary Statistics on data 4

7: Box plot on data 5

8: Box plot on data 6

9: Map with Bowtie for Illumina on data 4 and data 3: mapped reads


This might not be the best choice, as bowtie does not allow gapped alignment.  See here for a discussion of indels and SNV calling:


You will probably also want to consider local realignment around indels and potentially quality score recalibration.  
 

10: Filter Sam on data 9

11: SAM-to-BAM on data 10: converted to BAM

12: Generate pileup on data 11: converted pileup

13: Filter pileup on data 12

14: Filter data on 13 (c7>=1)

15: Sort on data 15 (C7; descending order)

 

First, if anyone has ideas on how to improve the workflow, I would be open to suggestions; especially from people experienced with Galaxy.

 

Second, I am concerned that many/most of the SNPs are known. Should I filter my data against the known SNPdb? If so, how can I do this in Galaxy (in Bowtie?)


Keep in mind that, depending on the version of dbSNP, there are many cancer-associated SNPs contaminating the database.

 

Third, as suggested in the screencast, I did not trim or filter my FASTQ Groomed data because I was interested in SNPs and I could filter on Quality later in the workflow. Would implementing a filtering step on phred quality (~20) at this step save me the step of filtering later on. Currently it takes multiple hours (~16) to process the data from start to finish, would filtering at this step reduce the amount of time that it takes to process my data? Presumably, there would be less data to process. I do this on the AWS Cloud and time is money!

 


Adding a gapped alignment algorithm, indel realignment, and quality recalibration can easily increase this time to a couple of days per sample.
 

Fifth, when using Galaxy on the AWS cloud, does adding additional cores or adding High CPU ( or both) shorten the time to process the data? When I set up extra cores, it appeared that some of them are idle and I don't want to pay for idle cores. If anyone could share information on how best to manage the cloud, it would be appreciated.

 

Finally, what is the difference between “stopping” an instance and “terminating” an instance on the cloud? Would I still get charged by AWS if I just stop an instance? Any clarification in this area would also be much appreciated. Again, time is money!

I hope this helps many of us!

 

Unfortunatly, I will not be in Pitt to ask these questions in person.

 

Thanks in advance!!!

 

Mike


___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
Reply | Threaded
Open this post in threaded view
|

Re: Analyzing Targeted Resequencing data with Galaxy

Sean Davis


On Fri, Apr 8, 2011 at 7:42 AM, Mike Dufault <[hidden email]> wrote:
Sean, Anton and Jen,
 
Thanks for all of the suggestions (in separate replies) on how to better analyze my SelectSure captured Exome data. My original work-flow is below in the e-mail string.
 
Based on the suggestions, I plan to change my work-flow by increasing my quality filter from 20 to 25-30 and increasing my minimum coverage from 3x to ~20x. I will use the Join function to compare the SNPs that are in common with the samples from two family members to filter (narrow down) what they have in common, since I am looking for a hereditary disease. Then i will use the Join function again with the SNPs from build (131) to characterize the SNPs.

Since you are looking only for variants in common, you can be more lenient (allow more false-positives per sample), so I would not increase the coverage that high and rely more on the snp quality filter. 
 
 
Sean suggested realignment around indels and potentially quality score recalibration. Is that even possible with Galaxy at the moment?
 

I do not think so.
 
Where in the flow can I perform Indel analysis? Will I need to process my data separately for SNPs and Indel analysis, or can they be done sequentially in the same linear work-flow? I am still a little unsure of the best way to hand this.
 

This depends on the software being used.  Pileup can call both indels and SNVs. 
 
Please let me know if you have any more suggestions or comments before I re-launch the analysis later this evening. Once I get a flow that works, I hope to be able to publish it for everyone to benefit from.
 
Thanks to the Galaxy team for an outstanding platform and support!
 
Mike
--- On Tue, 4/5/11, Sean Davis <[hidden email]> wrote:

From: Sean Davis <[hidden email]>

Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
To: "Mike Dufault" <[hidden email]>

Cc: "galaxy-user" <[hidden email]>
Date: Tuesday, April 5, 2011, 4:39 PM


Hi, Mike.  See my couple of comments below....

Sean

On Tue, Apr 5, 2011 at 2:22 PM, Mike Dufault <dufaultm@...> wrote:

Hi all,

 

Like many people on this e-mail chain, I have been looking for advice on how to process Exome data. Below, I have described in detail what I have done with the hope of getting some clarification. Hopefully it will be helpful to many of us!

 

I have SureSelect Exome captured data. The data was delivered to me as two separate files (/1) & (/2). Each file has ~33 million reads; 7.2 GB each. I am looking for SNPs from a family with cancer. Eventually I plan to compare the date from multiple members of the same family to find a related disease SNP.

 

Below is the workflow that I used to process my data. I adapted it from the Screencast titles: "Mapping Illumina Reads: Paired Ends Example." I used all of the same default parameters as in the screencast.

 

At the end of step 13, I had ~4,700,000 SNPs. This seemed like a lot so in step 14, I filtered on column 7 (c7) which I believe is the Quality SNP value. I set the filter as C7>=1 to remove all of the 0 (zero) values for Quality SNP. I figured that if they have a value of zero, they must not be real SNPs. This left me with ~180,000 SNPs.

 

1: Get Data: Illumina 1.3+ file (/1)

2: Get Data: Illumina 1.3+ file (/2)

3: FASTQ Groomer on data 1

4: FASTQ Groomer on data 2

5: FASTQ Summary Statistics on data 3

6: FASTQ Summary Statistics on data 4

7: Box plot on data 5

8: Box plot on data 6

9: Map with Bowtie for Illumina on data 4 and data 3: mapped reads


This might not be the best choice, as bowtie does not allow gapped alignment.  See here for a discussion of indels and SNV calling:


You will probably also want to consider local realignment around indels and potentially quality score recalibration.  
 

10: Filter Sam on data 9

11: SAM-to-BAM on data 10: converted to BAM

12: Generate pileup on data 11: converted pileup

13: Filter pileup on data 12

14: Filter data on 13 (c7>=1)

15: Sort on data 15 (C7; descending order)

 

First, if anyone has ideas on how to improve the workflow, I would be open to suggestions; especially from people experienced with Galaxy.

 

Second, I am concerned that many/most of the SNPs are known. Should I filter my data against the known SNPdb? If so, how can I do this in Galaxy (in Bowtie?)


Keep in mind that, depending on the version of dbSNP, there are many cancer-associated SNPs contaminating the database.

 

Third, as suggested in the screencast, I did not trim or filter my FASTQ Groomed data because I was interested in SNPs and I could filter on Quality later in the workflow. Would implementing a filtering step on phred quality (~20) at this step save me the step of filtering later on. Currently it takes multiple hours (~16) to process the data from start to finish, would filtering at this step reduce the amount of time that it takes to process my data? Presumably, there would be less data to process. I do this on the AWS Cloud and time is money!

 


Adding a gapped alignment algorithm, indel realignment, and quality recalibration can easily increase this time to a couple of days per sample.
 

Fifth, when using Galaxy on the AWS cloud, does adding additional cores or adding High CPU ( or both) shorten the time to process the data? When I set up extra cores, it appeared that some of them are idle and I don't want to pay for idle cores. If anyone could share information on how best to manage the cloud, it would be appreciated.

 

Finally, what is the difference between “stopping” an instance and “terminating” an instance on the cloud? Would I still get charged by AWS if I just stop an instance? Any clarification in this area would also be much appreciated. Again, time is money!

I hope this helps many of us!

 

Unfortunatly, I will not be in Pitt to ask these questions in person.

 

Thanks in advance!!!

 

Mike



___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/
12