AED calculations using the MAKER pipeline

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

AED calculations using the MAKER pipeline

vkrishna
Hi,

We have been using the MAKER pipeline here at JCVI to calculate AED scores by feeding in our annotation set as `model_gff` and the protein and EST evidence as `protein_gff` and `est_gff` respectively. Here is the issue we are having:

When running the above pipeline with protein2genome and est2genome evidence generated earlier by MAKER, there are no problems calculating the AED score. Normally this pipeline takes a little over 12 hours to complete.

But if we use our own evidence, AAT and Genewise aligned proteins for `protein_gff` and PASA assembled ESTs for `est_gff`, the same pipeline runs very very slow and the intermediary *.gff.ann file has many chunks (separated by '###') that are completely empty. Our evidence in formatted in the same way as est2genome or protein2genome (GFF file with "expressed_sequence_match::match_part" or "protein_match::match_part" features respectively)

The input to my pipeline is 8 chromosomes, ~2200 scaffolds and I use the default `max_dna_len` parameter used to split the large assemblies into chunks.

Investigating the master_datastore.log shows me that the scaffolds run through without any issues and the chromosomes are still being processed.
For any of the chromosomes, investigating the 'run.log' file, one level above 'theVoid' shows me how many "final.section" jobs were started and how many finished. And in the case of all the chromosomes, it tells me that everything that was started has finished. And the 'log.child.*' files within `theVoid` are all empty. Also within `theVoid`, I'm noticing that the "raw.section" and "evidence_*.gff" files are not empty. But one thing that is surprising is that of all the "final.section" files, only the one pertaining to the last chunk is very large (proportional to the size of the evidnce), the rest are all exactly the same size (exactly 331 bytes).

I'm running MAKER in MPI mode spawning 48 processes on a high memory machine with 64 available cores and 1TB of RAM.

I hope I've been able to explain my situation clearly in this email.

Any help is appreciated.
Thank you.

Vivek
_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: AED calculations using the MAKER pipeline

Carson Holt-3
In the current MAKER download when using GFF3 passthrough there was an
issue with everything being done at the very last step.  This of course
leads to a memory spike and a very slow last step.  That seems to be
similar to what you are describing. It should be resolved in what will
become version 2.28. I can give you access to the pre-release code, so you
can check that the issue is resolved for you.  I'll send details in a
separate e-mail.

Also the ### will be printed after every ~100,000 bp of assembly processed
by MAKER.  You can ignore them, but they actually have a meaning in GFF3.
Basically everything between two sets of ###'s are fully resolved.  It
allows programs that read GFF3 to parallelize file loading or just load
sections of a file as they can rapidly identify "safe chunks".  Without
them the entire file must be loaded into memory in order to be certain
that all feature parts are there (as there is no requirement for sorting
or order in GFF3).

log.child files will always be empty unless you run analysis like snap or
blast.

Thanks,
Carson






On 13-03-20 9:05 AM, "Krishnakumar, Vivek" <[hidden email]> wrote:

>Hi,
>
>We have been using the MAKER pipeline here at JCVI to calculate AED
>scores by feeding in our annotation set as `model_gff` and the protein
>and EST evidence as `protein_gff` and `est_gff` respectively. Here is the
>issue we are having:
>
>When running the above pipeline with protein2genome and est2genome
>evidence generated earlier by MAKER, there are no problems calculating
>the AED score. Normally this pipeline takes a little over 12 hours to
>complete.
>
>But if we use our own evidence, AAT and Genewise aligned proteins for
>`protein_gff` and PASA assembled ESTs for `est_gff`, the same pipeline
>runs very very slow and the intermediary *.gff.ann file has many chunks
>(separated by '###') that are completely empty. Our evidence in formatted
>in the same way as est2genome or protein2genome (GFF file with
>"expressed_sequence_match::match_part" or "protein_match::match_part"
>features respectively)
>
>The input to my pipeline is 8 chromosomes, ~2200 scaffolds and I use the
>default `max_dna_len` parameter used to split the large assemblies into
>chunks.
>
>Investigating the master_datastore.log shows me that the scaffolds run
>through without any issues and the chromosomes are still being processed.
>For any of the chromosomes, investigating the 'run.log' file, one level
>above 'theVoid' shows me how many "final.section" jobs were started and
>how many finished. And in the case of all the chromosomes, it tells me
>that everything that was started has finished. And the 'log.child.*'
>files within `theVoid` are all empty. Also within `theVoid`, I'm noticing
>that the "raw.section" and "evidence_*.gff" files are not empty. But one
>thing that is surprising is that of all the "final.section" files, only
>the one pertaining to the last chunk is very large (proportional to the
>size of the evidnce), the rest are all exactly the same size (exactly 331
>bytes).
>
>I'm running MAKER in MPI mode spawning 48 processes on a high memory
>machine with 64 available cores and 1TB of RAM.
>
>I hope I've been able to explain my situation clearly in this email.
>
>Any help is appreciated.
>Thank you.
>
>Vivek


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: AED calculations using the MAKER pipeline

Town, Christopher D.
Thanks. Is there any way of guestimating when this final step might be completed. We are in a time crunch here to get this analysis finished and the data/annotation out.

Best

Chris

-----Original Message-----
From: Carson Holt [mailto:[hidden email]]
Sent: Wednesday, March 20, 2013 9:51 AM
To: Krishnakumar, Vivek; [hidden email]
Cc: Town, Christopher D.; Tang, Haibao; Bidwell, Shelby; Rosen, Benjamin
Subject: Re: AED calculations using the MAKER pipeline

In the current MAKER download when using GFF3 passthrough there was an issue with everything being done at the very last step.  This of course leads to a memory spike and a very slow last step.  That seems to be similar to what you are describing. It should be resolved in what will become version 2.28. I can give you access to the pre-release code, so you can check that the issue is resolved for you.  I'll send details in a separate e-mail.

Also the ### will be printed after every ~100,000 bp of assembly processed by MAKER.  You can ignore them, but they actually have a meaning in GFF3.
Basically everything between two sets of ###'s are fully resolved.  It allows programs that read GFF3 to parallelize file loading or just load sections of a file as they can rapidly identify "safe chunks".  Without them the entire file must be loaded into memory in order to be certain that all feature parts are there (as there is no requirement for sorting or order in GFF3).

log.child files will always be empty unless you run analysis like snap or blast.

Thanks,
Carson






On 13-03-20 9:05 AM, "Krishnakumar, Vivek" <[hidden email]> wrote:

>Hi,
>
>We have been using the MAKER pipeline here at JCVI to calculate AED
>scores by feeding in our annotation set as `model_gff` and the protein
>and EST evidence as `protein_gff` and `est_gff` respectively. Here is
>the issue we are having:
>
>When running the above pipeline with protein2genome and est2genome
>evidence generated earlier by MAKER, there are no problems calculating
>the AED score. Normally this pipeline takes a little over 12 hours to
>complete.
>
>But if we use our own evidence, AAT and Genewise aligned proteins for
>`protein_gff` and PASA assembled ESTs for `est_gff`, the same pipeline
>runs very very slow and the intermediary *.gff.ann file has many chunks
>(separated by '###') that are completely empty. Our evidence in
>formatted in the same way as est2genome or protein2genome (GFF file
>with "expressed_sequence_match::match_part" or "protein_match::match_part"
>features respectively)
>
>The input to my pipeline is 8 chromosomes, ~2200 scaffolds and I use
>the default `max_dna_len` parameter used to split the large assemblies
>into chunks.
>
>Investigating the master_datastore.log shows me that the scaffolds run
>through without any issues and the chromosomes are still being processed.
>For any of the chromosomes, investigating the 'run.log' file, one level
>above 'theVoid' shows me how many "final.section" jobs were started and
>how many finished. And in the case of all the chromosomes, it tells me
>that everything that was started has finished. And the 'log.child.*'
>files within `theVoid` are all empty. Also within `theVoid`, I'm
>noticing that the "raw.section" and "evidence_*.gff" files are not
>empty. But one thing that is surprising is that of all the
>"final.section" files, only the one pertaining to the last chunk is
>very large (proportional to the size of the evidnce), the rest are all
>exactly the same size (exactly 331 bytes).
>
>I'm running MAKER in MPI mode spawning 48 processes on a high memory
>machine with 64 available cores and 1TB of RAM.
>
>I hope I've been able to explain my situation clearly in this email.
>
>Any help is appreciated.
>Thank you.
>
>Vivek


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: AED calculations using the MAKER pipeline

Mark Yandell
In reply to this post by vkrishna
Hi Vivek,

sound like its a  maybe problem with the protein2genome GFF file. Cane you send us a sample file that is known to produce the problem?

cheers,

--mark


Mark Yandell
Professor of Human Genetics
H.A. & Edna Benning Presidential Endowed Chair
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
ph:801-587-7707

________________________________________
From: [hidden email] [[hidden email]] on behalf of Krishnakumar, Vivek [[hidden email]]
Sent: Wednesday, March 20, 2013 7:05 AM
To: [hidden email]
Cc: Tang, Haibao; Rosen, Benjamin; Town,        Christopher D.; Bidwell, Shelby
Subject: [maker-devel] AED calculations using the MAKER pipeline

Hi,

We have been using the MAKER pipeline here at JCVI to calculate AED scores by feeding in our annotation set as `model_gff` and the protein and EST evidence as `protein_gff` and `est_gff` respectively. Here is the issue we are having:

When running the above pipeline with protein2genome and est2genome evidence generated earlier by MAKER, there are no problems calculating the AED score. Normally this pipeline takes a little over 12 hours to complete.

But if we use our own evidence, AAT and Genewise aligned proteins for `protein_gff` and PASA assembled ESTs for `est_gff`, the same pipeline runs very very slow and the intermediary *.gff.ann file has many chunks (separated by '###') that are completely empty. Our evidence in formatted in the same way as est2genome or protein2genome (GFF file with "expressed_sequence_match::match_part" or "protein_match::match_part" features respectively)

The input to my pipeline is 8 chromosomes, ~2200 scaffolds and I use the default `max_dna_len` parameter used to split the large assemblies into chunks.

Investigating the master_datastore.log shows me that the scaffolds run through without any issues and the chromosomes are still being processed.
For any of the chromosomes, investigating the 'run.log' file, one level above 'theVoid' shows me how many "final.section" jobs were started and how many finished. And in the case of all the chromosomes, it tells me that everything that was started has finished. And the 'log.child.*' files within `theVoid` are all empty. Also within `theVoid`, I'm noticing that the "raw.section" and "evidence_*.gff" files are not empty. But one thing that is surprising is that of all the "final.section" files, only the one pertaining to the last chunk is very large (proportional to the size of the evidnce), the rest are all exactly the same size (exactly 331 bytes).

I'm running MAKER in MPI mode spawning 48 processes on a high memory machine with 64 available cores and 1TB of RAM.

I hope I've been able to explain my situation clearly in this email.

Any help is appreciated.
Thank you.

Vivek
_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: AED calculations using the MAKER pipeline

Mark Yandell
In reply to this post by Town, Christopher D.
whoops. looks like carson has got this one already. Thanks!


Mark Yandell
Professor of Human Genetics
H.A. & Edna Benning Presidential Endowed Chair
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
ph:801-587-7707

________________________________________
From: [hidden email] [[hidden email]] on behalf of Town, Christopher D. [[hidden email]]
Sent: Wednesday, March 20, 2013 7:54 AM
To: Carson Holt; Krishnakumar, Vivek; [hidden email]
Cc: Tang, Haibao; Rosen, Benjamin; Bidwell, Shelby
Subject: Re: [maker-devel] AED calculations using the MAKER pipeline

Thanks. Is there any way of guestimating when this final step might be completed. We are in a time crunch here to get this analysis finished and the data/annotation out.

Best

Chris

-----Original Message-----
From: Carson Holt [mailto:[hidden email]]
Sent: Wednesday, March 20, 2013 9:51 AM
To: Krishnakumar, Vivek; [hidden email]
Cc: Town, Christopher D.; Tang, Haibao; Bidwell, Shelby; Rosen, Benjamin
Subject: Re: AED calculations using the MAKER pipeline

In the current MAKER download when using GFF3 passthrough there was an issue with everything being done at the very last step.  This of course leads to a memory spike and a very slow last step.  That seems to be similar to what you are describing. It should be resolved in what will become version 2.28. I can give you access to the pre-release code, so you can check that the issue is resolved for you.  I'll send details in a separate e-mail.

Also the ### will be printed after every ~100,000 bp of assembly processed by MAKER.  You can ignore them, but they actually have a meaning in GFF3.
Basically everything between two sets of ###'s are fully resolved.  It allows programs that read GFF3 to parallelize file loading or just load sections of a file as they can rapidly identify "safe chunks".  Without them the entire file must be loaded into memory in order to be certain that all feature parts are there (as there is no requirement for sorting or order in GFF3).

log.child files will always be empty unless you run analysis like snap or blast.

Thanks,
Carson






On 13-03-20 9:05 AM, "Krishnakumar, Vivek" <[hidden email]> wrote:

>Hi,
>
>We have been using the MAKER pipeline here at JCVI to calculate AED
>scores by feeding in our annotation set as `model_gff` and the protein
>and EST evidence as `protein_gff` and `est_gff` respectively. Here is
>the issue we are having:
>
>When running the above pipeline with protein2genome and est2genome
>evidence generated earlier by MAKER, there are no problems calculating
>the AED score. Normally this pipeline takes a little over 12 hours to
>complete.
>
>But if we use our own evidence, AAT and Genewise aligned proteins for
>`protein_gff` and PASA assembled ESTs for `est_gff`, the same pipeline
>runs very very slow and the intermediary *.gff.ann file has many chunks
>(separated by '###') that are completely empty. Our evidence in
>formatted in the same way as est2genome or protein2genome (GFF file
>with "expressed_sequence_match::match_part" or "protein_match::match_part"
>features respectively)
>
>The input to my pipeline is 8 chromosomes, ~2200 scaffolds and I use
>the default `max_dna_len` parameter used to split the large assemblies
>into chunks.
>
>Investigating the master_datastore.log shows me that the scaffolds run
>through without any issues and the chromosomes are still being processed.
>For any of the chromosomes, investigating the 'run.log' file, one level
>above 'theVoid' shows me how many "final.section" jobs were started and
>how many finished. And in the case of all the chromosomes, it tells me
>that everything that was started has finished. And the 'log.child.*'
>files within `theVoid` are all empty. Also within `theVoid`, I'm
>noticing that the "raw.section" and "evidence_*.gff" files are not
>empty. But one thing that is surprising is that of all the
>"final.section" files, only the one pertaining to the last chunk is
>very large (proportional to the size of the evidnce), the rest are all
>exactly the same size (exactly 331 bytes).
>
>I'm running MAKER in MPI mode spawning 48 processes on a high memory
>machine with 64 available cores and 1TB of RAM.
>
>I hope I've been able to explain my situation clearly in this email.
>
>Any help is appreciated.
>Thank you.
>
>Vivek


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: AED calculations using the MAKER pipeline

Carson Holt-2
In reply to this post by Town, Christopher D.
On the few cases where I found this (if it is the same issue you are
experiencing), it was very much dependent on the total size of the
evidence database and the length of the contigs.  For me it took about
25-50% longer, but used up 10-15x as much RAM (primarily because the
contigs were very long > 50 Mb each).  The issue was unnoticeable on the
short contigs that are more typical of de novo annotation.

Thanks,
Carson





On 13-03-20 9:54 AM, "Town, Christopher D." <[hidden email]> wrote:

>Thanks. Is there any way of guestimating when this final step might be
>completed. We are in a time crunch here to get this analysis finished and
>the data/annotation out.
>
>Best
>
>Chris
>
>-----Original Message-----
>From: Carson Holt [mailto:[hidden email]]
>Sent: Wednesday, March 20, 2013 9:51 AM
>To: Krishnakumar, Vivek; [hidden email]
>Cc: Town, Christopher D.; Tang, Haibao; Bidwell, Shelby; Rosen, Benjamin
>Subject: Re: AED calculations using the MAKER pipeline
>
>In the current MAKER download when using GFF3 passthrough there was an
>issue with everything being done at the very last step.  This of course
>leads to a memory spike and a very slow last step.  That seems to be
>similar to what you are describing. It should be resolved in what will
>become version 2.28. I can give you access to the pre-release code, so
>you can check that the issue is resolved for you.  I'll send details in a
>separate e-mail.
>
>Also the ### will be printed after every ~100,000 bp of assembly
>processed by MAKER.  You can ignore them, but they actually have a
>meaning in GFF3.
>Basically everything between two sets of ###'s are fully resolved.  It
>allows programs that read GFF3 to parallelize file loading or just load
>sections of a file as they can rapidly identify "safe chunks".  Without
>them the entire file must be loaded into memory in order to be certain
>that all feature parts are there (as there is no requirement for sorting
>or order in GFF3).
>
>log.child files will always be empty unless you run analysis like snap or
>blast.
>
>Thanks,
>Carson
>
>
>
>
>
>
>On 13-03-20 9:05 AM, "Krishnakumar, Vivek" <[hidden email]> wrote:
>
>>Hi,
>>
>>We have been using the MAKER pipeline here at JCVI to calculate AED
>>scores by feeding in our annotation set as `model_gff` and the protein
>>and EST evidence as `protein_gff` and `est_gff` respectively. Here is
>>the issue we are having:
>>
>>When running the above pipeline with protein2genome and est2genome
>>evidence generated earlier by MAKER, there are no problems calculating
>>the AED score. Normally this pipeline takes a little over 12 hours to
>>complete.
>>
>>But if we use our own evidence, AAT and Genewise aligned proteins for
>>`protein_gff` and PASA assembled ESTs for `est_gff`, the same pipeline
>>runs very very slow and the intermediary *.gff.ann file has many chunks
>>(separated by '###') that are completely empty. Our evidence in
>>formatted in the same way as est2genome or protein2genome (GFF file
>>with "expressed_sequence_match::match_part" or
>>"protein_match::match_part"
>>features respectively)
>>
>>The input to my pipeline is 8 chromosomes, ~2200 scaffolds and I use
>>the default `max_dna_len` parameter used to split the large assemblies
>>into chunks.
>>
>>Investigating the master_datastore.log shows me that the scaffolds run
>>through without any issues and the chromosomes are still being processed.
>>For any of the chromosomes, investigating the 'run.log' file, one level
>>above 'theVoid' shows me how many "final.section" jobs were started and
>>how many finished. And in the case of all the chromosomes, it tells me
>>that everything that was started has finished. And the 'log.child.*'
>>files within `theVoid` are all empty. Also within `theVoid`, I'm
>>noticing that the "raw.section" and "evidence_*.gff" files are not
>>empty. But one thing that is surprising is that of all the
>>"final.section" files, only the one pertaining to the last chunk is
>>very large (proportional to the size of the evidnce), the rest are all
>>exactly the same size (exactly 331 bytes).
>>
>>I'm running MAKER in MPI mode spawning 48 processes on a high memory
>>machine with 64 available cores and 1TB of RAM.
>>
>>I hope I've been able to explain my situation clearly in this email.
>>
>>Any help is appreciated.
>>Thank you.
>>
>>Vivek
>
>
>_______________________________________________
>maker-devel mailing list
>[hidden email]
>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org