Further split genome questions

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Further split genome questions

Jeanne Wilbrandt

Hi Carson,

I ran into more conspicuous behavior running maker 2.31 on a genome which is split into
20 parts, using the -g flag and the same basename.
Most of the jobs ran simultaneously on the same node, 17 seemed to finish normally, while
the remaining three seemed to be stalled and produced 0B of output. Do you have any
suggestion why this is happening?

After I stopped these stalled jobs, I checked the index.log and found that of 38.384
mentioned scaffolds, 154 appear only once in the log. The surprise is, that 2/3 of these
only appear as FINISHED (the rest only started). There are no models for these 'finished'
scaffolds stored in the .db and they are distributed over all parts of the genome (i.e.,
each of the 20 jobs contained scaffolds that 'did not start' but 'finished')
Should this be an issue of concern?
It might be a NFS lock problem, as NFS is heavily loaded, but the NFS files look good, so
we suspect something fishy going on...

Hope you can help,
best wishes,
Jeanne Wilbrandt

zmb // ZFMK // University of Bonn



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Further split genome questions

Daniel Ence-2
Hi Jeanne, what’s the average length of those 154 scaffolds that only appeared once in the log? Is the length pretty consistent?

~Daniel


On Aug 6, 2014, at 6:40 AM, Jeanne Wilbrandt <[hidden email]> wrote:

>
> Hi Carson,
>
> I ran into more conspicuous behavior running maker 2.31 on a genome which is split into
> 20 parts, using the -g flag and the same basename.
> Most of the jobs ran simultaneously on the same node, 17 seemed to finish normally, while
> the remaining three seemed to be stalled and produced 0B of output. Do you have any
> suggestion why this is happening?
>
> After I stopped these stalled jobs, I checked the index.log and found that of 38.384
> mentioned scaffolds, 154 appear only once in the log. The surprise is, that 2/3 of these
> only appear as FINISHED (the rest only started). There are no models for these 'finished'
> scaffolds stored in the .db and they are distributed over all parts of the genome (i.e.,
> each of the 20 jobs contained scaffolds that 'did not start' but 'finished')
> Should this be an issue of concern?
> It might be a NFS lock problem, as NFS is heavily loaded, but the NFS files look good, so
> we suspect something fishy going on...
>
> Hope you can help,
> best wishes,
> Jeanne Wilbrandt
>
> zmb // ZFMK // University of Bonn
>
>
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Further split genome questions

Carson Holt-2
In reply to this post by Jeanne Wilbrandt
If you are starting and restarting, or running multiple jobs then the log can be partially rebuilt.  On rebuild only the FINISHED entries are added.  If there is a GFF3 result file for the contig, then it is FINISHED. FASTA files will only exist for the contigs that have gene models. Small contigs will rarely contain models.

--Carson

Sent from my iPhone

> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>
>
> Hi Carson,
>
> I ran into more conspicuous behavior running maker 2.31 on a genome which is split into
> 20 parts, using the -g flag and the same basename.
> Most of the jobs ran simultaneously on the same node, 17 seemed to finish normally, while
> the remaining three seemed to be stalled and produced 0B of output. Do you have any
> suggestion why this is happening?
>
> After I stopped these stalled jobs, I checked the index.log and found that of 38.384
> mentioned scaffolds, 154 appear only once in the log. The surprise is, that 2/3 of these
> only appear as FINISHED (the rest only started). There are no models for these 'finished'
> scaffolds stored in the .db and they are distributed over all parts of the genome (i.e.,
> each of the 20 jobs contained scaffolds that 'did not start' but 'finished')
> Should this be an issue of concern?
> It might be a NFS lock problem, as NFS is heavily loaded, but the NFS files look good, so
> we suspect something fishy going on...
>
> Hope you can help,
> best wishes,
> Jeanne Wilbrandt
>
> zmb // ZFMK // University of Bonn
>
>
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Further split genome questions

Daniel Ence
In reply to this post by Jeanne Wilbrandt
Hi Jeanne, what’s the average length of those 154 scaffolds that only appeared once in the log? Is the length pretty consistent among those scaffolds?

~Daniel
On Aug 6, 2014, at 6:40 AM, Jeanne Wilbrandt <[hidden email]> wrote:

>
> Hi Carson,
>
> I ran into more conspicuous behavior running maker 2.31 on a genome which is split into
> 20 parts, using the -g flag and the same basename.
> Most of the jobs ran simultaneously on the same node, 17 seemed to finish normally, while
> the remaining three seemed to be stalled and produced 0B of output. Do you have any
> suggestion why this is happening?
>
> After I stopped these stalled jobs, I checked the index.log and found that of 38.384
> mentioned scaffolds, 154 appear only once in the log. The surprise is, that 2/3 of these
> only appear as FINISHED (the rest only started). There are no models for these 'finished'
> scaffolds stored in the .db and they are distributed over all parts of the genome (i.e.,
> each of the 20 jobs contained scaffolds that 'did not start' but 'finished')
> Should this be an issue of concern?
> It might be a NFS lock problem, as NFS is heavily loaded, but the NFS files look good, so
> we suspect something fishy going on...
>
> Hope you can help,
> best wishes,
> Jeanne Wilbrandt
>
> zmb // ZFMK // University of Bonn
>
>
>
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Further split genome questions

Jeanne Wilbrandt
In reply to this post by Jeanne Wilbrandt

aha, so this explains that.
Daniel, the average is 5930.37 bp, but ranging from ~ 50 to more than 60,000, roughly
half of the sequences being shorter than 3,000 bp.

What do you think about this weird 'I am running but not really doing anything'-behavior?


Thanks a lot!
Jeanne



On Wed, 6 Aug 2014 14:16:52 +0000
 Carson Holt <[hidden email]> wrote:

>If you are starting and restarting, or running multiple jobs then the log can be
>partially rebuilt.  On rebuild only the FINISHED entries are added.  If there is a GFF3
>result file for the contig, then it is FINISHED. FASTA files will only exist for the
>contigs that have gene models. Small contigs will rarely contain models.
>
>--Carson
>
>Sent from my iPhone
>
>> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>>
>>
>> Hi Carson,
>>
>> I ran into more conspicuous behavior running maker 2.31 on a genome which is split
>into
>> 20 parts, using the -g flag and the same basename.
>> Most of the jobs ran simultaneously on the same node, 17 seemed to finish normally,
>while
>> the remaining three seemed to be stalled and produced 0B of output. Do you have any
>> suggestion why this is happening?
>>
>> After I stopped these stalled jobs, I checked the index.log and found that of 38.384
>> mentioned scaffolds, 154 appear only once in the log. The surprise is, that 2/3 of
>these
>> only appear as FINISHED (the rest only started). There are no models for these
>'finished'
>> scaffolds stored in the .db and they are distributed over all parts of the genome
>(i.e.,
>> each of the 20 jobs contained scaffolds that 'did not start' but 'finished')
>> Should this be an issue of concern?
>> It might be a NFS lock problem, as NFS is heavily loaded, but the NFS files look good,
>so
>> we suspect something fishy going on...
>>
>> Hope you can help,
>> best wishes,
>> Jeanne Wilbrandt
>>
>> zmb // ZFMK // University of Bonn
>>
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> [hidden email]
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Further split genome questions

Carson Holt-2
I think the freezing is because you are starting too many simultaneous jobs.  You should try and use MPI to parallelize instead.  The concurrent job way of doing things can start to cause problems If you are running 10 or more jobs in the same directory. You could try splitting them into different directories.

--Carson

Sent from my iPhone

> On Aug 6, 2014, at 9:01 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>
>
> aha, so this explains that.
> Daniel, the average is 5930.37 bp, but ranging from ~ 50 to more than 60,000, roughly
> half of the sequences being shorter than 3,000 bp.
>
> What do you think about this weird 'I am running but not really doing anything'-behavior?
>
>
> Thanks a lot!
> Jeanne
>
>
>
> On Wed, 6 Aug 2014 14:16:52 +0000
> Carson Holt <[hidden email]> wrote:
>> If you are starting and restarting, or running multiple jobs then the log can be
>> partially rebuilt.  On rebuild only the FINISHED entries are added.  If there is a GFF3
>> result file for the contig, then it is FINISHED. FASTA files will only exist for the
>> contigs that have gene models. Small contigs will rarely contain models.
>>
>> --Carson
>>
>> Sent from my iPhone
>>
>>> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>>>
>>>
>>> Hi Carson,
>>>
>>> I ran into more conspicuous behavior running maker 2.31 on a genome which is split
>> into
>>> 20 parts, using the -g flag and the same basename.
>>> Most of the jobs ran simultaneously on the same node, 17 seemed to finish normally,
>> while
>>> the remaining three seemed to be stalled and produced 0B of output. Do you have any
>>> suggestion why this is happening?
>>>
>>> After I stopped these stalled jobs, I checked the index.log and found that of 38.384
>>> mentioned scaffolds, 154 appear only once in the log. The surprise is, that 2/3 of
>> these
>>> only appear as FINISHED (the rest only started). There are no models for these
>> 'finished'
>>> scaffolds stored in the .db and they are distributed over all parts of the genome
>> (i.e.,
>>> each of the 20 jobs contained scaffolds that 'did not start' but 'finished')
>>> Should this be an issue of concern?
>>> It might be a NFS lock problem, as NFS is heavily loaded, but the NFS files look good,
>> so
>>> we suspect something fishy going on...
>>>
>>> Hope you can help,
>>> best wishes,
>>> Jeanne Wilbrandt
>>>
>>> zmb // ZFMK // University of Bonn
>>>
>>>
>>>
>>> _______________________________________________
>>> maker-devel mailing list
>>> [hidden email]
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Further split genome questions

Jeanne Wilbrandt
In reply to this post by Jeanne Wilbrandt


We are using MPI as well, each of the 20 parts gets assigned 4 threads. Our admin reports
however, that the processes seem to assemble more threads than they are allowed. It is
not Blast (which is set to 1 cpu in the opts.ctl). Do you have a suggestion why?

If I start the jobs in the same directory, how can I make sure they write to the same
directory (as, I think is required to put the pieces together in the end?)? das -basename
take paths?


On Wed, 6 Aug 2014 15:12:50 +0000
 Carson Holt <[hidden email]> wrote:

>I think the freezing is because you are starting too many simultaneous jobs.  You should
>try and use MPI to parallelize instead.  The concurrent job way of doing things can
>start to cause problems If you are running 10 or more jobs in the same directory. You
>could try splitting them into different directories.
>
>--Carson
>
>Sent from my iPhone
>
>> On Aug 6, 2014, at 9:01 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>>
>>
>> aha, so this explains that.
>> Daniel, the average is 5930.37 bp, but ranging from ~ 50 to more than 60,000, roughly
>> half of the sequences being shorter than 3,000 bp.
>>
>> What do you think about this weird 'I am running but not really doing
>anything'-behavior?
>>
>>
>> Thanks a lot!
>> Jeanne
>>
>>
>>
>> On Wed, 6 Aug 2014 14:16:52 +0000
>> Carson Holt <[hidden email]> wrote:
>>> If you are starting and restarting, or running multiple jobs then the log can be
>>> partially rebuilt.  On rebuild only the FINISHED entries are added.  If there is a
>GFF3
>>> result file for the contig, then it is FINISHED. FASTA files will only exist for the
>>> contigs that have gene models. Small contigs will rarely contain models.
>>>
>>> --Carson
>>>
>>> Sent from my iPhone
>>>
>>>> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>>>>
>>>>
>>>> Hi Carson,
>>>>
>>>> I ran into more conspicuous behavior running maker 2.31 on a genome which is split
>>> into
>>>> 20 parts, using the -g flag and the same basename.
>>>> Most of the jobs ran simultaneously on the same node, 17 seemed to finish normally,
>>> while
>>>> the remaining three seemed to be stalled and produced 0B of output. Do you have any
>>>> suggestion why this is happening?
>>>>
>>>> After I stopped these stalled jobs, I checked the index.log and found that of 38.384
>>>> mentioned scaffolds, 154 appear only once in the log. The surprise is, that 2/3 of
>>> these
>>>> only appear as FINISHED (the rest only started). There are no models for these
>>> 'finished'
>>>> scaffolds stored in the .db and they are distributed over all parts of the genome
>>> (i.e.,
>>>> each of the 20 jobs contained scaffolds that 'did not start' but 'finished')
>>>> Should this be an issue of concern?
>>>> It might be a NFS lock problem, as NFS is heavily loaded, but the NFS files look
>good,
>>> so
>>>> we suspect something fishy going on...
>>>>
>>>> Hope you can help,
>>>> best wishes,
>>>> Jeanne Wilbrandt
>>>>
>>>> zmb // ZFMK // University of Bonn
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> [hidden email]
>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Further split genome questions

Carson Holt-2
Is your admin counting processes or cpu usage?  Because each system call creates a separate process, so you can expect multiple processes (each system call generates a new process) but only a single cpu of usage per instance.  Use different directories if you are running that many jobs.  You can concatenate the separate results when your done.  Use gff3_merge script to help concatenate the separate GFF3 files generated from separate jobs.

--Carson

Sent from my iPhone

> On Aug 6, 2014, at 9:33 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>
>
>
> We are using MPI as well, each of the 20 parts gets assigned 4 threads. Our admin reports
> however, that the processes seem to assemble more threads than they are allowed. It is
> not Blast (which is set to 1 cpu in the opts.ctl). Do you have a suggestion why?
>
> If I start the jobs in the same directory, how can I make sure they write to the same
> directory (as, I think is required to put the pieces together in the end?)? das -basename
> take paths?
>
>
> On Wed, 6 Aug 2014 15:12:50 +0000
> Carson Holt <[hidden email]> wrote:
>> I think the freezing is because you are starting too many simultaneous jobs.  You should
>> try and use MPI to parallelize instead.  The concurrent job way of doing things can
>> start to cause problems If you are running 10 or more jobs in the same directory. You
>> could try splitting them into different directories.
>>
>> --Carson
>>
>> Sent from my iPhone
>>
>>> On Aug 6, 2014, at 9:01 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>>>
>>>
>>> aha, so this explains that.
>>> Daniel, the average is 5930.37 bp, but ranging from ~ 50 to more than 60,000, roughly
>>> half of the sequences being shorter than 3,000 bp.
>>>
>>> What do you think about this weird 'I am running but not really doing
>> anything'-behavior?
>>>
>>>
>>> Thanks a lot!
>>> Jeanne
>>>
>>>
>>>
>>> On Wed, 6 Aug 2014 14:16:52 +0000
>>> Carson Holt <[hidden email]> wrote:
>>>> If you are starting and restarting, or running multiple jobs then the log can be
>>>> partially rebuilt.  On rebuild only the FINISHED entries are added.  If there is a
>> GFF3
>>>> result file for the contig, then it is FINISHED. FASTA files will only exist for the
>>>> contigs that have gene models. Small contigs will rarely contain models.
>>>>
>>>> --Carson
>>>>
>>>> Sent from my iPhone
>>>>
>>>>> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>>>>>
>>>>>
>>>>> Hi Carson,
>>>>>
>>>>> I ran into more conspicuous behavior running maker 2.31 on a genome which is split
>>>> into
>>>>> 20 parts, using the -g flag and the same basename.
>>>>> Most of the jobs ran simultaneously on the same node, 17 seemed to finish normally,
>>>> while
>>>>> the remaining three seemed to be stalled and produced 0B of output. Do you have any
>>>>> suggestion why this is happening?
>>>>>
>>>>> After I stopped these stalled jobs, I checked the index.log and found that of 38.384
>>>>> mentioned scaffolds, 154 appear only once in the log. The surprise is, that 2/3 of
>>>> these
>>>>> only appear as FINISHED (the rest only started). There are no models for these
>>>> 'finished'
>>>>> scaffolds stored in the .db and they are distributed over all parts of the genome
>>>> (i.e.,
>>>>> each of the 20 jobs contained scaffolds that 'did not start' but 'finished')
>>>>> Should this be an issue of concern?
>>>>> It might be a NFS lock problem, as NFS is heavily loaded, but the NFS files look
>> good,
>>>> so
>>>>> we suspect something fishy going on...
>>>>>
>>>>> Hope you can help,
>>>>> best wishes,
>>>>> Jeanne Wilbrandt
>>>>>
>>>>> zmb // ZFMK // University of Bonn
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> maker-devel mailing list
>>>>> [hidden email]
>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Further split genome questions

Jeanne Wilbrandt
In reply to this post by Jeanne Wilbrandt

Our admin counts processes. Do I understand you right, that one CPU handles several
processes?

I'm still confused by the different directories (and I made a mistake when asking last
time, I wanted to say 'If I do NOT start the jobs in the same directory...).
So, if I start each piece of a genome in its own directory (for example), then it gets a
unique basename (because the output will be separate from all other pieces anyway) and I
will not run dsindex but instead use gff3_merge for each piece's output and then once
again to merge all resulting gff3-files?

Hope I got you right :)

Thanks fopr your help!
Jeanne



On Wed, 6 Aug 2014 15:45:56 +0000
 Carson Holt <[hidden email]> wrote:

>Is your admin counting processes or cpu usage?  Because each system call creates a
>separate process, so you can expect multiple processes (each system call generates a new
>process) but only a single cpu of usage per instance.  Use different directories if you
>are running that many jobs.  You can concatenate the separate results when your done.
> Use gff3_merge script to help concatenate the separate GFF3 files generated from
>separate jobs.
>
>--Carson
>
>Sent from my iPhone
>
>> On Aug 6, 2014, at 9:33 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>>
>>
>>
>> We are using MPI as well, each of the 20 parts gets assigned 4 threads. Our admin
>reports
>> however, that the processes seem to assemble more threads than they are allowed. It is
>> not Blast (which is set to 1 cpu in the opts.ctl). Do you have a suggestion why?
>>
>> If I start the jobs in the same directory, how can I make sure they write to the same
>> directory (as, I think is required to put the pieces together in the end?)? das
>-basename
>> take paths?
>>
>>
>> On Wed, 6 Aug 2014 15:12:50 +0000
>> Carson Holt <[hidden email]> wrote:
>>> I think the freezing is because you are starting too many simultaneous jobs.  You
>should
>>> try and use MPI to parallelize instead.  The concurrent job way of doing things can
>>> start to cause problems If you are running 10 or more jobs in the same directory. You
>>> could try splitting them into different directories.
>>>
>>> --Carson
>>>
>>> Sent from my iPhone
>>>
>>>> On Aug 6, 2014, at 9:01 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>>>>
>>>>
>>>> aha, so this explains that.
>>>> Daniel, the average is 5930.37 bp, but ranging from ~ 50 to more than 60,000,
>roughly
>>>> half of the sequences being shorter than 3,000 bp.
>>>>
>>>> What do you think about this weird 'I am running but not really doing
>>> anything'-behavior?
>>>>
>>>>
>>>> Thanks a lot!
>>>> Jeanne
>>>>
>>>>
>>>>
>>>> On Wed, 6 Aug 2014 14:16:52 +0000
>>>> Carson Holt <[hidden email]> wrote:
>>>>> If you are starting and restarting, or running multiple jobs then the log can be
>>>>> partially rebuilt.  On rebuild only the FINISHED entries are added.  If there is a
>>> GFF3
>>>>> result file for the contig, then it is FINISHED. FASTA files will only exist for
>the
>>>>> contigs that have gene models. Small contigs will rarely contain models.
>>>>>
>>>>> --Carson
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>>> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>>>>>>
>>>>>>
>>>>>> Hi Carson,
>>>>>>
>>>>>> I ran into more conspicuous behavior running maker 2.31 on a genome which is split
>>>>> into
>>>>>> 20 parts, using the -g flag and the same basename.
>>>>>> Most of the jobs ran simultaneously on the same node, 17 seemed to finish
>normally,
>>>>> while
>>>>>> the remaining three seemed to be stalled and produced 0B of output. Do you have
>any
>>>>>> suggestion why this is happening?
>>>>>>
>>>>>> After I stopped these stalled jobs, I checked the index.log and found that of
>38.384
>>>>>> mentioned scaffolds, 154 appear only once in the log. The surprise is, that 2/3 of
>>>>> these
>>>>>> only appear as FINISHED (the rest only started). There are no models for these
>>>>> 'finished'
>>>>>> scaffolds stored in the .db and they are distributed over all parts of the genome
>>>>> (i.e.,
>>>>>> each of the 20 jobs contained scaffolds that 'did not start' but 'finished')
>>>>>> Should this be an issue of concern?
>>>>>> It might be a NFS lock problem, as NFS is heavily loaded, but the NFS files look
>>> good,
>>>>> so
>>>>>> we suspect something fishy going on...
>>>>>>
>>>>>> Hope you can help,
>>>>>> best wishes,
>>>>>> Jeanne Wilbrandt
>>>>>>
>>>>>> zmb // ZFMK // University of Bonn
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> maker-devel mailing list
>>>>>> [hidden email]
>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Further split genome questions

Daniel Ence
Hi Jeanne,

I believe that's right. You can pass gff3_merge either a list of gff3 files or a maker-created datastore index file. To compile the pieces for each of your different runs you would give gff3_merge the datastore index file. To put those resulting gff3 files together, you would pass gff3_merge the list of gff3 files that you want to merge.

~Daniel



Daniel Ence
Graduate Student
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
________________________________________
From: maker-devel [[hidden email]] on behalf of Jeanne Wilbrandt [[hidden email]]
Sent: Wednesday, August 13, 2014 3:32 AM
To: Carson Holt; Wilbrandt Jeanne
Cc: [hidden email]
Subject: Re: [maker-devel] Further split genome questions

Our admin counts processes. Do I understand you right, that one CPU handles several
processes?

I'm still confused by the different directories (and I made a mistake when asking last
time, I wanted to say 'If I do NOT start the jobs in the same directory...).
So, if I start each piece of a genome in its own directory (for example), then it gets a
unique basename (because the output will be separate from all other pieces anyway) and I
will not run dsindex but instead use gff3_merge for each piece's output and then once
again to merge all resulting gff3-files?

Hope I got you right :)

Thanks fopr your help!
Jeanne



On Wed, 6 Aug 2014 15:45:56 +0000
 Carson Holt <[hidden email]> wrote:

>Is your admin counting processes or cpu usage?  Because each system call creates a
>separate process, so you can expect multiple processes (each system call generates a new
>process) but only a single cpu of usage per instance.  Use different directories if you
>are running that many jobs.  You can concatenate the separate results when your done.
> Use gff3_merge script to help concatenate the separate GFF3 files generated from
>separate jobs.
>
>--Carson
>
>Sent from my iPhone
>
>> On Aug 6, 2014, at 9:33 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>>
>>
>>
>> We are using MPI as well, each of the 20 parts gets assigned 4 threads. Our admin
>reports
>> however, that the processes seem to assemble more threads than they are allowed. It is
>> not Blast (which is set to 1 cpu in the opts.ctl). Do you have a suggestion why?
>>
>> If I start the jobs in the same directory, how can I make sure they write to the same
>> directory (as, I think is required to put the pieces together in the end?)? das
>-basename
>> take paths?
>>
>>
>> On Wed, 6 Aug 2014 15:12:50 +0000
>> Carson Holt <[hidden email]> wrote:
>>> I think the freezing is because you are starting too many simultaneous jobs.  You
>should
>>> try and use MPI to parallelize instead.  The concurrent job way of doing things can
>>> start to cause problems If you are running 10 or more jobs in the same directory. You
>>> could try splitting them into different directories.
>>>
>>> --Carson
>>>
>>> Sent from my iPhone
>>>
>>>> On Aug 6, 2014, at 9:01 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>>>>
>>>>
>>>> aha, so this explains that.
>>>> Daniel, the average is 5930.37 bp, but ranging from ~ 50 to more than 60,000,
>roughly
>>>> half of the sequences being shorter than 3,000 bp.
>>>>
>>>> What do you think about this weird 'I am running but not really doing
>>> anything'-behavior?
>>>>
>>>>
>>>> Thanks a lot!
>>>> Jeanne
>>>>
>>>>
>>>>
>>>> On Wed, 6 Aug 2014 14:16:52 +0000
>>>> Carson Holt <[hidden email]> wrote:
>>>>> If you are starting and restarting, or running multiple jobs then the log can be
>>>>> partially rebuilt.  On rebuild only the FINISHED entries are added.  If there is a
>>> GFF3
>>>>> result file for the contig, then it is FINISHED. FASTA files will only exist for
>the
>>>>> contigs that have gene models. Small contigs will rarely contain models.
>>>>>
>>>>> --Carson
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>>> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>>>>>>
>>>>>>
>>>>>> Hi Carson,
>>>>>>
>>>>>> I ran into more conspicuous behavior running maker 2.31 on a genome which is split
>>>>> into
>>>>>> 20 parts, using the -g flag and the same basename.
>>>>>> Most of the jobs ran simultaneously on the same node, 17 seemed to finish
>normally,
>>>>> while
>>>>>> the remaining three seemed to be stalled and produced 0B of output. Do you have
>any
>>>>>> suggestion why this is happening?
>>>>>>
>>>>>> After I stopped these stalled jobs, I checked the index.log and found that of
>38.384
>>>>>> mentioned scaffolds, 154 appear only once in the log. The surprise is, that 2/3 of
>>>>> these
>>>>>> only appear as FINISHED (the rest only started). There are no models for these
>>>>> 'finished'
>>>>>> scaffolds stored in the .db and they are distributed over all parts of the genome
>>>>> (i.e.,
>>>>>> each of the 20 jobs contained scaffolds that 'did not start' but 'finished')
>>>>>> Should this be an issue of concern?
>>>>>> It might be a NFS lock problem, as NFS is heavily loaded, but the NFS files look
>>> good,
>>>>> so
>>>>>> we suspect something fishy going on...
>>>>>>
>>>>>> Hope you can help,
>>>>>> best wishes,
>>>>>> Jeanne Wilbrandt
>>>>>>
>>>>>> zmb // ZFMK // University of Bonn
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> maker-devel mailing list
>>>>>> [hidden email]
>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Further split genome questions

Carson Holt-2
In reply to this post by Jeanne Wilbrandt
Yes. One cpu will have several processes, most are helper processes that
will use 0% CPU almost all of the time (for example there is a shared
variable manager process that will launch with MAKER but will also be
called 'maker' under top because it is technically its child and not a
separate script).  Also system calls will launch a new process that will
use all CPU while the process calling it will drop to 0% CPU until it
finishes.

Yes.  Your explanation is correct. You then use gff3_merge to merge the
GFF3 file.

--Carson



On 8/13/14, 3:32 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:

>
>Our admin counts processes. Do I understand you right, that one CPU
>handles several
>processes?
>
>I'm still confused by the different directories (and I made a mistake
>when asking last
>time, I wanted to say 'If I do NOT start the jobs in the same
>directory...).
>So, if I start each piece of a genome in its own directory (for example),
>then it gets a
>unique basename (because the output will be separate from all other
>pieces anyway) and I
>will not run dsindex but instead use gff3_merge for each piece's output
>and then once
>again to merge all resulting gff3-files?
>
>Hope I got you right :)
>
>Thanks fopr your help!
>Jeanne
>
>
>
>On Wed, 6 Aug 2014 15:45:56 +0000
> Carson Holt <[hidden email]> wrote:
>>Is your admin counting processes or cpu usage?  Because each system call
>>creates a
>>separate process, so you can expect multiple processes (each system call
>>generates a new
>>process) but only a single cpu of usage per instance.  Use different
>>directories if you
>>are running that many jobs.  You can concatenate the separate results
>>when your done.
>> Use gff3_merge script to help concatenate the separate GFF3 files
>>generated from
>>separate jobs.
>>
>>--Carson
>>
>>Sent from my iPhone
>>
>>> On Aug 6, 2014, at 9:33 AM, "Jeanne Wilbrandt" <[hidden email]>
>>>wrote:
>>>
>>>
>>>
>>> We are using MPI as well, each of the 20 parts gets assigned 4
>>>threads. Our admin
>>reports
>>> however, that the processes seem to assemble more threads than they
>>>are allowed. It is
>>> not Blast (which is set to 1 cpu in the opts.ctl). Do you have a
>>>suggestion why?
>>>
>>> If I start the jobs in the same directory, how can I make sure they
>>>write to the same
>>> directory (as, I think is required to put the pieces together in the
>>>end?)? das
>>-basename
>>> take paths?
>>>
>>>
>>> On Wed, 6 Aug 2014 15:12:50 +0000
>>> Carson Holt <[hidden email]> wrote:
>>>> I think the freezing is because you are starting too many
>>>>simultaneous jobs.  You
>>should
>>>> try and use MPI to parallelize instead.  The concurrent job way of
>>>>doing things can
>>>> start to cause problems If you are running 10 or more jobs in the
>>>>same directory. You
>>>> could try splitting them into different directories.
>>>>
>>>> --Carson
>>>>
>>>> Sent from my iPhone
>>>>
>>>>> On Aug 6, 2014, at 9:01 AM, "Jeanne Wilbrandt" <[hidden email]>
>>>>>wrote:
>>>>>
>>>>>
>>>>> aha, so this explains that.
>>>>> Daniel, the average is 5930.37 bp, but ranging from ~ 50 to more
>>>>>than 60,000,
>>roughly
>>>>> half of the sequences being shorter than 3,000 bp.
>>>>>
>>>>> What do you think about this weird 'I am running but not really doing
>>>> anything'-behavior?
>>>>>
>>>>>
>>>>> Thanks a lot!
>>>>> Jeanne
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 6 Aug 2014 14:16:52 +0000
>>>>> Carson Holt <[hidden email]> wrote:
>>>>>> If you are starting and restarting, or running multiple jobs then
>>>>>>the log can be
>>>>>> partially rebuilt.  On rebuild only the FINISHED entries are added.
>>>>>> If there is a
>>>> GFF3
>>>>>> result file for the contig, then it is FINISHED. FASTA files will
>>>>>>only exist for
>>the
>>>>>> contigs that have gene models. Small contigs will rarely contain
>>>>>>models.
>>>>>>
>>>>>> --Carson
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>>> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt"
>>>>>>><[hidden email]> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi Carson,
>>>>>>>
>>>>>>> I ran into more conspicuous behavior running maker 2.31 on a
>>>>>>>genome which is split
>>>>>> into
>>>>>>> 20 parts, using the -g flag and the same basename.
>>>>>>> Most of the jobs ran simultaneously on the same node, 17 seemed to
>>>>>>>finish
>>normally,
>>>>>> while
>>>>>>> the remaining three seemed to be stalled and produced 0B of
>>>>>>>output. Do you have
>>any
>>>>>>> suggestion why this is happening?
>>>>>>>
>>>>>>> After I stopped these stalled jobs, I checked the index.log and
>>>>>>>found that of
>>38.384
>>>>>>> mentioned scaffolds, 154 appear only once in the log. The surprise
>>>>>>>is, that 2/3 of
>>>>>> these
>>>>>>> only appear as FINISHED (the rest only started). There are no
>>>>>>>models for these
>>>>>> 'finished'
>>>>>>> scaffolds stored in the .db and they are distributed over all
>>>>>>>parts of the genome
>>>>>> (i.e.,
>>>>>>> each of the 20 jobs contained scaffolds that 'did not start' but
>>>>>>>'finished')
>>>>>>> Should this be an issue of concern?
>>>>>>> It might be a NFS lock problem, as NFS is heavily loaded, but the
>>>>>>>NFS files look
>>>> good,
>>>>>> so
>>>>>>> we suspect something fishy going on...
>>>>>>>
>>>>>>> Hope you can help,
>>>>>>> best wishes,
>>>>>>> Jeanne Wilbrandt
>>>>>>>
>>>>>>> zmb // ZFMK // University of Bonn
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> maker-devel mailing list
>>>>>>> [hidden email]
>>>>>>>
>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.
>>>>>>>org
>>>
>



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Further split genome questions

Jeanne Wilbrandt
In reply to this post by Jeanne Wilbrandt

Thank you so much!

However, I'm still, struggling, I'm afraid: I tried this 'two-step merging' approach with
a subset of scaffolds and got duplicate IDs.

Here is what I did:
- divided input scaffolds in two files
- run maker separately on these files (-> separate output dirs)
-- additional input: maker-generated gff3 from previous (singular) run
-- repeatmasking, snaphmm, gmhmm, augustus_species are given
-- map_forward=0 / 1 (I tried both, to the same effect)
- gff3_merge two times using index-log
- gff3_merge these two gff3 files

$
grep -P "\tgene\t" merged_all.gff3 | cut -f9 | cut -f1 -d ";" | sort | uniq -c | sort -n
| tail
      2 ID=snap_masked-scf7180005140699-processed-gene-0.19
      2 ID=snap_masked-scf7180005140699-processed-gene-0.22
      2 ID=snap_masked-scf7180005140699-processed-gene-1.36
      2 ID=snap_masked-scf7180005140713-processed-gene-0.4
      2 ID=snap_masked-scf7180005140744-processed-gene-0.4
      2 ID=snap_masked-scf7180005140744-processed-gene-0.6
      2 ID=snap_masked-scf7180005140754-processed-gene-0.14
      2 ID=snap_masked-scf7180005140754-processed-gene-0.15
      2 ID=snap_masked-scf7180005140754-processed-gene-0.19
      2 ID=snap_masked-scf7180005181475-processed-gene-0.3

$ grep snap_masked-scf7180005181475-processed-gene-0.3 merged_all.gff3 | grep "\sgene"
scf7180005181475 maker gene 9050 9385 . - . ID=snap_masked-scf7180005181475-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.3
scf7180005181475 maker gene 846 1088 . - . ID=snap_masked-scf7180005181475-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.3

- found duplicates! i.e. the same ID for gene annotations in different areas of the same
scaffold (of 655 gene annotations, 51 appear twice)
-- this happens not only with gene, but also CDS and mRNA annotations, as far as I can
see (here, in one example, non-everlapping but close CDS snippets got the same ID).


I suspected this might have to do with the map_forward flag, but I get the same problem
again (with genes at the same locations).
I attached one of the ctl files for you in case you want to have a look, the other is
analogous. Do you need something else?

What did I miss? This should not happen, right?




On Wed, 13 Aug 2014 15:52:34 +0000
 Carson Holt <[hidden email]> wrote:

>Yes. One cpu will have several processes, most are helper processes that
>will use 0% CPU almost all of the time (for example there is a shared
>variable manager process that will launch with MAKER but will also be
>called 'maker' under top because it is technically its child and not a
>separate script).  Also system calls will launch a new process that will
>use all CPU while the process calling it will drop to 0% CPU until it
>finishes.
>
>Yes.  Your explanation is correct. You then use gff3_merge to merge the
>GFF3 file.
>
>--Carson
>
>
>
>On 8/13/14, 3:32 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>
>>
>>Our admin counts processes. Do I understand you right, that one CPU
>>handles several
>>processes?
>>
>>I'm still confused by the different directories (and I made a mistake
>>when asking last
>>time, I wanted to say 'If I do NOT start the jobs in the same
>>directory...).
>>So, if I start each piece of a genome in its own directory (for example),
>>then it gets a
>>unique basename (because the output will be separate from all other
>>pieces anyway) and I
>>will not run dsindex but instead use gff3_merge for each piece's output
>>and then once
>>again to merge all resulting gff3-files?
>>
>>Hope I got you right :)
>>
>>Thanks fopr your help!
>>Jeanne
>>
>>
>>
>>On Wed, 6 Aug 2014 15:45:56 +0000
>> Carson Holt <[hidden email]> wrote:
>>>Is your admin counting processes or cpu usage?  Because each system call
>>>creates a
>>>separate process, so you can expect multiple processes (each system call
>>>generates a new
>>>process) but only a single cpu of usage per instance.  Use different
>>>directories if you
>>>are running that many jobs.  You can concatenate the separate results
>>>when your done.
>>> Use gff3_merge script to help concatenate the separate GFF3 files
>>>generated from
>>>separate jobs.
>>>
>>>--Carson
>>>
>>>Sent from my iPhone
>>>
>>>> On Aug 6, 2014, at 9:33 AM, "Jeanne Wilbrandt" <[hidden email]>
>>>>wrote:
>>>>
>>>>
>>>>
>>>> We are using MPI as well, each of the 20 parts gets assigned 4
>>>>threads. Our admin
>>>reports
>>>> however, that the processes seem to assemble more threads than they
>>>>are allowed. It is
>>>> not Blast (which is set to 1 cpu in the opts.ctl). Do you have a
>>>>suggestion why?
>>>>
>>>> If I start the jobs in the same directory, how can I make sure they
>>>>write to the same
>>>> directory (as, I think is required to put the pieces together in the
>>>>end?)? das
>>>-basename
>>>> take paths?
>>>>
>>>>
>>>> On Wed, 6 Aug 2014 15:12:50 +0000
>>>> Carson Holt <[hidden email]> wrote:
>>>>> I think the freezing is because you are starting too many
>>>>>simultaneous jobs.  You
>>>should
>>>>> try and use MPI to parallelize instead.  The concurrent job way of
>>>>>doing things can
>>>>> start to cause problems If you are running 10 or more jobs in the
>>>>>same directory. You
>>>>> could try splitting them into different directories.
>>>>>
>>>>> --Carson
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>>> On Aug 6, 2014, at 9:01 AM, "Jeanne Wilbrandt" <[hidden email]>
>>>>>>wrote:
>>>>>>
>>>>>>
>>>>>> aha, so this explains that.
>>>>>> Daniel, the average is 5930.37 bp, but ranging from ~ 50 to more
>>>>>>than 60,000,
>>>roughly
>>>>>> half of the sequences being shorter than 3,000 bp.
>>>>>>
>>>>>> What do you think about this weird 'I am running but not really doing
>>>>> anything'-behavior?
>>>>>>
>>>>>>
>>>>>> Thanks a lot!
>>>>>> Jeanne
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, 6 Aug 2014 14:16:52 +0000
>>>>>> Carson Holt <[hidden email]> wrote:
>>>>>>> If you are starting and restarting, or running multiple jobs then
>>>>>>>the log can be
>>>>>>> partially rebuilt.  On rebuild only the FINISHED entries are added.
>>>>>>> If there is a
>>>>> GFF3
>>>>>>> result file for the contig, then it is FINISHED. FASTA files will
>>>>>>>only exist for
>>>the
>>>>>>> contigs that have gene models. Small contigs will rarely contain
>>>>>>>models.
>>>>>>>
>>>>>>> --Carson
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>>> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt"
>>>>>>>><[hidden email]> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Carson,
>>>>>>>>
>>>>>>>> I ran into more conspicuous behavior running maker 2.31 on a
>>>>>>>>genome which is split
>>>>>>> into
>>>>>>>> 20 parts, using the -g flag and the same basename.
>>>>>>>> Most of the jobs ran simultaneously on the same node, 17 seemed to
>>>>>>>>finish
>>>normally,
>>>>>>> while
>>>>>>>> the remaining three seemed to be stalled and produced 0B of
>>>>>>>>output. Do you have
>>>any
>>>>>>>> suggestion why this is happening?
>>>>>>>>
>>>>>>>> After I stopped these stalled jobs, I checked the index.log and
>>>>>>>>found that of
>>>38.384
>>>>>>>> mentioned scaffolds, 154 appear only once in the log. The surprise
>>>>>>>>is, that 2/3 of
>>>>>>> these
>>>>>>>> only appear as FINISHED (the rest only started). There are no
>>>>>>>>models for these
>>>>>>> 'finished'
>>>>>>>> scaffolds stored in the .db and they are distributed over all
>>>>>>>>parts of the genome
>>>>>>> (i.e.,
>>>>>>>> each of the 20 jobs contained scaffolds that 'did not start' but
>>>>>>>>'finished')
>>>>>>>> Should this be an issue of concern?
>>>>>>>> It might be a NFS lock problem, as NFS is heavily loaded, but the
>>>>>>>>NFS files look
>>>>> good,
>>>>>>> so
>>>>>>>> we suspect something fishy going on...
>>>>>>>>
>>>>>>>> Hope you can help,
>>>>>>>> best wishes,
>>>>>>>> Jeanne Wilbrandt
>>>>>>>>
>>>>>>>> zmb // ZFMK // University of Bonn
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> maker-devel mailing list
>>>>>>>> [hidden email]
>>>>>>>>
>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.
>>>>>>>>org
>>>>
>>
>
>

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

maker_opts_Lclav_splitrun_problem_01_mapfwd.ctl (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Further split genome questions

Carson Holt-2
What version of MAKER are you using? I'd also need to see the GFF3 files
before the merge.  You may also need to turn off map_forward since you are
passing in GFF3 with MAKER names, creating new models with MAKER names and
then moving names from old models forward onto new ones (which may force
names to be used twice).

--Carson


On 8/14/14, 9:40 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:

>
>Thank you so much!
>
>However, I'm still, struggling, I'm afraid: I tried this 'two-step
>merging' approach with
>a subset of scaffolds and got duplicate IDs.
>
>Here is what I did:
>- divided input scaffolds in two files
>- run maker separately on these files (-> separate output dirs)
>-- additional input: maker-generated gff3 from previous (singular) run
>-- repeatmasking, snaphmm, gmhmm, augustus_species are given
>-- map_forward=0 / 1 (I tried both, to the same effect)
>- gff3_merge two times using index-log
>- gff3_merge these two gff3 files
>
>$
>grep -P "\tgene\t" merged_all.gff3 | cut -f9 | cut -f1 -d ";" | sort |
>uniq -c | sort -n
>| tail
>      2 ID=snap_masked-scf7180005140699-processed-gene-0.19
>      2 ID=snap_masked-scf7180005140699-processed-gene-0.22
>      2 ID=snap_masked-scf7180005140699-processed-gene-1.36
>      2 ID=snap_masked-scf7180005140713-processed-gene-0.4
>      2 ID=snap_masked-scf7180005140744-processed-gene-0.4
>      2 ID=snap_masked-scf7180005140744-processed-gene-0.6
>      2 ID=snap_masked-scf7180005140754-processed-gene-0.14
>      2 ID=snap_masked-scf7180005140754-processed-gene-0.15
>      2 ID=snap_masked-scf7180005140754-processed-gene-0.19
>      2 ID=snap_masked-scf7180005181475-processed-gene-0.3
>
>$ grep snap_masked-scf7180005181475-processed-gene-0.3 merged_all.gff3 |
>grep "\sgene"
>scf7180005181475 maker gene 9050 9385 . - . ID=snap_masked-scf718000518147
>5-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.3
>scf7180005181475 maker gene 846 1088 . - . ID=snap_masked-scf7180005181475
>-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.3
>
>- found duplicates! i.e. the same ID for gene annotations in different
>areas of the same
>scaffold (of 655 gene annotations, 51 appear twice)
>-- this happens not only with gene, but also CDS and mRNA annotations, as
>far as I can
>see (here, in one example, non-everlapping but close CDS snippets got the
>same ID).
>
>
>I suspected this might have to do with the map_forward flag, but I get
>the same problem
>again (with genes at the same locations).
>I attached one of the ctl files for you in case you want to have a look,
>the other is
>analogous. Do you need something else?
>
>What did I miss? This should not happen, right?
>
>
>
>
>On Wed, 13 Aug 2014 15:52:34 +0000
> Carson Holt <[hidden email]> wrote:
>>Yes. One cpu will have several processes, most are helper processes that
>>will use 0% CPU almost all of the time (for example there is a shared
>>variable manager process that will launch with MAKER but will also be
>>called 'maker' under top because it is technically its child and not a
>>separate script).  Also system calls will launch a new process that will
>>use all CPU while the process calling it will drop to 0% CPU until it
>>finishes.
>>
>>Yes.  Your explanation is correct. You then use gff3_merge to merge the
>>GFF3 file.
>>
>>--Carson
>>
>>
>>
>>On 8/13/14, 3:32 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>>
>>>
>>>Our admin counts processes. Do I understand you right, that one CPU
>>>handles several
>>>processes?
>>>
>>>I'm still confused by the different directories (and I made a mistake
>>>when asking last
>>>time, I wanted to say 'If I do NOT start the jobs in the same
>>>directory...).
>>>So, if I start each piece of a genome in its own directory (for
>>>example),
>>>then it gets a
>>>unique basename (because the output will be separate from all other
>>>pieces anyway) and I
>>>will not run dsindex but instead use gff3_merge for each piece's output
>>>and then once
>>>again to merge all resulting gff3-files?
>>>
>>>Hope I got you right :)
>>>
>>>Thanks fopr your help!
>>>Jeanne
>>>
>>>
>>>
>>>On Wed, 6 Aug 2014 15:45:56 +0000
>>> Carson Holt <[hidden email]> wrote:
>>>>Is your admin counting processes or cpu usage?  Because each system
>>>>call
>>>>creates a
>>>>separate process, so you can expect multiple processes (each system
>>>>call
>>>>generates a new
>>>>process) but only a single cpu of usage per instance.  Use different
>>>>directories if you
>>>>are running that many jobs.  You can concatenate the separate results
>>>>when your done.
>>>> Use gff3_merge script to help concatenate the separate GFF3 files
>>>>generated from
>>>>separate jobs.
>>>>
>>>>--Carson
>>>>
>>>>Sent from my iPhone
>>>>
>>>>> On Aug 6, 2014, at 9:33 AM, "Jeanne Wilbrandt" <[hidden email]>
>>>>>wrote:
>>>>>
>>>>>
>>>>>
>>>>> We are using MPI as well, each of the 20 parts gets assigned 4
>>>>>threads. Our admin
>>>>reports
>>>>> however, that the processes seem to assemble more threads than they
>>>>>are allowed. It is
>>>>> not Blast (which is set to 1 cpu in the opts.ctl). Do you have a
>>>>>suggestion why?
>>>>>
>>>>> If I start the jobs in the same directory, how can I make sure they
>>>>>write to the same
>>>>> directory (as, I think is required to put the pieces together in the
>>>>>end?)? das
>>>>-basename
>>>>> take paths?
>>>>>
>>>>>
>>>>> On Wed, 6 Aug 2014 15:12:50 +0000
>>>>> Carson Holt <[hidden email]> wrote:
>>>>>> I think the freezing is because you are starting too many
>>>>>>simultaneous jobs.  You
>>>>should
>>>>>> try and use MPI to parallelize instead.  The concurrent job way of
>>>>>>doing things can
>>>>>> start to cause problems If you are running 10 or more jobs in the
>>>>>>same directory. You
>>>>>> could try splitting them into different directories.
>>>>>>
>>>>>> --Carson
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>>> On Aug 6, 2014, at 9:01 AM, "Jeanne Wilbrandt"
>>>>>>><[hidden email]>
>>>>>>>wrote:
>>>>>>>
>>>>>>>
>>>>>>> aha, so this explains that.
>>>>>>> Daniel, the average is 5930.37 bp, but ranging from ~ 50 to more
>>>>>>>than 60,000,
>>>>roughly
>>>>>>> half of the sequences being shorter than 3,000 bp.
>>>>>>>
>>>>>>> What do you think about this weird 'I am running but not really
>>>>>>>doing
>>>>>> anything'-behavior?
>>>>>>>
>>>>>>>
>>>>>>> Thanks a lot!
>>>>>>> Jeanne
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 6 Aug 2014 14:16:52 +0000
>>>>>>> Carson Holt <[hidden email]> wrote:
>>>>>>>> If you are starting and restarting, or running multiple jobs then
>>>>>>>>the log can be
>>>>>>>> partially rebuilt.  On rebuild only the FINISHED entries are
>>>>>>>>added.
>>>>>>>> If there is a
>>>>>> GFF3
>>>>>>>> result file for the contig, then it is FINISHED. FASTA files will
>>>>>>>>only exist for
>>>>the
>>>>>>>> contigs that have gene models. Small contigs will rarely contain
>>>>>>>>models.
>>>>>>>>
>>>>>>>> --Carson
>>>>>>>>
>>>>>>>> Sent from my iPhone
>>>>>>>>
>>>>>>>>> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt"
>>>>>>>>><[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Carson,
>>>>>>>>>
>>>>>>>>> I ran into more conspicuous behavior running maker 2.31 on a
>>>>>>>>>genome which is split
>>>>>>>> into
>>>>>>>>> 20 parts, using the -g flag and the same basename.
>>>>>>>>> Most of the jobs ran simultaneously on the same node, 17 seemed
>>>>>>>>>to
>>>>>>>>>finish
>>>>normally,
>>>>>>>> while
>>>>>>>>> the remaining three seemed to be stalled and produced 0B of
>>>>>>>>>output. Do you have
>>>>any
>>>>>>>>> suggestion why this is happening?
>>>>>>>>>
>>>>>>>>> After I stopped these stalled jobs, I checked the index.log and
>>>>>>>>>found that of
>>>>38.384
>>>>>>>>> mentioned scaffolds, 154 appear only once in the log. The
>>>>>>>>>surprise
>>>>>>>>>is, that 2/3 of
>>>>>>>> these
>>>>>>>>> only appear as FINISHED (the rest only started). There are no
>>>>>>>>>models for these
>>>>>>>> 'finished'
>>>>>>>>> scaffolds stored in the .db and they are distributed over all
>>>>>>>>>parts of the genome
>>>>>>>> (i.e.,
>>>>>>>>> each of the 20 jobs contained scaffolds that 'did not start' but
>>>>>>>>>'finished')
>>>>>>>>> Should this be an issue of concern?
>>>>>>>>> It might be a NFS lock problem, as NFS is heavily loaded, but the
>>>>>>>>>NFS files look
>>>>>> good,
>>>>>>>> so
>>>>>>>>> we suspect something fishy going on...
>>>>>>>>>
>>>>>>>>> Hope you can help,
>>>>>>>>> best wishes,
>>>>>>>>> Jeanne Wilbrandt
>>>>>>>>>
>>>>>>>>> zmb // ZFMK // University of Bonn
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> maker-devel mailing list
>>>>>>>>> [hidden email]
>>>>>>>>>
>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la
>>>>>>>>>b.
>>>>>>>>>org
>>>>>
>>>
>>
>>
>



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Further split genome questions

Jeanne Wilbrandt
In reply to this post by Jeanne Wilbrandt

It is version 2.31.

My first try was done with map_forward=0, and (I just noticed) the duplicates are present
in the separate gff3s already also in this case (one is attached).
 
Has this something to do with the first-run-gff3 I fed it?




On Thu, 14 Aug 2014 15:46:44 +0000
 Carson Holt <[hidden email]> wrote:

>What version of MAKER are you using? I'd also need to see the GFF3 files
>before the merge.  You may also need to turn off map_forward since you are
>passing in GFF3 with MAKER names, creating new models with MAKER names and
>then moving names from old models forward onto new ones (which may force
>names to be used twice).
>
>--Carson
>
>
>On 8/14/14, 9:40 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>
>>
>>Thank you so much!
>>
>>However, I'm still, struggling, I'm afraid: I tried this 'two-step
>>merging' approach with
>>a subset of scaffolds and got duplicate IDs.
>>
>>Here is what I did:
>>- divided input scaffolds in two files
>>- run maker separately on these files (-> separate output dirs)
>>-- additional input: maker-generated gff3 from previous (singular) run
>>-- repeatmasking, snaphmm, gmhmm, augustus_species are given
>>-- map_forward=0 / 1 (I tried both, to the same effect)
>>- gff3_merge two times using index-log
>>- gff3_merge these two gff3 files
>>
>>$
>>grep -P "\tgene\t" merged_all.gff3 | cut -f9 | cut -f1 -d ";" | sort |
>>uniq -c | sort -n
>>| tail
>>      2 ID=snap_masked-scf7180005140699-processed-gene-0.19
>>      2 ID=snap_masked-scf7180005140699-processed-gene-0.22
>>      2 ID=snap_masked-scf7180005140699-processed-gene-1.36
>>      2 ID=snap_masked-scf7180005140713-processed-gene-0.4
>>      2 ID=snap_masked-scf7180005140744-processed-gene-0.4
>>      2 ID=snap_masked-scf7180005140744-processed-gene-0.6
>>      2 ID=snap_masked-scf7180005140754-processed-gene-0.14
>>      2 ID=snap_masked-scf7180005140754-processed-gene-0.15
>>      2 ID=snap_masked-scf7180005140754-processed-gene-0.19
>>      2 ID=snap_masked-scf7180005181475-processed-gene-0.3
>>
>>$ grep snap_masked-scf7180005181475-processed-gene-0.3 merged_all.gff3 |
>>grep "\sgene"
>>scf7180005181475 maker gene 9050 9385 . - . ID=snap_masked-scf718000518147
>>5-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.3
>>scf7180005181475 maker gene 846 1088 . - . ID=snap_masked-scf7180005181475
>>-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.3
>>
>>- found duplicates! i.e. the same ID for gene annotations in different
>>areas of the same
>>scaffold (of 655 gene annotations, 51 appear twice)
>>-- this happens not only with gene, but also CDS and mRNA annotations, as
>>far as I can
>>see (here, in one example, non-everlapping but close CDS snippets got the
>>same ID).
>>
>>
>>I suspected this might have to do with the map_forward flag, but I get
>>the same problem
>>again (with genes at the same locations).
>>I attached one of the ctl files for you in case you want to have a look,
>>the other is
>>analogous. Do you need something else?
>>
>>What did I miss? This should not happen, right?
>>
>>
>>
>>
>>On Wed, 13 Aug 2014 15:52:34 +0000
>> Carson Holt <[hidden email]> wrote:
>>>Yes. One cpu will have several processes, most are helper processes that
>>>will use 0% CPU almost all of the time (for example there is a shared
>>>variable manager process that will launch with MAKER but will also be
>>>called 'maker' under top because it is technically its child and not a
>>>separate script).  Also system calls will launch a new process that will
>>>use all CPU while the process calling it will drop to 0% CPU until it
>>>finishes.
>>>
>>>Yes.  Your explanation is correct. You then use gff3_merge to merge the
>>>GFF3 file.
>>>
>>>--Carson
>>>
>>>
>>>
>>>On 8/13/14, 3:32 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>>>
>>>>
>>>>Our admin counts processes. Do I understand you right, that one CPU
>>>>handles several
>>>>processes?
>>>>
>>>>I'm still confused by the different directories (and I made a mistake
>>>>when asking last
>>>>time, I wanted to say 'If I do NOT start the jobs in the same
>>>>directory...).
>>>>So, if I start each piece of a genome in its own directory (for
>>>>example),
>>>>then it gets a
>>>>unique basename (because the output will be separate from all other
>>>>pieces anyway) and I
>>>>will not run dsindex but instead use gff3_merge for each piece's output
>>>>and then once
>>>>again to merge all resulting gff3-files?
>>>>
>>>>Hope I got you right :)
>>>>
>>>>Thanks fopr your help!
>>>>Jeanne
>>>>
>>>>
>>>>
>>>>On Wed, 6 Aug 2014 15:45:56 +0000
>>>> Carson Holt <[hidden email]> wrote:
>>>>>Is your admin counting processes or cpu usage?  Because each system
>>>>>call
>>>>>creates a
>>>>>separate process, so you can expect multiple processes (each system
>>>>>call
>>>>>generates a new
>>>>>process) but only a single cpu of usage per instance.  Use different
>>>>>directories if you
>>>>>are running that many jobs.  You can concatenate the separate results
>>>>>when your done.
>>>>> Use gff3_merge script to help concatenate the separate GFF3 files
>>>>>generated from
>>>>>separate jobs.
>>>>>
>>>>>--Carson
>>>>>
>>>>>Sent from my iPhone
>>>>>
>>>>>> On Aug 6, 2014, at 9:33 AM, "Jeanne Wilbrandt" <[hidden email]>
>>>>>>wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> We are using MPI as well, each of the 20 parts gets assigned 4
>>>>>>threads. Our admin
>>>>>reports
>>>>>> however, that the processes seem to assemble more threads than they
>>>>>>are allowed. It is
>>>>>> not Blast (which is set to 1 cpu in the opts.ctl). Do you have a
>>>>>>suggestion why?
>>>>>>
>>>>>> If I start the jobs in the same directory, how can I make sure they
>>>>>>write to the same
>>>>>> directory (as, I think is required to put the pieces together in the
>>>>>>end?)? das
>>>>>-basename
>>>>>> take paths?
>>>>>>
>>>>>>
>>>>>> On Wed, 6 Aug 2014 15:12:50 +0000
>>>>>> Carson Holt <[hidden email]> wrote:
>>>>>>> I think the freezing is because you are starting too many
>>>>>>>simultaneous jobs.  You
>>>>>should
>>>>>>> try and use MPI to parallelize instead.  The concurrent job way of
>>>>>>>doing things can
>>>>>>> start to cause problems If you are running 10 or more jobs in the
>>>>>>>same directory. You
>>>>>>> could try splitting them into different directories.
>>>>>>>
>>>>>>> --Carson
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>>> On Aug 6, 2014, at 9:01 AM, "Jeanne Wilbrandt"
>>>>>>>><[hidden email]>
>>>>>>>>wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> aha, so this explains that.
>>>>>>>> Daniel, the average is 5930.37 bp, but ranging from ~ 50 to more
>>>>>>>>than 60,000,
>>>>>roughly
>>>>>>>> half of the sequences being shorter than 3,000 bp.
>>>>>>>>
>>>>>>>> What do you think about this weird 'I am running but not really
>>>>>>>>doing
>>>>>>> anything'-behavior?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks a lot!
>>>>>>>> Jeanne
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, 6 Aug 2014 14:16:52 +0000
>>>>>>>> Carson Holt <[hidden email]> wrote:
>>>>>>>>> If you are starting and restarting, or running multiple jobs then
>>>>>>>>>the log can be
>>>>>>>>> partially rebuilt.  On rebuild only the FINISHED entries are
>>>>>>>>>added.
>>>>>>>>> If there is a
>>>>>>> GFF3
>>>>>>>>> result file for the contig, then it is FINISHED. FASTA files will
>>>>>>>>>only exist for
>>>>>the
>>>>>>>>> contigs that have gene models. Small contigs will rarely contain
>>>>>>>>>models.
>>>>>>>>>
>>>>>>>>> --Carson
>>>>>>>>>
>>>>>>>>> Sent from my iPhone
>>>>>>>>>
>>>>>>>>>> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt"
>>>>>>>>>><[hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Carson,
>>>>>>>>>>
>>>>>>>>>> I ran into more conspicuous behavior running maker 2.31 on a
>>>>>>>>>>genome which is split
>>>>>>>>> into
>>>>>>>>>> 20 parts, using the -g flag and the same basename.
>>>>>>>>>> Most of the jobs ran simultaneously on the same node, 17 seemed
>>>>>>>>>>to
>>>>>>>>>>finish
>>>>>normally,
>>>>>>>>> while
>>>>>>>>>> the remaining three seemed to be stalled and produced 0B of
>>>>>>>>>>output. Do you have
>>>>>any
>>>>>>>>>> suggestion why this is happening?
>>>>>>>>>>
>>>>>>>>>> After I stopped these stalled jobs, I checked the index.log and
>>>>>>>>>>found that of
>>>>>38.384
>>>>>>>>>> mentioned scaffolds, 154 appear only once in the log. The
>>>>>>>>>>surprise
>>>>>>>>>>is, that 2/3 of
>>>>>>>>> these
>>>>>>>>>> only appear as FINISHED (the rest only started). There are no
>>>>>>>>>>models for these
>>>>>>>>> 'finished'
>>>>>>>>>> scaffolds stored in the .db and they are distributed over all
>>>>>>>>>>parts of the genome
>>>>>>>>> (i.e.,
>>>>>>>>>> each of the 20 jobs contained scaffolds that 'did not start' but
>>>>>>>>>>'finished')
>>>>>>>>>> Should this be an issue of concern?
>>>>>>>>>> It might be a NFS lock problem, as NFS is heavily loaded, but the
>>>>>>>>>>NFS files look
>>>>>>> good,
>>>>>>>>> so
>>>>>>>>>> we suspect something fishy going on...
>>>>>>>>>>
>>>>>>>>>> Hope you can help,
>>>>>>>>>> best wishes,
>>>>>>>>>> Jeanne Wilbrandt
>>>>>>>>>>
>>>>>>>>>> zmb // ZFMK // University of Bonn
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> maker-devel mailing list
>>>>>>>>>> [hidden email]
>>>>>>>>>>
>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la
>>>>>>>>>>b.
>>>>>>>>>>org
>>>>>>
>>>>
>>>
>>>
>>
>
>

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

splitrun_problem_01_all.gff3 (6M) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Further split genome questions

Carson Holt-2
Which 2.31?  Current is 2.31.6.

--Carson


On 8/14/14, 9:53 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:

>
>It is version 2.31.
>
>My first try was done with map_forward=0, and (I just noticed) the
>duplicates are present
>in the separate gff3s already also in this case (one is attached).
>
>Has this something to do with the first-run-gff3 I fed it?
>
>
>
>
>On Thu, 14 Aug 2014 15:46:44 +0000
> Carson Holt <[hidden email]> wrote:
>>What version of MAKER are you using? I'd also need to see the GFF3 files
>>before the merge.  You may also need to turn off map_forward since you
>>are
>>passing in GFF3 with MAKER names, creating new models with MAKER names
>>and
>>then moving names from old models forward onto new ones (which may force
>>names to be used twice).
>>
>>--Carson
>>
>>
>>On 8/14/14, 9:40 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>>
>>>
>>>Thank you so much!
>>>
>>>However, I'm still, struggling, I'm afraid: I tried this 'two-step
>>>merging' approach with
>>>a subset of scaffolds and got duplicate IDs.
>>>
>>>Here is what I did:
>>>- divided input scaffolds in two files
>>>- run maker separately on these files (-> separate output dirs)
>>>-- additional input: maker-generated gff3 from previous (singular) run
>>>-- repeatmasking, snaphmm, gmhmm, augustus_species are given
>>>-- map_forward=0 / 1 (I tried both, to the same effect)
>>>- gff3_merge two times using index-log
>>>- gff3_merge these two gff3 files
>>>
>>>$
>>>grep -P "\tgene\t" merged_all.gff3 | cut -f9 | cut -f1 -d ";" | sort |
>>>uniq -c | sort -n
>>>| tail
>>>      2 ID=snap_masked-scf7180005140699-processed-gene-0.19
>>>      2 ID=snap_masked-scf7180005140699-processed-gene-0.22
>>>      2 ID=snap_masked-scf7180005140699-processed-gene-1.36
>>>      2 ID=snap_masked-scf7180005140713-processed-gene-0.4
>>>      2 ID=snap_masked-scf7180005140744-processed-gene-0.4
>>>      2 ID=snap_masked-scf7180005140744-processed-gene-0.6
>>>      2 ID=snap_masked-scf7180005140754-processed-gene-0.14
>>>      2 ID=snap_masked-scf7180005140754-processed-gene-0.15
>>>      2 ID=snap_masked-scf7180005140754-processed-gene-0.19
>>>      2 ID=snap_masked-scf7180005181475-processed-gene-0.3
>>>
>>>$ grep snap_masked-scf7180005181475-processed-gene-0.3 merged_all.gff3 |
>>>grep "\sgene"
>>>scf7180005181475 maker gene 9050 9385 . - . ID=snap_masked-scf7180005181
>>>47
>>>5-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.
>>>3
>>>scf7180005181475 maker gene 846 1088 . - . ID=snap_masked-scf71800051814
>>>75
>>>-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.3
>>>
>>>- found duplicates! i.e. the same ID for gene annotations in different
>>>areas of the same
>>>scaffold (of 655 gene annotations, 51 appear twice)
>>>-- this happens not only with gene, but also CDS and mRNA annotations,
>>>as
>>>far as I can
>>>see (here, in one example, non-everlapping but close CDS snippets got
>>>the
>>>same ID).
>>>
>>>
>>>I suspected this might have to do with the map_forward flag, but I get
>>>the same problem
>>>again (with genes at the same locations).
>>>I attached one of the ctl files for you in case you want to have a look,
>>>the other is
>>>analogous. Do you need something else?
>>>
>>>What did I miss? This should not happen, right?
>>>
>>>
>>>
>>>
>>>On Wed, 13 Aug 2014 15:52:34 +0000
>>> Carson Holt <[hidden email]> wrote:
>>>>Yes. One cpu will have several processes, most are helper processes
>>>>that
>>>>will use 0% CPU almost all of the time (for example there is a shared
>>>>variable manager process that will launch with MAKER but will also be
>>>>called 'maker' under top because it is technically its child and not a
>>>>separate script).  Also system calls will launch a new process that
>>>>will
>>>>use all CPU while the process calling it will drop to 0% CPU until it
>>>>finishes.
>>>>
>>>>Yes.  Your explanation is correct. You then use gff3_merge to merge the
>>>>GFF3 file.
>>>>
>>>>--Carson
>>>>
>>>>
>>>>
>>>>On 8/13/14, 3:32 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>>>>
>>>>>
>>>>>Our admin counts processes. Do I understand you right, that one CPU
>>>>>handles several
>>>>>processes?
>>>>>
>>>>>I'm still confused by the different directories (and I made a mistake
>>>>>when asking last
>>>>>time, I wanted to say 'If I do NOT start the jobs in the same
>>>>>directory...).
>>>>>So, if I start each piece of a genome in its own directory (for
>>>>>example),
>>>>>then it gets a
>>>>>unique basename (because the output will be separate from all other
>>>>>pieces anyway) and I
>>>>>will not run dsindex but instead use gff3_merge for each piece's
>>>>>output
>>>>>and then once
>>>>>again to merge all resulting gff3-files?
>>>>>
>>>>>Hope I got you right :)
>>>>>
>>>>>Thanks fopr your help!
>>>>>Jeanne
>>>>>
>>>>>
>>>>>
>>>>>On Wed, 6 Aug 2014 15:45:56 +0000
>>>>> Carson Holt <[hidden email]> wrote:
>>>>>>Is your admin counting processes or cpu usage?  Because each system
>>>>>>call
>>>>>>creates a
>>>>>>separate process, so you can expect multiple processes (each system
>>>>>>call
>>>>>>generates a new
>>>>>>process) but only a single cpu of usage per instance.  Use different
>>>>>>directories if you
>>>>>>are running that many jobs.  You can concatenate the separate results
>>>>>>when your done.
>>>>>> Use gff3_merge script to help concatenate the separate GFF3 files
>>>>>>generated from
>>>>>>separate jobs.
>>>>>>
>>>>>>--Carson
>>>>>>
>>>>>>Sent from my iPhone
>>>>>>
>>>>>>> On Aug 6, 2014, at 9:33 AM, "Jeanne Wilbrandt"
>>>>>>><[hidden email]>
>>>>>>>wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> We are using MPI as well, each of the 20 parts gets assigned 4
>>>>>>>threads. Our admin
>>>>>>reports
>>>>>>> however, that the processes seem to assemble more threads than they
>>>>>>>are allowed. It is
>>>>>>> not Blast (which is set to 1 cpu in the opts.ctl). Do you have a
>>>>>>>suggestion why?
>>>>>>>
>>>>>>> If I start the jobs in the same directory, how can I make sure they
>>>>>>>write to the same
>>>>>>> directory (as, I think is required to put the pieces together in
>>>>>>>the
>>>>>>>end?)? das
>>>>>>-basename
>>>>>>> take paths?
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 6 Aug 2014 15:12:50 +0000
>>>>>>> Carson Holt <[hidden email]> wrote:
>>>>>>>> I think the freezing is because you are starting too many
>>>>>>>>simultaneous jobs.  You
>>>>>>should
>>>>>>>> try and use MPI to parallelize instead.  The concurrent job way of
>>>>>>>>doing things can
>>>>>>>> start to cause problems If you are running 10 or more jobs in the
>>>>>>>>same directory. You
>>>>>>>> could try splitting them into different directories.
>>>>>>>>
>>>>>>>> --Carson
>>>>>>>>
>>>>>>>> Sent from my iPhone
>>>>>>>>
>>>>>>>>> On Aug 6, 2014, at 9:01 AM, "Jeanne Wilbrandt"
>>>>>>>>><[hidden email]>
>>>>>>>>>wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> aha, so this explains that.
>>>>>>>>> Daniel, the average is 5930.37 bp, but ranging from ~ 50 to more
>>>>>>>>>than 60,000,
>>>>>>roughly
>>>>>>>>> half of the sequences being shorter than 3,000 bp.
>>>>>>>>>
>>>>>>>>> What do you think about this weird 'I am running but not really
>>>>>>>>>doing
>>>>>>>> anything'-behavior?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks a lot!
>>>>>>>>> Jeanne
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, 6 Aug 2014 14:16:52 +0000
>>>>>>>>> Carson Holt <[hidden email]> wrote:
>>>>>>>>>> If you are starting and restarting, or running multiple jobs
>>>>>>>>>>then
>>>>>>>>>>the log can be
>>>>>>>>>> partially rebuilt.  On rebuild only the FINISHED entries are
>>>>>>>>>>added.
>>>>>>>>>> If there is a
>>>>>>>> GFF3
>>>>>>>>>> result file for the contig, then it is FINISHED. FASTA files
>>>>>>>>>>will
>>>>>>>>>>only exist for
>>>>>>the
>>>>>>>>>> contigs that have gene models. Small contigs will rarely contain
>>>>>>>>>>models.
>>>>>>>>>>
>>>>>>>>>> --Carson
>>>>>>>>>>
>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>
>>>>>>>>>>> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt"
>>>>>>>>>>><[hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Carson,
>>>>>>>>>>>
>>>>>>>>>>> I ran into more conspicuous behavior running maker 2.31 on a
>>>>>>>>>>>genome which is split
>>>>>>>>>> into
>>>>>>>>>>> 20 parts, using the -g flag and the same basename.
>>>>>>>>>>> Most of the jobs ran simultaneously on the same node, 17 seemed
>>>>>>>>>>>to
>>>>>>>>>>>finish
>>>>>>normally,
>>>>>>>>>> while
>>>>>>>>>>> the remaining three seemed to be stalled and produced 0B of
>>>>>>>>>>>output. Do you have
>>>>>>any
>>>>>>>>>>> suggestion why this is happening?
>>>>>>>>>>>
>>>>>>>>>>> After I stopped these stalled jobs, I checked the index.log and
>>>>>>>>>>>found that of
>>>>>>38.384
>>>>>>>>>>> mentioned scaffolds, 154 appear only once in the log. The
>>>>>>>>>>>surprise
>>>>>>>>>>>is, that 2/3 of
>>>>>>>>>> these
>>>>>>>>>>> only appear as FINISHED (the rest only started). There are no
>>>>>>>>>>>models for these
>>>>>>>>>> 'finished'
>>>>>>>>>>> scaffolds stored in the .db and they are distributed over all
>>>>>>>>>>>parts of the genome
>>>>>>>>>> (i.e.,
>>>>>>>>>>> each of the 20 jobs contained scaffolds that 'did not start'
>>>>>>>>>>>but
>>>>>>>>>>>'finished')
>>>>>>>>>>> Should this be an issue of concern?
>>>>>>>>>>> It might be a NFS lock problem, as NFS is heavily loaded, but
>>>>>>>>>>>the
>>>>>>>>>>>NFS files look
>>>>>>>> good,
>>>>>>>>>> so
>>>>>>>>>>> we suspect something fishy going on...
>>>>>>>>>>>
>>>>>>>>>>> Hope you can help,
>>>>>>>>>>> best wishes,
>>>>>>>>>>> Jeanne Wilbrandt
>>>>>>>>>>>
>>>>>>>>>>> zmb // ZFMK // University of Bonn
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>> [hidden email]
>>>>>>>>>>>
>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-
>>>>>>>>>>>la
>>>>>>>>>>>b.
>>>>>>>>>>>org
>>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: Further split genome questions

Carson Holt-2
In reply to this post by Jeanne Wilbrandt
For the file you just sent me, is that from the first run with
map_forward=0 or with map_forward=1?

--Carson

On 8/14/14, 9:53 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:

>
>It is version 2.31.
>
>My first try was done with map_forward=0, and (I just noticed) the
>duplicates are present
>in the separate gff3s already also in this case (one is attached).
>
>Has this something to do with the first-run-gff3 I fed it?
>
>
>
>
>On Thu, 14 Aug 2014 15:46:44 +0000
> Carson Holt <[hidden email]> wrote:
>>What version of MAKER are you using? I'd also need to see the GFF3 files
>>before the merge.  You may also need to turn off map_forward since you
>>are
>>passing in GFF3 with MAKER names, creating new models with MAKER names
>>and
>>then moving names from old models forward onto new ones (which may force
>>names to be used twice).
>>
>>--Carson
>>
>>
>>On 8/14/14, 9:40 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>>
>>>
>>>Thank you so much!
>>>
>>>However, I'm still, struggling, I'm afraid: I tried this 'two-step
>>>merging' approach with
>>>a subset of scaffolds and got duplicate IDs.
>>>
>>>Here is what I did:
>>>- divided input scaffolds in two files
>>>- run maker separately on these files (-> separate output dirs)
>>>-- additional input: maker-generated gff3 from previous (singular) run
>>>-- repeatmasking, snaphmm, gmhmm, augustus_species are given
>>>-- map_forward=0 / 1 (I tried both, to the same effect)
>>>- gff3_merge two times using index-log
>>>- gff3_merge these two gff3 files
>>>
>>>$
>>>grep -P "\tgene\t" merged_all.gff3 | cut -f9 | cut -f1 -d ";" | sort |
>>>uniq -c | sort -n
>>>| tail
>>>      2 ID=snap_masked-scf7180005140699-processed-gene-0.19
>>>      2 ID=snap_masked-scf7180005140699-processed-gene-0.22
>>>      2 ID=snap_masked-scf7180005140699-processed-gene-1.36
>>>      2 ID=snap_masked-scf7180005140713-processed-gene-0.4
>>>      2 ID=snap_masked-scf7180005140744-processed-gene-0.4
>>>      2 ID=snap_masked-scf7180005140744-processed-gene-0.6
>>>      2 ID=snap_masked-scf7180005140754-processed-gene-0.14
>>>      2 ID=snap_masked-scf7180005140754-processed-gene-0.15
>>>      2 ID=snap_masked-scf7180005140754-processed-gene-0.19
>>>      2 ID=snap_masked-scf7180005181475-processed-gene-0.3
>>>
>>>$ grep snap_masked-scf7180005181475-processed-gene-0.3 merged_all.gff3 |
>>>grep "\sgene"
>>>scf7180005181475 maker gene 9050 9385 . - . ID=snap_masked-scf7180005181
>>>47
>>>5-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.
>>>3
>>>scf7180005181475 maker gene 846 1088 . - . ID=snap_masked-scf71800051814
>>>75
>>>-processed-gene-0.3;Name=snap_masked-scf7180005181475-processed-gene-0.3
>>>
>>>- found duplicates! i.e. the same ID for gene annotations in different
>>>areas of the same
>>>scaffold (of 655 gene annotations, 51 appear twice)
>>>-- this happens not only with gene, but also CDS and mRNA annotations,
>>>as
>>>far as I can
>>>see (here, in one example, non-everlapping but close CDS snippets got
>>>the
>>>same ID).
>>>
>>>
>>>I suspected this might have to do with the map_forward flag, but I get
>>>the same problem
>>>again (with genes at the same locations).
>>>I attached one of the ctl files for you in case you want to have a look,
>>>the other is
>>>analogous. Do you need something else?
>>>
>>>What did I miss? This should not happen, right?
>>>
>>>
>>>
>>>
>>>On Wed, 13 Aug 2014 15:52:34 +0000
>>> Carson Holt <[hidden email]> wrote:
>>>>Yes. One cpu will have several processes, most are helper processes
>>>>that
>>>>will use 0% CPU almost all of the time (for example there is a shared
>>>>variable manager process that will launch with MAKER but will also be
>>>>called 'maker' under top because it is technically its child and not a
>>>>separate script).  Also system calls will launch a new process that
>>>>will
>>>>use all CPU while the process calling it will drop to 0% CPU until it
>>>>finishes.
>>>>
>>>>Yes.  Your explanation is correct. You then use gff3_merge to merge the
>>>>GFF3 file.
>>>>
>>>>--Carson
>>>>
>>>>
>>>>
>>>>On 8/13/14, 3:32 AM, "Jeanne Wilbrandt" <[hidden email]> wrote:
>>>>
>>>>>
>>>>>Our admin counts processes. Do I understand you right, that one CPU
>>>>>handles several
>>>>>processes?
>>>>>
>>>>>I'm still confused by the different directories (and I made a mistake
>>>>>when asking last
>>>>>time, I wanted to say 'If I do NOT start the jobs in the same
>>>>>directory...).
>>>>>So, if I start each piece of a genome in its own directory (for
>>>>>example),
>>>>>then it gets a
>>>>>unique basename (because the output will be separate from all other
>>>>>pieces anyway) and I
>>>>>will not run dsindex but instead use gff3_merge for each piece's
>>>>>output
>>>>>and then once
>>>>>again to merge all resulting gff3-files?
>>>>>
>>>>>Hope I got you right :)
>>>>>
>>>>>Thanks fopr your help!
>>>>>Jeanne
>>>>>
>>>>>
>>>>>
>>>>>On Wed, 6 Aug 2014 15:45:56 +0000
>>>>> Carson Holt <[hidden email]> wrote:
>>>>>>Is your admin counting processes or cpu usage?  Because each system
>>>>>>call
>>>>>>creates a
>>>>>>separate process, so you can expect multiple processes (each system
>>>>>>call
>>>>>>generates a new
>>>>>>process) but only a single cpu of usage per instance.  Use different
>>>>>>directories if you
>>>>>>are running that many jobs.  You can concatenate the separate results
>>>>>>when your done.
>>>>>> Use gff3_merge script to help concatenate the separate GFF3 files
>>>>>>generated from
>>>>>>separate jobs.
>>>>>>
>>>>>>--Carson
>>>>>>
>>>>>>Sent from my iPhone
>>>>>>
>>>>>>> On Aug 6, 2014, at 9:33 AM, "Jeanne Wilbrandt"
>>>>>>><[hidden email]>
>>>>>>>wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> We are using MPI as well, each of the 20 parts gets assigned 4
>>>>>>>threads. Our admin
>>>>>>reports
>>>>>>> however, that the processes seem to assemble more threads than they
>>>>>>>are allowed. It is
>>>>>>> not Blast (which is set to 1 cpu in the opts.ctl). Do you have a
>>>>>>>suggestion why?
>>>>>>>
>>>>>>> If I start the jobs in the same directory, how can I make sure they
>>>>>>>write to the same
>>>>>>> directory (as, I think is required to put the pieces together in
>>>>>>>the
>>>>>>>end?)? das
>>>>>>-basename
>>>>>>> take paths?
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 6 Aug 2014 15:12:50 +0000
>>>>>>> Carson Holt <[hidden email]> wrote:
>>>>>>>> I think the freezing is because you are starting too many
>>>>>>>>simultaneous jobs.  You
>>>>>>should
>>>>>>>> try and use MPI to parallelize instead.  The concurrent job way of
>>>>>>>>doing things can
>>>>>>>> start to cause problems If you are running 10 or more jobs in the
>>>>>>>>same directory. You
>>>>>>>> could try splitting them into different directories.
>>>>>>>>
>>>>>>>> --Carson
>>>>>>>>
>>>>>>>> Sent from my iPhone
>>>>>>>>
>>>>>>>>> On Aug 6, 2014, at 9:01 AM, "Jeanne Wilbrandt"
>>>>>>>>><[hidden email]>
>>>>>>>>>wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> aha, so this explains that.
>>>>>>>>> Daniel, the average is 5930.37 bp, but ranging from ~ 50 to more
>>>>>>>>>than 60,000,
>>>>>>roughly
>>>>>>>>> half of the sequences being shorter than 3,000 bp.
>>>>>>>>>
>>>>>>>>> What do you think about this weird 'I am running but not really
>>>>>>>>>doing
>>>>>>>> anything'-behavior?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks a lot!
>>>>>>>>> Jeanne
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, 6 Aug 2014 14:16:52 +0000
>>>>>>>>> Carson Holt <[hidden email]> wrote:
>>>>>>>>>> If you are starting and restarting, or running multiple jobs
>>>>>>>>>>then
>>>>>>>>>>the log can be
>>>>>>>>>> partially rebuilt.  On rebuild only the FINISHED entries are
>>>>>>>>>>added.
>>>>>>>>>> If there is a
>>>>>>>> GFF3
>>>>>>>>>> result file for the contig, then it is FINISHED. FASTA files
>>>>>>>>>>will
>>>>>>>>>>only exist for
>>>>>>the
>>>>>>>>>> contigs that have gene models. Small contigs will rarely contain
>>>>>>>>>>models.
>>>>>>>>>>
>>>>>>>>>> --Carson
>>>>>>>>>>
>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>
>>>>>>>>>>> On Aug 6, 2014, at 6:40 AM, "Jeanne Wilbrandt"
>>>>>>>>>>><[hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Carson,
>>>>>>>>>>>
>>>>>>>>>>> I ran into more conspicuous behavior running maker 2.31 on a
>>>>>>>>>>>genome which is split
>>>>>>>>>> into
>>>>>>>>>>> 20 parts, using the -g flag and the same basename.
>>>>>>>>>>> Most of the jobs ran simultaneously on the same node, 17 seemed
>>>>>>>>>>>to
>>>>>>>>>>>finish
>>>>>>normally,
>>>>>>>>>> while
>>>>>>>>>>> the remaining three seemed to be stalled and produced 0B of
>>>>>>>>>>>output. Do you have
>>>>>>any
>>>>>>>>>>> suggestion why this is happening?
>>>>>>>>>>>
>>>>>>>>>>> After I stopped these stalled jobs, I checked the index.log and
>>>>>>>>>>>found that of
>>>>>>38.384
>>>>>>>>>>> mentioned scaffolds, 154 appear only once in the log. The
>>>>>>>>>>>surprise
>>>>>>>>>>>is, that 2/3 of
>>>>>>>>>> these
>>>>>>>>>>> only appear as FINISHED (the rest only started). There are no
>>>>>>>>>>>models for these
>>>>>>>>>> 'finished'
>>>>>>>>>>> scaffolds stored in the .db and they are distributed over all
>>>>>>>>>>>parts of the genome
>>>>>>>>>> (i.e.,
>>>>>>>>>>> each of the 20 jobs contained scaffolds that 'did not start'
>>>>>>>>>>>but
>>>>>>>>>>>'finished')
>>>>>>>>>>> Should this be an issue of concern?
>>>>>>>>>>> It might be a NFS lock problem, as NFS is heavily loaded, but
>>>>>>>>>>>the
>>>>>>>>>>>NFS files look
>>>>>>>> good,
>>>>>>>>>> so
>>>>>>>>>>> we suspect something fishy going on...
>>>>>>>>>>>
>>>>>>>>>>> Hope you can help,
>>>>>>>>>>> best wishes,
>>>>>>>>>>> Jeanne Wilbrandt
>>>>>>>>>>>
>>>>>>>>>>> zmb // ZFMK // University of Bonn
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>> [hidden email]
>>>>>>>>>>>
>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-
>>>>>>>>>>>la
>>>>>>>>>>>b.
>>>>>>>>>>>org
>>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org