PARALLELIZED DE NOVO GENOME ANNOTATION WITHOUT MPI

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

PARALLELIZED DE NOVO GENOME ANNOTATION WITHOUT MPI

Quanwei Zhang
Hello:

I am doing genome annotation using Maker on our high performance computational cluster (HPC). Due to some issues of MPI, I submitted the Maker jobs several times under the same directory to HPC. Followed by the example in the protocol (as shown below), when I submit the jobs I make them as background processes by "&" except the first one. Is this necessary when I submit a job to a HPC? I found it costed much much longer time than I expected (according to a testing on a smaller data set). I am not sure whether setting the process as background process lead to this issue?

The example in the protocol
% maker 2> maker1.error
% maker 2> maker2.error &
% maker 2> maker3.error &
......

BTW, will the annotation on shorter contig (e.g., 500bp) cost ~ 1/100 of the time that cost for annotation a 50000bp contig? I am using SNAP for an inito and RNA-seq assembly and protein sequences as evidence. I have more than half contigs shorter than 300bp (whose total length is only about 5% of the total length of all contigs), I want to know whether I can save about half (or only about 5%) of the time if I ignore those short contigs.

 Thanks

Best
Quanwei

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: PARALLELIZED DE NOVO GENOME ANNOTATION WITHOUT MPI

Carson Holt-2
If you submit too many simultaneous, MAKER run then file locks will start to collide and one run will slow down the others. You should submit fewer simultaneous jobs and instead use MPI (maker must be configured and compiled to use MPI).

An example MPI launch command for running on 200 CPUs on a cluster —>
mpiexec -n 200 maker 2> maker_mpi1.error

—Carson



> On Feb 27, 2017, at 8:25 AM, Quanwei Zhang <[hidden email]> wrote:
>
> Hello:
>
> I am doing genome annotation using Maker on our high performance computational cluster (HPC). Due to some issues of MPI, I submitted the Maker jobs several times under the same directory to HPC. Followed by the example in the protocol (as shown below), when I submit the jobs I make them as background processes by "&" except the first one. Is this necessary when I submit a job to a HPC? I found it costed much much longer time than I expected (according to a testing on a smaller data set). I am not sure whether setting the process as background process lead to this issue?
>
> The example in the protocol
> % maker 2> maker1.error
> % maker 2> maker2.error &
> % maker 2> maker3.error &
> ......
>
> BTW, will the annotation on shorter contig (e.g., 500bp) cost ~ 1/100 of the time that cost for annotation a 50000bp contig? I am using SNAP for an inito and RNA-seq assembly and protein sequences as evidence. I have more than half contigs shorter than 300bp (whose total length is only about 5% of the total length of all contigs), I want to know whether I can save about half (or only about 5%) of the time if I ignore those short contigs.
>
>  Thanks
>
> Best
> Quanwei
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: PARALLELIZED DE NOVO GENOME ANNOTATION WITHOUT MPI

Quanwei Zhang
Thank you. I have submit my jobs to our server. What I plan to do is like this: (1) split contigs into 50 files; (2) for each contig file, I collected the annotation into gff and protein sequences into fasta format; (3) manually merge the 50 gff files and protein sequences files. Is what I am doing also correct?

Best
Quanwei

2017-03-01 15:54 GMT-05:00 Carson Holt <[hidden email]>:
If you split into separate files, you can use the -g option to select the input file together with the -base option so all output goes to the same directory. Because they technically have different input files, this will avoid file locking issues. You have to use the -dsindex option at the end to rebuild the datastore index, so it looks like a single job. But that is one way to get around the issue.

—Carson



On Mar 1, 2017, at 1:52 PM, Quanwei Zhang <[hidden email]> wrote:

Thank you. But I met some problems  with MPI on our server. So now I split my contigs into several files and annotate those files separately. After I finish the annotation on each file, I will merge the results. 

Thank you for your explanation!

Best
Quanwei 

2017-03-01 15:36 GMT-05:00 Carson Holt <[hidden email]>:
If you submit too many simultaneous, MAKER run then file locks will start to collide and one run will slow down the others. You should submit fewer simultaneous jobs and instead use MPI (maker must be configured and compiled to use MPI).

An example MPI launch command for running on 200 CPUs on a cluster —>
mpiexec -n 200 maker 2> maker_mpi1.error

—Carson



> On Feb 27, 2017, at 8:25 AM, Quanwei Zhang <[hidden email]> wrote:
>
> Hello:
>
> I am doing genome annotation using Maker on our high performance computational cluster (HPC). Due to some issues of MPI, I submitted the Maker jobs several times under the same directory to HPC. Followed by the example in the protocol (as shown below), when I submit the jobs I make them as background processes by "&" except the first one. Is this necessary when I submit a job to a HPC? I found it costed much much longer time than I expected (according to a testing on a smaller data set). I am not sure whether setting the process as background process lead to this issue?
>
> The example in the protocol
> % maker 2> maker1.error
> % maker 2> maker2.error &
> % maker 2> maker3.error &
> ......
>
> BTW, will the annotation on shorter contig (e.g., 500bp) cost ~ 1/100 of the time that cost for annotation a 50000bp contig? I am using SNAP for an inito and RNA-seq assembly and protein sequences as evidence. I have more than half contigs shorter than 300bp (whose total length is only about 5% of the total length of all contigs), I want to know whether I can save about half (or only about 5%) of the time if I ignore those short contigs.
>
>  Thanks
>
> Best
> Quanwei
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org





_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: PARALLELIZED DE NOVO GENOME ANNOTATION WITHOUT MPI

Carson Holt-2
That will work.

—Carson

On Mar 1, 2017, at 2:09 PM, Quanwei Zhang <[hidden email]> wrote:

Thank you. I have submit my jobs to our server. What I plan to do is like this: (1) split contigs into 50 files; (2) for each contig file, I collected the annotation into gff and protein sequences into fasta format; (3) manually merge the 50 gff files and protein sequences files. Is what I am doing also correct?

Best
Quanwei

2017-03-01 15:54 GMT-05:00 Carson Holt <[hidden email]>:
If you split into separate files, you can use the -g option to select the input file together with the -base option so all output goes to the same directory. Because they technically have different input files, this will avoid file locking issues. You have to use the -dsindex option at the end to rebuild the datastore index, so it looks like a single job. But that is one way to get around the issue.

—Carson



On Mar 1, 2017, at 1:52 PM, Quanwei Zhang <[hidden email]> wrote:

Thank you. But I met some problems  with MPI on our server. So now I split my contigs into several files and annotate those files separately. After I finish the annotation on each file, I will merge the results. 

Thank you for your explanation!

Best
Quanwei 

2017-03-01 15:36 GMT-05:00 Carson Holt <[hidden email]>:
If you submit too many simultaneous, MAKER run then file locks will start to collide and one run will slow down the others. You should submit fewer simultaneous jobs and instead use MPI (maker must be configured and compiled to use MPI).

An example MPI launch command for running on 200 CPUs on a cluster —>
mpiexec -n 200 maker 2> maker_mpi1.error

—Carson



> On Feb 27, 2017, at 8:25 AM, Quanwei Zhang <[hidden email]> wrote:
>
> Hello:
>
> I am doing genome annotation using Maker on our high performance computational cluster (HPC). Due to some issues of MPI, I submitted the Maker jobs several times under the same directory to HPC. Followed by the example in the protocol (as shown below), when I submit the jobs I make them as background processes by "&" except the first one. Is this necessary when I submit a job to a HPC? I found it costed much much longer time than I expected (according to a testing on a smaller data set). I am not sure whether setting the process as background process lead to this issue?
>
> The example in the protocol
> % maker 2> maker1.error
> % maker 2> maker2.error &
> % maker 2> maker3.error &
> ......
>
> BTW, will the annotation on shorter contig (e.g., 500bp) cost ~ 1/100 of the time that cost for annotation a 50000bp contig? I am using SNAP for an inito and RNA-seq assembly and protein sequences as evidence. I have more than half contigs shorter than 300bp (whose total length is only about 5% of the total length of all contigs), I want to know whether I can save about half (or only about 5%) of the time if I ignore those short contigs.
>
>  Thanks
>
> Best
> Quanwei
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org






_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: PARALLELIZED DE NOVO GENOME ANNOTATION WITHOUT MPI

Quanwei Zhang
Hi Carson:

I split my contigs into 50 files and annotated them parallelized. After annotation finish, I used "gff3_merge -d"  and "fasta_merge -d" to get the gff and fasta files for each of the 50 files. Now I am trying to merge those gff files into one gff. But I found behind the annotation information, the contig sequences are attached into the gff files. So I think I can not simply merge them using the command "cat file1.gff file2.gff ...file50.gff > merged.gff". So I am considering to merge those files in two ways, would you please give me a suggestion (which works)?
(1) If the contigs sequences will not be useful for downstream functional annotation, then I want to remove all the contig sequences from those gff, and then merge gff file with only annotation information using "cat" command.
(2) Merge the annotation part and the contig sequences part (from those 50 gff files) separately, then merge the two file (i.e., the file including all annotation information, and the file including all the contigs sequences) by adding the contig sequence to the end of annotation information.

Thanks



2017-03-01 16:10 GMT-05:00 Carson Holt <[hidden email]>:
That will work.

—Carson

On Mar 1, 2017, at 2:09 PM, Quanwei Zhang <[hidden email]> wrote:

Thank you. I have submit my jobs to our server. What I plan to do is like this: (1) split contigs into 50 files; (2) for each contig file, I collected the annotation into gff and protein sequences into fasta format; (3) manually merge the 50 gff files and protein sequences files. Is what I am doing also correct?

Best
Quanwei

2017-03-01 15:54 GMT-05:00 Carson Holt <[hidden email]>:
If you split into separate files, you can use the -g option to select the input file together with the -base option so all output goes to the same directory. Because they technically have different input files, this will avoid file locking issues. You have to use the -dsindex option at the end to rebuild the datastore index, so it looks like a single job. But that is one way to get around the issue.

—Carson



On Mar 1, 2017, at 1:52 PM, Quanwei Zhang <[hidden email]> wrote:

Thank you. But I met some problems  with MPI on our server. So now I split my contigs into several files and annotate those files separately. After I finish the annotation on each file, I will merge the results. 

Thank you for your explanation!

Best
Quanwei 

2017-03-01 15:36 GMT-05:00 Carson Holt <[hidden email]>:
If you submit too many simultaneous, MAKER run then file locks will start to collide and one run will slow down the others. You should submit fewer simultaneous jobs and instead use MPI (maker must be configured and compiled to use MPI).

An example MPI launch command for running on 200 CPUs on a cluster —>
mpiexec -n 200 maker 2> maker_mpi1.error

—Carson



> On Feb 27, 2017, at 8:25 AM, Quanwei Zhang <[hidden email]> wrote:
>
> Hello:
>
> I am doing genome annotation using Maker on our high performance computational cluster (HPC). Due to some issues of MPI, I submitted the Maker jobs several times under the same directory to HPC. Followed by the example in the protocol (as shown below), when I submit the jobs I make them as background processes by "&" except the first one. Is this necessary when I submit a job to a HPC? I found it costed much much longer time than I expected (according to a testing on a smaller data set). I am not sure whether setting the process as background process lead to this issue?
>
> The example in the protocol
> % maker 2> maker1.error
> % maker 2> maker2.error &
> % maker 2> maker3.error &
> ......
>
> BTW, will the annotation on shorter contig (e.g., 500bp) cost ~ 1/100 of the time that cost for annotation a 50000bp contig? I am using SNAP for an inito and RNA-seq assembly and protein sequences as evidence. I have more than half contigs shorter than 300bp (whose total length is only about 5% of the total length of all contigs), I want to know whether I can save about half (or only about 5%) of the time if I ignore those short contigs.
>
>  Thanks
>
> Best
> Quanwei
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org







_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: PARALLELIZED DE NOVO GENOME ANNOTATION WITHOUT MPI

Carson Holt-2
Use gff3_merge again without the -d option. Just give it all 50 files.

--Carson

Sent from my iPhone

On Mar 7, 2017, at 8:14 AM, Quanwei Zhang <[hidden email]> wrote:

Hi Carson:

I split my contigs into 50 files and annotated them parallelized. After annotation finish, I used "gff3_merge -d"  and "fasta_merge -d" to get the gff and fasta files for each of the 50 files. Now I am trying to merge those gff files into one gff. But I found behind the annotation information, the contig sequences are attached into the gff files. So I think I can not simply merge them using the command "cat file1.gff file2.gff ...file50.gff > merged.gff". So I am considering to merge those files in two ways, would you please give me a suggestion (which works)?
(1) If the contigs sequences will not be useful for downstream functional annotation, then I want to remove all the contig sequences from those gff, and then merge gff file with only annotation information using "cat" command.
(2) Merge the annotation part and the contig sequences part (from those 50 gff files) separately, then merge the two file (i.e., the file including all annotation information, and the file including all the contigs sequences) by adding the contig sequence to the end of annotation information.

Thanks



2017-03-01 16:10 GMT-05:00 Carson Holt <[hidden email]>:
That will work.

—Carson

On Mar 1, 2017, at 2:09 PM, Quanwei Zhang <[hidden email]> wrote:

Thank you. I have submit my jobs to our server. What I plan to do is like this: (1) split contigs into 50 files; (2) for each contig file, I collected the annotation into gff and protein sequences into fasta format; (3) manually merge the 50 gff files and protein sequences files. Is what I am doing also correct?

Best
Quanwei

2017-03-01 15:54 GMT-05:00 Carson Holt <[hidden email]>:
If you split into separate files, you can use the -g option to select the input file together with the -base option so all output goes to the same directory. Because they technically have different input files, this will avoid file locking issues. You have to use the -dsindex option at the end to rebuild the datastore index, so it looks like a single job. But that is one way to get around the issue.

—Carson



On Mar 1, 2017, at 1:52 PM, Quanwei Zhang <[hidden email]> wrote:

Thank you. But I met some problems  with MPI on our server. So now I split my contigs into several files and annotate those files separately. After I finish the annotation on each file, I will merge the results. 

Thank you for your explanation!

Best
Quanwei 

2017-03-01 15:36 GMT-05:00 Carson Holt <[hidden email]>:
If you submit too many simultaneous, MAKER run then file locks will start to collide and one run will slow down the others. You should submit fewer simultaneous jobs and instead use MPI (maker must be configured and compiled to use MPI).

An example MPI launch command for running on 200 CPUs on a cluster —>
mpiexec -n 200 maker 2> maker_mpi1.error

—Carson



> On Feb 27, 2017, at 8:25 AM, Quanwei Zhang <[hidden email]> wrote:
>
> Hello:
>
> I am doing genome annotation using Maker on our high performance computational cluster (HPC). Due to some issues of MPI, I submitted the Maker jobs several times under the same directory to HPC. Followed by the example in the protocol (as shown below), when I submit the jobs I make them as background processes by "&" except the first one. Is this necessary when I submit a job to a HPC? I found it costed much much longer time than I expected (according to a testing on a smaller data set). I am not sure whether setting the process as background process lead to this issue?
>
> The example in the protocol
> % maker 2> maker1.error
> % maker 2> maker2.error &
> % maker 2> maker3.error &
> ......
>
> BTW, will the annotation on shorter contig (e.g., 500bp) cost ~ 1/100 of the time that cost for annotation a 50000bp contig? I am using SNAP for an inito and RNA-seq assembly and protein sequences as evidence. I have more than half contigs shorter than 300bp (whose total length is only about 5% of the total length of all contigs), I want to know whether I can save about half (or only about 5%) of the time if I ignore those short contigs.
>
>  Thanks
>
> Best
> Quanwei
> _______________________________________________
> maker-devel mailing list
> [hidden email]
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org







_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Loading...