About loading files to chado database.

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

About loading files to chado database.

kentnf .
Hi All,

I am new to tripal and learn how to use it to build genome website in recently. I have a simple question about loading GFF files to Chado using Tripal module. 
We plan to load several plant genomes to the database. It seems that the speed of loading GFF files will become slower when the size of database increase. I remember loading the 1st genome just takes about one night. Now there are 4 genomes in our database. Loading the 5th genome spend about 4-5 days. I want to know is there any suggestion for loading new genomes quickly? 
BTW, our server is an old server with 1 CPU (4core), 8GB memory, and 300GB SCSI hard disk. Will the loading speed be fast when upgrading the server? Thanks.     




Best,
Yi Zheng

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: About loading files to chado database.

Stephen Ficklin-2

Hi Yi,

Thanks for your question!

Unfortunately, importing a very large GFF file does take time as you've discovered.  I suspect that folks with more powerful servers could load the same files faster, but we find it does take at least a day to load these big files.  Even the original Chado Perl loaders take quite  a while to load a GFF file into Chado.  The issue is that every row in the GFF file has several referential integrity checks that must be checked before a row can be inserted.    Publishing your gene pages will also take some time.  But the good news is that once those two steps are done, you don't have to do them again unless you add new data.  Unfortunately, loading large genomic data into a highly normalized database like Chado takes time....

There are probably some tweaks we could do to improve the GFFloader to speed it up. Also, I suspect that a server with 8GB of RAM may be too small for several plant genomes.   For PostgreSQL to work quickly it likes to do index searching and sorting using RAM.  As the database tables grow larger and RAM is limited it can cause slowness as PostgreSQL can't fit what it wants into RAM...  When you are loading your genomes are you reaching the memory limit on your machine?   Are the web server and database server on the same machine?    How fast are the drives on your 300GB drive (7200 RPM, 10K RPM, 15K RPM), do you only have one hard drive in the machine?

Stephen  


On 1/12/2017 2:43 PM, kentnf . wrote:
Hi All,

I am new to tripal and learn how to use it to build genome website in recently. I have a simple question about loading GFF files to Chado using Tripal module. 
We plan to load several plant genomes to the database. It seems that the speed of loading GFF files will become slower when the size of database increase. I remember loading the 1st genome just takes about one night. Now there are 4 genomes in our database. Loading the 5th genome spend about 4-5 days. I want to know is there any suggestion for loading new genomes quickly? 
BTW, our server is an old server with 1 CPU (4core), 8GB memory, and 300GB SCSI hard disk. Will the loading speed be fast when upgrading the server? Thanks.     




Best,
Yi Zheng


------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi


_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal


------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: About loading files to chado database.

kentnf .
Hi Stephen,

Thank you for the quick reply. I didn't monitor the memory usage when loading GFF files, But I saw memory info like: "Parsing Line 128750 (44.54%). Memory: 28,614,936 bytes."  You are right, web server and database server on the same machine. My old SCSI drive should be 10K RPM with raid 5. We plan to buy a new server.  How about use SSD on the new server?  Please give me some suggestion about the memory size for the new server. Thanks 

Best,
Yi     

On Fri, Jan 13, 2017 at 2:09 AM, Stephen Ficklin <[hidden email]> wrote:

Hi Yi,

Thanks for your question!

Unfortunately, importing a very large GFF file does take time as you've discovered.  I suspect that folks with more powerful servers could load the same files faster, but we find it does take at least a day to load these big files.  Even the original Chado Perl loaders take quite  a while to load a GFF file into Chado.  The issue is that every row in the GFF file has several referential integrity checks that must be checked before a row can be inserted.    Publishing your gene pages will also take some time.  But the good news is that once those two steps are done, you don't have to do them again unless you add new data.  Unfortunately, loading large genomic data into a highly normalized database like Chado takes time....

There are probably some tweaks we could do to improve the GFFloader to speed it up. Also, I suspect that a server with 8GB of RAM may be too small for several plant genomes.   For PostgreSQL to work quickly it likes to do index searching and sorting using RAM.  As the database tables grow larger and RAM is limited it can cause slowness as PostgreSQL can't fit what it wants into RAM...  When you are loading your genomes are you reaching the memory limit on your machine?   Are the web server and database server on the same machine?    How fast are the drives on your 300GB drive (7200 RPM, 10K RPM, 15K RPM), do you only have one hard drive in the machine?

Stephen  


On 1/12/2017 2:43 PM, kentnf . wrote:
Hi All,

I am new to tripal and learn how to use it to build genome website in recently. I have a simple question about loading GFF files to Chado using Tripal module. 
We plan to load several plant genomes to the database. It seems that the speed of loading GFF files will become slower when the size of database increase. I remember loading the 1st genome just takes about one night. Now there are 4 genomes in our database. Loading the 5th genome spend about 4-5 days. I want to know is there any suggestion for loading new genomes quickly? 
BTW, our server is an old server with 1 CPU (4core), 8GB memory, and 300GB SCSI hard disk. Will the loading speed be fast when upgrading the server? Thanks.     




Best,
Yi Zheng


------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi


_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal


------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal



------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal