SNP experience

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

SNP experience

joe carlson
Hi Julie,

you had mentioned in a previous email that you had done some work loading SNPs into a mine. I was wondering if you could tell me a little about that dataset.

I’ve just tried to load a big set. Unfortunately I ran out of disk space before it was totally loaded. All hope is not lost since I can continue where I got cut off after i clean up a little. I’m curious as to the size of your data, how it was structured and the performance for loading and retrievals.

The set I’m working on now is ~ 550 individuals with ~30M SNPs. Using the SimpleObject for storing the genotype information is absolutely essential, as I’ve mentioned. I’m loading that as part of a post processing step and can load it in ~ 3 days. But I’d like to do it much faster since we may be increasing the number of individuals in the next year or so.

I’d appreciate hearing of your experiences,

Thanks,

Joe
_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: SNP experience

Julie Sullivan
Hi Joe

Here are some stats for the human InterMine (http://human.intermine.org/)

SNP Count: 60,487,836

Build time (SNP source): 6 hours
Build time (total): 28 hours
DB size: 552 GB

You can test performance of the webapp for yourself. (They are all
created as sequence alterations, just FYI). I am happy with query
responsiveness but I do think you get an error if you try to select all
60 million sequence alterations in the querybuilder.

http://intermine.readthedocs.org/en/latest/database/performance/data-loading/

I believe we spoke about this earlier, have you turned off data tracking
for some classes? That will help a bit.

Also, we've added a speed test on the beta branch:

http://intermine.readthedocs.org/en/latest/database/performance/data-loading/#performance-test

It would be interesting to see your results. What are the stats of your
current server?

Julie

On 15/09/14 06:16, Joe Carlson wrote:

> Hi Julie,
>
> you had mentioned in a previous email that you had done some work loading SNPs into a mine. I was wondering if you could tell me a little about that dataset.
>
> I’ve just tried to load a big set. Unfortunately I ran out of disk space before it was totally loaded. All hope is not lost since I can continue where I got cut off after i clean up a little. I’m curious as to the size of your data, how it was structured and the performance for loading and retrievals.
>
> The set I’m working on now is ~ 550 individuals with ~30M SNPs. Using the SimpleObject for storing the genotype information is absolutely essential, as I’ve mentioned. I’m loading that as part of a post processing step and can load it in ~ 3 days. But I’d like to do it much faster since we may be increasing the number of individuals in the next year or so.
>
> I’d appreciate hearing of your experiences,
>
> Thanks,
>
> Joe
> _______________________________________________
> dev mailing list
> [hidden email]
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: SNP experience

joe carlson

On 09/15/2014 09:41 AM, Julie Sullivan wrote:

> Hi Joe
>
> Here are some stats for the human InterMine (http://human.intermine.org/)
>
> SNP Count: 60,487,836
>
> Build time (SNP source): 6 hours
> Build time (total): 28 hours
> DB size: 552 GB
>
> You can test performance of the webapp for yourself. (They are all
> created as sequence alterations, just FYI). I am happy with query
> responsiveness but I do think you get an error if you try to select
> all 60 million sequence alterations in the querybuilder.
>
> http://intermine.readthedocs.org/en/latest/database/performance/data-loading/ 
>
>
> I believe we spoke about this earlier, have you turned off data
> tracking for some classes? That will help a bit.

I've synced up my GATK results data loader on github if you want to take
a look for it. My basic idea is to load the snps as part of the
integration step but link the snps to the samples at the postprocessing
step. (This turns out to be a lot faster than making the links at the
integration steps.)

For one organism, we have 32M SNPs and 559 individuals. There are simply
too many records to insert the snp->sample links at the integration step.

The table linking the snp to the sample is a simple object and I can
load that at a pretty decent rate. I think I may not be doing things in
a very intermine-y way: in the postprocessing step first I query to get
all the sample names and id's, then the snp names and id's. I re-read
the vcf file and create a plain JDBC CopyManager in a separate thread to
do the insertions. It's about a 3 day job. I think I can do a further
parallelization with an indexed vcf file.

This was working OK but I ran out of disk space after records. I had
tried to estimate the space that I would need - I was thinking it would
come in at ~ 400 Gb - and thought I was going to be OK, but I had
neglected a couple of huge issues: the 23 byte record overhead and the
sizes of the indexes. Given these overheads I'm starting to rethink the
whole approach. The space was more like 900Gb for the table and another
750 Gb for the indexes.

So I was starting of thinking of different ways to do the linking. Given
the record overhead, creating a separate record for every observed snp
is killing us. I'm kicking around some sort of JSON store for the sample
information. This will involve getting into the guts of the query
building. But it may be the only way to allow us to scale up to (an even
larger number) of samples.

Joe

>
> Also, we've added a speed test on the beta branch:
>
> http://intermine.readthedocs.org/en/latest/database/performance/data-loading/#performance-test 
>
>
> It would be interesting to see your results. What are the stats of
> your current server?
>
> Julie
>
> On 15/09/14 06:16, Joe Carlson wrote:
>> Hi Julie,
>>
>> you had mentioned in a previous email that you had done some work
>> loading SNPs into a mine. I was wondering if you could tell me a
>> little about that dataset.
>>
>> I’ve just tried to load a big set. Unfortunately I ran out of disk
>> space before it was totally loaded. All hope is not lost since I can
>> continue where I got cut off after i clean up a little. I’m curious
>> as to the size of your data, how it was structured and the
>> performance for loading and retrievals.
>>
>> The set I’m working on now is ~ 550 individuals with ~30M SNPs. Using
>> the SimpleObject for storing the genotype information is absolutely
>> essential, as I’ve mentioned. I’m loading that as part of a post
>> processing step and can load it in ~ 3 days. But I’d like to do it
>> much faster since we may be increasing the number of individuals in
>> the next year or so.
>>
>> I’d appreciate hearing of your experiences,
>>
>> Thanks,
>>
>> Joe
>> _______________________________________________
>> dev mailing list
>> [hidden email]
>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>


_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: SNP experience

Julie Sullivan
Hi Joe!

1. What are the stats of your hardware? I am getting much better loading
times than you and am trying to figure out why.

2. Did you try the performance test? The test may help us eliminate
hardware or configuration as a cause of slow loading.

3. What is the average speed for your build? The SNP source?

The log file is intermine.log in phytomine/integrate and the lines look
something like:

2014-09-02 15:42:24 INFO  org.intermine.dataconversion.FullXmlConverter
     - Processed 580000 rows - running at 3846153 (200000 avg 3164556)
(avg 1320582) rows per minute

4. How big is your data tracker table? Have you turned off the data
tracker for data types that aren't merged?

Here are my stats for the current human InterMine. You can see the
tracker is the largest table.

                              relname                             |
pg_size_pretty
-----------------------------------------------------------------+----------------
  tracker                                                         | 72 GB
  intermineobject                                                 | 62 GB
  tracker_objectid                                                | 38 GB
  consequence                                                     | 14 GB
  sequencealteration                                              | 13 GB
  sequencefeature                                                 | 12 GB
  snv                                                             | 11 GB
  bioentity                                                       | 9511 MB
  consequence__description_like                                   | 8998 MB
  consequence__description_equals                                 | 8998 MB
  clob                                                            | 7460 MB

5. Maybe you can send me the list of your tables and sizes?

6. Is this a typo:

https://github.com/JoeCarlson/intermine/blob/dev/phytomine/project.xml#L350-L371

7. data sources

GATK

https://github.com/JoeCarlson/intermine/blob/dev/bio/postprocess/main/src/org/intermine/bio/postprocess/SingletonSequenceTransfer.java

and postprocess

https://github.com/JoeCarlson/intermine/blob/dev/bio/postprocess/main/src/org/intermine/bio/postprocess/SingletonSequenceTransfer.java

Is that the correct version? I'll take a look!

8. There are a few other ways we can optimise the build times and
database sizes.

"From humanmine SNP example data estimate ~14% reduction in table size
for intermineobject if class names replaced by unqualified class names":
https://github.com/intermine/intermine/issues/675

Using an in-memory database instead of the items database, save time
having to write to disk
https://github.com/intermine/intermine/issues/731

We have other ideas too! some good some bad. But maybe you and I could
discuss different strategies, and prioritise based on your requirements.

9. Would you be open to a skype call next week?

You are obviously free to update your mine so it meets your needs!
However we are confronting similar issues and it makes sense to
coordinate our efforts. Do you agree?

The less "different" your mine is from the core code, the more I'll be
able to help you and the more you can share code and take advantage of
the work we're doing.

Cheers
Julie

On 15/09/14 23:48, Joe Carlson wrote:

>
> On 09/15/2014 09:41 AM, Julie Sullivan wrote:
>> Hi Joe
>>
>> Here are some stats for the human InterMine (http://human.intermine.org/)
>>
>> SNP Count: 60,487,836
>>
>> Build time (SNP source): 6 hours
>> Build time (total): 28 hours
>> DB size: 552 GB
>>
>> You can test performance of the webapp for yourself. (They are all
>> created as sequence alterations, just FYI). I am happy with query
>> responsiveness but I do think you get an error if you try to select
>> all 60 million sequence alterations in the querybuilder.
>>
>> http://intermine.readthedocs.org/en/latest/database/performance/data-loading/
>>
>>
>> I believe we spoke about this earlier, have you turned off data
>> tracking for some classes? That will help a bit.
>
> I've synced up my GATK results data loader on github if you want to take
> a look for it. My basic idea is to load the snps as part of the
> integration step but link the snps to the samples at the postprocessing
> step. (This turns out to be a lot faster than making the links at the
> integration steps.)
>
> For one organism, we have 32M SNPs and 559 individuals. There are simply
> too many records to insert the snp->sample links at the integration step.
>
> The table linking the snp to the sample is a simple object and I can
> load that at a pretty decent rate. I think I may not be doing things in
> a very intermine-y way: in the postprocessing step first I query to get
> all the sample names and id's, then the snp names and id's. I re-read
> the vcf file and create a plain JDBC CopyManager in a separate thread to
> do the insertions. It's about a 3 day job. I think I can do a further
> parallelization with an indexed vcf file.
>
> This was working OK but I ran out of disk space after records. I had
> tried to estimate the space that I would need - I was thinking it would
> come in at ~ 400 Gb - and thought I was going to be OK, but I had
> neglected a couple of huge issues: the 23 byte record overhead and the
> sizes of the indexes. Given these overheads I'm starting to rethink the
> whole approach. The space was more like 900Gb for the table and another
> 750 Gb for the indexes.
>
> So I was starting of thinking of different ways to do the linking. Given
> the record overhead, creating a separate record for every observed snp
> is killing us. I'm kicking around some sort of JSON store for the sample
> information. This will involve getting into the guts of the query
> building. But it may be the only way to allow us to scale up to (an even
> larger number) of samples.
>
> Joe
>
>>
>> Also, we've added a speed test on the beta branch:
>>
>> http://intermine.readthedocs.org/en/latest/database/performance/data-loading/#performance-test
>>
>>
>> It would be interesting to see your results. What are the stats of
>> your current server?
>>
>> Julie
>>
>> On 15/09/14 06:16, Joe Carlson wrote:
>>> Hi Julie,
>>>
>>> you had mentioned in a previous email that you had done some work
>>> loading SNPs into a mine. I was wondering if you could tell me a
>>> little about that dataset.
>>>
>>> I’ve just tried to load a big set. Unfortunately I ran out of disk
>>> space before it was totally loaded. All hope is not lost since I can
>>> continue where I got cut off after i clean up a little. I’m curious
>>> as to the size of your data, how it was structured and the
>>> performance for loading and retrievals.
>>>
>>> The set I’m working on now is ~ 550 individuals with ~30M SNPs. Using
>>> the SimpleObject for storing the genotype information is absolutely
>>> essential, as I’ve mentioned. I’m loading that as part of a post
>>> processing step and can load it in ~ 3 days. But I’d like to do it
>>> much faster since we may be increasing the number of individuals in
>>> the next year or so.
>>>
>>> I’d appreciate hearing of your experiences,
>>>
>>> Thanks,
>>>
>>> Joe
>>> _______________________________________________
>>> dev mailing list
>>> [hidden email]
>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>>
>
>

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev