Memory Database

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Memory Database

Scooter Willis-2
Memory Database
Trying to get  a feel for the performance tradeoff of loading all gff3 features into memory where the machine has 16GB of memory 8 cores etc versus performance in loading everything into a database. The genome is 40MB with currently 20MB of feature file size and I would expect that everything will fit in under 200MB of memory.

Is everything loaded into a hash lookup for the feature names for searches using into memory? If so that should be as fast as executing a SQL query. If the memory database implementation reads everything from disk per request then an indexed database would be faster.

When selecting features that cover a scaffold and index range is the memory implementation for the search efficient in that it uses some sort of binary search. This could give a performance advantage to the database if the indexing is setup properly to deal with ranges. Typically doing a sql query for rows < and > a range will benefit from the start and end fields being indexed and a tree is used to find the desired range. Just curious to see if anyone has done any benchmarking of mysql database versus in-memory for finding features in a range.

If running on 64 bit OS any application memory limits that can be tweaked to allow the use of more memory?

When in update is made to a config file or database gff3 file how are the updates processed? Is the application polling for file time/date/size changes and then will automatically reload?

How much server side caching occurs for regions that have already been viewed? Trying to get a handle on request that seem to take a long time to retrieve even with the fast cgi(fgb2) option and then other times performance seems good. If server side image caching is being used for regions that are being queried is the query faster in memory versus database to determine if any data elements have changed?

The lab will be working on manually curating the gene predictions and it would seem that if I leave everything in gff3 files and they use their favorite tool(Artemis, Apollo, etc) and access the gff3 files via a file share then I can greatly reduce the complexity of updating databases or being forced into using an application that knows how to connect to the chado database.



Any wisdom on pro/cons/limits/expectations of memory-vs-database would be greatly appreciated.

Thanks

Scooter

 

------------------------------------------------------------------------------


_______________________________________________
Gmod-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-devel
Reply | Threaded
Open this post in threaded view
|

Re: Memory Database

Lincoln Stein
There is going to be a big initial performance penalty when parsing the file into memory for the first time, which is why it is important to use fastCGI or mod_perl so that the hit is one-time only. The in-memory implementation does use hashes to speed up searches for genomic position, sequence name, and feature type, but it falls down to a linear search when searching for attributes. It is quite fast for most browser-related tasks; in particular the overlapping-range search is a lot faster than a naive SQL range search because it bins the search space; however the SQL adaptors use the same binning scheme.

As far as memory requirements go, I have never benchmarked the in-memory adaptor, because I tend to think of it as something to get people started. I would expect each feature to eat up at least a 100 bytes of memory; possibly more.

The application does check the modification time of the source files and reloads the data if they have been updated.

How much server side caching occurs for regions that have already been viewed? Trying to get a handle on request that seem to take a long time to retrieve even with the fast cgi(fgb2) option and then other times performance seems good. If server side image caching is being used for regions that are being queried is the query faster in memory versus database to determine if any data elements have changed?

The images are cached to disk, so reloading and turning tracks off and on again is fast. When scrolling around, though, you rarely scroll back to exactly the same place you started, so image caching doesn't help much. If you are seeing big fluctuations in request time, it may have to do with other things that are going on on the machine. I can show you how to turn on debugging if you want to track down the bottlenecks.

There's a terrific application called "DNA Subway" that the CSHL DNA Learning Center has put together that ties together GBrowse, Apollo and several sequence analysis tools. It's possibly general enough to be used in a research environment: http://dnasubway.iplantcollaborative.org/

Lincoln

On Tue, Jun 1, 2010 at 3:16 PM, Scooter Willis <[hidden email]> wrote:

Trying to get  a feel for the performance tradeoff of loading all gff3 features into memory where the machine has 16GB of memory 8 cores etc versus performance in loading everything into a database. The genome is 40MB with currently 20MB of feature file size and I would expect that everything will fit in under 200MB of memory.

Is everything loaded into a hash lookup for the feature names for searches using into memory? If so that should be as fast as executing a SQL query. If the memory database implementation reads everything from disk per request then an indexed database would be faster.

When selecting features that cover a scaffold and index range is the memory implementation for the search efficient in that it uses some sort of binary search. This could give a performance advantage to the database if the indexing is setup properly to deal with ranges. Typically doing a sql query for rows < and > a range will benefit from the start and end fields being indexed and a tree is used to find the desired range. Just curious to see if anyone has done any benchmarking of mysql database versus in-memory for finding features in a range.

If running on 64 bit OS any application memory limits that can be tweaked to allow the use of more memory?

When in update is made to a config file or database gff3 file how are the updates processed? Is the application polling for file time/date/size changes and then will automatically reload?

How much server side caching occurs for regions that have already been viewed? Trying to get a handle on request that seem to take a long time to retrieve even with the fast cgi(fgb2) option and then other times performance seems good. If server side image caching is being used for regions that are being queried is the query faster in memory versus database to determine if any data elements have changed?

The lab will be working on manually curating the gene predictions and it would seem that if I leave everything in gff3 files and they use their favorite tool(Artemis, Apollo, etc) and access the gff3 files via a file share then I can greatly reduce the complexity of updating databases or being forced into using an application that knows how to connect to the chado database.



Any wisdom on pro/cons/limits/expectations of memory-vs-database would be greatly appreciated.

Thanks

Scooter

 

------------------------------------------------------------------------------


_______________________________________________
Gmod-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-devel




--
Lincoln D. Stein
Director, Informatics and Biocomputing Platform
Ontario Institute for Cancer Research
101 College St., Suite 800
Toronto, ON, Canada M5G0A3
416 673-8514
Assistant: Renata Musa <[hidden email]>

------------------------------------------------------------------------------


_______________________________________________
Gmod-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-devel