Direct data loading from a db

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Direct data loading from a db

joe carlson
I've made a pull request of a direct data loader which takes its data
from another database. It's a pretty thin layer on top of the
DirectDataLoader. I found it useful when I was loading data from chado.
(The next phytozome has 175 organisms in the mine. This cut the loading
time down to a couple days from a couple weeks.)

It took me a little while to work out how to do direct loading using the
gradle build system. I put examples of what worked (for me) in
https://github.com/JoeCarlson/demo-sources

Joe

_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Direct data loading from a db

Paulo Nuin
Hi Joe

Loading directly from another database saves time due to faster SQL transactions (assuming you are using Postgres on your mine) or it is because the data is already structured some way?

Thanks

Paulo



> On Dec 7, 2018, at 10:06 AM, Joe Carlson <[hidden email]> wrote:
>
> I've made a pull request of a direct data loader which takes its data from another database. It's a pretty thin layer on top of the DirectDataLoader. I found it useful when I was loading data from chado. (The next phytozome has 175 organisms in the mine. This cut the loading time down to a couple days from a couple weeks.)
>
> It took me a little while to work out how to do direct loading using the gradle build system. I put examples of what worked (for me) in https://github.com/JoeCarlson/demo-sources
>
> Joe
>
> _______________________________________________
> dev mailing list
> [hidden email]
> https://lists.intermine.org/mailman/listinfo/dev

_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Direct data loading from a db

joe carlson
Hi Paulo,

It saves time since you're not going through the items-<mine> database.
Loading data is normally a two step process where item information is
loaded into the first database, then a second operation takes the data
from the item database, merges (if necessary) with records in the
production database and inserts.

For us, loading organisms from our chado db happened pretty quickly at
first, but as we added more and more, the merging operation slowed down
considerably.

With the direct data loaded you put things into the production database
in one step. It works best in those cases where you are not merging data
with things already in the production database and you do not need to
make references to existing data.

Joe

On 12/7/18 9:10 AM, Paulo Nuin wrote:

> Hi Joe
>
> Loading directly from another database saves time due to faster SQL transactions (assuming you are using Postgres on your mine) or it is because the data is already structured some way?
>
> Thanks
>
> Paulo
>
>
>
>> On Dec 7, 2018, at 10:06 AM, Joe Carlson <[hidden email]> wrote:
>>
>> I've made a pull request of a direct data loader which takes its data from another database. It's a pretty thin layer on top of the DirectDataLoader. I found it useful when I was loading data from chado. (The next phytozome has 175 organisms in the mine. This cut the loading time down to a couple days from a couple weeks.)
>>
>> It took me a little while to work out how to do direct loading using the gradle build system. I put examples of what worked (for me) in https://github.com/JoeCarlson/demo-sources
>>
>> Joe
>>
>> _______________________________________________
>> dev mailing list
>> [hidden email]
>> https://lists.intermine.org/mailman/listinfo/dev
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Direct data loading from a db

Paulo Nuin
Hi Joe

Thanks. Indeed the amount time used checking the items is quite high. It makes sense to bypass when not actually merging things.

Cheers
Paulo



> On Dec 7, 2018, at 10:17 AM, Joe Carlson <[hidden email]> wrote:
>
> Hi Paulo,
>
> It saves time since you're not going through the items-<mine> database. Loading data is normally a two step process where item information is loaded into the first database, then a second operation takes the data from the item database, merges (if necessary) with records in the production database and inserts.
>
> For us, loading organisms from our chado db happened pretty quickly at first, but as we added more and more, the merging operation slowed down considerably.
>
> With the direct data loaded you put things into the production database in one step. It works best in those cases where you are not merging data with things already in the production database and you do not need to make references to existing data.
>
> Joe
>
> On 12/7/18 9:10 AM, Paulo Nuin wrote:
>> Hi Joe
>>
>> Loading directly from another database saves time due to faster SQL transactions (assuming you are using Postgres on your mine) or it is because the data is already structured some way?
>>
>> Thanks
>>
>> Paulo
>>
>>
>>
>>> On Dec 7, 2018, at 10:06 AM, Joe Carlson <[hidden email]> wrote:
>>>
>>> I've made a pull request of a direct data loader which takes its data from another database. It's a pretty thin layer on top of the DirectDataLoader. I found it useful when I was loading data from chado. (The next phytozome has 175 organisms in the mine. This cut the loading time down to a couple days from a couple weeks.)
>>>
>>> It took me a little while to work out how to do direct loading using the gradle build system. I put examples of what worked (for me) in https://github.com/JoeCarlson/demo-sources
>>>
>>> Joe
>>>
>>> _______________________________________________
>>> dev mailing list
>>> [hidden email]
>>> https://lists.intermine.org/mailman/listinfo/dev

_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Direct data loading from a db

Sam Hokin-3
Sounds attractively dangerous to me. :)

On 12/7/18 10:57 AM, Paulo Nuin wrote:

> Hi Joe
>
> Thanks. Indeed the amount time used checking the items is quite high. It makes sense to bypass when not actually merging things.
>
> Cheers
> Paulo
>
>
>
>> On Dec 7, 2018, at 10:17 AM, Joe Carlson <[hidden email]> wrote:
>>
>> Hi Paulo,
>>
>> It saves time since you're not going through the items-<mine> database. Loading data is normally a two step process where item information is loaded into the first database, then a second operation takes the data from the item database, merges (if necessary) with records in the production database and inserts.
>>
>> For us, loading organisms from our chado db happened pretty quickly at first, but as we added more and more, the merging operation slowed down considerably.
>>
>> With the direct data loaded you put things into the production database in one step. It works best in those cases where you are not merging data with things already in the production database and you do not need to make references to existing data.
>>
>> Joe
>>
>> On 12/7/18 9:10 AM, Paulo Nuin wrote:
>>> Hi Joe
>>>
>>> Loading directly from another database saves time due to faster SQL transactions (assuming you are using Postgres on your mine) or it is because the data is already structured some way?
>>>
>>> Thanks
>>>
>>> Paulo
>>>
>>>
>>>
>>>> On Dec 7, 2018, at 10:06 AM, Joe Carlson <[hidden email]> wrote:
>>>>
>>>> I've made a pull request of a direct data loader which takes its data from another database. It's a pretty thin layer on top of the DirectDataLoader. I found it useful when I was loading data from chado. (The next phytozome has 175 organisms in the mine. This cut the loading time down to a couple days from a couple weeks.)
>>>>
>>>> It took me a little while to work out how to do direct loading using the gradle build system. I put examples of what worked (for me) in https://github.com/JoeCarlson/demo-sources
>>>>
>>>> Joe
>>>>
>>>> _______________________________________________
>>>> dev mailing list
>>>> [hidden email]
>>>> https://lists.intermine.org/mailman/listinfo/dev
>
> _______________________________________________
> dev mailing list
> [hidden email]
> https://lists.intermine.org/mailman/listinfo/dev
>
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev