load larger (21Gb) xml file

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

load larger (21Gb) xml file

Pengcheng Yang
Hi Miners,

I am running InterMine on a server with 128Gb memory.  The server become
dead after all the memory was used up during I load an xml item file
with 21gb size, the item file was produced from an gff file. The source
was made with intermine-items-large-xml-file. I plan to split the gff
file into small parts, reproduce the item file, and load individually.
Why the loading process need so much memory? Is there an alternative method?

Best,

Pengcheng

_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: load larger (21Gb) xml file

Julie Sullivan-2
No idea, 128 GB should be enough! The large XML source was specifically
created for larger XML files, it uses the items database instead of
loading directly to the database.

The build does slurp everything into memory so we would need at least 21
GB. Here is the code:

https://github.com/intermine/intermine/blob/dev/intermine/integrate/main/src/org/intermine/dataconversion/FullXmlConverter.java

You see all it does is parse the XML, then store the resulting objects.

On 05/07/2018 08:30 AM, Pengcheng Yang wrote:

> Hi Miners,
>
> I am running InterMine on a server with 128Gb memory.  The server become
> dead after all the memory was used up during I load an xml item file
> with 21gb size, the item file was produced from an gff file. The source
> was made with intermine-items-large-xml-file. I plan to split the gff
> file into small parts, reproduce the item file, and load individually.
> Why the loading process need so much memory? Is there an alternative
> method?
>
> Best,
>
> Pengcheng
>
> _______________________________________________
> dev mailing list
> [hidden email]
> https://lists.intermine.org/mailman/listinfo/dev
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: load larger (21Gb) xml file

Pengcheng Yang
Hi Julie,

I have tested individually loading the sub-parts of the big xml file.
But failed at the sixth part (totally nine parts, every part is ~2.5Gb,
each part need ~2180s to load). I am reloading the data with dump action
after loading the fifth part, then I will kill the project_build process
and then rerun from the dump point.

Best,

Pengcheng


On 2018-5-8 15:48, Julie Sullivan wrote:

> No idea, 128 GB should be enough! The large XML source was
> specifically created for larger XML files, it uses the items database
> instead of loading directly to the database.
>
> The build does slurp everything into memory so we would need at least
> 21 GB. Here is the code:
>
> https://github.com/intermine/intermine/blob/dev/intermine/integrate/main/src/org/intermine/dataconversion/FullXmlConverter.java 
>
>
> You see all it does is parse the XML, then store the resulting objects.
>
> On 05/07/2018 08:30 AM, Pengcheng Yang wrote:
>> Hi Miners,
>>
>> I am running InterMine on a server with 128Gb memory.  The server
>> become dead after all the memory was used up during I load an xml
>> item file with 21gb size, the item file was produced from an gff
>> file. The source was made with intermine-items-large-xml-file. I plan
>> to split the gff file into small parts, reproduce the item file, and
>> load individually. Why the loading process need so much memory? Is
>> there an alternative method?
>>
>> Best,
>>
>> Pengcheng
>>
>> _______________________________________________
>> dev mailing list
>> [hidden email]
>> https://lists.intermine.org/mailman/listinfo/dev
>

_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: load larger (21Gb) xml file

Pengcheng Yang
Hi all,

I have tested the individually loading the parts. The dump action was
performed after loading the fifth part. However, when I run the
project_build with " -l -v localhost" [1], the memory has been used up
while the loading of the dump was still be running [2]. It seems I am
stuck at this step that all the previously loaded data has been loaded
into the memory again. I have to kill the project_build in case it will
make the server stuck.

You can see that the memory 128Gb has been used up [2]. And this server
computer only run the project_build program, no other application
programs were run simultaneously.

Could any one help me on this problem?

Best,

Pengcheng


1. ================= log information for project_build =================
restarting using database: locustminebeta:locust-repeat-gff-denovo-5

Tue May  8 23:20:22 CST 2018


running: dropdb -U postgres -h localhost locustminebeta

Tue May  8 23:20:23 CST 2018


running: createdb -E SQL_ASCII -U postgres -h localhost -T
locustminebeta:locust-repeat-gff-denovo-5 locustminebeta

2. ================ the remained memory during loading the dump
"locustminebeta:locust-repeat-gff-denovo-5" ==============
-bash-4.2$ free
              total       used       free     shared    buffers cached
Mem:     131961972  131557096     404876     233260       2036 128404960
-/+ buffers/cache:    3150100  128811872
Swap:     16457724          0   16457724


On 2018-5-8 16:49, Pengcheng Yang wrote:

> Hi Julie,
>
> I have tested individually loading the sub-parts of the big xml file.
> But failed at the sixth part (totally nine parts, every part is
> ~2.5Gb, each part need ~2180s to load). I am reloading the data with
> dump action after loading the fifth part, then I will kill the
> project_build process and then rerun from the dump point.
>
> Best,
>
> Pengcheng
>
>
> On 2018-5-8 15:48, Julie Sullivan wrote:
>> No idea, 128 GB should be enough! The large XML source was
>> specifically created for larger XML files, it uses the items database
>> instead of loading directly to the database.
>>
>> The build does slurp everything into memory so we would need at least
>> 21 GB. Here is the code:
>>
>> https://github.com/intermine/intermine/blob/dev/intermine/integrate/main/src/org/intermine/dataconversion/FullXmlConverter.java 
>>
>>
>> You see all it does is parse the XML, then store the resulting objects.
>>
>> On 05/07/2018 08:30 AM, Pengcheng Yang wrote:
>>> Hi Miners,
>>>
>>> I am running InterMine on a server with 128Gb memory.  The server
>>> become dead after all the memory was used up during I load an xml
>>> item file with 21gb size, the item file was produced from an gff
>>> file. The source was made with intermine-items-large-xml-file. I
>>> plan to split the gff file into small parts, reproduce the item
>>> file, and load individually. Why the loading process need so much
>>> memory? Is there an alternative method?
>>>
>>> Best,
>>>
>>> Pengcheng
>>>
>>> _______________________________________________
>>> dev mailing list
>>> [hidden email]
>>> https://lists.intermine.org/mailman/listinfo/dev
>>
>
> _______________________________________________
> dev mailing list
> [hidden email]
> https://lists.intermine.org/mailman/listinfo/dev

_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: load larger (21Gb) xml file

Julie Sullivan-2
Last time the issue was you had a webapp pointed to the database. Can
you verify that's not the case anymore?

On 05/08/2018 04:48 PM, Pengcheng Yang wrote:

> Hi all,
>
> I have tested the individually loading the parts. The dump action was
> performed after loading the fifth part. However, when I run the
> project_build with " -l -v localhost" [1], the memory has been used up
> while the loading of the dump was still be running [2]. It seems I am
> stuck at this step that all the previously loaded data has been loaded
> into the memory again. I have to kill the project_build in case it will
> make the server stuck.
>
> You can see that the memory 128Gb has been used up [2]. And this server
> computer only run the project_build program, no other application
> programs were run simultaneously.
>
> Could any one help me on this problem?
>
> Best,
>
> Pengcheng
>
>
> 1. ================= log information for project_build =================
> restarting using database: locustminebeta:locust-repeat-gff-denovo-5
>
> Tue May  8 23:20:22 CST 2018
>
>
> running: dropdb -U postgres -h localhost locustminebeta
>
> Tue May  8 23:20:23 CST 2018
>
>
> running: createdb -E SQL_ASCII -U postgres -h localhost -T
> locustminebeta:locust-repeat-gff-denovo-5 locustminebeta
>
> 2. ================ the remained memory during loading the dump
> "locustminebeta:locust-repeat-gff-denovo-5" ==============
> -bash-4.2$ free
>               total       used       free     shared    buffers cached
> Mem:     131961972  131557096     404876     233260       2036 128404960
> -/+ buffers/cache:    3150100  128811872
> Swap:     16457724          0   16457724
>
>
> On 2018-5-8 16:49, Pengcheng Yang wrote:
>> Hi Julie,
>>
>> I have tested individually loading the sub-parts of the big xml file.
>> But failed at the sixth part (totally nine parts, every part is
>> ~2.5Gb, each part need ~2180s to load). I am reloading the data with
>> dump action after loading the fifth part, then I will kill the
>> project_build process and then rerun from the dump point.
>>
>> Best,
>>
>> Pengcheng
>>
>>
>> On 2018-5-8 15:48, Julie Sullivan wrote:
>>> No idea, 128 GB should be enough! The large XML source was
>>> specifically created for larger XML files, it uses the items database
>>> instead of loading directly to the database.
>>>
>>> The build does slurp everything into memory so we would need at least
>>> 21 GB. Here is the code:
>>>
>>> https://github.com/intermine/intermine/blob/dev/intermine/integrate/main/src/org/intermine/dataconversion/FullXmlConverter.java 
>>>
>>>
>>> You see all it does is parse the XML, then store the resulting objects.
>>>
>>> On 05/07/2018 08:30 AM, Pengcheng Yang wrote:
>>>> Hi Miners,
>>>>
>>>> I am running InterMine on a server with 128Gb memory.  The server
>>>> become dead after all the memory was used up during I load an xml
>>>> item file with 21gb size, the item file was produced from an gff
>>>> file. The source was made with intermine-items-large-xml-file. I
>>>> plan to split the gff file into small parts, reproduce the item
>>>> file, and load individually. Why the loading process need so much
>>>> memory? Is there an alternative method?
>>>>
>>>> Best,
>>>>
>>>> Pengcheng
>>>>
>>>> _______________________________________________
>>>> dev mailing list
>>>> [hidden email]
>>>> https://lists.intermine.org/mailman/listinfo/dev
>>>
>>
>> _______________________________________________
>> dev mailing list
>> [hidden email]
>> https://lists.intermine.org/mailman/listinfo/dev
>
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: load larger (21Gb) xml file

Pengcheng Yang
Hi Julie,

This is another problem. Actually, I am now running the complete version
of the locustmine, i.e completely loaded all the genome data.
Previously, I just test the data model and wideget using several scaffolds.

The configuration and data model has been updated at:
https://github.com/pengchy/intermine/tree/master/locustminebeta.

The genome size is 6.5Gb and the gff file has ~16 million entries.

I just restart my server after running from previously dumped point. I
still don know why was the server stuck.

Best,

Pengcheng


On 2018-5-9 15:28, Julie Sullivan wrote:

> Last time the issue was you had a webapp pointed to the database. Can
> you verify that's not the case anymore?
>
> On 05/08/2018 04:48 PM, Pengcheng Yang wrote:
>> Hi all,
>>
>> I have tested the individually loading the parts. The dump action was
>> performed after loading the fifth part. However, when I run the
>> project_build with " -l -v localhost" [1], the memory has been used
>> up while the loading of the dump was still be running [2]. It seems I
>> am stuck at this step that all the previously loaded data has been
>> loaded into the memory again. I have to kill the project_build in
>> case it will make the server stuck.
>>
>> You can see that the memory 128Gb has been used up [2]. And this
>> server computer only run the project_build program, no other
>> application programs were run simultaneously.
>>
>> Could any one help me on this problem?
>>
>> Best,
>>
>> Pengcheng
>>
>>
>> 1. ================= log information for project_build =================
>> restarting using database: locustminebeta:locust-repeat-gff-denovo-5
>>
>> Tue May  8 23:20:22 CST 2018
>>
>>
>> running: dropdb -U postgres -h localhost locustminebeta
>>
>> Tue May  8 23:20:23 CST 2018
>>
>>
>> running: createdb -E SQL_ASCII -U postgres -h localhost -T
>> locustminebeta:locust-repeat-gff-denovo-5 locustminebeta
>>
>> 2. ================ the remained memory during loading the dump
>> "locustminebeta:locust-repeat-gff-denovo-5" ==============
>> -bash-4.2$ free
>>               total       used       free     shared    buffers cached
>> Mem:     131961972  131557096     404876     233260       2036 128404960
>> -/+ buffers/cache:    3150100  128811872
>> Swap:     16457724          0   16457724
>>
>>
>> On 2018-5-8 16:49, Pengcheng Yang wrote:
>>> Hi Julie,
>>>
>>> I have tested individually loading the sub-parts of the big xml
>>> file. But failed at the sixth part (totally nine parts, every part
>>> is ~2.5Gb, each part need ~2180s to load). I am reloading the data
>>> with dump action after loading the fifth part, then I will kill the
>>> project_build process and then rerun from the dump point.
>>>
>>> Best,
>>>
>>> Pengcheng
>>>
>>>
>>> On 2018-5-8 15:48, Julie Sullivan wrote:
>>>> No idea, 128 GB should be enough! The large XML source was
>>>> specifically created for larger XML files, it uses the items
>>>> database instead of loading directly to the database.
>>>>
>>>> The build does slurp everything into memory so we would need at
>>>> least 21 GB. Here is the code:
>>>>
>>>> https://github.com/intermine/intermine/blob/dev/intermine/integrate/main/src/org/intermine/dataconversion/FullXmlConverter.java 
>>>>
>>>>
>>>> You see all it does is parse the XML, then store the resulting
>>>> objects.
>>>>
>>>> On 05/07/2018 08:30 AM, Pengcheng Yang wrote:
>>>>> Hi Miners,
>>>>>
>>>>> I am running InterMine on a server with 128Gb memory.  The server
>>>>> become dead after all the memory was used up during I load an xml
>>>>> item file with 21gb size, the item file was produced from an gff
>>>>> file. The source was made with intermine-items-large-xml-file. I
>>>>> plan to split the gff file into small parts, reproduce the item
>>>>> file, and load individually. Why the loading process need so much
>>>>> memory? Is there an alternative method?
>>>>>
>>>>> Best,
>>>>>
>>>>> Pengcheng
>>>>>
>>>>> _______________________________________________
>>>>> dev mailing list
>>>>> [hidden email]
>>>>> https://lists.intermine.org/mailman/listinfo/dev
>>>>
>>>
>>> _______________________________________________
>>> dev mailing list
>>> [hidden email]
>>> https://lists.intermine.org/mailman/listinfo/dev
>>
>

_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: load larger (21Gb) xml file

Julie Sullivan-2
So you are not running a webapp?

On 05/09/2018 09:44 AM, Pengcheng Yang wrote:

> Hi Julie,
>
> This is another problem. Actually, I am now running the complete version
> of the locustmine, i.e completely loaded all the genome data.
> Previously, I just test the data model and wideget using several scaffolds.
>
> The configuration and data model has been updated at:
> https://github.com/pengchy/intermine/tree/master/locustminebeta.
>
> The genome size is 6.5Gb and the gff file has ~16 million entries.
>
> I just restart my server after running from previously dumped point. I
> still don know why was the server stuck.
>
> Best,
>
> Pengcheng
>
>
> On 2018-5-9 15:28, Julie Sullivan wrote:
>> Last time the issue was you had a webapp pointed to the database. Can
>> you verify that's not the case anymore?
>>
>> On 05/08/2018 04:48 PM, Pengcheng Yang wrote:
>>> Hi all,
>>>
>>> I have tested the individually loading the parts. The dump action was
>>> performed after loading the fifth part. However, when I run the
>>> project_build with " -l -v localhost" [1], the memory has been used
>>> up while the loading of the dump was still be running [2]. It seems I
>>> am stuck at this step that all the previously loaded data has been
>>> loaded into the memory again. I have to kill the project_build in
>>> case it will make the server stuck.
>>>
>>> You can see that the memory 128Gb has been used up [2]. And this
>>> server computer only run the project_build program, no other
>>> application programs were run simultaneously.
>>>
>>> Could any one help me on this problem?
>>>
>>> Best,
>>>
>>> Pengcheng
>>>
>>>
>>> 1. ================= log information for project_build =================
>>> restarting using database: locustminebeta:locust-repeat-gff-denovo-5
>>>
>>> Tue May  8 23:20:22 CST 2018
>>>
>>>
>>> running: dropdb -U postgres -h localhost locustminebeta
>>>
>>> Tue May  8 23:20:23 CST 2018
>>>
>>>
>>> running: createdb -E SQL_ASCII -U postgres -h localhost -T
>>> locustminebeta:locust-repeat-gff-denovo-5 locustminebeta
>>>
>>> 2. ================ the remained memory during loading the dump
>>> "locustminebeta:locust-repeat-gff-denovo-5" ==============
>>> -bash-4.2$ free
>>>               total       used       free     shared    buffers cached
>>> Mem:     131961972  131557096     404876     233260       2036 128404960
>>> -/+ buffers/cache:    3150100  128811872
>>> Swap:     16457724          0   16457724
>>>
>>>
>>> On 2018-5-8 16:49, Pengcheng Yang wrote:
>>>> Hi Julie,
>>>>
>>>> I have tested individually loading the sub-parts of the big xml
>>>> file. But failed at the sixth part (totally nine parts, every part
>>>> is ~2.5Gb, each part need ~2180s to load). I am reloading the data
>>>> with dump action after loading the fifth part, then I will kill the
>>>> project_build process and then rerun from the dump point.
>>>>
>>>> Best,
>>>>
>>>> Pengcheng
>>>>
>>>>
>>>> On 2018-5-8 15:48, Julie Sullivan wrote:
>>>>> No idea, 128 GB should be enough! The large XML source was
>>>>> specifically created for larger XML files, it uses the items
>>>>> database instead of loading directly to the database.
>>>>>
>>>>> The build does slurp everything into memory so we would need at
>>>>> least 21 GB. Here is the code:
>>>>>
>>>>> https://github.com/intermine/intermine/blob/dev/intermine/integrate/main/src/org/intermine/dataconversion/FullXmlConverter.java 
>>>>>
>>>>>
>>>>> You see all it does is parse the XML, then store the resulting
>>>>> objects.
>>>>>
>>>>> On 05/07/2018 08:30 AM, Pengcheng Yang wrote:
>>>>>> Hi Miners,
>>>>>>
>>>>>> I am running InterMine on a server with 128Gb memory.  The server
>>>>>> become dead after all the memory was used up during I load an xml
>>>>>> item file with 21gb size, the item file was produced from an gff
>>>>>> file. The source was made with intermine-items-large-xml-file. I
>>>>>> plan to split the gff file into small parts, reproduce the item
>>>>>> file, and load individually. Why the loading process need so much
>>>>>> memory? Is there an alternative method?
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Pengcheng
>>>>>>
>>>>>> _______________________________________________
>>>>>> dev mailing list
>>>>>> [hidden email]
>>>>>> https://lists.intermine.org/mailman/listinfo/dev
>>>>>
>>>>
>>>> _______________________________________________
>>>> dev mailing list
>>>> [hidden email]
>>>> https://lists.intermine.org/mailman/listinfo/dev
>>>
>>
>
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: load larger (21Gb) xml file

Julie Sullivan-2
In reply to this post by Pengcheng Yang
For future-me who will be googling for the solution to this problem,
this was solved by creating a database for building and a separate
database for hosting the webapp.

On 05/07/2018 08:30 AM, Pengcheng Yang wrote:

> Hi Miners,
>
> I am running InterMine on a server with 128Gb memory.  The server become
> dead after all the memory was used up during I load an xml item file
> with 21gb size, the item file was produced from an gff file. The source
> was made with intermine-items-large-xml-file. I plan to split the gff
> file into small parts, reproduce the item file, and load individually.
> Why the loading process need so much memory? Is there an alternative
> method?
>
> Best,
>
> Pengcheng
>
> _______________________________________________
> dev mailing list
> [hidden email]
> https://lists.intermine.org/mailman/listinfo/dev
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev