Any way to remove a single organism (and related data) from an InterMine db?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Any way to remove a single organism (and related data) from an InterMine db?

Sam Hokin-3
Apologies if this is in the docs or was already hit on the dev list. I'm loading seven legumes into an InterMine db from a
tripal.chado database. Each takes an hour or so. So, I'm on organism #5, and I realize that I need to tweak my version of the chado
SequenceProcessor for this organism (I've got organism-specific tweaks to match primaryidentifier with a Uniprot/NCBI/etc.
identifier when I can).

If I update my Java and rerun ant -Dsource=chado-db for this organism, after already having run it, it errors out as follows:

load:
      [echo]
      [echo]       Loading chado-db-3847 (chado-db) tgt items into production DB
      [echo]
[Finalizer] INFO com.zaxxer.hikari.pool.HikariPool - HikariCP pool db.production is shutting down.
[Finalizer] INFO com.zaxxer.hikari.pool.HikariPool - HikariCP pool db.tripal is shutting down.
[integrate] [main] INFO com.zaxxer.hikari.HikariDataSource - HikariCP pool db.production is starting.
[integrate] [main] INFO com.zaxxer.hikari.HikariDataSource - HikariCP pool db.common-tgt-items is starting.

BUILD FAILED
/home/shokin/intermine-ncgr-sh/imbuild/integrate.xml:54: The following error occurred while executing this line:
/home/shokin/intermine-ncgr-sh/imbuild/source.xml:330: java.lang.RuntimeException: Exception while dataloading - to allow multiple
errors, set the property "dataLoader.allowMultipleErrors" to true
Problem while loading item identifier 0_1 because
There is already an equivalent in the database from this source (<Source: name="chado-db-3847", type="chado-db", skeleton=false>)
from a *previous* run; object from source in this run: "Ontology [id=1, name="Sequence Ontology",
url="http://www.sequenceontology.org"]", object from database: "Ontology [id=1000000, name="Sequence Ontology",
url="http://www.sequenceontology.org"]"; noticed problem while merging field "url" originally read from source: <Source:
name="chado-db-3847", type="chado-db", skeleton=false>

It looks like the ant task can't succeed because of a merge fail with already-existing records for this organism.

SO, is there a SQL script or ant task that will remove all data for a specific organism? Otherwise I have to clean and rebuild the
db and start all over again. Which isn't the end of the world but it's a time-waster when the four organisms I've loaded already are
fine - four hours down the drain.

OR, is this error actually benign and I can set dataLoader.allowMultipleErrors=true to get around this error and get the data loaded
into the production DB?

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Any way to remove a single organism (and related data) from an InterMine db?

joe carlson
Hi Sam,

I'm not from the Cambridge group, but I can weigh in while they're
asleep. (Literally. it's night there now.)

The official response will be 'drop the database and start over'.

The basic issue is that there are records in the tracker table that show
you have used the source chado-db-3847 already. You can only use the a
source name once. If you were to try to reload from a source with the
same name, you need to have control of what information has precedence.
Without a way to say 'this source has higher priority than that one',
it's not well defined how to resolve conflicts.

If you just delete the elements in the tracker table and try to restart,
the integration step will get all bothered by the fact that there are
items which do not have a record of where they came from.

One workaround is to make another entry in project.xml which is a
duplicate of chado-db-3847 - call it 'chado-db-3847-redo' - and try
loading that data source. Since there are probably going to be different
information in one of the fields when you do the loading, you'll need to
specify that chado-db-3847-redo has a higher priority than chado-db-3847
in the priorities files for the fields that are different. And getting
the merge keys correct is essential, or else you'll have duplicated
items. This will not fix any issue where the first load loaded data that
you did not want to be loaded. It can only add information.

4 hours down the drain isn't bad. It takes 2 weeks to build our mine.
Imagine how I feel when I need to restart that.

Joe Carlson

On 09/24/2015 11:33 AM, Sam Hokin wrote:

> Apologies if this is in the docs or was already hit on the dev list.
> I'm loading seven legumes into an InterMine db from a tripal.chado
> database. Each takes an hour or so. So, I'm on organism #5, and I
> realize that I need to tweak my version of the chado SequenceProcessor
> for this organism (I've got organism-specific tweaks to match
> primaryidentifier with a Uniprot/NCBI/etc. identifier when I can).
>
> If I update my Java and rerun ant -Dsource=chado-db for this organism,
> after already having run it, it errors out as follows:
>
> load:
>      [echo]
>      [echo]       Loading chado-db-3847 (chado-db) tgt items into
> production DB
>      [echo]
> [Finalizer] INFO com.zaxxer.hikari.pool.HikariPool - HikariCP pool
> db.production is shutting down.
> [Finalizer] INFO com.zaxxer.hikari.pool.HikariPool - HikariCP pool
> db.tripal is shutting down.
> [integrate] [main] INFO com.zaxxer.hikari.HikariDataSource - HikariCP
> pool db.production is starting.
> [integrate] [main] INFO com.zaxxer.hikari.HikariDataSource - HikariCP
> pool db.common-tgt-items is starting.
>
> BUILD FAILED
> /home/shokin/intermine-ncgr-sh/imbuild/integrate.xml:54: The following
> error occurred while executing this line:
> /home/shokin/intermine-ncgr-sh/imbuild/source.xml:330:
> java.lang.RuntimeException: Exception while dataloading - to allow
> multiple errors, set the property "dataLoader.allowMultipleErrors" to
> true
> Problem while loading item identifier 0_1 because
> There is already an equivalent in the database from this source
> (<Source: name="chado-db-3847", type="chado-db", skeleton=false>) from
> a *previous* run; object from source in this run: "Ontology [id=1,
> name="Sequence Ontology", url="http://www.sequenceontology.org"]",
> object from database: "Ontology [id=1000000, name="Sequence Ontology",
> url="http://www.sequenceontology.org"]"; noticed problem while merging
> field "url" originally read from source: <Source:
> name="chado-db-3847", type="chado-db", skeleton=false>
>
> It looks like the ant task can't succeed because of a merge fail with
> already-existing records for this organism.
>
> SO, is there a SQL script or ant task that will remove all data for a
> specific organism? Otherwise I have to clean and rebuild the db and
> start all over again. Which isn't the end of the world but it's a
> time-waster when the four organisms I've loaded already are fine -
> four hours down the drain.
>
> OR, is this error actually benign and I can set
> dataLoader.allowMultipleErrors=true to get around this error and get
> the data loaded into the production DB?
>
> _______________________________________________
> dev mailing list
> [hidden email]
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev


_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Any way to remove a single organism (and related data) from an InterMine db?

Sam Hokin-3
Fair enough, Joe, thanks. Just thought I'd ask; once I get everything locked down it can be an overnight job, but if there were a
way to back out an errant datasource task, I'd use it.

On 09/24/2015 04:34 PM, Joe Carlson wrote:
> Hi Sam,
>
> I'm not from the Cambridge group, but I can weigh in while they're asleep. (Literally. it's night there now.)
>
> The official response will be 'drop the database and start over'....

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Any way to remove a single organism (and related data) from an InterMine db?

vkrishna
Hi Sam,

As Joe has already implied, a few hours lost is not that bad (in the grand scheme of things). Our complete ThaleMine builds normally take around 14-15 hours, which is obviously nowhere close to the 2 weeks taken for Joe’s PhytoMine builds.

Restarting the build from the top is always the safest/cleanest route.

An alternative approach which might help your recover from a faulty data source, is to set up several checkpoints at various spots along the build, by editing the project.xml and setting `dump=“true”` in the source stanza. For example, refer to the source configuration on this page: http://intermine.readthedocs.org/en/latest/database/data-sources/library/chado/?highlight=dump

These intermediary `dumps` are fully compliant InterMine databases, against which you can stand up a webapp and inspect the contents.

Hope this helps!

Thank you and Regards,
Vivek

> On Sep 24, 2015, at 5:51 PM, Sam Hokin <[hidden email]> wrote:
>
> Fair enough, Joe, thanks. Just thought I'd ask; once I get everything locked down it can be an overnight job, but if there were a
> way to back out an errant datasource task, I'd use it.
>
> On 09/24/2015 04:34 PM, Joe Carlson wrote:
>> Hi Sam,
>>
>> I'm not from the Cambridge group, but I can weigh in while they're asleep. (Literally. it's night there now.)
>>
>> The official response will be 'drop the database and start over'....
>
> _______________________________________________
> dev mailing list
> [hidden email]
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Any way to remove a single organism (and related data) from an InterMine db?

vkrishna
Also, the intermediary dump files work well in conjuction with InterMine project_build script. Please see documentation here: http://intermine.readthedocs.org/en/latest/database/database-building/build-script/?highlight=dump

Thank you.
Vivek

> On Sep 24, 2015, at 6:13 PM, Krishnakumar, Vivek <[hidden email]> wrote:
>
> Hi Sam,
>
> As Joe has already implied, a few hours lost is not that bad (in the grand scheme of things). Our complete ThaleMine builds normally take around 14-15 hours, which is obviously nowhere close to the 2 weeks taken for Joe’s PhytoMine builds.
>
> Restarting the build from the top is always the safest/cleanest route.
>
> An alternative approach which might help your recover from a faulty data source, is to set up several checkpoints at various spots along the build, by editing the project.xml and setting `dump=“true”` in the source stanza. For example, refer to the source configuration on this page: http://intermine.readthedocs.org/en/latest/database/data-sources/library/chado/?highlight=dump
>
> These intermediary `dumps` are fully compliant InterMine databases, against which you can stand up a webapp and inspect the contents.
>
> Hope this helps!
>
> Thank you and Regards,
> Vivek
>
>> On Sep 24, 2015, at 5:51 PM, Sam Hokin <[hidden email]> wrote:
>>
>> Fair enough, Joe, thanks. Just thought I'd ask; once I get everything locked down it can be an overnight job, but if there were a
>> way to back out an errant datasource task, I'd use it.
>>
>> On 09/24/2015 04:34 PM, Joe Carlson wrote:
>>> Hi Sam,
>>>
>>> I'm not from the Cambridge group, but I can weigh in while they're asleep. (Literally. it's night there now.)
>>>
>>> The official response will be 'drop the database and start over'....
>>
>> _______________________________________________
>> dev mailing list
>> [hidden email]
>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>
> _______________________________________________
> dev mailing list
> [hidden email]
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Any way to remove a single organism (and related data) from an InterMine db?

Justin Clark-Casey
In reply to this post by joe carlson
I'm from the Cambridge group but have much less experience than Joe :)

But yes, I would say the official response is drop and start over.  I think that one issue with simple deletion is that if data source B has merged with data
source A, then on deleting data source B there's currently no way of restoring any overwritten values from data source A.

Setting up the project XML to output checkpoint SQL dumps can be very useful in these situations, as Vivek says elsewhere.  Dumps can be done within the
database itself or to an external file.  The data load can then be restarted from anywhere where you have a checkpoint dump with the project_build script.

Best,

--
Justin Clark-Casey, Synbiomine/InterMine Software Developer
http://synbiomine.org
http://twitter.com/justincc

On 24/09/15 22:34, Joe Carlson wrote:

> Hi Sam,
>
> I'm not from the Cambridge group, but I can weigh in while they're asleep. (Literally. it's night there now.)
>
> The official response will be 'drop the database and start over'.
>
> The basic issue is that there are records in the tracker table that show you have used the source chado-db-3847 already. You can only use the a source name
> once. If you were to try to reload from a source with the same name, you need to have control of what information has precedence. Without a way to say 'this
> source has higher priority than that one', it's not well defined how to resolve conflicts.
>
> If you just delete the elements in the tracker table and try to restart, the integration step will get all bothered by the fact that there are items which do
> not have a record of where they came from.
>
> One workaround is to make another entry in project.xml which is a duplicate of chado-db-3847 - call it 'chado-db-3847-redo' - and try loading that data source.
> Since there are probably going to be different information in one of the fields when you do the loading, you'll need to specify that chado-db-3847-redo has a
> higher priority than chado-db-3847 in the priorities files for the fields that are different. And getting the merge keys correct is essential, or else you'll
> have duplicated items. This will not fix any issue where the first load loaded data that you did not want to be loaded. It can only add information.
>
> 4 hours down the drain isn't bad. It takes 2 weeks to build our mine. Imagine how I feel when I need to restart that.
>
> Joe Carlson
>
> On 09/24/2015 11:33 AM, Sam Hokin wrote:
>> Apologies if this is in the docs or was already hit on the dev list. I'm loading seven legumes into an InterMine db from a tripal.chado database. Each takes
>> an hour or so. So, I'm on organism #5, and I realize that I need to tweak my version of the chado SequenceProcessor for this organism (I've got
>> organism-specific tweaks to match primaryidentifier with a Uniprot/NCBI/etc. identifier when I can).
>>
>> If I update my Java and rerun ant -Dsource=chado-db for this organism, after already having run it, it errors out as follows:
>>
>> load:
>>      [echo]
>>      [echo]       Loading chado-db-3847 (chado-db) tgt items into production DB
>>      [echo]
>> [Finalizer] INFO com.zaxxer.hikari.pool.HikariPool - HikariCP pool db.production is shutting down.
>> [Finalizer] INFO com.zaxxer.hikari.pool.HikariPool - HikariCP pool db.tripal is shutting down.
>> [integrate] [main] INFO com.zaxxer.hikari.HikariDataSource - HikariCP pool db.production is starting.
>> [integrate] [main] INFO com.zaxxer.hikari.HikariDataSource - HikariCP pool db.common-tgt-items is starting.
>>
>> BUILD FAILED
>> /home/shokin/intermine-ncgr-sh/imbuild/integrate.xml:54: The following error occurred while executing this line:
>> /home/shokin/intermine-ncgr-sh/imbuild/source.xml:330: java.lang.RuntimeException: Exception while dataloading - to allow multiple errors, set the property
>> "dataLoader.allowMultipleErrors" to true
>> Problem while loading item identifier 0_1 because
>> There is already an equivalent in the database from this source (<Source: name="chado-db-3847", type="chado-db", skeleton=false>) from a *previous* run;
>> object from source in this run: "Ontology [id=1, name="Sequence Ontology", url="http://www.sequenceontology.org"]", object from database: "Ontology
>> [id=1000000, name="Sequence Ontology", url="http://www.sequenceontology.org"]"; noticed problem while merging field "url" originally read from source:
>> <Source: name="chado-db-3847", type="chado-db", skeleton=false>
>>
>> It looks like the ant task can't succeed because of a merge fail with already-existing records for this organism.
>>
>> SO, is there a SQL script or ant task that will remove all data for a specific organism? Otherwise I have to clean and rebuild the db and start all over
>> again. Which isn't the end of the world but it's a time-waster when the four organisms I've loaded already are fine - four hours down the drain.
>>
>> OR, is this error actually benign and I can set dataLoader.allowMultipleErrors=true to get around this error and get the data loaded into the production DB?
>>
>> _______________________________________________
>> dev mailing list
>> [hidden email]
>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>
>
> _______________________________________________
> dev mailing list
> [hidden email]
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Any way to remove a single organism (and related data) from an InterMine db?

Justin Clark-Casey
Oh, and I believe setting dataLoader.allowMultipleErrors=true will still blow up after about 100 errors or so, so it's not a handy way of simply overwriting a
previous load.

--
Justin Clark-Casey, Synbiomine/InterMine Software Developer
http://synbiomine.org
http://twitter.com/justincc

On 28/09/15 13:29, Justin Clark-Casey wrote:

> I'm from the Cambridge group but have much less experience than Joe :)
>
> But yes, I would say the official response is drop and start over.  I think that one issue with simple deletion is that if data source B has merged with data
> source A, then on deleting data source B there's currently no way of restoring any overwritten values from data source A.
>
> Setting up the project XML to output checkpoint SQL dumps can be very useful in these situations, as Vivek says elsewhere.  Dumps can be done within the
> database itself or to an external file.  The data load can then be restarted from anywhere where you have a checkpoint dump with the project_build script.
>
> Best,
>
> --
> Justin Clark-Casey, Synbiomine/InterMine Software Developer
> http://synbiomine.org
> http://twitter.com/justincc
>
> On 24/09/15 22:34, Joe Carlson wrote:
>> Hi Sam,
>>
>> I'm not from the Cambridge group, but I can weigh in while they're asleep. (Literally. it's night there now.)
>>
>> The official response will be 'drop the database and start over'.
>>
>> The basic issue is that there are records in the tracker table that show you have used the source chado-db-3847 already. You can only use the a source name
>> once. If you were to try to reload from a source with the same name, you need to have control of what information has precedence. Without a way to say 'this
>> source has higher priority than that one', it's not well defined how to resolve conflicts.
>>
>> If you just delete the elements in the tracker table and try to restart, the integration step will get all bothered by the fact that there are items which do
>> not have a record of where they came from.
>>
>> One workaround is to make another entry in project.xml which is a duplicate of chado-db-3847 - call it 'chado-db-3847-redo' - and try loading that data source.
>> Since there are probably going to be different information in one of the fields when you do the loading, you'll need to specify that chado-db-3847-redo has a
>> higher priority than chado-db-3847 in the priorities files for the fields that are different. And getting the merge keys correct is essential, or else you'll
>> have duplicated items. This will not fix any issue where the first load loaded data that you did not want to be loaded. It can only add information.
>>
>> 4 hours down the drain isn't bad. It takes 2 weeks to build our mine. Imagine how I feel when I need to restart that.
>>
>> Joe Carlson
>>
>> On 09/24/2015 11:33 AM, Sam Hokin wrote:
>>> Apologies if this is in the docs or was already hit on the dev list. I'm loading seven legumes into an InterMine db from a tripal.chado database. Each takes
>>> an hour or so. So, I'm on organism #5, and I realize that I need to tweak my version of the chado SequenceProcessor for this organism (I've got
>>> organism-specific tweaks to match primaryidentifier with a Uniprot/NCBI/etc. identifier when I can).
>>>
>>> If I update my Java and rerun ant -Dsource=chado-db for this organism, after already having run it, it errors out as follows:
>>>
>>> load:
>>>      [echo]
>>>      [echo]       Loading chado-db-3847 (chado-db) tgt items into production DB
>>>      [echo]
>>> [Finalizer] INFO com.zaxxer.hikari.pool.HikariPool - HikariCP pool db.production is shutting down.
>>> [Finalizer] INFO com.zaxxer.hikari.pool.HikariPool - HikariCP pool db.tripal is shutting down.
>>> [integrate] [main] INFO com.zaxxer.hikari.HikariDataSource - HikariCP pool db.production is starting.
>>> [integrate] [main] INFO com.zaxxer.hikari.HikariDataSource - HikariCP pool db.common-tgt-items is starting.
>>>
>>> BUILD FAILED
>>> /home/shokin/intermine-ncgr-sh/imbuild/integrate.xml:54: The following error occurred while executing this line:
>>> /home/shokin/intermine-ncgr-sh/imbuild/source.xml:330: java.lang.RuntimeException: Exception while dataloading - to allow multiple errors, set the property
>>> "dataLoader.allowMultipleErrors" to true
>>> Problem while loading item identifier 0_1 because
>>> There is already an equivalent in the database from this source (<Source: name="chado-db-3847", type="chado-db", skeleton=false>) from a *previous* run;
>>> object from source in this run: "Ontology [id=1, name="Sequence Ontology", url="http://www.sequenceontology.org"]", object from database: "Ontology
>>> [id=1000000, name="Sequence Ontology", url="http://www.sequenceontology.org"]"; noticed problem while merging field "url" originally read from source:
>>> <Source: name="chado-db-3847", type="chado-db", skeleton=false>
>>>
>>> It looks like the ant task can't succeed because of a merge fail with already-existing records for this organism.
>>>
>>> SO, is there a SQL script or ant task that will remove all data for a specific organism? Otherwise I have to clean and rebuild the db and start all over
>>> again. Which isn't the end of the world but it's a time-waster when the four organisms I've loaded already are fine - four hours down the drain.
>>>
>>> OR, is this error actually benign and I can set dataLoader.allowMultipleErrors=true to get around this error and get the data loaded into the production DB?
>>>
>>> _______________________________________________
>>> dev mailing list
>>> [hidden email]
>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>
>>
>> _______________________________________________
>> dev mailing list
>> [hidden email]
>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>
> _______________________________________________
> dev mailing list
> [hidden email]
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev