Shouldn't unique keys be enforced during data source load into production database? (They're not.)

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Shouldn't unique keys be enforced during data source load into production database? (They're not.)

Sam Hokin-3
I've finally resolved some extreme confusion about enforcement of unique keys defined in DATASOURCE_keys.properties. What I've found
is that the production database can be loaded with two records that violate those key definitions without throwing an error;
however, if you then run _another_ datasource that attempts to merge with those records, you get an error thrown as follows:

Problem while loading item identifier 2_1100 because
Duplicate objects from the same data source; o1 = "QTL:10012630" (in database), o2 = "QTL:10011241" (in database), source1 =
"<Source: name="file-cmap", type="null", skeleton=false>", source2 = "<Source: name="file-cmap", type="null", skeleton=false>"

Fair enough, but file-cmap was run two sources ago! And, yes, there ARE two records that violate the key definitions IN the
production database from file-cmap:

soymine=> select * from qtl where secondaryidentifier='Pod maturity 22-3';
-[ RECORD 1 ]--------------+-------------------------------------
id                         | 10011241
primaryidentifier          | GmComposite2003_H_Pod maturity 22-3
secondaryidentifier        | Pod maturity 22-3
sequenceontologytermid     | 10012811
organismid                 | 10000001
class                      | org.intermine.model.bio.QTL
-[ RECORD 2 ]--------------+-------------------------------------
id                         | 10012630
primaryidentifier          | GmComposite2003_C2_Pod maturity 22-3
secondaryidentifier        | Pod maturity 22-3
sequenceontologytermid     | 10012811
organismid                 | 10000001
class                      | org.intermine.model.bio.QTL

Here are the key definitions that resulted in the error being thrown well after the culprit datasource was run:

QTL.key_primaryidentifier=primaryIdentifier
QTL.key_secondaryidentifier=secondaryIdentifier,organism

The second of these is violated by these two records.

This seems fundamentally wrong (and very confusing) to me. I should not be able to load duplicate objects into the production
database which violate the unique keys that I've defined.

Is there a reason why the uniqueness rules defined in DATASOURCE_keys.properties aren't applied when loading data from a datasource
into the production database, but rather only throw an error during a merge later on? Is there something I can change so that my
uniqueness rules are enforced during data loading, so I get an informative error when it happens, rather than down the line when I'm
running a totally different data source?

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Shouldn't unique keys be enforced during data source load into production database? (They're not.)

joe carlson
Sam,

I'm sure the intermine folks will have an "official" response, but here's my 2 cents works.

Calling 2 things the "same" is a fluid concept. In some data loader contexts, that means the same name, other times the same primaryIdentifer, other times the same primaryAccession, and so on. So your concept of 'sameness' must always be specified in every data loader where that record is getting touched. (This was my mental roadblock for a while. I was saying that 2 things were the same in one data loader and expecting them to be the same in another data loader.)

What happens if you do not say 2 things are the same? You will get 2 records. This is exactly what you're asking for so you should be happy! This looks like what you did when you loaded file-cmap.

Now what do you want to happen when you load more information into the QTL record when you load this new data source? You want the integrator to merge the QTL information with an existing one if it has the same key. But you already have 2. How can it do this? Do you want it to try to merge the 2 existing records into 1, then integrate the new information from this data source? It would be nice for this to happen, but I don't think the integrator can do that. (This involves chasing down all the records that reference the QTL records, making sure these things should not be merged, and so forth.) It's better to be consistent in specifying the relevant primary keys for every step.

So, if you want 2 QTLs to be the same based on the secondaryIdentifer and organism, specify this whenever you're loading secondaryIdentifier and organism information.

You may ask why intermine does not just impose database constraints on uniqueness, foreign keys, and so forth. It's a performance thing. When you're doing huge inserts, it's much faster for constraints to be imposed at the application layer.

Joe



On Apr 12, 2016, at 12:35 PM, Sam Hokin wrote:

> I've finally resolved some extreme confusion about enforcement of unique keys defined in DATASOURCE_keys.properties. What I've found is that the production database can be loaded with two records that violate those key definitions without throwing an error; however, if you then run _another_ datasource that attempts to merge with those records, you get an error thrown as follows:
>
> Problem while loading item identifier 2_1100 because
> Duplicate objects from the same data source; o1 = "QTL:10012630" (in database), o2 = "QTL:10011241" (in database), source1 = "<Source: name="file-cmap", type="null", skeleton=false>", source2 = "<Source: name="file-cmap", type="null", skeleton=false>"
>
> Fair enough, but file-cmap was run two sources ago! And, yes, there ARE two records that violate the key definitions IN the production database from file-cmap:
>
> soymine=> select * from qtl where secondaryidentifier='Pod maturity 22-3';
> -[ RECORD 1 ]--------------+-------------------------------------
> id                         | 10011241
> primaryidentifier          | GmComposite2003_H_Pod maturity 22-3
> secondaryidentifier        | Pod maturity 22-3
> sequenceontologytermid     | 10012811
> organismid                 | 10000001
> class                      | org.intermine.model.bio.QTL
> -[ RECORD 2 ]--------------+-------------------------------------
> id                         | 10012630
> primaryidentifier          | GmComposite2003_C2_Pod maturity 22-3
> secondaryidentifier        | Pod maturity 22-3
> sequenceontologytermid     | 10012811
> organismid                 | 10000001
> class                      | org.intermine.model.bio.QTL
>
> Here are the key definitions that resulted in the error being thrown well after the culprit datasource was run:
>
> QTL.key_primaryidentifier=primaryIdentifier
> QTL.key_secondaryidentifier=secondaryIdentifier,organism
>
> The second of these is violated by these two records.
>
> This seems fundamentally wrong (and very confusing) to me. I should not be able to load duplicate objects into the production database which violate the unique keys that I've defined.
>
> Is there a reason why the uniqueness rules defined in DATASOURCE_keys.properties aren't applied when loading data from a datasource into the production database, but rather only throw an error during a merge later on? Is there something I can change so that my uniqueness rules are enforced during data loading, so I get an informative error when it happens, rather than down the line when I'm running a totally different data source?
>
> _______________________________________________
> dev mailing list
> [hidden email]
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev


_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Shouldn't unique keys be enforced during data source load into production database? (They're not.)

Sam Hokin-3
On 04/12/2016 01:59 PM, Joe Carlson wrote:
> Sam,
> ...
> So, if you want 2 QTLs to be the same based on the secondaryIdentifer and organism, specify this whenever you're loading
> secondaryIdentifier and organism information.

I did. And it wasn't enforced. That's my point. I'm using the same key defs for all of my data loaders. In fact, I'm using a single
source-wide properties file for all of my sources, legfed_keys.properties. But, just to be sure, I copied the relevant key
definitions to file-cmap_keys.properties as well. Same result.

The data loader for which those two QTL objects are defined to be identical in the keys file loaded those two QTL objects into the
production database without error. I see no way to get the loader to throw an error on duplicate objects. I've defined a key so that
it should have done so. And I've seen this many times, which is what caused a lot of confusion on my part until I realized what was
going on.

It's not the end of the world, but it sure seems like a bug to me! I can write my loaders to throw exceptions if duplicate records
come up (I always load everything into maps before storing them), but I was hoping that DATASOURCE_keys.properties would handle that
during items->production loading.

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Shouldn't unique keys be enforced during data source load into production database? (They're not.)

Julie Sullivan
In reply to this post by Sam Hokin-3
Sam

Yes you are right. The keys are more "integration" keys than true keys.
It's up to each source to make sure the data is unique. This is not
ideal or obvious and can cause problems. I've made a ticket:

        https://github.com/intermine/intermine/issues/1347

Thanks
Julie

On 12/04/16 20:35, Sam Hokin wrote:

> I've finally resolved some extreme confusion about enforcement of unique
> keys defined in DATASOURCE_keys.properties. What I've found is that the
> production database can be loaded with two records that violate those
> key definitions without throwing an error; however, if you then run
> _another_ datasource that attempts to merge with those records, you get
> an error thrown as follows:
>
> Problem while loading item identifier 2_1100 because
> Duplicate objects from the same data source; o1 = "QTL:10012630" (in
> database), o2 = "QTL:10011241" (in database), source1 = "<Source:
> name="file-cmap", type="null", skeleton=false>", source2 = "<Source:
> name="file-cmap", type="null", skeleton=false>"
>
> Fair enough, but file-cmap was run two sources ago! And, yes, there ARE
> two records that violate the key definitions IN the production database
> from file-cmap:
>
> soymine=> select * from qtl where secondaryidentifier='Pod maturity 22-3';
> -[ RECORD 1 ]--------------+-------------------------------------
> id                         | 10011241
> primaryidentifier          | GmComposite2003_H_Pod maturity 22-3
> secondaryidentifier        | Pod maturity 22-3
> sequenceontologytermid     | 10012811
> organismid                 | 10000001
> class                      | org.intermine.model.bio.QTL
> -[ RECORD 2 ]--------------+-------------------------------------
> id                         | 10012630
> primaryidentifier          | GmComposite2003_C2_Pod maturity 22-3
> secondaryidentifier        | Pod maturity 22-3
> sequenceontologytermid     | 10012811
> organismid                 | 10000001
> class                      | org.intermine.model.bio.QTL
>
> Here are the key definitions that resulted in the error being thrown
> well after the culprit datasource was run:
>
> QTL.key_primaryidentifier=primaryIdentifier
> QTL.key_secondaryidentifier=secondaryIdentifier,organism
>
> The second of these is violated by these two records.
>
> This seems fundamentally wrong (and very confusing) to me. I should not
> be able to load duplicate objects into the production database which
> violate the unique keys that I've defined.
>
> Is there a reason why the uniqueness rules defined in
> DATASOURCE_keys.properties aren't applied when loading data from a
> datasource into the production database, but rather only throw an error
> during a merge later on? Is there something I can change so that my
> uniqueness rules are enforced during data loading, so I get an
> informative error when it happens, rather than down the line when I'm
> running a totally different data source?
>
> _______________________________________________
> dev mailing list
> [hidden email]
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev