doc on primary keys

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

doc on primary keys

joe carlson
I’ve been looking at the documentation on primary keys

http://intermine.readthedocs.org/en/latest/database/database-building/primary-keys/

Is this up to date?

I’ve been wondering exactly what was the purpose of genomic_keyDefs.properties in <mine>/dbmodel/resources/. From what I see in the doc, this file must contain the description of all keys that will be used by any of the data loaders. And the key properties files in bio/source/<sourcename>/resources/ refer to these keys by key name with the syntax Classname = keyname. Is this just an older way of doing it and has been replaced by the form Classname.keyname=field1, field2 in the individual loaders?

 I see that the doc says the central key file is an older way of defining keys. If I have no cases of specifying keys by name in the any of my sources, does the genomic_keyDefs.properties file used anywhere?

Thanks,

Joe
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: doc on primary keys

Julie Sullivan-2


On 27/04/16 20:05, Joe Carlson wrote:
> I’ve been looking at the documentation on primary keys
>
> http://intermine.readthedocs.org/en/latest/database/database-building/primary-keys/
>
> Is this up to date?

Yes

> I’ve been wondering exactly what was the purpose of genomic_keyDefs.properties in

<mine>/dbmodel/resources/. From what I see in the doc, this file must
contain the

description of all keys that will be used by any of the data loaders.
And the key

  properties files in bio/source/<sourcename>/resources/ refer to these
keys by key

  name with the syntax Classname = keyname. Is this just an older way of
doing it a

nd has been replaced by the form Classname.keyname=field1, field2 in the
individual

  loaders?

Yes, exactly. We thought it was easier to define keys in each data
source so we added the new way. However some people disagreed and wanted
to keep the old way, so we left both.

>   I see that the doc says the central key file is an older way of defining keys.

If I have no cases of specifying keys by name in the any of my sources,
does the

genomic_keyDefs.properties file used anywhere?

No. If you've defined keys in each data source that central keys file is
not used for integration.

> Thanks,
>
> Joe
> _______________________________________________
> dev mailing list
> [hidden email]
> https://lists.intermine.org/mailman/listinfo/dev
>
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: doc on primary keys

Sam Hokin-3
I had a bit of confusion re. keys between data _sources_ (individual entries in project.xml) versus the data _type_ to which they
belong (given by the "type" field in the project.xml entry). I've got a whole bundle of processors, each in a separate data
_source_, but they're all under bio/source/legfed; here's four of them:

     <source name="chado-genomics" type="legfed" dump="true">...</source>
     <source name="chado-genetics" type="legfed" dump="true">...</source>
     <source name="chado-featureprop" type="legfed" dump="true">...</source>
     <source name="chado-go" type="legfed" dump="true">...</source>

I thought from the docs that I could use a single keys file /bio/source/legfed/resources/legfed_keys.properties rather than an
individual one from each source. And, it seemed to hold up fairly well for a while. But then I discovered that changes to
legfed_keys.properties didn't "take" (despite ant clean, etc.) while if I created a single <source>_keys.properties file for each
_source_ in legfed, they were read correctly during integration.

It may just have been my misunderstanding of the docs, but I was certainly confused on this one. I do see the advantage of separate
keys files for each source (for example, when a particular data source provides the secondaryIdentifier, not the primaryIdentifier
to merge on), but when that's not the case, it'd be super handy to be able to define a default keys file for all _sources_ within a
_type_. Not a huge deal, but it crossed me up for quite a while.

A second aspect of this is that you have to duplicate keys files when you're using the same processor but a different source
database - because the source name is different. I'm merging data from two different chado databases (one specifically for peanuts,
the other for other legumes). So, even though the only difference in the data source definitions is the organisms and the db.name, I
have to create a new keys file for each re-used source, for example:

     <!-- chado genomics - bean, soybean -->
     <source name="chado-genomics" type="legfed" dump="true">
       <property name="source.db.name" value="tripal"/>
       <property name="organisms" value="3847 3885 3398"/>
        ...
       <property name="processors" value="org.intermine.bio.dataconversion.SequenceProcessor"/>
     </source>

     <!-- chado genomics - peanut -->
     <source name="chado-genomics2" type="legfed" dump="true">
       <property name="source.db.name" value="peanutbase"/>
       <property name="organisms" value="3398 3817 3818 130453 130454"/>
        ...
       <property name="processors" value="org.intermine.bio.dataconversion.SequenceProcessor"/>
     </source>

This requires a chado-genomics2_keys.properties file as well as the original chado-genomics_keys.properties. If there were a
properly-working default keys file under bio/sources/legfed/resources, I'd not have to duplicate the keys file, provided the default
were sufficient.

Picky, picky, I know. But I like to share my pain with y'all. :)
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: doc on primary keys

joe carlson
In reply to this post by Julie Sullivan-2
Thanks, Julie.

The reason I was doing the close reading is that I noticed that I had
messed up something when building a mine. It was going extremely slow
and I was puzzled. During the integration step, things were getting
queried based on a secondaryIdentifier even though I had never specified
that in any (known) key properties file.

I later realized it was an ant thing. (Thus reinforcing my general
formiphobia.) I had played around with using the secondaryidentifier as
the key field in one of my loaders. But then later decided against it.
Even though I had changed the key properties file in my
<source>/resources/ directory, the version in <source>/build was not
regenerated with the next build. Running an 'ant clean' solved this
issue. but it looks like there are some dependencies not captured in the
build files.

Joe

On 04/28/2016 01:34 AM, Julie Sullivan wrote:

>
>
> On 27/04/16 20:05, Joe Carlson wrote:
>> I’ve been looking at the documentation on primary keys
>>
>> http://intermine.readthedocs.org/en/latest/database/database-building/primary-keys/ 
>>
>>
>> Is this up to date?
>
> Yes
>
>> I’ve been wondering exactly what was the purpose of
>> genomic_keyDefs.properties in
>
> <mine>/dbmodel/resources/. From what I see in the doc, this file must
> contain the
>
> description of all keys that will be used by any of the data loaders.
> And the key
>
>  properties files in bio/source/<sourcename>/resources/ refer to these
> keys by key
>
>  name with the syntax Classname = keyname. Is this just an older way
> of doing it a
>
> nd has been replaced by the form Classname.keyname=field1, field2 in
> the individual
>
>  loaders?
>
> Yes, exactly. We thought it was easier to define keys in each data
> source so we added the new way. However some people disagreed and
> wanted to keep the old way, so we left both.
>
>>   I see that the doc says the central key file is an older way of
>> defining keys.
>
> If I have no cases of specifying keys by name in the any of my
> sources, does the
>
> genomic_keyDefs.properties file used anywhere?
>
> No. If you've defined keys in each data source that central keys file
> is not used for integration.
>
>> Thanks,
>>
>> Joe
>> _______________________________________________
>> dev mailing list
>> [hidden email]
>> https://lists.intermine.org/mailman/listinfo/dev
>>

_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: doc on primary keys

Sam Hokin-3
Oh. Hrm. This is exactly what happened to me when my changes to <source>/resources/<source>_keys.properties didn't "take". But my
ant clean, which had no effect, was in the integrate dir, not in <source>. So my skrieg yesterday about not having a reliable
default keys file under <source>/resources may be misplaced; sounds like it's a build bug as Joe suggests. I'll try again with a
single keys file but run ant clean under <source> whenever I change it and see how it goes.

On 04/28/2016 12:23 PM, Joe Carlson wrote:

> Thanks, Julie.
>
> The reason I was doing the close reading is that I noticed that I had messed up something when building a mine. It was going
> extremely slow and I was puzzled. During the integration step, things were getting queried based on a secondaryIdentifier even
> though I had never specified that in any (known) key properties file.
>
> I later realized it was an ant thing. (Thus reinforcing my general formiphobia.) I had played around with using the
> secondaryidentifier as the key field in one of my loaders. But then later decided against it. Even though I had changed the key
> properties file in my <source>/resources/ directory, the version in <source>/build was not regenerated with the next build. Running
> an 'ant clean' solved this issue. but it looks like there are some dependencies not captured in the build files.
>
> Joe
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: doc on primary keys

Sam Hokin-3
Just an update/unskrieg: it does turn out my keys problem was the same as Joe's problem; I need to ant clean in <type> and
<type>/main if I change the "default" (as I call it) <type>/resources/<type>_keys.properties file, which is used in lieu of all the
individual files I'd need to have for my sundry sources under this type. It does work correctly if I do the cleaning after changes,
before I integrate. The integration still throws an error about the missing foobar_keys.properties file for the specific foobar
datasource under that type, but it does respect the <type>_keys.properties definitions all the same. (I'm using "type" here rather
than "source" to correspond to the definitions in project.xml. There can be many datasources under a single "type" tree, as I have,
and this particular issue relates to the fact that my "type" bio/sources/legfed has a dozen sources/processors, all individual
datasource entries in project.xml.)

On 04/28/2016 01:19 PM, Sam Hokin wrote:

> Oh. Hrm. This is exactly what happened to me when my changes to <source>/resources/<source>_keys.properties didn't "take". But my
> ant clean, which had no effect, was in the integrate dir, not in <source>. So my skrieg yesterday about not having a reliable
> default keys file under <source>/resources may be misplaced; sounds like it's a build bug as Joe suggests. I'll try again with a
> single keys file but run ant clean under <source> whenever I change it and see how it goes.
>
> On 04/28/2016 12:23 PM, Joe Carlson wrote:
>> Thanks, Julie.
>>
>> The reason I was doing the close reading is that I noticed that I had messed up something when building a mine. It was going
>> extremely slow and I was puzzled. During the integration step, things were getting queried based on a secondaryIdentifier even
>> though I had never specified that in any (known) key properties file.
>>
>> I later realized it was an ant thing. (Thus reinforcing my general formiphobia.) I had played around with using the
>> secondaryidentifier as the key field in one of my loaders. But then later decided against it. Even though I had changed the key
>> properties file in my <source>/resources/ directory, the version in <source>/build was not regenerated with the next build. Running
>> an 'ant clean' solved this issue. but it looks like there are some dependencies not captured in the build files.
>>
>> Joe
> _______________________________________________
> dev mailing list
> [hidden email]
> https://lists.intermine.org/mailman/listinfo/dev
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev