failure to merge entities from two processors with same primaryIdentifier

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

failure to merge entities from two processors with same primaryIdentifier

Sam Hokin-3
Hi, devs. I know this is an old topic, but I've not been able to figure out how to fix it in my case. I've got a data source,
chado-legfed, with two processors running off of a chado database: GeneticMapProcessor, which processes genetic data, and a spin of
Kim Rutherford's SequenceProcessor, which processes genomic data for the specific and rather weird chado that we're running (Legume
Federation).

The one common element between genetics and genomics is GeneticMarker, which extends SequenceFeature as follows:

<!-- genetic markers are both genetic and genomic -->
<class name="GeneticMarker" extends="SequenceFeature" is-interface="true">
     <!-- these from featurepos -->
     <collection name="linkageGroupPositions" referenced-type="LinkageGroupPosition"/>
</class>

It's not important for this discussion, but if you're curious, I'm locating genetic markers on linkage groups (pieces of genetic
maps) using a LinkageGroupPosition entity which defines a centiMorgan position on the underlying LinkageGroup. These are purely
genetic data:

<class name="LinkageGroupPosition" is-interface="true">
     <!-- these from featurepos -->
     <attribute name="position" type="java.lang.Double"/>
     <reference name="linkageGroup" referenced-type="LinkageGroup"/>
</class>

<class name="LinkageGroup" extends="BioEntity" is-interface="true">
     <attribute name="length" type="java.lang.Double"/>
     <reference name="geneticMap" referenced-type="GeneticMap" reverse-reference="linkageGroups"/>
     <collection name="geneticMarkers" referenced-type="GeneticMarker"/>
     <reference name="sequenceOntologyTerm" referenced-type="SOTerm"/>
</class>

So, I run GeneticMapProcessor, which creates the geneticmarker records along with the geneticmarkerlinkagegrouppositions records
flawlessly. There is no issue with GeneticMapProcessor that I can see. (It does other genetic stuff as well, all fine.)

Then, I run SequenceProcessor, which in addition to all the strictly genomic data, creates geneticmarker records with sequences and
locations on chromosomes, as it should.

I've declared the primary key for GeneticMarker in chado-legfed/resources/chado-legfed_keys.properties as follows:

GeneticMap.key_primaryidentifier=primaryIdentifier

The duplicate geneticmarker records indeed have the same primaryidentifier value; sequenceid is missing from the
GeneticMapProcessor-created records as expected and name is missing from the SequenceProcessor-created records:

    id    |   primaryidentifier    |   secondaryidentifier    |   name   | sequenceid
---------+------------------------+--------------------------+----------+------------
  1009888 | 118M3-phavu            | 118M3                    | 118M3    |
  2026865 | 118M3-phavu            | 118M3                    |          |    2026866
  1009889 | 11M1-phavu             | 11M1                     | 11M1     |
  2026868 | 11M1-phavu             | 11M1                     |          |    2026869
  1009082 | 11M-Gm-phavu           | 11M-Gm                   | 11M-Gm   |
  2024452 | 11M-Gm-phavu           | 11M-Gm                   |          |    2024453

This is after all post-processing.

What else do I need to get geneticmarker records merged from the two processor outputs? Preserving, of course, the relationships in
geneticmarkerlinkagegrouppositions (from GeneticMapProcessor) and DNA sequences and locations (from SequenceProcessor). I do have
genomic priorities set as follows:

GeneticMarker.name = bean-genetics, *
GeneticMarker.secondaryIdentifier = bean-genetics, *
GeneticMarker.length = bean-genomics, *

But that has no effect.

Thanks! And Happy New Year!
Sam Hokin

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: failure to merge entities from two processors with same primaryIdentifier

Justin Clark-Casey
Hi Sam.  This is a bit of a mystery.  The integration would be done at the end of the Legume step - I would expect chado-legfed_keys.properties to be the
critical file here.  Priorities comes into play later and would only trigger errors, never a failure to do integration at all if a duplicate primary key entries
were found.

You may already have checked this many times, but are you definite that the keys file is being picked up?  A quick way to check would be to change the value to
be an invalid field, e.g.

GeneticMap.key_primaryidentifier=primaryIdentifierBAD

and look for the error in the rebuild.

If it's definitely being picked up could you send me the full -v log?  Not sure if this will tell us much but it would be the starting point for investigating
further.

Best,

--
Justin Clark-Casey, Synbiomine/InterMine Developer
http://synbiomine.org
http://twitter.com/justincc

On 30/12/15 21:34, Sam Hokin wrote:

> Hi, devs. I know this is an old topic, but I've not been able to figure out how to fix it in my case. I've got a data source, chado-legfed, with two processors
> running off of a chado database: GeneticMapProcessor, which processes genetic data, and a spin of Kim Rutherford's SequenceProcessor, which processes genomic
> data for the specific and rather weird chado that we're running (Legume Federation).
>
> The one common element between genetics and genomics is GeneticMarker, which extends SequenceFeature as follows:
>
> <!-- genetic markers are both genetic and genomic -->
> <class name="GeneticMarker" extends="SequenceFeature" is-interface="true">
>      <!-- these from featurepos -->
>      <collection name="linkageGroupPositions" referenced-type="LinkageGroupPosition"/>
> </class>
>
> It's not important for this discussion, but if you're curious, I'm locating genetic markers on linkage groups (pieces of genetic maps) using a
> LinkageGroupPosition entity which defines a centiMorgan position on the underlying LinkageGroup. These are purely genetic data:
>
> <class name="LinkageGroupPosition" is-interface="true">
>      <!-- these from featurepos -->
>      <attribute name="position" type="java.lang.Double"/>
>      <reference name="linkageGroup" referenced-type="LinkageGroup"/>
> </class>
>
> <class name="LinkageGroup" extends="BioEntity" is-interface="true">
>      <attribute name="length" type="java.lang.Double"/>
>      <reference name="geneticMap" referenced-type="GeneticMap" reverse-reference="linkageGroups"/>
>      <collection name="geneticMarkers" referenced-type="GeneticMarker"/>
>      <reference name="sequenceOntologyTerm" referenced-type="SOTerm"/>
> </class>
>
> So, I run GeneticMapProcessor, which creates the geneticmarker records along with the geneticmarkerlinkagegrouppositions records flawlessly. There is no issue
> with GeneticMapProcessor that I can see. (It does other genetic stuff as well, all fine.)
>
> Then, I run SequenceProcessor, which in addition to all the strictly genomic data, creates geneticmarker records with sequences and locations on chromosomes, as
> it should.
>
> I've declared the primary key for GeneticMarker in chado-legfed/resources/chado-legfed_keys.properties as follows:
>
> GeneticMap.key_primaryidentifier=primaryIdentifier
>
> The duplicate geneticmarker records indeed have the same primaryidentifier value; sequenceid is missing from the GeneticMapProcessor-created records as expected
> and name is missing from the SequenceProcessor-created records:
>
>     id    |   primaryidentifier    |   secondaryidentifier    |   name   | sequenceid
> ---------+------------------------+--------------------------+----------+------------
>   1009888 | 118M3-phavu            | 118M3                    | 118M3    |
>   2026865 | 118M3-phavu            | 118M3                    |          |    2026866
>   1009889 | 11M1-phavu             | 11M1                     | 11M1     |
>   2026868 | 11M1-phavu             | 11M1                     |          |    2026869
>   1009082 | 11M-Gm-phavu           | 11M-Gm                   | 11M-Gm   |
>   2024452 | 11M-Gm-phavu           | 11M-Gm                   |          |    2024453
>
> This is after all post-processing.
>
> What else do I need to get geneticmarker records merged from the two processor outputs? Preserving, of course, the relationships in
> geneticmarkerlinkagegrouppositions (from GeneticMapProcessor) and DNA sequences and locations (from SequenceProcessor). I do have genomic priorities set as
> follows:
>
> GeneticMarker.name = bean-genetics, *
> GeneticMarker.secondaryIdentifier = bean-genetics, *
> GeneticMarker.length = bean-genomics, *
>
> But that has no effect.
>
> Thanks! And Happy New Year!
> Sam Hokin
>
> _______________________________________________
> dev mailing list
> [hidden email]
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: failure to merge entities from two processors with same primaryIdentifier

Sam Hokin-3
Well, Justin, as often is the case, the mystery is solved by discovering that the problem was between keyboard and chair. When I
looked at your reply I noticed that I'd entered "GeneticMap.key_primaryidentifier=primaryIdentifier" in chado-legfed_keys.properties
(which is legit on its own) rather than "GeneticMarker.key_primaryidentifier=primaryIdentifier"! Genetics dyslexia.

So fixing that did indeed merge equal-primaryIdentifier entries correctly, although it revealed yet another issue that arises
because I've also apparently managed to muck up the primaryIdentifier between the two sources:

    id    | name  | primaryidentifier | secondaryidentifier | sequenceid
---------+-------+-------------------+---------------------+------------
  1000720 | BM151 | BM151             | BM151               |
  1008147 | BM151 | BM151-phavu       | BM151               |    2014459
  1000413 | BM152 | BM152             | BM152               |
  1008148 | BM152 | BM152-phavu       | BM152               |    2014461

The chado "name" is BM151 while the chado "uniquename" is BM151-phavu.

So, in conclusion, InterMine is working great. Happy 2016! (And thanks for jarring my brain with your reply!!!)

On 01/04/2016 11:19 AM, Justin Clark-Casey wrote:

> Hi Sam.  This is a bit of a mystery.  The integration would be done at the end of the Legume step - I would expect
> chado-legfed_keys.properties to be the critical file here.  Priorities comes into play later and would only trigger errors, never a
> failure to do integration at all if a duplicate primary key entries were found.
>
> You may already have checked this many times, but are you definite that the keys file is being picked up?  A quick way to check
> would be to change the value to be an invalid field, e.g.
>
> GeneticMap.key_primaryidentifier=primaryIdentifierBAD
>
> and look for the error in the rebuild.
>
> If it's definitely being picked up could you send me the full -v log?  Not sure if this will tell us much but it would be the
> starting point for investigating further.

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: failure to merge entities from two processors with same primaryIdentifier

Justin Clark-Casey
No problem, glad you found the issue :)  Happy 2016 to you (and the rest of the list) too.

--
Justin Clark-Casey, Synbiomine/InterMine Developer
http://synbiomine.org
http://twitter.com/justincc

On 04/01/16 22:24, Sam Hokin wrote:

> Well, Justin, as often is the case, the mystery is solved by discovering that the problem was between keyboard and chair. When I looked at your reply I noticed
> that I'd entered "GeneticMap.key_primaryidentifier=primaryIdentifier" in chado-legfed_keys.properties (which is legit on its own) rather than
> "GeneticMarker.key_primaryidentifier=primaryIdentifier"! Genetics dyslexia.
>
> So fixing that did indeed merge equal-primaryIdentifier entries correctly, although it revealed yet another issue that arises because I've also apparently
> managed to muck up the primaryIdentifier between the two sources:
>
>     id    | name  | primaryidentifier | secondaryidentifier | sequenceid
> ---------+-------+-------------------+---------------------+------------
>   1000720 | BM151 | BM151             | BM151               |
>   1008147 | BM151 | BM151-phavu       | BM151               |    2014459
>   1000413 | BM152 | BM152             | BM152               |
>   1008148 | BM152 | BM152-phavu       | BM152               |    2014461
>
> The chado "name" is BM151 while the chado "uniquename" is BM151-phavu.
>
> So, in conclusion, InterMine is working great. Happy 2016! (And thanks for jarring my brain with your reply!!!)
>
> On 01/04/2016 11:19 AM, Justin Clark-Casey wrote:
>> Hi Sam.  This is a bit of a mystery.  The integration would be done at the end of the Legume step - I would expect
>> chado-legfed_keys.properties to be the critical file here.  Priorities comes into play later and would only trigger errors, never a
>> failure to do integration at all if a duplicate primary key entries were found.
>>
>> You may already have checked this many times, but are you definite that the keys file is being picked up?  A quick way to check
>> would be to change the value to be an invalid field, e.g.
>>
>> GeneticMap.key_primaryidentifier=primaryIdentifierBAD
>>
>> and look for the error in the rebuild.
>>
>> If it's definitely being picked up could you send me the full -v log?  Not sure if this will tell us much but it would be the
>> starting point for investigating further.
>
> _______________________________________________
> dev mailing list
> [hidden email]
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev