Re: Handling duplicate features in GFF3

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Handling duplicate features in GFF3

Julie Sullivan
Hi Deepak

No, you are exactly correct in your logic. However the key you set is an
"integration" key, meaning it's what determines what a unique object is
when merging the data from your GFF3 source into the database.

Is the problem happening in your GFF3 source itself? Or is the error
being thrown when you are trying to merge the second GFF3 source into
the database (which holds the data from your first GFF3 source). I would
think that you shouldn't have merge problems because, as you say, the
tRNA data are from different organisms. You might have an error *in* the
source if you have two tRNAs at the same location for some reason.

Can you verify what the error message says?

Thanks!
Julie

On 25/11/14 20:42, Deepak Unni wrote:

> Hi Julie,
>
> I was trying to load a GFF that has lots of tRNA having same GeneID but
> are located on different parts of a chromosome.
> Since I am considering the GeneID as the primaryIdentifier I came across
> the problem of duplicate entities at the time of loading the GFF3.
>
> So I created a custom GFF3 source and modified the keys.properties file
> by adding,
>
> SequenceFeature.key = ChromosomeLocation
>
>
> This worked for one GFF3 file having such duplicates tRNAs.
> Now when I use the same source for a GFF3 with tRNAs from a different
> organism, the source gives an error saying there is a duplicate.
>
> I was hoping that it would consider an entity to be duplicate only if
> there is already an entity that has the same SequenceFeature.key value.
>
> Did I miss something in the configuration or is the assumption I made
> about the keys wrong?
>
>
> Thanks,
>
> Deepak
>
> Research Analyst
> S104A Animal Science Research Center,
> University of Missouri, Columbia

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Handling duplicate features in GFF3

Deepak Unni
Hi Julie,

You are right. There is never an integration that happens since the two GFF3 are from different organism.

First, I loaded a GFF3 containing duplicate tRNAs, having same GeneID but different location.
Then I loaded a GFF3 for another organism having tRNAs, with same GeneID but different location, using the same source as before.

But this time I get an ObjectStore error,

Caused by: org.intermine.objectstore.ObjectStoreException: Duplicate objects found for pk org.intermine.model.bio.Gene.key_primaryIdentifier: Gene [briefDescription="null", chromosome=172, chromosomeLocation=2580, description="null", downstreamIntergenicRegion=null, id="616", length="73", name="null", organism=1, primaryIdentifier="102606533", score="null", scoreType="null", secondaryIdentifier="null", sequence=null, sequenceOntologyTerm=551, symbol="Trnaf-gaa", upstreamIntergenicRegion=null]


The funny thing is, if I reverse the order of the GFF3 being loaded then the second GFF3 loads and now the first GFF3 fails.
It seems that when loading the first GFF3, the tRNAs having same GeneID but on different location are treated as unique (based on key discussed earlier).
But it breaks when loading another GFF3 using the same source.

Thanks,

Deepak

On Wed, Nov 26, 2014 at 4:24 AM, Julie Sullivan <[hidden email]> wrote:
Hi Deepak

No, you are exactly correct in your logic. However the key you set is an "integration" key, meaning it's what determines what a unique object is when merging the data from your GFF3 source into the database.

Is the problem happening in your GFF3 source itself? Or is the error being thrown when you are trying to merge the second GFF3 source into the database (which holds the data from your first GFF3 source). I would think that you shouldn't have merge problems because, as you say, the tRNA data are from different organisms. You might have an error *in* the source if you have two tRNAs at the same location for some reason.

Can you verify what the error message says?

Thanks!
Julie


On 25/11/14 20:42, Deepak Unni wrote:
Hi Julie,

I was trying to load a GFF that has lots of tRNA having same GeneID but
are located on different parts of a chromosome.
Since I am considering the GeneID as the primaryIdentifier I came across
the problem of duplicate entities at the time of loading the GFF3.

So I created a custom GFF3 source and modified the keys.properties file
by adding,

SequenceFeature.key = ChromosomeLocation


This worked for one GFF3 file having such duplicates tRNAs.
Now when I use the same source for a GFF3 with tRNAs from a different
organism, the source gives an error saying there is a duplicate.

I was hoping that it would consider an entity to be duplicate only if
there is already an entity that has the same SequenceFeature.key value.

Did I miss something in the configuration or is the assumption I made
about the keys wrong?


Thanks,

Deepak

Research Analyst
S104A Animal Science Research Center,
University of Missouri, Columbia



--
Research Analyst
S104A Animal Science Research Center,
University of Missouri, Columbia

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Handling duplicate features in GFF3

Julie Sullivan
I believe the build doesn't care about keys until it needs them, e.g
when it's going to do a merge. So that makes sense the error is on the
second source.

Have you checked the data for duplicates? e.g. do you have two tRNAs at
the same location in the same file? If so you won't be able to use that
as an integration key.

On 26/11/14 14:35, Deepak Unni wrote:

> Hi Julie,
>
> You are right. There is never an integration that happens since the two
> GFF3 are from different organism.
>
> First, I loaded a GFF3 containing duplicate tRNAs, having same GeneID
> but different location.
> Then I loaded a GFF3 for another organism having tRNAs, with same GeneID
> but different location, using the same source as before.
>
> But this time I get an ObjectStore error,
>
> Caused by: org.intermine.objectstore.ObjectStoreException: Duplicate
> objects found for pk org.intermine.model.bio.Gene.key_primaryIdentifier:
> Gene [briefDescription="null", chromosome=172, chromosomeLocation=2580,
> description="null", downstreamIntergenicRegion=null, id="616",
> length="73", name="null", organism=1, primaryIdentifier="102606533",
> score="null", scoreType="null", secondaryIdentifier="null",
> sequence=null, sequenceOntologyTerm=551, symbol="Trnaf-gaa",
> upstreamIntergenicRegion=null]
>
>
> The funny thing is, if I reverse the order of the GFF3 being loaded then
> the second GFF3 loads and now the first GFF3 fails.
> It seems that when loading the first GFF3, the tRNAs having same GeneID
> but on different location are treated as unique (based on key discussed
> earlier).
> But it breaks when loading another GFF3 using the same source.
>
> Thanks,
>
> Deepak
>
> On Wed, Nov 26, 2014 at 4:24 AM, Julie Sullivan <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hi Deepak
>
>     No, you are exactly correct in your logic. However the key you set
>     is an "integration" key, meaning it's what determines what a unique
>     object is when merging the data from your GFF3 source into the database.
>
>     Is the problem happening in your GFF3 source itself? Or is the error
>     being thrown when you are trying to merge the second GFF3 source
>     into the database (which holds the data from your first GFF3
>     source). I would think that you shouldn't have merge problems
>     because, as you say, the tRNA data are from different organisms. You
>     might have an error *in* the source if you have two tRNAs at the
>     same location for some reason.
>
>     Can you verify what the error message says?
>
>     Thanks!
>     Julie
>
>
>     On 25/11/14 20:42, Deepak Unni wrote:
>
>         Hi Julie,
>
>         I was trying to load a GFF that has lots of tRNA having same
>         GeneID but
>         are located on different parts of a chromosome.
>         Since I am considering the GeneID as the primaryIdentifier I
>         came across
>         the problem of duplicate entities at the time of loading the GFF3.
>
>         So I created a custom GFF3 source and modified the
>         keys.properties file
>         by adding,
>
>         SequenceFeature.key = ChromosomeLocation
>
>
>         This worked for one GFF3 file having such duplicates tRNAs.
>         Now when I use the same source for a GFF3 with tRNAs from a
>         different
>         organism, the source gives an error saying there is a duplicate.
>
>         I was hoping that it would consider an entity to be duplicate
>         only if
>         there is already an entity that has the same SequenceFeature.key
>         value.
>
>         Did I miss something in the configuration or is the assumption I
>         made
>         about the keys wrong?
>
>
>         Thanks,
>
>         Deepak
>
>         Research Analyst
>         S104A Animal Science Research Center,
>         University of Missouri, Columbia
>
>
>
>
> --
> Research Analyst
> S104A Animal Science Research Center,
> University of Missouri, Columbia

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Handling duplicate features in GFF3

Deepak Unni
Hi Julie,

I checked the GFF3 again and I didn't find any feature having the same chromosome, start, stop, and strand.

For reference I am attaching the two GFF3 and the source that is used for loading the GFF3.



On Wed, Nov 26, 2014 at 9:36 AM, Julie Sullivan <[hidden email]> wrote:
I believe the build doesn't care about keys until it needs them, e.g when it's going to do a merge. So that makes sense the error is on the second source.

Have you checked the data for duplicates? e.g. do you have two tRNAs at the same location in the same file? If so you won't be able to use that as an integration key.

On 26/11/14 14:35, Deepak Unni wrote:
Hi Julie,

You are right. There is never an integration that happens since the two
GFF3 are from different organism.

First, I loaded a GFF3 containing duplicate tRNAs, having same GeneID
but different location.
Then I loaded a GFF3 for another organism having tRNAs, with same GeneID
but different location, using the same source as before.

But this time I get an ObjectStore error,

Caused by: org.intermine.objectstore.ObjectStoreException: Duplicate
objects found for pk org.intermine.model.bio.Gene.key_primaryIdentifier:
Gene [briefDescription="null", chromosome=172, chromosomeLocation=2580,
description="null", downstreamIntergenicRegion=null, id="616",
length="73", name="null", organism=1, primaryIdentifier="102606533",
score="null", scoreType="null", secondaryIdentifier="null",
sequence=null, sequenceOntologyTerm=551, symbol="Trnaf-gaa",
upstreamIntergenicRegion=null]


The funny thing is, if I reverse the order of the GFF3 being loaded then
the second GFF3 loads and now the first GFF3 fails.
It seems that when loading the first GFF3, the tRNAs having same GeneID
but on different location are treated as unique (based on key discussed
earlier).
But it breaks when loading another GFF3 using the same source.

Thanks,

Deepak

On Wed, Nov 26, 2014 at 4:24 AM, Julie Sullivan <[hidden email]
<mailto:[hidden email]>> wrote:

    Hi Deepak

    No, you are exactly correct in your logic. However the key you set
    is an "integration" key, meaning it's what determines what a unique
    object is when merging the data from your GFF3 source into the database.

    Is the problem happening in your GFF3 source itself? Or is the error
    being thrown when you are trying to merge the second GFF3 source
    into the database (which holds the data from your first GFF3
    source). I would think that you shouldn't have merge problems
    because, as you say, the tRNA data are from different organisms. You
    might have an error *in* the source if you have two tRNAs at the
    same location for some reason.

    Can you verify what the error message says?

    Thanks!
    Julie


    On 25/11/14 20:42, Deepak Unni wrote:

        Hi Julie,

        I was trying to load a GFF that has lots of tRNA having same
        GeneID but
        are located on different parts of a chromosome.
        Since I am considering the GeneID as the primaryIdentifier I
        came across
        the problem of duplicate entities at the time of loading the GFF3.

        So I created a custom GFF3 source and modified the
        keys.properties file
        by adding,

        SequenceFeature.key = ChromosomeLocation


        This worked for one GFF3 file having such duplicates tRNAs.
        Now when I use the same source for a GFF3 with tRNAs from a
        different
        organism, the source gives an error saying there is a duplicate.

        I was hoping that it would consider an entity to be duplicate
        only if
        there is already an entity that has the same SequenceFeature.key
        value.

        Did I miss something in the configuration or is the assumption I
        made
        about the keys wrong?


        Thanks,

        Deepak

        Research Analyst
        S104A Animal Science Research Center,
        University of Missouri, Columbia




--
Research Analyst
S104A Animal Science Research Center,
University of Missouri, Columbia



--
Research Analyst
S104A Animal Science Research Center,
University of Missouri, Columbia

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

dataset1_noncoding.gff3 (1M) Download Attachment
dataset2_noncoding.gff3 (3M) Download Attachment
noncoding-gff-source.tar.gz (16K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Handling duplicate features in GFF3

Julie Sullivan
Hi Deepak

Do you need to merge features from different files? e.g. is the same
gene in different GFF3 files? Or is each file unique to an organism?

If you don't need to merge items, you should remove the integration key:

        SequenceFeature.key = ChromosomeLocation

I know you added this key because you had duplicate keys and got an
error. Can you remove this integration key, run your build and send me
the error message?

Here is the doc for the GFF3 source:

http://intermine.readthedocs.org/en/latest/database/data-sources/library/gff/

Julie


On 26/11/14 16:28, Deepak Unni wrote:

> Hi Julie,
>
> I checked the GFF3 again and I didn't find any feature having the same
> chromosome, start, stop, and strand.
>
> For reference I am attaching the two GFF3 and the source that is used
> for loading the GFF3.
>
>
>
> On Wed, Nov 26, 2014 at 9:36 AM, Julie Sullivan <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     I believe the build doesn't care about keys until it needs them, e.g
>     when it's going to do a merge. So that makes sense the error is on
>     the second source.
>
>     Have you checked the data for duplicates? e.g. do you have two tRNAs
>     at the same location in the same file? If so you won't be able to
>     use that as an integration key.
>
>     On 26/11/14 14:35, Deepak Unni wrote:
>
>         Hi Julie,
>
>         You are right. There is never an integration that happens since
>         the two
>         GFF3 are from different organism.
>
>         First, I loaded a GFF3 containing duplicate tRNAs, having same
>         GeneID
>         but different location.
>         Then I loaded a GFF3 for another organism having tRNAs, with
>         same GeneID
>         but different location, using the same source as before.
>
>         But this time I get an ObjectStore error,
>
>         Caused by: org.intermine.objectstore.__ObjectStoreException:
>         Duplicate
>         objects found for pk
>         org.intermine.model.bio.Gene.__key_primaryIdentifier:
>         Gene [briefDescription="null", chromosome=172,
>         chromosomeLocation=2580,
>         description="null", downstreamIntergenicRegion=__null, id="616",
>         length="73", name="null", organism=1, primaryIdentifier="102606533",
>         score="null", scoreType="null", secondaryIdentifier="null",
>         sequence=null, sequenceOntologyTerm=551, symbol="Trnaf-gaa",
>         upstreamIntergenicRegion=null]
>
>
>         The funny thing is, if I reverse the order of the GFF3 being
>         loaded then
>         the second GFF3 loads and now the first GFF3 fails.
>         It seems that when loading the first GFF3, the tRNAs having same
>         GeneID
>         but on different location are treated as unique (based on key
>         discussed
>         earlier).
>         But it breaks when loading another GFF3 using the same source.
>
>         Thanks,
>
>         Deepak
>
>         On Wed, Nov 26, 2014 at 4:24 AM, Julie Sullivan
>         <[hidden email] <mailto:[hidden email]>
>         <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>
>              Hi Deepak
>
>              No, you are exactly correct in your logic. However the key
>         you set
>              is an "integration" key, meaning it's what determines what
>         a unique
>              object is when merging the data from your GFF3 source into
>         the database.
>
>              Is the problem happening in your GFF3 source itself? Or is
>         the error
>              being thrown when you are trying to merge the second GFF3
>         source
>              into the database (which holds the data from your first GFF3
>              source). I would think that you shouldn't have merge problems
>              because, as you say, the tRNA data are from different
>         organisms. You
>              might have an error *in* the source if you have two tRNAs
>         at the
>              same location for some reason.
>
>              Can you verify what the error message says?
>
>              Thanks!
>              Julie
>
>
>              On 25/11/14 20:42, Deepak Unni wrote:
>
>                  Hi Julie,
>
>                  I was trying to load a GFF that has lots of tRNA having
>         same
>                  GeneID but
>                  are located on different parts of a chromosome.
>                  Since I am considering the GeneID as the
>         primaryIdentifier I
>                  came across
>                  the problem of duplicate entities at the time of
>         loading the GFF3.
>
>                  So I created a custom GFF3 source and modified the
>                  keys.properties file
>                  by adding,
>
>                  SequenceFeature.key = ChromosomeLocation
>
>
>                  This worked for one GFF3 file having such duplicates tRNAs.
>                  Now when I use the same source for a GFF3 with tRNAs from a
>                  different
>                  organism, the source gives an error saying there is a
>         duplicate.
>
>                  I was hoping that it would consider an entity to be
>         duplicate
>                  only if
>                  there is already an entity that has the same
>         SequenceFeature.key
>                  value.
>
>                  Did I miss something in the configuration or is the
>         assumption I
>                  made
>                  about the keys wrong?
>
>
>                  Thanks,
>
>                  Deepak
>
>                  Research Analyst
>                  S104A Animal Science Research Center,
>                  University of Missouri, Columbia
>
>
>
>
>         --
>         Research Analyst
>         S104A Animal Science Research Center,
>         University of Missouri, Columbia
>
>
>
>
> --
> Research Analyst
> S104A Animal Science Research Center,
> University of Missouri, Columbia

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Handling duplicate features in GFF3

Deepak Unni
Hi Julie,

I don't expect an integration between the two GFF3 since they are from different organism and hence should be mutually exclusive.

I removed the SequenceFeature key and tried loading the two GFF3s and I got the same error as before.

Here is the stack trace,
https://gist.github.com/deepakunni3/fada098bb4cbdda24bb8

I think I understood where the keys come into play.

But I am surprised at the error. There are no common GeneIDs (primaryIdentifiers) between the two GFF3s since they belong to two different organism.



On Thu, Nov 27, 2014 at 5:39 AM, Julie Sullivan <[hidden email]> wrote:
Hi Deepak

Do you need to merge features from different files? e.g. is the same gene in different GFF3 files? Or is each file unique to an organism?

If you don't need to merge items, you should remove the integration key:

        SequenceFeature.key = ChromosomeLocation

I know you added this key because you had duplicate keys and got an error. Can you remove this integration key, run your build and send me the error message?

Here is the doc for the GFF3 source:

http://intermine.readthedocs.org/en/latest/database/data-sources/library/gff/

Julie


On 26/11/14 16:28, Deepak Unni wrote:
Hi Julie,

I checked the GFF3 again and I didn't find any feature having the same
chromosome, start, stop, and strand.

For reference I am attaching the two GFF3 and the source that is used
for loading the GFF3.



On Wed, Nov 26, 2014 at 9:36 AM, Julie Sullivan <[hidden email]
<mailto:[hidden email]>> wrote:

    I believe the build doesn't care about keys until it needs them, e.g
    when it's going to do a merge. So that makes sense the error is on
    the second source.

    Have you checked the data for duplicates? e.g. do you have two tRNAs
    at the same location in the same file? If so you won't be able to
    use that as an integration key.

    On 26/11/14 14:35, Deepak Unni wrote:

        Hi Julie,

        You are right. There is never an integration that happens since
        the two
        GFF3 are from different organism.

        First, I loaded a GFF3 containing duplicate tRNAs, having same
        GeneID
        but different location.
        Then I loaded a GFF3 for another organism having tRNAs, with
        same GeneID
        but different location, using the same source as before.

        But this time I get an ObjectStore error,

        Caused by: org.intermine.objectstore.__ObjectStoreException:
        Duplicate
        objects found for pk
        org.intermine.model.bio.Gene.__key_primaryIdentifier:
        Gene [briefDescription="null", chromosome=172,
        chromosomeLocation=2580,
        description="null", downstreamIntergenicRegion=__null, id="616",
        length="73", name="null", organism=1, primaryIdentifier="102606533",
        score="null", scoreType="null", secondaryIdentifier="null",
        sequence=null, sequenceOntologyTerm=551, symbol="Trnaf-gaa",
        upstreamIntergenicRegion=null]


        The funny thing is, if I reverse the order of the GFF3 being
        loaded then
        the second GFF3 loads and now the first GFF3 fails.
        It seems that when loading the first GFF3, the tRNAs having same
        GeneID
        but on different location are treated as unique (based on key
        discussed
        earlier).
        But it breaks when loading another GFF3 using the same source.

        Thanks,

        Deepak

        On Wed, Nov 26, 2014 at 4:24 AM, Julie Sullivan
        <[hidden email] <mailto:[hidden email]>
        <mailto:[hidden email] <mailto:[hidden email]>>> wrote:

             Hi Deepak

             No, you are exactly correct in your logic. However the key
        you set
             is an "integration" key, meaning it's what determines what
        a unique
             object is when merging the data from your GFF3 source into
        the database.

             Is the problem happening in your GFF3 source itself? Or is
        the error
             being thrown when you are trying to merge the second GFF3
        source
             into the database (which holds the data from your first GFF3
             source). I would think that you shouldn't have merge problems
             because, as you say, the tRNA data are from different
        organisms. You
             might have an error *in* the source if you have two tRNAs
        at the
             same location for some reason.

             Can you verify what the error message says?

             Thanks!
             Julie


             On 25/11/14 20:42, Deepak Unni wrote:

                 Hi Julie,

                 I was trying to load a GFF that has lots of tRNA having
        same
                 GeneID but
                 are located on different parts of a chromosome.
                 Since I am considering the GeneID as the
        primaryIdentifier I
                 came across
                 the problem of duplicate entities at the time of
        loading the GFF3.

                 So I created a custom GFF3 source and modified the
                 keys.properties file
                 by adding,

                 SequenceFeature.key = ChromosomeLocation


                 This worked for one GFF3 file having such duplicates tRNAs.
                 Now when I use the same source for a GFF3 with tRNAs from a
                 different
                 organism, the source gives an error saying there is a
        duplicate.

                 I was hoping that it would consider an entity to be
        duplicate
                 only if
                 there is already an entity that has the same
        SequenceFeature.key
                 value.

                 Did I miss something in the configuration or is the
        assumption I
                 made
                 about the keys wrong?


                 Thanks,

                 Deepak

                 Research Analyst
                 S104A Animal Science Research Center,
                 University of Missouri, Columbia




        --
        Research Analyst
        S104A Animal Science Research Center,
        University of Missouri, Columbia




--
Research Analyst
S104A Animal Science Research Center,
University of Missouri, Columbia



--
Research Analyst
S104A Animal Science Research Center,
University of Missouri, Columbia

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Handling duplicate features in GFF3

Julie Sullivan
The error is happening in the single source. It's almost impossible to
tell because the error messages are so cryptic but the error message
would mention the source if there were a conflict.

I do see two genes with name=102606533 in each file. (gene412 and
gene413) in dataset1_noncoding.gff3.

They have different locations though. You have a few options:

1. choose another key that is unique
2. remove keys because features aren't being merged

Keys may also be defined here:

https://github.com/intermine/intermine/blob/beta/humanmine/dbmodel/resources/genomic_keyDefs.properties#L9-L11

If you remove gene from this file, no genes will merge ever.

3. remove one of the genes manually from the data file

Not a good plan long term, but if you just want to get your data loaded.

4. Add organism to your gene key. That way genes from different
organisms won't try to merge. We do this for symbol because symbol isn't
unique across organisms:

        Gene.key_symbol_org=symbol, organism

On 27/11/14 14:48, Deepak Unni wrote:

> Hi Julie,
>
> I don't expect an integration between the two GFF3 since they are from
> different organism and hence should be mutually exclusive.
>
> I removed the SequenceFeature key and tried loading the two GFF3s and I
> got the same error as before.
>
> Here is the stack trace,
> https://gist.github.com/deepakunni3/fada098bb4cbdda24bb8
>
> I think I understood where the keys come into play.
>
> But I am surprised at the error. There are no common GeneIDs
> (primaryIdentifiers) between the two GFF3s since they belong to two
> different organism.
>
>
>
> On Thu, Nov 27, 2014 at 5:39 AM, Julie Sullivan <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hi Deepak
>
>     Do you need to merge features from different files? e.g. is the same
>     gene in different GFF3 files? Or is each file unique to an organism?
>
>     If you don't need to merge items, you should remove the integration key:
>
>              SequenceFeature.key = ChromosomeLocation
>
>     I know you added this key because you had duplicate keys and got an
>     error. Can you remove this integration key, run your build and send
>     me the error message?
>
>     Here is the doc for the GFF3 source:
>
>     http://intermine.readthedocs.__org/en/latest/database/data-__sources/library/gff/
>     <http://intermine.readthedocs.org/en/latest/database/data-sources/library/gff/>
>
>     Julie
>
>
>     On 26/11/14 16:28, Deepak Unni wrote:
>
>         Hi Julie,
>
>         I checked the GFF3 again and I didn't find any feature having
>         the same
>         chromosome, start, stop, and strand.
>
>         For reference I am attaching the two GFF3 and the source that is
>         used
>         for loading the GFF3.
>
>
>
>         On Wed, Nov 26, 2014 at 9:36 AM, Julie Sullivan
>         <[hidden email] <mailto:[hidden email]>
>         <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>
>              I believe the build doesn't care about keys until it needs
>         them, e.g
>              when it's going to do a merge. So that makes sense the
>         error is on
>              the second source.
>
>              Have you checked the data for duplicates? e.g. do you have
>         two tRNAs
>              at the same location in the same file? If so you won't be
>         able to
>              use that as an integration key.
>
>              On 26/11/14 14:35, Deepak Unni wrote:
>
>                  Hi Julie,
>
>                  You are right. There is never an integration that
>         happens since
>                  the two
>                  GFF3 are from different organism.
>
>                  First, I loaded a GFF3 containing duplicate tRNAs,
>         having same
>                  GeneID
>                  but different location.
>                  Then I loaded a GFF3 for another organism having tRNAs,
>         with
>                  same GeneID
>                  but different location, using the same source as before.
>
>                  But this time I get an ObjectStore error,
>
>                  Caused by:
>         org.intermine.objectstore.____ObjectStoreException:
>                  Duplicate
>                  objects found for pk
>                  org.intermine.model.bio.Gene.____key_primaryIdentifier:
>                  Gene [briefDescription="null", chromosome=172,
>                  chromosomeLocation=2580,
>                  description="null",
>         downstreamIntergenicRegion=____null, id="616",
>                  length="73", name="null", organism=1,
>         primaryIdentifier="102606533",
>                  score="null", scoreType="null", secondaryIdentifier="null",
>                  sequence=null, sequenceOntologyTerm=551,
>         symbol="Trnaf-gaa",
>                  upstreamIntergenicRegion=null]
>
>
>                  The funny thing is, if I reverse the order of the GFF3
>         being
>                  loaded then
>                  the second GFF3 loads and now the first GFF3 fails.
>                  It seems that when loading the first GFF3, the tRNAs
>         having same
>                  GeneID
>                  but on different location are treated as unique (based
>         on key
>                  discussed
>                  earlier).
>                  But it breaks when loading another GFF3 using the same
>         source.
>
>                  Thanks,
>
>                  Deepak
>
>                  On Wed, Nov 26, 2014 at 4:24 AM, Julie Sullivan
>                  <[hidden email] <mailto:[hidden email]>
>         <mailto:[hidden email] <mailto:[hidden email]>>
>                  <mailto:[hidden email] <mailto:[hidden email]>
>         <mailto:[hidden email] <mailto:[hidden email]>>>> wrote:
>
>                       Hi Deepak
>
>                       No, you are exactly correct in your logic. However
>         the key
>                  you set
>                       is an "integration" key, meaning it's what
>         determines what
>                  a unique
>                       object is when merging the data from your GFF3
>         source into
>                  the database.
>
>                       Is the problem happening in your GFF3 source
>         itself? Or is
>                  the error
>                       being thrown when you are trying to merge the
>         second GFF3
>                  source
>                       into the database (which holds the data from your
>         first GFF3
>                       source). I would think that you shouldn't have
>         merge problems
>                       because, as you say, the tRNA data are from different
>                  organisms. You
>                       might have an error *in* the source if you have
>         two tRNAs
>                  at the
>                       same location for some reason.
>
>                       Can you verify what the error message says?
>
>                       Thanks!
>                       Julie
>
>
>                       On 25/11/14 20:42, Deepak Unni wrote:
>
>                           Hi Julie,
>
>                           I was trying to load a GFF that has lots of
>         tRNA having
>                  same
>                           GeneID but
>                           are located on different parts of a chromosome.
>                           Since I am considering the GeneID as the
>                  primaryIdentifier I
>                           came across
>                           the problem of duplicate entities at the time of
>                  loading the GFF3.
>
>                           So I created a custom GFF3 source and modified the
>                           keys.properties file
>                           by adding,
>
>                           SequenceFeature.key = ChromosomeLocation
>
>
>                           This worked for one GFF3 file having such
>         duplicates tRNAs.
>                           Now when I use the same source for a GFF3 with
>         tRNAs from a
>                           different
>                           organism, the source gives an error saying
>         there is a
>                  duplicate.
>
>                           I was hoping that it would consider an entity
>         to be
>                  duplicate
>                           only if
>                           there is already an entity that has the same
>                  SequenceFeature.key
>                           value.
>
>                           Did I miss something in the configuration or
>         is the
>                  assumption I
>                           made
>                           about the keys wrong?
>
>
>                           Thanks,
>
>                           Deepak
>
>                           Research Analyst
>                           S104A Animal Science Research Center,
>                           University of Missouri, Columbia
>
>
>
>
>                  --
>                  Research Analyst
>                  S104A Animal Science Research Center,
>                  University of Missouri, Columbia
>
>
>
>
>         --
>         Research Analyst
>         S104A Animal Science Research Center,
>         University of Missouri, Columbia
>
>
>
>
> --
> Research Analyst
> S104A Animal Science Research Center,
> University of Missouri, Columbia

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Handling duplicate features in GFF3

Deepak Unni
Hi Julie,

The suggestion on adding organism to gene key worked. :)

Now I can load both the GFF3 and several other GFF3 without any errors.

Thanks,

Deepak
 

On Thu, Nov 27, 2014 at 9:04 AM, Julie Sullivan <[hidden email]> wrote:
The error is happening in the single source. It's almost impossible to tell because the error messages are so cryptic but the error message would mention the source if there were a conflict.

I do see two genes with name=102606533 in each file. (gene412 and gene413) in dataset1_noncoding.gff3.

They have different locations though. You have a few options:

1. choose another key that is unique
2. remove keys because features aren't being merged

Keys may also be defined here:

https://github.com/intermine/intermine/blob/beta/humanmine/dbmodel/resources/genomic_keyDefs.properties#L9-L11

If you remove gene from this file, no genes will merge ever.

3. remove one of the genes manually from the data file

Not a good plan long term, but if you just want to get your data loaded.

4. Add organism to your gene key. That way genes from different organisms won't try to merge. We do this for symbol because symbol isn't unique across organisms:

        Gene.key_symbol_org=symbol, organism

On 27/11/14 14:48, Deepak Unni wrote:
Hi Julie,

I don't expect an integration between the two GFF3 since they are from
different organism and hence should be mutually exclusive.

I removed the SequenceFeature key and tried loading the two GFF3s and I
got the same error as before.

Here is the stack trace,
https://gist.github.com/deepakunni3/fada098bb4cbdda24bb8

I think I understood where the keys come into play.

But I am surprised at the error. There are no common GeneIDs
(primaryIdentifiers) between the two GFF3s since they belong to two
different organism.



On Thu, Nov 27, 2014 at 5:39 AM, Julie Sullivan <[hidden email]
<mailto:[hidden email]>> wrote:

    Hi Deepak

    Do you need to merge features from different files? e.g. is the same
    gene in different GFF3 files? Or is each file unique to an organism?

    If you don't need to merge items, you should remove the integration key:

             SequenceFeature.key = ChromosomeLocation

    I know you added this key because you had duplicate keys and got an
    error. Can you remove this integration key, run your build and send
    me the error message?

    Here is the doc for the GFF3 source:

    http://intermine.readthedocs.__org/en/latest/database/data-__sources/library/gff/
    <http://intermine.readthedocs.org/en/latest/database/data-sources/library/gff/>

    Julie


    On 26/11/14 16:28, Deepak Unni wrote:

        Hi Julie,

        I checked the GFF3 again and I didn't find any feature having
        the same
        chromosome, start, stop, and strand.

        For reference I am attaching the two GFF3 and the source that is
        used
        for loading the GFF3.



        On Wed, Nov 26, 2014 at 9:36 AM, Julie Sullivan
        <[hidden email] <mailto:[hidden email]>
        <mailto:[hidden email] <mailto:[hidden email]>>> wrote:

             I believe the build doesn't care about keys until it needs
        them, e.g
             when it's going to do a merge. So that makes sense the
        error is on
             the second source.

             Have you checked the data for duplicates? e.g. do you have
        two tRNAs
             at the same location in the same file? If so you won't be
        able to
             use that as an integration key.

             On 26/11/14 14:35, Deepak Unni wrote:

                 Hi Julie,

                 You are right. There is never an integration that
        happens since
                 the two
                 GFF3 are from different organism.

                 First, I loaded a GFF3 containing duplicate tRNAs,
        having same
                 GeneID
                 but different location.
                 Then I loaded a GFF3 for another organism having tRNAs,
        with
                 same GeneID
                 but different location, using the same source as before.

                 But this time I get an ObjectStore error,

                 Caused by:
        org.intermine.objectstore.____ObjectStoreException:
                 Duplicate
                 objects found for pk
                 org.intermine.model.bio.Gene.____key_primaryIdentifier:
                 Gene [briefDescription="null", chromosome=172,
                 chromosomeLocation=2580,
                 description="null",
        downstreamIntergenicRegion=____null, id="616",

                 length="73", name="null", organism=1,
        primaryIdentifier="102606533",
                 score="null", scoreType="null", secondaryIdentifier="null",
                 sequence=null, sequenceOntologyTerm=551,
        symbol="Trnaf-gaa",
                 upstreamIntergenicRegion=null]


                 The funny thing is, if I reverse the order of the GFF3
        being
                 loaded then
                 the second GFF3 loads and now the first GFF3 fails.
                 It seems that when loading the first GFF3, the tRNAs
        having same
                 GeneID
                 but on different location are treated as unique (based
        on key
                 discussed
                 earlier).
                 But it breaks when loading another GFF3 using the same
        source.

                 Thanks,

                 Deepak

                 On Wed, Nov 26, 2014 at 4:24 AM, Julie Sullivan
                 <[hidden email] <mailto:[hidden email]>
        <mailto:[hidden email] <mailto:[hidden email]>>
                 <mailto:[hidden email] <mailto:[hidden email]>
        <mailto:[hidden email] <mailto:[hidden email]>>>> wrote:

                      Hi Deepak

                      No, you are exactly correct in your logic. However
        the key
                 you set
                      is an "integration" key, meaning it's what
        determines what
                 a unique
                      object is when merging the data from your GFF3
        source into
                 the database.

                      Is the problem happening in your GFF3 source
        itself? Or is
                 the error
                      being thrown when you are trying to merge the
        second GFF3
                 source
                      into the database (which holds the data from your
        first GFF3
                      source). I would think that you shouldn't have
        merge problems
                      because, as you say, the tRNA data are from different
                 organisms. You
                      might have an error *in* the source if you have
        two tRNAs
                 at the
                      same location for some reason.

                      Can you verify what the error message says?

                      Thanks!
                      Julie


                      On 25/11/14 20:42, Deepak Unni wrote:

                          Hi Julie,

                          I was trying to load a GFF that has lots of
        tRNA having
                 same
                          GeneID but
                          are located on different parts of a chromosome.
                          Since I am considering the GeneID as the
                 primaryIdentifier I
                          came across
                          the problem of duplicate entities at the time of
                 loading the GFF3.

                          So I created a custom GFF3 source and modified the
                          keys.properties file
                          by adding,

                          SequenceFeature.key = ChromosomeLocation


                          This worked for one GFF3 file having such
        duplicates tRNAs.
                          Now when I use the same source for a GFF3 with
        tRNAs from a
                          different
                          organism, the source gives an error saying
        there is a
                 duplicate.

                          I was hoping that it would consider an entity
        to be
                 duplicate
                          only if
                          there is already an entity that has the same
                 SequenceFeature.key
                          value.

                          Did I miss something in the configuration or
        is the
                 assumption I
                          made
                          about the keys wrong?


                          Thanks,

                          Deepak

                          Research Analyst
                          S104A Animal Science Research Center,
                          University of Missouri, Columbia




                 --
                 Research Analyst
                 S104A Animal Science Research Center,
                 University of Missouri, Columbia




        --
        Research Analyst
        S104A Animal Science Research Center,
        University of Missouri, Columbia




--
Research Analyst
S104A Animal Science Research Center,
University of Missouri, Columbia



--
Research Analyst
S104A Animal Science Research Center,
University of Missouri, Columbia

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev