The problem with human identifiers

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

The problem with human identifiers

Julie Sullivan
Hello devs,

I need your brains to solve a problem.

Human genes can have zero-to-many Ensembl identifiers. They also can
have zero-to-many NCBI identifiers.

Currently we are using Ensembl identifiers for the primary identifier.
However lots of people use NCBI identifiers. Maybe, given the way we do
things here at InterMine, this is the best thing to do. But it would be
nice if we could accommodate everyone.

I've attached the PDF but here's my original doc with Gos notes (request
access):
https://docs.google.com/document/d/1iEUlLLrYNQOtGAFBqVtN67yum3luafiesp7QNqtW1l4/edit


_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

Humanidentifiers.pdf (203K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: The problem with human identifiers

Gail Binkley-2
Hi Julie,

I haven't seen any response to this email yet so I'll pipe up.  I understand the reluctance to create a "new" identifier.  There are already too many identifiers and mapping between them is a nightmare.  However, unique identifiers are necessary for a database to function properly.  If you create a unique Intermine identifier, it should be just a number - no encoding any other information it it.  Ideally it will be a number that is only used internally at Intermine and never gets out into the public id space.  

- Gail
-----
Gail Binkley
SGD Project Manager
Database Administrator SGD, ENCODE3, GO, CGD, AspGD
Stanford University
[hidden email]


From: "Julie Sullivan" <[hidden email]>
To: [hidden email]
Sent: Thursday, October 3, 2013 9:00:32 AM
Subject: [InterMine Dev] The problem with human identifiers

Hello devs,

I need your brains to solve a problem.

Human genes can have zero-to-many Ensembl identifiers. They also can
have zero-to-many NCBI identifiers.

Currently we are using Ensembl identifiers for the primary identifier.
However lots of people use NCBI identifiers. Maybe, given the way we do
things here at InterMine, this is the best thing to do. But it would be
nice if we could accommodate everyone.

I've attached the PDF but here's my original doc with Gos notes (request
access):
https://docs.google.com/document/d/1iEUlLLrYNQOtGAFBqVtN67yum3luafiesp7QNqtW1l4/edit


_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev


_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: The problem with human identifiers

Joel Richardson-2
In reply to this post by Julie Sullivan

Hi Julie,

Sorry this response took a while. I have been thinking about this on and
off
for days. Partly that's because it always takes me a long time to
write anything. But mostly it's because I think this is actually a very
deep
issue. Rather than simply being a question of what goes into the
primaryIdentifier field, this problem and the questions raised in the
document get at the heart of why we (MODs) do what we do.
To answer one of your questions, I would say yes, GO annotations (and any
other data)
associated with an Ensembl ID *would* be then also be associated
with the Entrez ID. Otherwise, what's the point?


We have the same situation with mouse genes. Entrez, Ensembl, and
Havana all define gene models that have a great deal of correspondence,
but also
significant disagreements. The difference is that MGI takes on the
role of "unifying" all this (including additional cDNAs from Riken and
good old heritable phenotypes thrown in for good measure) into a single
nonredundant set, and assigning MGI numbers to result. This is a pretty
involved process requiring both automation and curation. MGI does not
simply moosh things together, but actively engages the providers to
resolve
issues. This has proved extremely valuable, too. Although most genes are
easy (1:1, 1:0, or 0:1), there are lots of cases where the correspondences
are tangled (1:n, n:1, and even n:m). The other big (bigger!) issue is
ensuring the stability of MGI id assignments as the providers update their
models and (especially) when there is a new genome build. To summarize,
providing
a unified, non-redundant genome feature catalog with stably assigned IDs
is one
of the core activities on MGI.

Of course, all this makes life easy for MouseMine! The MGI id goes into the
primaryIdentifier field, and everything else is a CrossReference.
The problem for HumanMine is that there is no analogous single "authority"
for
human gene id's. However, by merging the three sources of human genes into
a
single set, you become (de facto) that authority. One question is how
"good" the
results of a fully automatic approach will be. For sure, all the same
tangled
gnarliness that we deal with for mouse exists for human, and without any
kind of
curatorial step and feedback to the providers, I think the result can only
be so good.
Beyond that, the lack of stable IDs will be a problem for the user.

If it's at all possible to pick one provider as your authority and hang
other ids
off the genes whenever possible (and ignore the rest for now), that would
be the clean and
robust, if not complete.

That my 2 cents. (OK, maybe 10 cents!)

Cheers,
Joel


--
Joel E. Richardson, Ph.D.
Sr. Research Scientist
Mouse Genome Informatics
The Jackson Laboratory
600 Main Street
Bar Harbor, Maine 04609
207-288-6435
[hidden email]





On 10/3/13 12:00 PM, "Julie Sullivan" <[hidden email]> wrote:

>Hello devs,
>
>I need your brains to solve a problem.
>
>Human genes can have zero-to-many Ensembl identifiers. They also can
>have zero-to-many NCBI identifiers.
>
>Currently we are using Ensembl identifiers for the primary identifier.
>However lots of people use NCBI identifiers. Maybe, given the way we do
>things here at InterMine, this is the best thing to do. But it would be
>nice if we could accommodate everyone.
>
>I've attached the PDF but here's my original doc with Gos notes (request
>access):
>https://docs.google.com/document/d/1iEUlLLrYNQOtGAFBqVtN67yum3luafiesp7QNq
>tW1l4/edit


The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: The problem with human identifiers

Jayaraman, Pushkala
The Ratmine team(Liz and me) concur with Joel on this..
We do something similar at RGD as well.. we have one RGDId for a gene and it contains curated information encompassing one or multiple ensemblIds and Entrezids.
We do have automated Pipelines that bring in all the data from EntrezGene and all the other data sources(Ensembl/Uniprot/HGNC/MGI..etc..) and our curators go over the painful process of going over the gene and "confirming" or "rejecting" the information automatically combined by our complex data pipelines.
Once all the data is combined, our curators then create an RGDId for that gene and before you know it, we have a new RAT gene!
In Ratmine we then postprocess the data a lil bit.. we make the RGDId the primary Id for our rat genes, MGI Id the primary Id for our mouse genes and Entrez id the primary id for our Human genes ( Using the Complex Ids Resolvers and its cached map)  and make all the others cross references to this data... so that we still maintain interoperability with the other mines and the MODS.  

Pushkala

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Joel Richardson
Sent: Wednesday, October 09, 2013 2:34 PM
To: Julie Sullivan; [hidden email]
Subject: Re: [InterMine Dev] The problem with human identifiers

WARNING: This email may be an attempt to get your password or personal information. Do not click links or reply to the message below unless you are absolutely sure it is valid. Contact the MCW Help Desk at (414) 955-4357 if you have questions.
________________________________


Hi Julie,

Sorry this response took a while. I have been thinking about this on and off for days. Partly that's because it always takes me a long time to write anything. But mostly it's because I think this is actually a very deep issue. Rather than simply being a question of what goes into the primaryIdentifier field, this problem and the questions raised in the document get at the heart of why we (MODs) do what we do.
To answer one of your questions, I would say yes, GO annotations (and any other data) associated with an Ensembl ID *would* be then also be associated with the Entrez ID. Otherwise, what's the point?


We have the same situation with mouse genes. Entrez, Ensembl, and Havana all define gene models that have a great deal of correspondence, but also significant disagreements. The difference is that MGI takes on the role of "unifying" all this (including additional cDNAs from Riken and good old heritable phenotypes thrown in for good measure) into a single nonredundant set, and assigning MGI numbers to result. This is a pretty involved process requiring both automation and curation. MGI does not simply moosh things together, but actively engages the providers to resolve issues. This has proved extremely valuable, too. Although most genes are easy (1:1, 1:0, or 0:1), there are lots of cases where the correspondences are tangled (1:n, n:1, and even n:m). The other big (bigger!) issue is ensuring the stability of MGI id assignments as the providers update their models and (especially) when there is a new genome build. To summarize, providing a unified, non-redundant genome feature catalog with stably assigned IDs is one of the core activities on MGI.

Of course, all this makes life easy for MouseMine! The MGI id goes into the primaryIdentifier field, and everything else is a CrossReference.
The problem for HumanMine is that there is no analogous single "authority"
for
human gene id's. However, by merging the three sources of human genes into a single set, you become (de facto) that authority. One question is how "good" the results of a fully automatic approach will be. For sure, all the same tangled gnarliness that we deal with for mouse exists for human, and without any kind of curatorial step and feedback to the providers, I think the result can only be so good.
Beyond that, the lack of stable IDs will be a problem for the user.

If it's at all possible to pick one provider as your authority and hang other ids off the genes whenever possible (and ignore the rest for now), that would be the clean and robust, if not complete.

That my 2 cents. (OK, maybe 10 cents!)

Cheers,
Joel


--
Joel E. Richardson, Ph.D.
Sr. Research Scientist
Mouse Genome Informatics
The Jackson Laboratory
600 Main Street
Bar Harbor, Maine 04609
207-288-6435
[hidden email]





On 10/3/13 12:00 PM, "Julie Sullivan" <[hidden email]> wrote:

>Hello devs,
>
>I need your brains to solve a problem.
>
>Human genes can have zero-to-many Ensembl identifiers. They also can
>have zero-to-many NCBI identifiers.
>
>Currently we are using Ensembl identifiers for the primary identifier.
>However lots of people use NCBI identifiers. Maybe, given the way we do
>things here at InterMine, this is the best thing to do. But it would be
>nice if we could accommodate everyone.
>
>I've attached the PDF but here's my original doc with Gos notes
>(request
>access):
>https://docs.google.com/document/d/1iEUlLLrYNQOtGAFBqVtN67yum3luafiesp7
>QNq
>tW1l4/edit


The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Lax
Reply | Threaded
Open this post in threaded view
|

Re: The problem with human identifiers

Lax
In reply to this post by Joel Richardson-2
I agree with Joel that it would be good if there was one set of standard IDs which forms a stable reference. MGI IDs are great example of that! Wish same could be done for other species also. It makes it the consumers life so easy. It avoids unnecessary data wrangling in our part which could lead to spurious results.
Best
Lax.

Sent from my iPhone

> On Oct 9, 2013, at 3:33 PM, Joel Richardson <[hidden email]> wrote:
>
>
> Hi Julie,
>
> Sorry this response took a while. I have been thinking about this on and
> off
> for days. Partly that's because it always takes me a long time to
> write anything. But mostly it's because I think this is actually a very
> deep
> issue. Rather than simply being a question of what goes into the
> primaryIdentifier field, this problem and the questions raised in the
> document get at the heart of why we (MODs) do what we do.
> To answer one of your questions, I would say yes, GO annotations (and any
> other data)
> associated with an Ensembl ID *would* be then also be associated
> with the Entrez ID. Otherwise, what's the point?
>
>
> We have the same situation with mouse genes. Entrez, Ensembl, and
> Havana all define gene models that have a great deal of correspondence,
> but also
> significant disagreements. The difference is that MGI takes on the
> role of "unifying" all this (including additional cDNAs from Riken and
> good old heritable phenotypes thrown in for good measure) into a single
> nonredundant set, and assigning MGI numbers to result. This is a pretty
> involved process requiring both automation and curation. MGI does not
> simply moosh things together, but actively engages the providers to
> resolve
> issues. This has proved extremely valuable, too. Although most genes are
> easy (1:1, 1:0, or 0:1), there are lots of cases where the correspondences
> are tangled (1:n, n:1, and even n:m). The other big (bigger!) issue is
> ensuring the stability of MGI id assignments as the providers update their
> models and (especially) when there is a new genome build. To summarize,
> providing
> a unified, non-redundant genome feature catalog with stably assigned IDs
> is one
> of the core activities on MGI.
>
> Of course, all this makes life easy for MouseMine! The MGI id goes into the
> primaryIdentifier field, and everything else is a CrossReference.
> The problem for HumanMine is that there is no analogous single "authority"
> for
> human gene id's. However, by merging the three sources of human genes into
> a
> single set, you become (de facto) that authority. One question is how
> "good" the
> results of a fully automatic approach will be. For sure, all the same
> tangled
> gnarliness that we deal with for mouse exists for human, and without any
> kind of
> curatorial step and feedback to the providers, I think the result can only
> be so good.
> Beyond that, the lack of stable IDs will be a problem for the user.
>
> If it's at all possible to pick one provider as your authority and hang
> other ids
> off the genes whenever possible (and ignore the rest for now), that would
> be the clean and
> robust, if not complete.
>
> That my 2 cents. (OK, maybe 10 cents!)
>
> Cheers,
> Joel
>
>
> --
> Joel E. Richardson, Ph.D.
> Sr. Research Scientist
> Mouse Genome Informatics
> The Jackson Laboratory
> 600 Main Street
> Bar Harbor, Maine 04609
> 207-288-6435
> [hidden email]
>
>
>
>
>
>> On 10/3/13 12:00 PM, "Julie Sullivan" <[hidden email]> wrote:
>>
>> Hello devs,
>>
>> I need your brains to solve a problem.
>>
>> Human genes can have zero-to-many Ensembl identifiers. They also can
>> have zero-to-many NCBI identifiers.
>>
>> Currently we are using Ensembl identifiers for the primary identifier.
>> However lots of people use NCBI identifiers. Maybe, given the way we do
>> things here at InterMine, this is the best thing to do. But it would be
>> nice if we could accommodate everyone.
>>
>> I've attached the PDF but here's my original doc with Gos notes (request
>> access):
>> https://docs.google.com/document/d/1iEUlLLrYNQOtGAFBqVtN67yum3luafiesp7QNq
>> tW1l4/edit
>
>
> The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
>
> _______________________________________________
> dev mailing list
> [hidden email]
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev