Chado, SO, and pangenomes

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Chado, SO, and pangenomes

Jim Hu
I've been thinking about this off and on for several years, and we're finally going to try it.  However, I thought I would try to get some input before we get too deep.

High throughput sequencing is generating multiple complete genome sequence for lots of prokaryotes.  Searching NCBI genome with 

escherichia coli[organism] AND complete genome AND chromosome

already gives a 35 records, and 189 if we include whole genome shotgun records.  The idea of a pangenome, the set of all genes in all strains, is starting to spread in the bacterial genomics literature, and we think of EcoliWiki as annotating the E. coli pangenome, as opposed to EcoCyc, which is based on a single E. coli genome (strain MG1655).

The question then, is how to represent the pangenome with SO and Chado.  My thinking is based on a level of abstraction that I think may not be consistent with the historical guidelines for OBO-based systems, but I hope I'm wrong about that.  The basic idea is that the pangenome is a type of sequence_collection, and strain-specific genomes would be members (part_of?) the pangenome.  Similarly, individual features like genes would be part_of the pangenome and would have multiple instances.

My inclination is to have a single gene across the pangenome and sets of alleles.  The minimal case would be where there is one wt allele in all strains that have the gene; that allele feature would have a feature relationship to the gene, and multiple featurelocs using different genomes as srcfeature_id.  Annotations such as GO annotations would attach to the canonical gene/protein/RNA.

Does this sound like it would work?

Jim

p.s. this is, of course, ignoring the whole problem of whether SO:0000704 works for bacteria...

=====================================

Jim Hu

Associate Professor

Dept. of Biochemistry and Biophysics

2128 TAMU

Texas A&M Univ.

College Station, TX 77843-2128

979-862-4054







------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: [SO-devel] Chado, SO, and pangenomes

Judith Blake
Re: [SO-devel] Chado, SO, and pangenomes Reminds me of old efforts to record all genes in tetraploid plants. The problem being different numbers of genes depending on the ‘ploidy’.  I think the resources was ‘Mendel’.  Maybe some of their thinking would help here.

Judy

Here it is – Carl Price et al.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC29857/


The Mendel database contains names for plant-wide families of sequenced plant genes. The names have either been approved by the Commission on Plant Gene Nomenclature (CPGN), an organization of the International Society for Plant Molecular Biology (ISPMB), or are identified as provisional or temporary names. Mendel also identifies the corresponding genes in individual species of plants.

On 7/20/10 7:19 PM, "Jim Hu" <jimhu@...> wrote:

I've been thinking about this off and on for several years, and we're finally going to try it.  However, I thought I would try to get some input before we get too deep.

High throughput sequencing is generating multiple complete genome sequence for lots of prokaryotes.  Searching NCBI genome with

escherichia coli[organism] AND complete genome AND chromosome

already gives a 35 records, and 189 if we include whole genome shotgun records.  The idea of a pangenome, the set of all genes in all strains, is starting to spread in the bacterial genomics literature, and we think of EcoliWiki as annotating the E. coli pangenome, as opposed to EcoCyc, which is based on a single E. coli genome (strain MG1655).

The question then, is how to represent the pangenome with SO and Chado.  My thinking is based on a level of abstraction that I think may not be consistent with the historical guidelines for OBO-based systems, but I hope I'm wrong about that.  The basic idea is that the pangenome is a type of sequence_collection, and strain-specific genomes would be members (part_of?) the pangenome.  Similarly, individual features like genes would be part_of the pangenome and would have multiple instances.

My inclination is to have a single gene across the pangenome and sets of alleles.  The minimal case would be where there is one wt allele in all strains that have the gene; that allele feature would have a feature relationship to the gene, and multiple featurelocs using different genomes as
srcfeature_id.  Annotations such as GO annotations would attach to the canonical gene/protein/RNA.

Does this sound like it would work?

Jim

p.s. this is, of course, ignoring the whole problem of whether
SO:0000704 works for bacteria...

=====================================

Jim Hu

Associate Professor

Dept. of Biochemistry and Biophysics

2128 TAMU

Texas A&M Univ.

College Station, TX 77843-2128

979-862-4054







------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: [SO-devel] Chado, SO, and pangenomes

Karen Eilbeck-2
In reply to this post by Jim Hu
Re: [SO-devel] Chado, SO, and pangenomes Hi Jim,
Thanks for this email

I may have a kind of solution for you.
We have been working on Genome Variation Format, a gff based format for variants, initially for personal medicine, but it is proving to be useful for a wider audience.
We have had our manuscript provisionally accepted by Genome Biology, pending a couple of revisions and have got the ensembl folks on board and a few others.

It can be used for an individual, a collection of individuals or a population.
It provides a mechanism to capture the differences between a given genome and a reference, and is independent of technology used.
We have used it for SNV’s but it also worked for bigger changes.

The Yandell lab have developed a suite of very cool programs called VAAST, that use this format as the currency. I copied Mark in as I think you may have a few things to talk about.

Let me know if you want to know more and I’ll send you the paper and the spec. I really think it could be useful to you.
And, we need to talk about poor old 704. Lets arrange something.

--Karen




On 7/20/10 5:19 PM, "Jim Hu" <jimhu@...> wrote:

I've been thinking about this off and on for several years, and we're finally going to try it.  However, I thought I would try to get some input before we get too deep.

High throughput sequencing is generating multiple complete genome sequence for lots of prokaryotes.  Searching NCBI genome with

escherichia coli[organism] AND complete genome AND chromosome

already gives a 35 records, and 189 if we include whole genome shotgun records.  The idea of a pangenome, the set of all genes in all strains, is starting to spread in the bacterial genomics literature, and we think of EcoliWiki as annotating the E. coli pangenome, as opposed to EcoCyc, which is based on a single E. coli genome (strain MG1655).

The question then, is how to represent the pangenome with SO and Chado.  My thinking is based on a level of abstraction that I think may not be consistent with the historical guidelines for OBO-based systems, but I hope I'm wrong about that.  The basic idea is that the pangenome is a type of sequence_collection, and strain-specific genomes would be members (part_of?) the pangenome.  Similarly, individual features like genes would be part_of the pangenome and would have multiple instances.

My inclination is to have a single gene across the pangenome and sets of alleles.  The minimal case would be where there is one wt allele in all strains that have the gene; that allele feature would have a feature relationship to the gene, and multiple featurelocs using different genomes as
srcfeature_id.  Annotations such as GO annotations would attach to the canonical gene/protein/RNA.

Does this sound like it would work?

Jim

p.s. this is, of course, ignoring the whole problem of whether
SO:0000704 works for bacteria...

=====================================

Jim Hu

Associate Professor

Dept. of Biochemistry and Biophysics

2128 TAMU

Texas A&M Univ.

College Station, TX 77843-2128

979-862-4054







------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema