[biomart-users] Understand the percentage_gc_content

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[biomart-users] Understand the percentage_gc_content

Yongchao Ge
Hi,

I'm trying to understand how the percentage_gc_content is computed from the R package biomaRt. Initially, I found that it was quite strange that the percentage_gc_content is the same for all transcripts of a gene from biomaRt (see the R code below).
For example, for the two transcripts (ENSMUST00000187148,ENSMUST00000115891) of gene ENSMUSG00000000103, the  percentage_gc_content for both transcripts is the same (36.56)

I also computed the percentage_gc_content manually, we obtained the 40.20 and 39.46 respectively for the two transcripts ENSMUST00000187148,ENSMUST00000115891 (see the R code below). I also obtained the same result when I used another source that is independent of the R code below.

1 .So my question is, what is the exact meaning of the percentage_gc_content in the BiomaRt?

2.  While exploring this, the BM function has a bug if we had the attributes "cdna" or "gene_exon" in the getBM function (see the print out of the seq variable in the following R code) where the column names has been shifted. It would be nice to have this bug fixed.

The following is the R code

##Understand the GC_content % obtained from biomaRt
library(biomaRt)
mart<-useMart(biomart = "ensembl",host="www.ensembl.org",dataset ="mmusculus_gene_ensembl")
library(Biostrings)
GCperc<-function(x)
{
    x1<-DNAString(x[,1])
    alf <- alphabetFrequency(x1, as.prob=TRUE)
    data.frame(ensembl_transcript_id=x[,2],length=length(x1),GCperc=100*sum(alf[c("G", "C")]))
}
t2g<-getBM(filter="ensembl_gene_id",values="ENSMUSG00000000103",
           attributes = c("ensembl_transcript_id",
           "percentage_gc_content","transcript_length"),
           mart = mart)
seq<-getBM(filter="ensembl_gene_id",values="ENSMUSG00000000103",
           attributes=c("ensembl_transcript_id","cdna"),#"gene_exon very messy"
           mart=mart)
print(t2g)
GCperc(seq[1,])
GCperc(seq[2,])


##and the output
> print(t2g)
  ensembl_transcript_id percentage_gc_content transcript_length
1    ENSMUST00000187148                 36.56              2846
2    ENSMUST00000115891                 36.56              2816
> GCperc(seq[1,])
  ensembl_transcript_id length   GCperc
1    ENSMUST00000115891   2816 40.19886
> GCperc(seq[2,])
  ensembl_transcript_id length   GCperc
1    ENSMUST00000187148   2846 39.45889

--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
Visit this group at https://groups.google.com/group/biomart-users.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [biomart-users] Understand the percentage_gc_content

Thomas Maurel
Hello,

1. In the Ensembl gene mart, the percentage_gc_content actually correspond to the Gene %GC content. This is why all the Transcripts of gene ENSMUSG00000000103 will return the value of 36.56.
We can rename this attribute on the interface to “Gene % GC content” and the BiomaRt attribute to “percentage_gene_gc_content” for our next release e!88 if that can make things clearer. 
2. The Bioconductor people are looking after the BiomaRt R module, could you please report this to the Bioconductor forum: https://support.bioconductor.org/

Hope this helps,
Kind Regards,
Thomas
On 20 Dec 2016, at 19:32, Yongchao Ge <[hidden email]> wrote:

Hi,

I'm trying to understand how the percentage_gc_content is computed from the R package biomaRt. Initially, I found that it was quite strange that the percentage_gc_content is the same for all transcripts of a gene from biomaRt (see the R code below).
For example, for the two transcripts (ENSMUST00000187148,ENSMUST00000115891) of gene ENSMUSG00000000103, the  percentage_gc_content for both transcripts is the same (36.56)

I also computed the percentage_gc_content manually, we obtained the 40.20 and 39.46 respectively for the two transcripts ENSMUST00000187148,ENSMUST00000115891 (see the R code below). I also obtained the same result when I used another source that is independent of the R code below.

1 .So my question is, what is the exact meaning of the percentage_gc_content in the BiomaRt?

2.  While exploring this, the BM function has a bug if we had the attributes "cdna" or "gene_exon" in the getBM function (see the print out of the seq variable in the following R code) where the column names has been shifted. It would be nice to have this bug fixed.

The following is the R code

##Understand the GC_content % obtained from biomaRt
library(biomaRt)
mart<-useMart(biomart = "ensembl",host="www.ensembl.org",dataset ="mmusculus_gene_ensembl")
library(Biostrings)
GCperc<-function(x)
{
    x1<-DNAString(x[,1])
    alf <- alphabetFrequency(x1, as.prob=TRUE)
    data.frame(ensembl_transcript_id=x[,2],length=length(x1),GCperc=100*sum(alf[c("G", "C")]))
}
t2g<-getBM(filter="ensembl_gene_id",values="ENSMUSG00000000103",
           attributes = c("ensembl_transcript_id",
           "percentage_gc_content","transcript_length"),
           mart = mart)
seq<-getBM(filter="ensembl_gene_id",values="ENSMUSG00000000103",
           attributes=c("ensembl_transcript_id","cdna"),#"gene_exon very messy"
           mart=mart)
print(t2g)
GCperc(seq[1,])
GCperc(seq[2,])


##and the output
> print(t2g)
  ensembl_transcript_id percentage_gc_content transcript_length
1    ENSMUST00000187148                 36.56              2846
2    ENSMUST00000115891                 36.56              2816
> GCperc(seq[1,])
  ensembl_transcript_id length   GCperc
1    ENSMUST00000115891   2816 40.19886
> GCperc(seq[2,])
  ensembl_transcript_id length   GCperc
1    ENSMUST00000187148   2846 39.45889


--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
Visit this group at https://groups.google.com/group/biomart-users.
For more options, visit https://groups.google.com/d/optout.

--
Thomas Maurel
Bioinformatician - Ensembl Production Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
Visit this group at https://groups.google.com/group/biomart-users.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [biomart-users] Understand the percentage_gc_content

Yongchao Ge
Thanks Thomas for the reply,

Can you give more details on how "Gene %GC content" is computed?

For example, how do you define the sequence of a gene since a gene is collection of many transcripts (isoforms) where different isoforms have different nucleotide sequences?

After an initial guess and manual checking for the example gene ENSMUSG00000000103, you were probably collecting all of  the nucleotide bases that are between the gene starting position and the gene end position, regardless of the position being in the introns of all transcripts. If that is the case,  I'm wondering where the "Gene %GC content" can be useful in applications.

Yongchao



On Wed, Dec 21, 2016 at 5:44 AM, Thomas Maurel <[hidden email]> wrote:
Hello,

1. In the Ensembl gene mart, the percentage_gc_content actually correspond to the Gene %GC content. This is why all the Transcripts of gene ENSMUSG00000000103 will return the value of 36.56.
We can rename this attribute on the interface to “Gene % GC content” and the BiomaRt attribute to “percentage_gene_gc_content” for our next release e!88 if that can make things clearer. 
2. The Bioconductor people are looking after the BiomaRt R module, could you please report this to the Bioconductor forum: https://support.bioconductor.org/

Hope this helps,
Kind Regards,
Thomas
On 20 Dec 2016, at 19:32, Yongchao Ge <[hidden email]> wrote:

Hi,

I'm trying to understand how the percentage_gc_content is computed from the R package biomaRt. Initially, I found that it was quite strange that the percentage_gc_content is the same for all transcripts of a gene from biomaRt (see the R code below).
For example, for the two transcripts (ENSMUST00000187148,ENSMUST00000115891) of gene ENSMUSG00000000103, the  percentage_gc_content for both transcripts is the same (36.56)

I also computed the percentage_gc_content manually, we obtained the 40.20 and 39.46 respectively for the two transcripts ENSMUST00000187148,ENSMUST00000115891 (see the R code below). I also obtained the same result when I used another source that is independent of the R code below.

1 .So my question is, what is the exact meaning of the percentage_gc_content in the BiomaRt?

2.  While exploring this, the BM function has a bug if we had the attributes "cdna" or "gene_exon" in the getBM function (see the print out of the seq variable in the following R code) where the column names has been shifted. It would be nice to have this bug fixed.

The following is the R code

##Understand the GC_content % obtained from biomaRt
library(biomaRt)
mart<-useMart(biomart = "ensembl",host="www.ensembl.org",dataset ="mmusculus_gene_ensembl")
library(Biostrings)
GCperc<-function(x)
{
    x1<-DNAString(x[,1])
    alf <- alphabetFrequency(x1, as.prob=TRUE)
    data.frame(ensembl_transcript_id=x[,2],length=length(x1),GCperc=100*sum(alf[c("G", "C")]))
}
t2g<-getBM(filter="ensembl_gene_id",values="ENSMUSG00000000103",
           attributes = c("ensembl_transcript_id",
           "percentage_gc_content","transcript_length"),
           mart = mart)
seq<-getBM(filter="ensembl_gene_id",values="ENSMUSG00000000103",
           attributes=c("ensembl_transcript_id","cdna"),#"gene_exon very messy"
           mart=mart)
print(t2g)
GCperc(seq[1,])
GCperc(seq[2,])


##and the output
> print(t2g)
  ensembl_transcript_id percentage_gc_content transcript_length
1    ENSMUST00000187148                 36.56              2846
2    ENSMUST00000115891                 36.56              2816
> GCperc(seq[1,])
  ensembl_transcript_id length   GCperc
1    ENSMUST00000115891   2816 40.19886
> GCperc(seq[2,])
  ensembl_transcript_id length   GCperc
1    ENSMUST00000187148   2846 39.45889


--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
Visit this group at https://groups.google.com/group/biomart-users.
For more options, visit https://groups.google.com/d/optout.

--
Thomas Maurel
Bioinformatician - Ensembl Production Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom


--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
Visit this group at https://groups.google.com/group/biomart-users.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [biomart-users] Understand the percentage_gc_content

Thomas Maurel
Dear Yongchao,

I am afraid that is beyond my knowledge, could you please email the Ensembl Helpdesk: http://www.ensembl.org/Help/Contact. Someone there should be able to tell you how we generate the “Gene %GC content” in Ensembl.

Hope this helps,
Kind Regards,
Thomas
On 21 Dec 2016, at 14:15, Yongchao Ge <[hidden email]> wrote:

Thanks Thomas for the reply,

Can you give more details on how "Gene %GC content" is computed?

For example, how do you define the sequence of a gene since a gene is collection of many transcripts (isoforms) where different isoforms have different nucleotide sequences?

After an initial guess and manual checking for the example gene ENSMUSG00000000103, you were probably collecting all of  the nucleotide bases that are between the gene starting position and the gene end position, regardless of the position being in the introns of all transcripts. If that is the case,  I'm wondering where the "Gene %GC content" can be useful in applications.

Yongchao



On Wed, Dec 21, 2016 at 5:44 AM, Thomas Maurel <[hidden email]> wrote:
Hello,

1. In the Ensembl gene mart, the percentage_gc_content actually correspond to the Gene %GC content. This is why all the Transcripts of gene ENSMUSG00000000103 will return the value of 36.56.
We can rename this attribute on the interface to “Gene % GC content” and the BiomaRt attribute to “percentage_gene_gc_content” for our next release e!88 if that can make things clearer. 
2. The Bioconductor people are looking after the BiomaRt R module, could you please report this to the Bioconductor forum: https://support.bioconductor.org/

Hope this helps,
Kind Regards,
Thomas
On 20 Dec 2016, at 19:32, Yongchao Ge <[hidden email]> wrote:

Hi,

I'm trying to understand how the percentage_gc_content is computed from the R package biomaRt. Initially, I found that it was quite strange that the percentage_gc_content is the same for all transcripts of a gene from biomaRt (see the R code below).
For example, for the two transcripts (ENSMUST00000187148,ENSMUST00000115891) of gene ENSMUSG00000000103, the  percentage_gc_content for both transcripts is the same (36.56)

I also computed the percentage_gc_content manually, we obtained the 40.20 and 39.46 respectively for the two transcripts ENSMUST00000187148,ENSMUST00000115891 (see the R code below). I also obtained the same result when I used another source that is independent of the R code below.

1 .So my question is, what is the exact meaning of the percentage_gc_content in the BiomaRt?

2.  While exploring this, the BM function has a bug if we had the attributes "cdna" or "gene_exon" in the getBM function (see the print out of the seq variable in the following R code) where the column names has been shifted. It would be nice to have this bug fixed.

The following is the R code

##Understand the GC_content % obtained from biomaRt
library(biomaRt)
mart<-useMart(biomart = "ensembl",host="www.ensembl.org",dataset ="mmusculus_gene_ensembl")
library(Biostrings)
GCperc<-function(x)
{
    x1<-DNAString(x[,1])
    alf <- alphabetFrequency(x1, as.prob=TRUE)
    data.frame(ensembl_transcript_id=x[,2],length=length(x1),GCperc=100*sum(alf[c("G", "C")]))
}
t2g<-getBM(filter="ensembl_gene_id",values="ENSMUSG00000000103",
           attributes = c("ensembl_transcript_id",
           "percentage_gc_content","transcript_length"),
           mart = mart)
seq<-getBM(filter="ensembl_gene_id",values="ENSMUSG00000000103",
           attributes=c("ensembl_transcript_id","cdna"),#"gene_exon very messy"
           mart=mart)
print(t2g)
GCperc(seq[1,])
GCperc(seq[2,])


##and the output
> print(t2g)
  ensembl_transcript_id percentage_gc_content transcript_length
1    ENSMUST00000187148                 36.56              2846
2    ENSMUST00000115891                 36.56              2816
> GCperc(seq[1,])
  ensembl_transcript_id length   GCperc
1    ENSMUST00000115891   2816 40.19886
> GCperc(seq[2,])
  ensembl_transcript_id length   GCperc
1    ENSMUST00000187148   2846 39.45889


--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
Visit this group at https://groups.google.com/group/biomart-users.
For more options, visit https://groups.google.com/d/optout.

--
Thomas Maurel
Bioinformatician - Ensembl Production Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom



--
Thomas Maurel
Bioinformatician - Ensembl Production Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
Visit this group at https://groups.google.com/group/biomart-users.
For more options, visit https://groups.google.com/d/optout.