[biomart-users] Understand the meaning of percentage_gc_content

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[biomart-users] Understand the meaning of percentage_gc_content

Yongchao Ge
Somehow my previous message was not sent. I redrafted the same message I sent before.

Hi,

I'm trying to understand the meaning of the percentage_gc_content generated by biomaRt. I found that it was quite strange that all transcripts of a gene gave the same percentage_gc_content. For example, for the two transcripts (ENSMUST00000187148,ENSMUST00000115891)  of gene ENSMUSG00000000103, the percentage_gc_content is exactly the same 36.56 (see the R code below).

I did the computation manually and the the percentage_gc_content for the two transcripts (ENSMUST00000187148,ENSMUST00000115891) is respectively 39.46 and 40.20 (see the R code below).
These two numbers are also confirmed from another source that is independent of the R code below

1. So my question is, what is the percentage_gc_content that is generated by biomaRt?

2. While I was exploring the BM function of the biomaRt package, there is a bug if we wanted to use the attributes "cdna" or "gene_exon", which will shift the columns names, see a print out of the variable seq in the following R code.

Thanks,

Yongchao

----------------------------------------
The following is the R code

library(biomaRt)
mart<-useMart(biomart = "ensembl",host="www.ensembl.org",dataset ="mmusculus_gene_ensembl")
library(Biostrings)
GCperc<-function(x)
{
    x1<-DNAString(x[,1])
    alf <- alphabetFrequency(x1, as.prob=TRUE)
    data.frame(ensembl_transcript_id=x[,2],length=length(x1),GCperc=100*sum(alf[c("G", "C")]))
}
t2g<-getBM(filter="ensembl_gene_id",values="ENSMUSG00000000103",
           attributes = c("ensembl_transcript_id",
           "percentage_gc_content","transcript_length"),
           mart = mart)
seq<-getBM(filter="ensembl_gene_id",values="ENSMUSG00000000103",
           attributes=c("ensembl_transcript_id","cdna"),#"gene_exon very messy"
           mart=mart)
print(t2g)
GCperc(seq[1,])
GCperc(seq[2,])

--------------------------------------------------------------------
The following is the output

> print(t2g)
  ensembl_transcript_id percentage_gc_content transcript_length
1    ENSMUST00000187148                 36.56              2846
2    ENSMUST00000115891                 36.56              2816
> GCperc(seq[1,])
  ensembl_transcript_id length   GCperc
1    ENSMUST00000115891   2816 40.19886
> GCperc(seq[2,])
  ensembl_transcript_id length   GCperc
1    ENSMUST00000187148   2846 39.45889      

--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
Visit this group at https://groups.google.com/group/biomart-users.
For more options, visit https://groups.google.com/d/optout.