[biomart-users] Understand the meaning of percentage_gc_content

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[biomart-users] Understand the meaning of percentage_gc_content

Yongchao Ge
Somehow my previous message was not sent. I redrafted the same message I sent before.


I'm trying to understand the meaning of the percentage_gc_content generated by biomaRt. I found that it was quite strange that all transcripts of a gene gave the same percentage_gc_content. For example, for the two transcripts (ENSMUST00000187148,ENSMUST00000115891)  of gene ENSMUSG00000000103, the percentage_gc_content is exactly the same 36.56 (see the R code below).

I did the computation manually and the the percentage_gc_content for the two transcripts (ENSMUST00000187148,ENSMUST00000115891) is respectively 39.46 and 40.20 (see the R code below).
These two numbers are also confirmed from another source that is independent of the R code below

1. So my question is, what is the percentage_gc_content that is generated by biomaRt?

2. While I was exploring the BM function of the biomaRt package, there is a bug if we wanted to use the attributes "cdna" or "gene_exon", which will shift the columns names, see a print out of the variable seq in the following R code.



The following is the R code

mart<-useMart(biomart = "ensembl",host="www.ensembl.org",dataset ="mmusculus_gene_ensembl")
    alf <- alphabetFrequency(x1, as.prob=TRUE)
    data.frame(ensembl_transcript_id=x[,2],length=length(x1),GCperc=100*sum(alf[c("G", "C")]))
           attributes = c("ensembl_transcript_id",
           mart = mart)
           attributes=c("ensembl_transcript_id","cdna"),#"gene_exon very messy"

The following is the output

> print(t2g)
  ensembl_transcript_id percentage_gc_content transcript_length
1    ENSMUST00000187148                 36.56              2846
2    ENSMUST00000115891                 36.56              2816
> GCperc(seq[1,])
  ensembl_transcript_id length   GCperc
1    ENSMUST00000115891   2816 40.19886
> GCperc(seq[2,])
  ensembl_transcript_id length   GCperc
1    ENSMUST00000187148   2846 39.45889      

You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
Visit this group at https://groups.google.com/group/biomart-users.
For more options, visit https://groups.google.com/d/optout.