[biomart-users] What is the difference between transcript_start, 5_utr_start, exon_chrom_start, and start_position?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[biomart-users] What is the difference between transcript_start, 5_utr_start, exon_chrom_start, and start_position?

Anders Perrone
Hi,

I'm trying to extract the promoter region (5'utr) and also from there get a random upstream region as a control. I need to know the position and length of the 5'utr relative to the coding sequence. I can't make sense of these start positions though. I'm having trouble discerning any sort of pattern. Sometimes they're the same sometimes they're completely different.

       ensembl_gene_id 5_utr_start transcript_start exon_chrom_start genomic_coding_start start_position
1   ENSOANG00000031802        2302               21             2214                 2214             21
7   ENSOANG00000009938        5528             3783             5365                 5365           3783
51  ENSOANG00000011569      104432           104432           104432               104458         104432
105 ENSOANG00000011041       47220            31014            47009                47009          31014
221 ENSOANG00000030544        3485             3485             3485                 3623           3485
223 ENSOANG00000003406       22424             8999            20453                20453           8999
228 ENSOANG00000014845       20160            20160            20160                20176          20160
239 ENSOANG00000029583         322              322              322                  408            322
241 ENSOANG00000008762          68               68               68                  111             68
244 ENSOANG00000028314        3585             3012             3327                 3327           3012
252 ENSOANG00000000738        5831             2472             5550                 5550           2472
292 ENSOANG00000001994       52334            52334            52334                   NA          52334
293 ENSOANG00000001994       83725            52334            83725                83756          52334
301 ENSOANG00000001996      115954           115954           115954               115996         115954
323 ENSOANG00000012886        2673             2673             2673                 2811           2673
328 ENSOANG00000021431        2477             2477             2477                 2526           2477
386 ENSOANG00000007865      630571           588129           630571                   NA         588129
387 ENSOANG00000007865      620068           588129           619927               619927         588129
396 ENSOANG00000031768      713867           693872           713824               713824         693872
424 ENSOANG00000007864      746040           746040           746040               746067         745925


--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
Visit this group at http://groups.google.com/group/biomart-users.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [biomart-users] What is the difference between transcript_start, 5_utr_start, exon_chrom_start, and start_position?

Arek Kasprzyk
Hi Anders,

It has been a while since i was involved in such calculations. Perhaps it would be safer to ask directly at the source: [hidden email]

a.

On 27 October 2015 at 21:57, Anders Perrone <[hidden email]> wrote:
Hi,

I'm trying to extract the promoter region (5'utr) and also from there get a random upstream region as a control. I need to know the position and length of the 5'utr relative to the coding sequence. I can't make sense of these start positions though. I'm having trouble discerning any sort of pattern. Sometimes they're the same sometimes they're completely different.

       ensembl_gene_id 5_utr_start transcript_start exon_chrom_start genomic_coding_start start_position
1   ENSOANG00000031802        2302               21             2214                 2214             21
7   ENSOANG00000009938        5528             3783             5365                 5365           3783
51  ENSOANG00000011569      104432           104432           104432               104458         104432
105 ENSOANG00000011041       47220            31014            47009                47009          31014
221 ENSOANG00000030544        3485             3485             3485                 3623           3485
223 ENSOANG00000003406       22424             8999            20453                20453           8999
228 ENSOANG00000014845       20160            20160            20160                20176          20160
239 ENSOANG00000029583         322              322              322                  408            322
241 ENSOANG00000008762          68               68               68                  111             68
244 ENSOANG00000028314        3585             3012             3327                 3327           3012
252 ENSOANG00000000738        5831             2472             5550                 5550           2472
292 ENSOANG00000001994       52334            52334            52334                   NA          52334
293 ENSOANG00000001994       83725            52334            83725                83756          52334
301 ENSOANG00000001996      115954           115954           115954               115996         115954
323 ENSOANG00000012886        2673             2673             2673                 2811           2673
328 ENSOANG00000021431        2477             2477             2477                 2526           2477
386 ENSOANG00000007865      630571           588129           630571                   NA         588129
387 ENSOANG00000007865      620068           588129           619927               619927         588129
396 ENSOANG00000031768      713867           693872           713824               713824         693872
424 ENSOANG00000007864      746040           746040           746040               746067         745925


--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
Visit this group at http://groups.google.com/group/biomart-users.
For more options, visit https://groups.google.com/d/optout.



--


“You have enemies? Good. 
That means you've stood up for something, sometime in your life.”

― Winston Churchill

--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
Visit this group at http://groups.google.com/group/biomart-users.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [biomart-users] What is the difference between transcript_start, 5_utr_start, exon_chrom_start, and start_position?

Amonida Zadissa
Hi Anders and Arek,

We are looking into this issue currently, I was just going to send an
email stating this.

Best regards,
Amonida

On 29/10/2015 11:54, Arek Kasprzyk wrote:

> Hi Anders,
>
> It has been a while since i was involved in such calculations. Perhaps it
> would be safer to ask directly at the source: [hidden email].
>
> a.
>
> On 27 October 2015 at 21:57, Anders Perrone <[hidden email]>
> wrote:
>
>> Hi,
>>
>> I'm trying to extract the promoter region (5'utr) and also from there get
>> a random upstream region as a control. I need to know the position and
>> length of the 5'utr relative to the coding sequence. I can't make sense of
>> these start positions though. I'm having trouble discerning any sort of
>> pattern. Sometimes they're the same sometimes they're completely different.
>>
>>         ensembl_gene_id 5_utr_start transcript_start exon_chrom_start
>> genomic_coding_start start_position
>> 1   ENSOANG00000031802        2302               21             2214
>>            2214             21
>> 7   ENSOANG00000009938        5528             3783             5365
>>            5365           3783
>> 51  ENSOANG00000011569      104432           104432           104432
>>          104458         104432
>> 105 ENSOANG00000011041       47220            31014            47009
>>           47009          31014
>> 221 ENSOANG00000030544        3485             3485             3485
>>            3623           3485
>> 223 ENSOANG00000003406       22424             8999            20453
>>           20453           8999
>> 228 ENSOANG00000014845       20160            20160            20160
>>           20176          20160
>> 239 ENSOANG00000029583         322              322              322
>>             408            322
>> 241 ENSOANG00000008762          68               68               68
>>             111             68
>> 244 ENSOANG00000028314        3585             3012             3327
>>            3327           3012
>> 252 ENSOANG00000000738        5831             2472             5550
>>            5550           2472
>> 292 ENSOANG00000001994       52334            52334            52334
>>              NA          52334
>> 293 ENSOANG00000001994       83725            52334            83725
>>           83756          52334
>> 301 ENSOANG00000001996      115954           115954           115954
>>          115996         115954
>> 323 ENSOANG00000012886        2673             2673             2673
>>            2811           2673
>> 328 ENSOANG00000021431        2477             2477             2477
>>            2526           2477
>> 386 ENSOANG00000007865      630571           588129           630571
>>              NA         588129
>> 387 ENSOANG00000007865      620068           588129           619927
>>          619927         588129
>> 396 ENSOANG00000031768      713867           693872           713824
>>          713824         693872
>> 424 ENSOANG00000007864      746040           746040           746040
>>          746067         745925
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "biomart-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [hidden email].
>> Visit this group at http://groups.google.com/group/biomart-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>

--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
Visit this group at http://groups.google.com/group/biomart-users.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [biomart-users] What is the difference between transcript_start, 5_utr_start, exon_chrom_start, and start_position?

Amonida Zadissa
Dear Anders

Sorry for the delay in getting back to you regarding your question. I
have some explanations for the various attributes that you have in your
list. You can also see these more descriptive namings on the Biomart
interface than the internal names.

Below are the variables that you have in your list along with what they
actually are. Note that start coordinate is always lower than end
coordinate regardless of which strand the gene is on.

* 5_utr_start: genomic start coordinate for 5' UTR

* transcript_start: genomic start coordinate for the transcript
   If the transcript is on the -strand, you'll need to look at
   transcript_end to get the correct genomic coordinate.

* exon_chrom_start: genomic start coordinate for the exon

* genomic_coding_start: genomic coding start coordinate for the
   exon

* start_position: genomic start coordinate for the gene. Again, if the
   gene is on -strand, the starting coordinate is given by end_position.

On the interface the columns that you have requested are:

ensembl_gene_id:      Ensembl Gene ID
5_utr_start:          5' UTR Start
transcript_start:     Transcript Start (bp)
exon_chrom_start:     Exon Chr Start (bp)
genomic_coding_start: Genomic coding start
start_position:       Gene Start (bp)

A few important things to note are:

* Which strand the gene is on
* Not all transcripts have 5' UTR
* Not all first exons are fully coding

So for getting the position and length of the 5' UTR, you might want to look
at the following variables in addition to what you have. These
additional variables will make it easier to fetch the information needed
but are also needed when the gene is on -strand.

ensembl_exon_id:    Ensembl Exon ID
5_utr_end:          5' UTR end
rank:               Exon Rank in Transcript
genomic_coding_end: Genomic coding end

The new variable, unrelated to the above list, is 'rank'. Exon rank
tells you just that, the ranking of each exon. You can of course also
get this information by looking at the genomic coordinates but I find
the ranking quicker to use.

So the length of 5' UTR would be:

5_utr_end - 5_utr_start +1

If your first exon is non-coding then you'll have to consider the next
exon in rank to get the full length of the 5' UTR which will then be:

5_utr_end (for second exon) - 5_utr_start (for first exon) + 1

If your gene is on -strand then the the 5_utr_end represents the start
of the 5' UTR when fetching upstream promoter sequence.

I've taking the first gene in your list as an eample

ensembl_gene_id ensembl_transcript_id ensembl_exon_id start_position
end_position transcript_start transcript_end strand 5_utr_start
5_utr_end exon_chrom_start exon_chr_end rank
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ENSOANG00000031802 ENSOANT00000041968 ENSOANE00000228402 21
2302 21 2302 -1 2302 2302 2214 2302 1
ENSOANG00000031802 ENSOANT00000041968 ENSOANE00000221420 21
2302 21 2302 -1 21 245 2

Length of 5' UTR = 5_utr_end - 5_utr_start +1 = 2302 - 2303 +1 = 1
Start of 5' UTR for fetching upstream region: 2302

I hope this explains the different variables available and their usage.

Cheers,
Amonida

--
Amonida Zadissa
Ensembl Production
EMBL-EBI

On 29/10/2015 12:36, Amonida Zadissa wrote:

> Hi Anders and Arek,
>
> We are looking into this issue currently, I was just going to send an
> email stating this.
>
> Best regards,
> Amonida
>
> On 29/10/2015 11:54, Arek Kasprzyk wrote:
>> Hi Anders,
>>
>> It has been a while since i was involved in such calculations. Perhaps it
>> would be safer to ask directly at the source: [hidden email].
>>
>> a.
>>
>> On 27 October 2015 at 21:57, Anders Perrone <[hidden email]>
>> wrote:
>>
>>> Hi,
>>>
>>> I'm trying to extract the promoter region (5'utr) and also from there
>>> get
>>> a random upstream region as a control. I need to know the position and
>>> length of the 5'utr relative to the coding sequence. I can't make
>>> sense of
>>> these start positions though. I'm having trouble discerning any sort of
>>> pattern. Sometimes they're the same sometimes they're completely
>>> different.
>>>
>>>         ensembl_gene_id 5_utr_start transcript_start exon_chrom_start
>>> genomic_coding_start start_position
>>> 1   ENSOANG00000031802        2302               21             2214
>>>            2214             21
>>> 7   ENSOANG00000009938        5528             3783             5365
>>>            5365           3783
>>> 51  ENSOANG00000011569      104432           104432           104432
>>>          104458         104432
>>> 105 ENSOANG00000011041       47220            31014            47009
>>>           47009          31014
>>> 221 ENSOANG00000030544        3485             3485             3485
>>>            3623           3485
>>> 223 ENSOANG00000003406       22424             8999            20453
>>>           20453           8999
>>> 228 ENSOANG00000014845       20160            20160            20160
>>>           20176          20160
>>> 239 ENSOANG00000029583         322              322              322
>>>             408            322
>>> 241 ENSOANG00000008762          68               68               68
>>>             111             68
>>> 244 ENSOANG00000028314        3585             3012             3327
>>>            3327           3012
>>> 252 ENSOANG00000000738        5831             2472             5550
>>>            5550           2472
>>> 292 ENSOANG00000001994       52334            52334            52334
>>>              NA          52334
>>> 293 ENSOANG00000001994       83725            52334            83725
>>>           83756          52334
>>> 301 ENSOANG00000001996      115954           115954           115954
>>>          115996         115954
>>> 323 ENSOANG00000012886        2673             2673             2673
>>>            2811           2673
>>> 328 ENSOANG00000021431        2477             2477             2477
>>>            2526           2477
>>> 386 ENSOANG00000007865      630571           588129           630571
>>>              NA         588129
>>> 387 ENSOANG00000007865      620068           588129           619927
>>>          619927         588129
>>> 396 ENSOANG00000031768      713867           693872           713824
>>>          713824         693872
>>> 424 ENSOANG00000007864      746040           746040           746040
>>>          746067         745925
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups
>>> "biomart-users" group.
>>> To unsubscribe from this group and stop receiving emails from it,
>>> send an
>>> email to [hidden email].
>>> Visit this group at http://groups.google.com/group/biomart-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>

--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
Visit this group at http://groups.google.com/group/biomart-users.
For more options, visit https://groups.google.com/d/optout.