[BioMart Users] General query

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[BioMart Users] General query

Jaime Tovar
Hello,

I'm trying the new interface for biomart and I have a couple of comments.

I'm trying to download protein sequences for genes in homo sapiens for GRCh37.p3

In the results I find something like this for multiple genes:

>ENSG00000000003|ENSP00000362111
MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATNV
PFVLIATGTVIILLGTFGCFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKN
SFKNNYEKALKQYNSTGDYRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKL
EDCTPQRDADKVNNEGCFIKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNN
QYEIV*

>ENSG00000000003|ENSP00000409517
MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATFG
CFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKNSFKNNYEKALKQYNSTGD
YRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKLEDCTPQRDADKVNNEGCF
IKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNNQYEIV*

>ENSG00000000003|
Sequence unavailable

>ENSG00000000003|
Sequence unavailable
It makes me think there is a problem when joining the tables and so some empty protein ids are resulting in extra rows for the result.

Also I would like to know if the is the need of using the unique option present in the previous version to show only one result per gene id|protein id

Also these queries tend to be quiet long. As the application works now I can download the file as txt, but is extremely slow and I guess for some files it may result in incomplete files without way of knowing if the file is actually complete or not. I mean a time out may result in an incomplete file which looks good and there will be no way to probe the contrary just from the data. I think the possibility of getting the file as a gz compressed file is highly desirable either directly from the service or by a link in email.

Best regards,

J

_______________________________________________
Users mailing list
[hidden email]
https://lists.biomart.org/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: [BioMart Users] General query

Arek Kasprzyk
Hi Jaime,

thanks for your input.

Re: 'repeated rows'
I'd suggest that you include transcript id in your result as well. The seemingly repeating genes are usually a sign of alternative splice variants (different transcripts). Also it is worth to include gene type, there maybe some genes that have transcripts but no proteins. So there is a chance that what you are seeing is actually genuinly unique. Someone from Ensembl could probably provide a better insight. (cc'ing Rhoda)

Re: speed.
I remember that the first implementation by Jonathan was very slow but Jack optimized it quite a bit and remember Junjun testing it and being satisfied with the speed so i am not sure if this is the server issue now or something has changed since then. (cc'ing Junjun so he canĀ  comment)

Re: gz option.
This 0.8 service is still in development. You are right. We'll definitely need the 'gz' by email option as well


a


On Tue, Nov 22, 2011 at 11:11 AM, Jaime Tovar <[hidden email]> wrote:
Hello,

I'm trying the new interface for biomart and I have a couple of comments.

I'm trying to download protein sequences for genes in homo sapiens for GRCh37.p3

In the results I find something like this for multiple genes:

>ENSG00000000003|ENSP00000362111
MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATNV
PFVLIATGTVIILLGTFGCFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKN
SFKNNYEKALKQYNSTGDYRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKL
EDCTPQRDADKVNNEGCFIKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNN
QYEIV*

>ENSG00000000003|ENSP00000409517
MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATFG
CFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKNSFKNNYEKALKQYNSTGD
YRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKLEDCTPQRDADKVNNEGCF
IKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNNQYEIV*

>ENSG00000000003|
Sequence unavailable

>ENSG00000000003|
Sequence unavailable
It makes me think there is a problem when joining the tables and so some empty protein ids are resulting in extra rows for the result.

Also I would like to know if the is the need of using the unique option present in the previous version to show only one result per gene id|protein id

Also these queries tend to be quiet long. As the application works now I can download the file as txt, but is extremely slow and I guess for some files it may result in incomplete files without way of knowing if the file is actually complete or not. I mean a time out may result in an incomplete file which looks good and there will be no way to probe the contrary just from the data. I think the possibility of getting the file as a gz compressed file is highly desirable either directly from the service or by a link in email.

Best regards,

J

_______________________________________________
Users mailing list
[hidden email]
https://lists.biomart.org/mailman/listinfo/users



_______________________________________________
Users mailing list
[hidden email]
https://lists.biomart.org/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: [BioMart Users] General query

Rhoda Kinsella
Hi Jaime
If you take a look at the gene on the Ensembl website (see
here:http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000000003;r=X:99883667-99894988)
you will see that this gene has four transcripts, two are protein coding
and two are processed transcripts. I agree with Arek that you should
filter by protein coding and add Ensembl transcript ID to your list of
attributes.
Hope that helps
Regards
Rhoda


> Hi Jaime,
>
> thanks for your input.
>
> Re: 'repeated rows'
> I'd suggest that you include transcript id in your result as well. The
> seemingly repeating genes are usually a sign of alternative splice
> variants
> (different transcripts). Also it is worth to include gene type, there
> maybe
> some genes that have transcripts but no proteins. So there is a chance
> that
> what you are seeing is actually genuinly unique. Someone from Ensembl
> could
> probably provide a better insight. (cc'ing Rhoda)
>
> Re: speed.
> I remember that the first implementation by Jonathan was very slow but
> Jack
> optimized it quite a bit and remember Junjun testing it and being
> satisfied
> with the speed so i am not sure if this is the server issue now or
> something has changed since then. (cc'ing Junjun so he can  comment)
>
> Re: gz option.
> This 0.8 service is still in development. You are right. We'll definitely
> need the 'gz' by email option as well
>
>
> a
>
>
> On Tue, Nov 22, 2011 at 11:11 AM, Jaime Tovar <[hidden email]> wrote:
>
>>  Hello,
>>
>> I'm trying the new interface for biomart and I have a couple of
>> comments.
>>
>> I'm trying to download protein sequences for genes in homo sapiens for
>> GRCh37.p3
>>
>> In the results I find something like this for multiple genes:
>>
>>  >ENSG00000000003|ENSP00000362111
>> MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATNV
>> PFVLIATGTVIILLGTFGCFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKN
>> SFKNNYEKALKQYNSTGDYRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKL
>> EDCTPQRDADKVNNEGCFIKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNN
>> QYEIV*
>>
>> >ENSG00000000003|ENSP00000409517
>> MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATFG
>> CFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKNSFKNNYEKALKQYNSTGD
>> YRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKLEDCTPQRDADKVNNEGCF
>> IKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNNQYEIV*
>>
>> >ENSG00000000003|
>> Sequence unavailable
>>
>> >ENSG00000000003|
>> Sequence unavailable
>>
>> It makes me think there is a problem when joining the tables and so some
>> empty protein ids are resulting in extra rows for the result.
>>
>> Also I would like to know if the is the need of using the unique option
>> present in the previous version to show only one result per gene
>> id|protein
>> id
>>
>> Also these queries tend to be quiet long. As the application works now I
>> can download the file as txt, but is extremely slow and I guess for some
>> files it may result in incomplete files without way of knowing if the
>> file
>> is actually complete or not. I mean a time out may result in an
>> incomplete
>> file which looks good and there will be no way to probe the contrary
>> just
>> from the data. I think the possibility of getting the file as a gz
>> compressed file is highly desirable either directly from the service or
>> by
>> a link in email.
>>
>> Best regards,
>>
>> J
>>
>> _______________________________________________
>> Users mailing list
>> [hidden email]
>> https://lists.biomart.org/mailman/listinfo/users
>>
>>
>


_______________________________________________
Users mailing list
[hidden email]
https://lists.biomart.org/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: [BioMart Users] General query

Junjun Zhang
In reply to this post by Arek Kasprzyk
Hi Jaime,
The new sequence retrieval tool is overall slower than the old one. Getting sequences for the whole genome will take some time. As I suggested to another user, for genome-wide query, it should be better to get data for one chromosome in one queries.
Running queries in the background on the server should be a better option, although this is not supported yet.
Hope this helps,
Junjun
Sent from my BBerry

 
From: Arek Kasprzyk [mailto:[hidden email]]
Sent: Tuesday, November 22, 2011 12:45 PM
To: Jaime Tovar <[hidden email]>
Cc: [hidden email] <[hidden email]>; Rhoda Kinsella <[hidden email]>; Junjun Zhang
Subject: Re: [BioMart Users] General query
 
Hi Jaime,

thanks for your input.

Re: 'repeated rows'
I'd suggest that you include transcript id in your result as well. The seemingly repeating genes are usually a sign of alternative splice variants (different transcripts). Also it is worth to include gene type, there maybe some genes that have transcripts but no proteins. So there is a chance that what you are seeing is actually genuinly unique. Someone from Ensembl could probably provide a better insight. (cc'ing Rhoda)

Re: speed.
I remember that the first implementation by Jonathan was very slow but Jack optimized it quite a bit and remember Junjun testing it and being satisfied with the speed so i am not sure if this is the server issue now or something has changed since then. (cc'ing Junjun so he canĀ  comment)

Re: gz option.
This 0.8 service is still in development. You are right. We'll definitely need the 'gz' by email option as well


a


On Tue, Nov 22, 2011 at 11:11 AM, Jaime Tovar <[hidden email]> wrote:
Hello,

I'm trying the new interface for biomart and I have a couple of comments.

I'm trying to download protein sequences for genes in homo sapiens for GRCh37.p3

In the results I find something like this for multiple genes:

>ENSG00000000003|ENSP00000362111
MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATNV
PFVLIATGTVIILLGTFGCFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKN
SFKNNYEKALKQYNSTGDYRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKL
EDCTPQRDADKVNNEGCFIKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNN
QYEIV*

>ENSG00000000003|ENSP00000409517
MASPSRRLQTKPVITCFKSVLLIYTFIFWITGVILLAVGIWGKVSLENYFSLLNEKATFG
CFATCRASAWMLKLYAMFLTLVFLVELVAAIVGFVFRHEIKNSFKNNYEKALKQYNSTGD
YRSHAVDKIQNTLHCCGVTDYRDWTDTNYYSEKGFPKSCCKLEDCTPQRDADKVNNEGCF
IKVMTIIESEMGVVAGISFGVACFQLIGIFLAYCLSRAITNNQYEIV*

>ENSG00000000003|
Sequence unavailable

>ENSG00000000003|
Sequence unavailable
It makes me think there is a problem when joining the tables and so some empty protein ids are resulting in extra rows for the result.

Also I would like to know if the is the need of using the unique option present in the previous version to show only one result per gene id|protein id

Also these queries tend to be quiet long. As the application works now I can download the file as txt, but is extremely slow and I guess for some files it may result in incomplete files without way of knowing if the file is actually complete or not. I mean a time out may result in an incomplete file which looks good and there will be no way to probe the contrary just from the data. I think the possibility of getting the file as a gz compressed file is highly desirable either directly from the service or by a link in email.

Best regards,

J

_______________________________________________
Users mailing list
[hidden email]
https://lists.biomart.org/mailman/listinfo/users



_______________________________________________
Users mailing list
[hidden email]
https://lists.biomart.org/mailman/listinfo/users