[biomart-users] are empty lines in biomart results normal?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[biomart-users] are empty lines in biomart results normal?

wbazant
I am querying Ensembl's "Mouse genes (GRCm38.p5)" dataset. I believed falsely that there will be an empty line if there are no mappings for a particular gene, but in fact I sometimes get an odd empty line for a gene to which I anyway have mappings.

Do they have any semantics? Also, is this typical or a property of a data set that I should go and report?


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query  virtualSchemaName = "default" formatter = "TSV" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >
			
	<Dataset name = "mmusculus_gene_ensembl" interface = "default" >
		<Filter name = "ensembl_gene_id" value = "ENSMUSG00000000085"/>
		<Attribute name = "ensembl_gene_id" />
		<Attribute name = "go_id" />
	</Dataset>
</Query>

(results: snip)
Gene stable ID	GO term accession
ENSMUSG00000000085	GO:0016458
ENSMUSG00000000085	
ENSMUSG00000000085	GO:0045892
ENSMUSG00000000085	GO:0010369
ENSMUSG00000000085	GO:0016458

--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
Visit this group at https://groups.google.com/group/biomart-users.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [biomart-users] are empty lines in biomart results normal?

Thomas Maurel
Hello,

This is an expected behaviour of the BioMart software. The GO xrefs are attached to Transcripts in the Ensembl database so if you only ask for Gene in your mart query, you will get one GO result per transcript so you will see ENSMUSG00000000085 multiple times with empty line if the Transcript don’t have a GO term attached to it. If you add the “Ensembl Transcript ID” to your query:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query  virtualSchemaName = "default" formatter = "TSV" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >
			
	<Dataset name = "mmusculus_gene_ensembl" interface = "default" >
		<Filter name = "ensembl_gene_id" value = "ENSMUSG00000000085"/>
		<Attribute name = "ensembl_gene_id" />
		<Attribute name = "ensembl_transcript_id" />
		<Attribute name = "go_id" />
	</Dataset>
</Query>

You will see that for example ENSMUST00000136801 don’t have a GO term accession so the “GO term accession” will be blank.

Gene stable ID	Transcript stable ID	GO term accession
ENSMUSG00000000085	ENSMUST00000000087	GO:0005634
ENSMUSG00000000085	ENSMUST00000000087	GO:0006355
ENSMUSG00000000085	ENSMUST00000000087	GO:0005634
ENSMUSG00000000085	ENSMUST00000000087	GO:0005634
ENSMUSG00000000085	ENSMUST00000000087	GO:0007275
ENSMUSG00000000085	ENSMUST00000000087	GO:0007283
ENSMUSG00000000085	ENSMUST00000000087	GO:0007283
ENSMUSG00000000085	ENSMUST00000000087	GO:0006351
ENSMUSG00000000085	ENSMUST00000000087	GO:0006338
ENSMUSG00000000085	ENSMUST00000000087	GO:0009952
ENSMUSG00000000085	ENSMUST00000000087	GO:0005515
ENSMUSG00000000085	ENSMUST00000000087	GO:0045892
ENSMUSG00000000085	ENSMUST00000000087	GO:0045892
ENSMUSG00000000085	ENSMUST00000000087	GO:0010369
ENSMUSG00000000085	ENSMUST00000000087	GO:0016458
ENSMUSG00000000085	ENSMUST00000106301	GO:0005634
ENSMUSG00000000085	ENSMUST00000106301	GO:0006355
ENSMUSG00000000085	ENSMUST00000106301	GO:0005634
ENSMUSG00000000085	ENSMUST00000106301	GO:0005634
ENSMUSG00000000085	ENSMUST00000106301	GO:0007275
ENSMUSG00000000085	ENSMUST00000106301	GO:0007283
ENSMUSG00000000085	ENSMUST00000106301	GO:0007283
ENSMUSG00000000085	ENSMUST00000106301	GO:0006351
ENSMUSG00000000085	ENSMUST00000106301	GO:0006338
ENSMUSG00000000085	ENSMUST00000106301	GO:0009952
ENSMUSG00000000085	ENSMUST00000106301	GO:0005515
ENSMUSG00000000085	ENSMUST00000106301	GO:0045892
ENSMUSG00000000085	ENSMUST00000106301	GO:0045892
ENSMUSG00000000085	ENSMUST00000106301	GO:0010369
ENSMUSG00000000085	ENSMUST00000106301	GO:0016458
ENSMUSG00000000085	ENSMUST00000136801	
ENSMUSG00000000085	ENSMUST00000134375	
ENSMUSG00000000085	ENSMUST00000144862	
ENSMUSG00000000085	ENSMUST00000144555	
ENSMUSG00000000085	ENSMUST00000153099	
ENSMUSG00000000085	ENSMUST00000132116	
ENSMUSG00000000085	ENSMUST00000151345	
ENSMUSG00000000085	ENSMUST00000122860	
ENSMUSG00000000085	ENSMUST00000133290	
ENSMUSG00000000085	ENSMUST00000133169	
This is the default behaviour of MySQL that the BioMart software is using in the background.
If you only want back data attached to a GO term, you can add the “with GO ID(s)” filter set to “only”:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query  virtualSchemaName = "default" formatter = "TSV" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >
			
	<Dataset name = "mmusculus_gene_ensembl" interface = "default" >
		<Filter name = "ensembl_gene_id" value = "ENSMUSG00000000085"/>
		<Filter name = "with_go" excluded = "0"/>
		<Attribute name = "ensembl_gene_id" />
		<Attribute name = "ensembl_transcript_id" />
		<Attribute name = "go_id" />
	</Dataset>
</Query>
Hope this helps,
Kind Regards,
Thomas

On 22 Sep 2017, at 08:19, [hidden email] wrote:

I am querying Ensembl's "Mouse genes (GRCm38.p5)" dataset. I believed falsely that there will be an empty line if there are no mappings for a particular gene, but in fact I sometimes get an odd empty line for a gene to which I anyway have mappings.

Do they have any semantics? Also, is this typical or a property of a data set that I should go and report?


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query  virtualSchemaName = "default" formatter = "TSV" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >
			
	<Dataset name = "mmusculus_gene_ensembl" interface = "default" >
		<Filter name = "ensembl_gene_id" value = "ENSMUSG00000000085"/>
		<Attribute name = "ensembl_gene_id" />
		<Attribute name = "go_id" />
	</Dataset>
</Query>

(results: snip)
Gene stable ID	GO term accession
ENSMUSG00000000085	GO:0016458
ENSMUSG00000000085	
ENSMUSG00000000085	GO:0045892
ENSMUSG00000000085	GO:0010369
ENSMUSG00000000085	GO:0016458

--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
Visit this group at https://groups.google.com/group/biomart-users.
For more options, visit https://groups.google.com/d/optout.

--
Thomas Maurel
Bioinformatician - Ensembl Production Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom

--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
Visit this group at https://groups.google.com/group/biomart-users.
For more options, visit https://groups.google.com/d/optout.