BGee Ingest Guide¶
Source Information¶
InfoRes ID: infores:bgee
Description: Bgee is a database for retrieval and comparison of gene expression patterns across multiple animal species. It provides an intuitive answer to the question 'where is a gene expressed?' and supports research in cancer and agriculture, as well as evolutionary biology.
Citations: - Frederic B Bastian, Julien Roux, Anne Niknejad, Aurélie Comte, Sara S Fonseca Costa, Tarcisio Mendes de Farias, Sébastien Moretti, Gilles Parmentier, Valentine Rech de Laval, Marta Rosikiewicz, Julien Wollbrett, Amina Echchiki, Angélique Escoriza, Walid H Gharib, Mar Gonzales-Porta, Yohan Jarosz, Balazs Laurenczy, Philippe Moret, Emilie Person, Patrick Roelli, Komal Sanjeev, Mathieu Seppey, Marc Robinson-Rechavi, The Bgee suite: integrated curated expression atlas and comparative transcriptomics in animals, Nucleic Acids Research, Volume 49, Issue D1, 8 January 2021, Pages D831–D847, https://doi.org/10.1093/nar/gkaa793
Data Access Locations: - None
Data Provision Mechanisms: file_download
Data Formats: tsv.gz
Data Versioning and Releases: v15.0
Additional Notes: None
Ingest Information¶
Ingest Categories: primary_knowledge_provider
Utility: The information provided by Bgee gives us insight into how gene expression is located within specific cell lines and anatomical locations across a multitude of organisms.
Scope: Gene expression in specific UBERON and CL curies. Provides various metrics to help researchers assess veracity of each call.
Relevant Files¶
| File Name | Location | Description |
|---|---|---|
| Homo_sapiens_expr_simple.tsv.gz | https://bgee.org/ftp/bgee_v15_0/download/calls/expr_calls | Homo sapiens (human) gene expression information. |
| Rattus_norvegicus_expr_simple.tsv.gz | https://bgee.org/ftp/bgee_v15_0/download/calls/expr_calls | Rattus norvegicus (brown rat) gene expression information. |
| Mus_musculus_expr_simple.tsv.gz | https://bgee.org/ftp/bgee_v15_0/download/calls/expr_calls | Mus musculus (house mouse) gene expression information. |
Included Content¶
| File Name | Included Records | Fields Used |
|---|---|---|
| Homo_sapiens_expr_simple.tsv.gz | None | GeneID, Genename, AnatomicalentityID, Anatomicalentityname, Expression, Callquality, FDR, Expressionscore, Expressionrank |
| Rattus_norvegicus_expr_simple.tsv.gz | None | GeneID, Genename, AnatomicalentityID, Anatomicalentityname, Expression, Callquality, FDR, Expressionscore, Expressionrank |
| Mus_musculus_expr_simple.tsv.gz | None | GeneID, Genename, AnatomicalentityID, Anatomicalentityname, Expression, Callquality, FDR, Expressionscore, Expressionrank |
Filtered Content¶
| File Name | Filtered Records | Rationale |
|---|---|---|
| Homo_sapiens_expr_simple.tsv.gz | 'Expression rank' >= 10,000 OR 'Expression score' <= 70 OR 'FDR' >= 0.05 OR 'Expression' == "absent" | We are specifically targeting gene expression calls which are strongly signaled and highly probable. Any calls which have a low score or are potentially spurious are filtered. |
| Rattus_norvegicus_expr_simple.tsv.gz | 'Expression rank' >= 10,000 OR 'Expression score' <= 70 OR 'FDR' >= 0.05 OR 'Expression' == "absent" | We are specifically targeting gene expression calls which are strongly signaled and highly probable. Any calls which have a low score or are potentially spurious are filtered. |
| Mus_musculus_expr_simple.tsv.gz | 'Expression rank' >= 10,000 OR 'Expression score' <= 70 OR 'FDR' >= 0.05 OR 'Expression' == "absent" | We are specifically targeting gene expression calls which are strongly signaled and highly probable. Any calls which have a low score or are potentially spurious are filtered. |
Future Content Considerations¶
edge_content: Filter some organisms which are already filtered by Monarch. In general, explore prior Monarch ingest of this source to see what we can learn/borrow. And reach out to F. Bastian to get a better undrestanding of the source and its utility for Translator use cases. - Relevant files: Caenorhabditis_elegans_expr_simple.tsv.gz, Danio_rerio_expr_simple.tsv.gz, Drosophila_melanogaster_expr_simple.tsv.gz, Mus_musculus_expr_simple.tsv.gz, Rattus_norvegicus_expr_simple.tsv.gz, Xenopus_laevis_expr_simple.tsv.gz
edge_content: Add in more model organisms. [Danio_rerio_expr_simple.tsv.gz Danio rerio (zebra fish), Xenopus_laevis_expr_simple.tsv.gz Xenopus laevis (African clawed frog), Drosophila_melanogaster_expr_simple.tsv.gz Drosophila melanogaster (fruit fly), Caenorhabditis_elegans_expr_simple.tsv.gz Caenorhabditis elegans (roundworm), Canis_lupus_familiaris_expr_simple.tsv.gz Canis lupus familiaris (dog), Bos_taurus_expr_simple.tsv.gz Bos taurus (cattle), Sus_scrofa_expr_simple.tsv.gz Sus scrofa (wild boar), Gallus_gallus_expr_simple.tsv.gz Gallus gallus (red junglefowl)]. - Relevant files: Danio_rerio_expr_simple.tsv.gz, Xenopus_laevis_expr_simple.tsv.gz, Drosophila_melanogaster_expr_simple.tsv.gz, Caenorhabditis_elegans_expr_simple.tsv.gz, Canis_lupus_familiaris_expr_simple.tsv.gz, Bos_taurus_expr_simple.tsv.gz, Sus_scrofa_expr_simple.tsv.gz, Gallus_gallus_expr_simple.tsv.gz
edge_content: Add qualifying developmental stage information. - Relevant files: all
edge_content: Add edges describing over/under expression of genes in specific tissues. And info indicating when a gene is ubiquitously expressed. - Relevant files: all
edge_properties: Add in incorporation of Expressionrank, Expressionscore, and FDR as edge properties. - Relevant files: all
edge_content: Learn about meaning/utility of scores, and consider applying filters to focus on subset that is most meaningful/useful for Translator. - Relevant files: all
Target Information¶
Target InfoRes ID: infores:translator-bgee-kgx
Edge Types¶
| Subject Categories | Predicate | Object Categories | Knowledge Level | Agent Type | UI Explanation |
|---|---|---|---|---|---|
| biolink:Gene | biolink:AnatomicalEntity, biolink:Cell, biolink:GrossAnatomicalStructure | knowledge_assertion | automated_agent | Bgee data provide Gene–Tissue/Stage expression assertions generated by uniformly reprocessing healthy RNA-Seq, microarray, EST, and in situ hybridization datasets, applying statistical thresholds to call expression presence/absence within experiments, and integrating evidence across studies and data types - followed by ontology-based propagation - to produce a single consensus expression call with an associated confidence level. |
Node Types¶
| Node Category | Source Identifier Types | Additional Notes |
|---|---|---|
| biolink:Gene | ENSEMBL | |
| biolink:Cell | CL | |
| biolink:AnatomicalEntity | UBERON | |
| biolink:GrossAnatomicalStructure | UBERON |
Provenance Information¶
Contributors: - Daniel korn: code author - Kevin Schaper: code support - Evan Morris: code support - Sierra Moxon: code support - Matthew Brush: data modeling, domain expertise
Artifacts: - Ingest Survey: https://docs.google.com/spreadsheets/d/1bx4OSH1_HR69sKXIL1UBbelbUEx8X0b-gZTi8F81ypo/ - Ingest Ticket: https://github.com/NCATSTranslator/Data-Ingest-Coordination-Working-Group/issues/54