Search & lookup terms#
Entities and ontologies can be complex with many different identifiers.
Here we show Bionty’s lookup model for species, genes, proteins and cell markers. You’ll see how to
access the reference table via
.df()
look up an entity term via
.lookup()
look up an entity term via
.search()
import bionty as bt
.fields: fields of an ontology reference#
gene_bt = bt.Gene()
gene_bt
Gene
Species: human
Source: ensembl, release-109
#terms: 75124
📖 Gene.df(): ontology reference table
🔎 Gene.lookup(): autocompletion of terms
🎯 Gene.search(): free text search of terms
🧐 Gene.inspect(): check if identifiers are mappable
👽 Gene.map_synonyms(): map synonyms to standardized names
🔗 Gene.ontology: Pronto.Ontology object
gene_bt.fields
{'biotype',
'description',
'ensembl_gene_id',
'hgnc_id',
'ncbi_gene_id',
'symbol',
'synonyms'}
Fields can be accessed as attributes for autocompletion:
(You can pass them to the field
parameter in any bionty function instead of strings.)
gene_bt.ncbi_gene_id
ncbi_gene_id
.df()
: reference table#
Data scientists love DataFrames, and every entity has a reference table containing all the fields.
df = gene_bt.df()
df.head()
ensembl_gene_id | symbol | ncbi_gene_id | hgnc_id | biotype | description | synonyms | |
---|---|---|---|---|---|---|---|
0 | ENSG00000000003 | TSPAN6 | 7105 | HGNC:11858 | protein_coding | tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858] | TSPAN-6|T245|TM4SF6 |
1 | ENSG00000000005 | TNMD | 64102 | HGNC:17757 | protein_coding | tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757] | tendin|ChM1L|TEM|myodulin|BRICD4 |
2 | ENSG00000000419 | DPM1 | 8813 | HGNC:3005 | protein_coding | dolichyl-phosphate mannosyltransferase subunit... | CDGIE|MPDS |
3 | ENSG00000000457 | SCYL3 | 57147 | HGNC:19285 | protein_coding | SCY1 like pseudokinase 3 [Source:HGNC Symbol;A... | PACE1|PACE-1 |
4 | ENSG00000000460 | C1orf112 | 55732 | HGNC:25565 | protein_coding | chromosome 1 open reading frame 112 [Source:HG... | FLJ10706 |
To access the information of, for example the multiple gene symbols, we select the corresponding species through Pandas:
df.set_index("symbol").loc[["LMNA", "TCF7", "BRCA1"]]
ensembl_gene_id | ncbi_gene_id | hgnc_id | biotype | description | synonyms | |
---|---|---|---|---|---|---|
symbol | ||||||
LMNA | ENSG00000160789 | 4000 | HGNC:6636 | protein_coding | lamin A/C [Source:HGNC Symbol;Acc:HGNC:6636] | PRO1|LMNL1|MADA|CMD1A|HGPS|LMN1|LGMD1B |
TCF7 | ENSG00000081059 | 6932 | HGNC:11639 | protein_coding | transcription factor 7 [Source:HGNC Symbol;Acc... | TCF-1 |
BRCA1 | ENSG00000012048 | 672 | HGNC:1100 | protein_coding | BRCA1 DNA repair associated [Source:HGNC Symbo... | PPP1R53|RNF53|FANCS|BRCC1 |
.lookup(): Lookup terms and records with autocompletion#
Terms can be searched with auto-complete using a lookup object.
lookup = gene_bt.lookup()
We provide dot.
accessor for normalized terms (lower case, only contains alphanumeric characters and underscores):
lookup.tcf7
Gene(ensembl_gene_id='ENSG00000081059', symbol='TCF7', ncbi_gene_id='6932', hgnc_id='HGNC:11639', biotype='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', synonyms='TCF-1')
To look up the exact original strings, convert the lookup object to dict and use the bracket[]
accessor for autocompletion:
lookup_dict = lookup.dict()
lookup_dict["TCF7"]
Gene(ensembl_gene_id='ENSG00000081059', symbol='TCF7', ncbi_gene_id='6932', hgnc_id='HGNC:11639', biotype='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', synonyms='TCF-1')
By default, the name
field is used to generate lookup keys.
You can specify another field to look up:
lookup = gene_bt.lookup(gene_bt.hgnc_id)
If multiple entries are matched, they are returned as a list:
lookup.hgnc_10478
[Gene(ensembl_gene_id='ENSG00000227322', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000206289', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000235712', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000204231', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000231321', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000228333', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP')]
lookup_dict = lookup.dict()
lookup_dict["HGNC:10478"]
[Gene(ensembl_gene_id='ENSG00000227322', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000206289', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000235712', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000204231', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000231321', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000228333', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP')]
.search
: Search a term against a field#
celltype_bt = bt.CellType()
celltype_bt.search("cytotoxic T cells").head(2)
ontology_id | definition | synonyms | parents | __ratio__ | |
---|---|---|---|---|---|
name | |||||
cytotoxic T cell | CL:0000910 | A Mature T Cell That Differentiated And Acquir... | cytotoxic T-lymphocyte|cytotoxic T lymphocyte|... | [CL:0000911] | 96.969697 |
Tc2 cell | CL:0000918 | A Cd8-Positive, Alpha-Beta Positive T Cell Exp... | Tc2 T lymphocyte|Tc2 T cell|Tc2 T-cell|Th2 CD8... | [CL:0000908] | 76.190476 |
By default, search also matches against each of the synonyms:
celltype_bt.search("P cell").head(2)
ontology_id | definition | synonyms | parents | __ratio__ | |
---|---|---|---|---|---|
name | |||||
nodal myocyte | CL:0002072 | A Specialized Cardiac Myocyte In The Sinoatria... | cardiac pacemaker cell|myocytus nodalis|P cell | [CL:0002086] | 100.000000 |
pigmented ciliary epithelial cell | CL:0002303 | A Cell That Is Part Of Pigmented Ciliary Epith... | PE cell | [CL:0000529] | 92.307692 |
You can turn off synonym matching with synonyms_field=None
:
celltype_bt.search("P cell", synonyms_field=None).head(2)
ontology_id | definition | synonyms | parents | __ratio__ | |
---|---|---|---|---|---|
name | |||||
PP cell | CL:0000696 | A Cell That Stores And Secretes Pancreatic Pol... | type F enteroendocrine cell | [CL:0000167, CL:0000164] | 92.307692 |
peg cell | CL:4033014 | A Small, Narrow, Peg-Shaped Epithelial Cell Wi... | FTE stem-like cell|fallopian tube epithelial s... | [CL:0000066] | 85.714286 |
Match against another field (default is “name”):
celltype_bt.search("CD8+ alpha beta T cells", field=celltype_bt.definition).head(2)
ontology_id | name | synonyms | parents | __ratio__ | |
---|---|---|---|---|---|
definition | |||||
A T Cell That Expresses An Alpha-Beta T Cell Receptor Complex. | CL:0000789 | alpha-beta T cell | alpha-beta T lymphocyte|alpha-beta T-lymphocyt... | [CL:0000084] | 85.000000 |
A T Cell Expressing An Alpha-Beta T Cell Receptor And The Cd8 Coreceptor. | CL:0000625 | CD8-positive, alpha-beta T cell | CD8-positive, alpha-beta T-cell|CD8-positive, ... | [CL:0000791] | 81.481481 |
Return all results as a DataFrame ranked by matching ratios:
celltype_bt.search("P cell", top_hit=True)
CellType(name='nodal myocyte', ontology_id='CL:0002072', definition='A Specialized Cardiac Myocyte In The Sinoatrial And Atrioventricular Nodes. The Cell Is Slender And Fusiform Confined To The Nodal Center, Circumferentially Arranged Around The Nodal Artery.', synonyms='cardiac pacemaker cell|myocytus nodalis|P cell', parents=array(['CL:0002086'], dtype=object))
Tied results will all be returns as top hits:
celltype_bt.search("A cell", top_hit=True, synonyms_field=None)
[CellType(name='fat cell', ontology_id='CL:0000136', definition='A Fat-Storing Cell Found Mostly In The Abdominal Cavity And Subcutaneous Tissue Of Mammals. Fat Is Usually Stored In The Form Of Triglycerides.', synonyms='adipocyte|adipose cell', parents=array(['CL:0002320', 'CL:0000325'], dtype=object)),
CellType(name='cap cell', ontology_id='CL:0000676', definition=None, synonyms=None, parents=array(['CL:0000548', 'CL:0000378'], dtype=object))]