When it comes to data of any kind- from the ingredients list on a bar of chocolate to single-cell RNA-seq data, it’s undeniable that accurate annotation is indispensable. However, when it comes to biomedical data, this becomes complicated. A disease or a tissue can be represented by its synonyms and acronyms in many ontologies, causing datasets to be annotated differently. For instance, insulin-dependent diabetes mellitus may be represented as type I diabetes, insulin-dependent diabetes, or even diabetes mellitus. This is not just a reason for great confusion but can also affect the quality of R&D efforts.
While searching for a disease in a database, the ambiguity and lack of harmonization in reporting the disease or tissue could have the following drawbacks -
When querying the datasets for a disease named ‘IBD’, the result should ideally include the datasets annotated with the diseases: ‘inflammatory bowel diseases’, ‘inflammatory bowel diseases, Crohn's disease’, and ‘inflammatory bowel diseases 8’. However, this expansion of a keyword doesn’t happen under the hood, resulting in a lesser number of valid hits.
Ontology-based recommendations will be able to provide more valid hits with lesser effort. The users will now be able to use the function ‘recommend’ in Polly-python SQL queries to search for datasets having related terms of the keyword mentioned during the function call (discussed in the next section).
The ontology-based recommendation system will not only fetch the datasets that match the keyword but also search for datasets annotated on the hypernyms, hyponyms, synonyms, and acronyms of the keyword.
For example, if the user tries to query the dataset for the disease ‘obesity’, the result set of ontological recommendations would also include the searches for the terms - ‘obesity, abdominal' ; ‘overweight’ ; ‘overnutrition’ ; ‘obesity, metabolically benign’ ; ‘obesity, morbid’; 'diabetes mellitus, obesity’
While curating the data from public sources on Polly, we use our proprietary NLP models to harmonize the reported disease as per MeSH ontology. This enables users to find their data of interest with relative ease.
The ontological recommendation system has been integrated within the Polly-Python SQL queries. The user can use the following function to get recommendations for a keyword.
recommend(field_name, keyword, key - ['match' | 'related'])
It should be noted that the expansion of the keyword happens implicitly, thereby reducing manual interventions.
Before implementation of the feature, users would query for a given tissue and disease, as shown below.
SELECT dataset_id, curated_disease, curated_tissue
FROM geo.datasets
WHERE CONTAINS(curated_disease,'Breast Neoplasms') AND
CONTAINS(curated_tissue,'breast')
As per the output, the user can fetch the datasets for the given disease and tissue combination.
After implementation of the 'recommend' feature, users use the following SQL query to query for the datasets:
SELECT dataset_id, curated_disease, curated_tissue
FROM geo.datasets
WHERE CONTAINS(curated_disease, recommend('curated_disease', 'breast neoplasms', 'related'))
AND CONTAINS(curated_tissue, recommend('curated_tissue', 'breast', 'related'))
The recommend function used in the query is parsed and expanded to the related terms for a given disease and tissue. Internally, the query is converted into the following SQL after expansion -
SELECT dataset_id, curated_disease, curated_tissue
FROM geo.datasets
WHERE (CONTAINS(curated_disease, 'Breast Neoplasms, Male')
OR CONTAINS(curated_disease, 'Unilateral Breast Neoplasms')
OR CONTAINS(curated_disease, 'Breast Neoplasms')
OR CONTAINS(curated_disease, 'Carcinoma, Ductal, Breast')
OR CONTAINS(curated_disease, 'Inflammatory Breast Neoplasms')
OR CONTAINS(curated_disease, 'Triple Negative Breast Neoplasms'))
AND (CONTAINS(curated_tissue, 'milk fat')
OR CONTAINS(curated_tissue, 'epithelium')
OR CONTAINS(curated_tissue, 'mammary gland')
OR CONTAINS(curated_tissue, 'breast epithelium')
OR CONTAINS(curated_tissue, 'breast')
OR CONTAINS(curated_tissue, 'nipple')
OR CONTAINS(curated_tissue, 'milk')
OR CONTAINS(curated_tissue, 'mammary duct')
OR CONTAINS(curated_tissue, 'thorax')
OR CONTAINS(curated_tissue, 'colostrum')
OR CONTAINS(curated_tissue, 'mammary epithelium'))
Here, the advantage is that the result set would include datasets curated on other related terms along the earlier results. The domain for querying the datasets is therefore expanded with the new feature.
MeSH provides similar functionality to ontological recommendations on Polly. The users can query the MeSH API(s) for the children and parents using specific SQL queries. One of the examples of queries to fetch children is mentioned below -
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
PREFIX mesh: <http://id.nlm.nih.gov/mesh/>
PREFIX mesh2015: <http://id.nlm.nih.gov/mesh/2015/>
PREFIX mesh2016: <http://id.nlm.nih.gov/mesh/2016/>
PREFIX mesh2017: <http://id.nlm.nih.gov/mesh/2017/>
PREFIX mesh2018: <http://id.nlm.nih.gov/mesh/2018/>
PREFIX mesh2019: <http://id.nlm.nih.gov/mesh/2019/>
PREFIX mesh2020: <http://id.nlm.nih.gov/mesh/2020/>
PREFIX mesh2021: <http://id.nlm.nih.gov/mesh/2021/>
SELECT DISTINCT ?descriptor ?treeNum ?childTreeNum ?label
FROM <http://id.nlm.nih.gov/mesh>
WHERE {
%s meshv:treeNumber ?treeNum .
?childTreeNum meshv:parentTreeNumber+ ?treeNum .
?descriptor meshv:treeNumber ?childTreeNum .
?descriptor rdfs:label ?label .
}
ORDER BY ?ChildTreeNum
The overhead in using MeSH queries with Polly is that MeSH API(s) returns a JSON response. The response must be manipulated outside Polly-python to get the hypernyms of the term. The user would have to explicitly mention all the expanded terms to increase the domain for the dataset search.
There are several reasons why users might want to use ontology recommendations over MeSH queries. Let’s explore a few of them:
Ontology Recommendations Require the Least Manual Intervention
Unlike MeSH queries that are used outside Polly-python, ontology recommendation functionality allows the expansion of terms within the queries and provides datasets annotated on expanded terms.
As mentioned, the user’s job finishes as soon as they use the ‘recommend’ function in SQL queries. Polly-Python gets into action and parses the query to get the expansions of the keyword. The user-written query is then manipulated to include the list of expanded terms.
MeSH Expands a Term with Its Child Terms in Ontology Mapping
Along with the child terms, ontology recommendations in Polly-python include the related terms of a keyword that users might be interested in.
MeSH SQL queries are complex and very particular about what the user wants to fetch from the ontology tree mapping. The feature implemented on Polly has the upper hand as it increases the dataset search zone to the children, parents, and other similar terms of keywords given by the user.
Ontology Recommendation Provides Auto-complete Functionality
Another con of using MeSH queries is that the user must know the descriptor id of the term they want to expand. As compared to this, Polly-python now provides an auto-complete functionality. This means that the user can type in the prefix of any disease or tissue with the ‘recommend’ function, and the keyword would be expanded to its related terms implicitly.
For instance, a query such as the one mentioned below is completely valid -
SELECT dataset_id, curated_disease, curated_tissue
FROM geo.datasets
WHERE CONTAINS(curated_disease,'Liv', 'match')
In the above-mentioned scenario, ontology recommendation functionality would expand ‘Liv’ to all those diseases that have the prefix ‘Liv’. The expansion list might look like this -
‘Liver Diseases’ ; ‘Liver Neoplasms’ ; ‘Fatty Liver’ ; ‘Liver Failure’ ; ‘Adenoma, Liver Cell’
This feature can be helpful when Polly users are not sure what disease they might be interested in. The auto-complete feature would increase the search domain exponentially.
With the ever-evolving language in the biomedical space, it becomes critical to simplify ontology mapping while maximizing the outcomes. We have seen that even though controlled vocabularies such as MeSH exist, they are not user-friendly.
Ontology recommendation on Polly is an efficient way to enlarge the search space of datasets. It integrates well with Polly-python SQL queries and provides better results with the most minor manual efforts. Even if users are unsure about the exact ontological terms, they can still query Polly-python using its auto-complete feature to get datasets of their interest.