Enhancing Search Through Ontology-Driven Knowledge Graphs

Pawan Verma
November 18, 2022
Enhancing Search Through Ontology-Driven Knowledge Graphs

At present, there is a vast compendium of biomedical data being made publicly available from multiple sources. However, a major cause of concern for majority for the users is the the challenge of finding the right data for their research. Biomedical datasets are stored in various data repositories that fulfil different functions. Users may need to query various data repositories to collect all desired information. As a result, these biological datasets exist in fragments that do not “talk” to each other, thereby reducing its value.

The current biomedical data retrieval systems have some major flaws. We, at Elucidata, have identified a few below:

  1. Lack of cross-repository search
  2. Lack of semantic search

Therefore users spend more time searching datasets in known repositories than doing actual science. The growing amount of heterogeneous data makes it impossible to know for sure where some data of interest will be.

An example of a dataset search application case is when a dataset search engine delivers datasets in response to a user query. Such dataset search engines are currently in use, including the Mendeley Data search engine, run by Elsevier and Google Dataset Search.

The drawback of such query-driven search engines is that they rely on the user providing an accurate search phrase.

The major driving factor towards making search more user-friendly is, to build a recommendation paradigm for search phrases for enabling better data findability and reducing false negatives. Moreover, context-driven search is achieved by modelling the relationships that exist between several biomedical molecular entities that are often unknown to a user or are not obvious by looking at datasets in silos.

In order to represent such relationships we utilise the capabilities of a network-based approach called a “Knowledge Graph” (KG).

Graph Representation of Biomedical Molecular Data

In order to represent linked biological entities, we tend to utilise a schema or a structure that is a result of how the data is being modelled. More often, such models are defined by relationships that are known to exist between the given entities. For example, a Protein is a “product” of a Gene, therefore these relationships are known to exist despite the heterogeneity of the data.

But, not all relationships that exist between multiple biological entities are well-defined or “universally” accepted. There are several examples where, the relationships between biological entities can change depending on the how the data is being modelled.

In order to model the “universal” relationships, we utilise relationships between entities using existing ontologies (O).

What is an Ontology?

The term “ontology” was borrowed from philosophy to computer science to signify a machine-readable formalisation of a conceptualisation pertaining to a particular domain of knowledge. Simply put, an ontology is a digital object that can be interpreted by both humans and machines.

Below is an example of a linked view of 7 molecular entities linked into a consolidated ontology. The relationships depicted between the entities are consistent with the universally accepted ontologies linked to each of the entities (Details in next section).

Figure 1: Neo4j Graph Schema for 7 biological entities. (drug, cell-line, cell-type, gene, disease, tissue, pathway)

Modelling Biological Ontologies in Neo4J

Neo4j is a graph-native database which performs fast, dynamic and transactional queries over graph data. The inherent structure of Neo4j is a label-property graph model.

The information is organised as nodes, relationships and properties.

Nodes are entities in the graph:

  • Nodes can be tagged with labels, representing their different roles in your domain.
  • Nodes can hold any number of key-value pairs, or properties. (For example, disease).
  • Node labels may also attach metadata (such as index or constraint information) to certain nodes.

Relationships provide directed, named, connections between two node entities (e.g. cell-line sampled_from tissue).

  • Relationships always have a direction, a type, a start node, and an end node, and they can have properties, just like nodes.
  • Nodes can have any number or type of relationships without sacrificing performance.
  • Although relationships are always directed, they can be navigated efficiently in any direction.

There are key advantages of modelling graph data in Neo4j:

  1. Has its own querying language, CYPHER, similar to SQL but optimised for graphs.
  2. Is flexible schema allows rapid materialisation and adding new relationships as per requirement.
  3. Enables traversals across a wide range of depth and breadth with moderate compute.
  4. Is offered as a managed service by AuraDB which is essentially a graph database.

A “native” graph database means that Neo4j implements a true graph model till the storage level, as in it is not stored as a “graph abstraction” over another technology.

Overview of the Architecture

Figure 2: PollyGraph Architecture

Polly, which is home to over 1.6 million biomedical molecular data, is standardised for both data and metadata. There are two major advantages to this effort. First, data standardisation using uniform file formats such as GCT, h5ad, vcf etc. enable large scale consumption of data for analysis. Second, metadata harmonisation enables users to perform data findability at scale without ever worrying to provide exact keywords for search.

To make search even more powerful for scientists, we introduce PollyGraph to perform semantic searches using several biological concepts.

The architecture mainly comprises of 4 components: Source, Data Layer, Semantic Layer and Graph Layer

1. Source: The entities and relationships that form a knowledge graph are available in repositories which maintains domain specific ontologies. These sources provide ontologically linked data in .obo files that are built on the Web Ontology Language (OWL) Framework.
Certain sources exist that map biological entities and store them in a tabular format, but do not maintain an ontology. For such sources we define rules that formalise the relationships between the entities, thereby making them machine actionable.

2. Data Layer: There are 2 key processes in the data layer:
Import the entities and relationships from multiple sources and generate data models or loadings. These loadings contain data that describes an entity using a unique identifier and relationships between entities.
Store the data obtained from multiple heterogeneous data sources into a uniform database management system using a well-defined schema.

3. Semantic Layer: The semantic layer organises the mapped entities stored in multiple tables in the DB store to a flat file containing RDF triples using a pre-defined ontology. The biomedical ontology (O) used here forms a set of rules but do not contain the traditional graph data represented as a subject — predicate — object or ‘triplet’. In order to generate these triplets we define a set mapping (M) assertions of relationships between an ontology and a data source.

In order to facilitate this, the Ontology-Based Data Access (OBDA) framework [1] is used, which allows querying arbitrary data sources using SPARQL. The ontology together with the mappings exposes the RDF graph.

4. Graph Layer: The materialised graph triples in RDF cannot be directly imported on Neo4j. The reason being that the inherent graph data structure of Neo4j is not the same as RDF. RDF acts as a persistent store for the graph data, but is not dynamic as is not meant to handle transactional workloads.

Neo4j is however a graph-native that performs well with dynamic datasets and transactional workloads. Here, the graph data is defined as a label-property graph (LPG) where each class acts as a single node in a graph defined by it’s properties.

In order to export our RDF based graph data into Neo4j, we utilise Neosemantics (n10s). n10s is a plugin provided by Neo4j that enables the use of RDF in Neo4j.

Detailed View of the Semantic Layer

Figure 3: Semantic Layer

The Semantic Model comprises of 2 main components; Ontology(O) and Mappings(M).

  1. Ontology (O): This component contains the multiple biomedical ontologies consolidated in a single data model called Resource Description Framework (RDF). RDF is a W3C standard meant for data inter-change and is used for storing graph data in a lossless manner. Semantics defined in RDF are just rules, from which no new information can be derived from the triples. These rules are called ontologies that are an optional layer on top of RDF. A crucial step here is to combine ontologies from multiple sources.
    From existing ontologies:
    (i)
    Existing biological ontologies such as BTO etc. are created based on W3C standards, therefore can be imported as is. Figure 4a gives an example of BTO ontology that contains data as RDF triples therefore they need not be modified.
    Defining custom rules between entities:
    (i)
    Cross-entity relationships are seldom defined in one single biomedical ontology and therefore need to be defined explicitly.
    (ii) The Ontology allows addition/modification of new/existing rules when a new relationship is defined.

Therefore, an ontology can be updated as per the requirement or to model a specific consumption journey for users.

Figure 4: Landscape of the Brenda Tissue Ontology (BTO). Highlighting tissue “midbrain central gray“ (left). KG representation of the tissue ontology for “midbrain central gray“ after loading the relationships on Neo4J. (right)

2. Mappings (M): The OBDA paradigm exposes a conceptual view of the domain knowledge with the purpose of querying, in a manner, where users DO NOT need an understanding of the data sources or the relationship between them. Simply put, OBDA provides a high-level querying interface without exposing the structure of the relational database to the users.

The mapping contains a series of declarations that defines relationships (predicates) that exist between a subject and object using a unique mapping identifier that populates the ontology using a querying language specific the relational database. The query imports the required mappings from the DBMS store in order to materialise graph triples with the intent to populate an ontology(O).

Ontologies are used as a conceptual representation of data stored in RDBMs. The basic approach for mapping databases to ontologies and vice versa is Direct Mapping, which suggests such basic mapping rules as:

i. mapping tables into Classes;

ii. mapping columns into Data Properties;

iii. mapping foreign keys into Object Properties

The following is an example of a mapping declaration specified in an .obda file.

[PrefixDeclaration]: http://example.org/voc#ex: http://example.org/owl: http://www.w3.org/2002/07/owl#rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#xml: http://www.w3.org/XML/1998/namespacexsd: http://www.w3.org/2001/XMLSchema#foaf: http://xmlns.com/foaf/0.1/obda: https://w3id.org/obda/vocabulary#rdfs: http://www.w3.org/2000/01/rdf-schema#[MappingDeclaration] @collection [[mappingId <db_name>-cell_line-tissuetarget :<db_name>/thing/{cell_line} :sampled_from :<db_name>/thing/{tissue} . source SELECT * FROM <db_name>.<table>...]]

Detailed View of the Graph Layer

The materialised RDF triples are just a flat form of graph data which are non-transactional in nature. To efficiently perform dynamic graph operations, the data needs to be exported to Neo4J using neosemantics (n10s).

n10s provides a control over the ontology that is serialised in RDF to be imported and configured into a neo4j database.

The main method for importing RDF is n10s.rdf.import.fetch. It imports and persists into Neo4j, the triples returned by a URI. For ontologies serialised as RDF, n10s.onto.import.fetch provides more control over the RDF.

Once the ontology has been configured based on requirements, it can be fetched directly using the URI for the OWL file and a serialisation format using the following command.

CALL n10s.rdf.import.fetch(<RDF_PATH>, <format>);

The first thing we notice is that dataType properties in the RDF have been converted into node properties and object properties are now relationships connecting nodes.

Querying PollyGraph

One of the main mechanisms behind this ability of semantic search is to utilise publicly available biomedical ontologies to provide more meaningful results using knowledge graph. Linking biomedical ontologies by mining associations between them.

By leveraging ontologies, semantic search is able to provide a suitable response even if the results don’t contain the exact wording of the query.

Find Datasets Related to Brain Injury where BRD1 Gene is Regulated

Note: Here, Brain Injury is not a valid MeSH term. However, it’s able to fetch datasets tagged with the relevant MeSH term for brain injury.

Such search enable users to worry less about the specificity of input keywords and can perform queries that are biology focussed.

MATCH (n:ns0__disease) WHERE ANY (x in n.ns0__name + n.ns0__synonyms where toLower(x) CONTAINS("brain injury")) WITH n.ns0__name as disease UNWIND disease as d MATCH (m:ns0__gene {ns0__gene_symbol : ['BRD1']})--(n:ns0__disease {ns0__name : [d]})--(p: dataset) RETURN m,n,p;

Fetch Drugs that Up-Regulate Genes that are Down-Regulated in a Given Disease (e.g. Sarcomas)

Similar to above example, sarcoma is an extremely generic search for which the query is able to fetch datasets on relevant diseases which may be of interest to users, thereby obtaining results that are non-obvious.

MATCH (p:ns0__disease)-[:ns0__downregulates]-(q:ns0__gene)-[:ns0__drug_upregulates]-(r:ns0__drug) WHERE ANY(x IN p.ns0__name + p.ns0__synonyms WHERE tolower(x) CONTAINS('sarcoma'))WITH r.ns0__name AS drug, p.ns0__name AS disease, q.ns0__gene_symbol as geneRETURN drug, gene, disease

Fetch Other Drugs which Perturb the Expression of Same Genes as Colistin

MATCH (p:ns0__drug)-[r1]-(q:ns0__gene)-[r2]-(r:ns0__drug)WHERE ANY(x IN p.ns0__name + p.ns0__synonyms WHERE tolower(x) CONTAINS("colistin"))RETURN p, q, r

Building an extensive biomedical knowledge graph is an iterative process. Further developments will allow users to make more complex queries and also allow cross referencing to multiple ontology IDs for a given biological entity. Adding more relationships will therefore make the KG richer in context which will enable users to execute more robust queries on Polly.

References

  1. Bagosi, T., Calvanese, D., Hardi, J., Komla-Ebri, S., Lanti, D., Rezk, M., … & Xiao, G. (2014, August). The ontop framework for ontology based data access. In Chinese Semantic Web and Web Science Conference (pp. 67–77). Springer, Berlin, Heidelberg.

Other Resources

Request Demo