In the last few years, Graph Databases have become increasingly popular to capture information out of unstructured documents, such as scientific publications and technical reports. The main concepts identified within the sentences of the documents are usually represented as nodes of the graph, while the relationships across concepts are represented as edges between nodes. Such graphs provide the unique capability of generating novel information by navigating along the relationships across the nodes. As an example, consider a geological formation that in one report is associated with a lithology while in another report it is associated to a geological age. With graph navigation, one can then retrieve the information of both the lithology and the geological age of that formation, even if no publication ever reported the two pieces of information together. In this sense, we can speak about Knowledge Graphs, capable of generating knowledge out of the original information.
The paper reports the joint effort by IBM and Eni to develop a fully configurable and flexible environment where Knowledge Graphs can be generated and navigated to exploit the knowledge embedded in geological papers and magazines. The nodes of the Knowledge Graph are created by using a number of Annotators, i.e. software that can automatically identify the relevant geological concepts within the paragraphs of the documents by exploiting a variety of algorithms, spanning from Natural Language Understanding, dictionaries and ontologies up to machine learning methods. The edges of the Knowledge Graphs are created by using a number of Relationship Aggregators, i.e. software that can automatically link to each other the concepts identified by the Annotators. Annotators and Aggregators can be graphically configured by the users in a "Data Flow", i.e. a processing pipeline that applies them in the desired order to generate the final Knowledge Graph.
The powerful query engine of the Knowledge Graph allows to perform parallel traversal of the graph on top of an HPC architecture. The query can be designed graphically by specifying a workflow, i.e. a number of traversal steps across the graph. The paper shows a number of practical examples where complex geological queries are executed, such as retrieving the main information about the Petroleum System Events of a basin, e.g. the formations acting as source, seal, reservoir, together with their lithologies and geological ages.