Natural Language Processing in RDF graphs

This blog shows how to store text data in a RDF graph and retrieve and analyze information from that graph. Resource Description Format (RDF) graphs are very suitable structures for storing Natural Language Processing (NLP) data. They enable combining NLP data with other data sets in RDF (such as legal entities data from the Global LEI Foundation and the EIOPA register of European insurance undertakings, terminology data, for example Solvency 2 terminology and data from XBRL reports); and they allow adding text semantics in the form of linguistic annotations, which enables NLP analyses simply by executing database queries.

Here is what I did. To get a proper amount of text data I web-scraped the entire website of De Nederlandsche Bank (text in webpages and in PDF documents, including speeches, press releases, research publications, sector information, dnbulletins, and all blogs by Maarten Gelderman and Olaf Sleijpen, consisting of over 4.000 documents). Text extraction from the web pages was done with the Python package newspaper3k (a great tip from my NLP colleagues from the Authority for Consumers and Markets). Text data was then converted to the NLP Annotation Format (NAF), for which I defined a RDF representation (implemented in the Nafigator package) to upload the data in a RDF triple-store. For the triple-store I used Ontotext’s GraphDB, one of the best RDF database currently available. Then, information can be retrieved from the graph database with SPARQL queries for all kinds of NLP analyses.

Using a triple-store for NLP data leads to an efficient retrieval process of text data, especially if you compare that to a process where you search through different annotation files. Triple-stores for RDF (and the new RDF-star) have become efficient and powerful solutions with the equal capabilities as property graphs but with advantages of RDF and ontologies.

I will describe two parts of this process that are not straightforward in detail: the RDF representation of NAF, and retrieving data from the graph database.

The NLP Annotation Format in RDF

The NLP Annotation Format is an easy format for storing text annotations (see here for links to the description). All documents that were scraped from the website were processed with the Python package Nafigator, that is able to convert PDF document and HTML-files to XML-files satisfying the NLP Annotation Format. Standard annotation layers with the raw text, word forms, terms, named entities and dependencies were added using the Stanford Stanza NLP processor.

In this representation every annotation (word forms, terms, named entities, etc.) of every document must have an Uniform Resource Identifier (URI). To do this, I used a prefix doc_xxx for each document in the document set. This prefix can, for example, be set by

@prefix doc_001: <http://rdf.mangosaurus.eu/doc_001/> .

Which in this case is an identifier based on the domain of this blog. For web-scraped documents you might also use the original URL of the document. Furthermore, for the RDF representation of NAF a provisional RDF Schema with prefix naf-base was made with the basic properties en classes of NAF.

The basic structure is set out below. All examples provided below are derived from the file example.pdf in the Nafigator package (the first sentences of the first page starts with: ‘The Nafigator package … ‘).

Document and header

Every document has a header and pages.

doc_001:doc a naf-base:document ;
    naf-base:hasHeader doc_001:nafHeader ;
    naf-base:hasPages ( doc_001:page1 ) .

Here naf-base:document is a RDF Class and naf-base:hasHeader and naf-base:hasPages are RDF Properties. The three lines above state that doc_001:doc is a document with header doc_001:nafHeader and a single page doc_001:page1.

In the header all metadata of the document is stored, including all linguistics processors and models that were used in processing the document. Below you see the metadata of the NAF text layer and the document metadata.

doc_001:nafHeader a naf-base:header ;
    naf-base:hasLinguisticProcessors [ 
        naf-base:hasLayer naf-base:text ;
        naf-base:lp [ 
            naf-base:hasBeginTimestamp "2022-04-10T13:45:43UTC" ;
            naf-base:hasEndTimestamp "2022-04-10T13:45:44UTC" ;
            naf-base:hasHostname "desktop-computer" ;
            naf-base:hasModel "stanza_resources\\en\\tokenize\\ewt.pt" ;
            naf-base:hasName "text" ;
            naf-base:hasVersion "stanza_version-1.2.2" 
        ] 
        ...
    ] ;
    naf-base:hasPublic [ 
        dc:format "application/pdf" ;
        dc:uri "data/example.pdf" 
    ] .

Sentences, paragraphs and pages

Here is an example of a sentence object with properties.

doc_001:sent1 a naf-base:sentence ;
    naf-base:isPartOf doc_001:para1, doc_001:page1 ;
    naf-base:hasSpan ( doc_001:wf1 doc_001:wf2 ...  doc_001:wf29 ) .

These three lines describe the properties of the RDF subject doc_001:sent1. The doc_001:sent1 identifies the RDF subject for the first sentence of the first document; the first line says that the subject doc_001:sent1 is a (rdf:type) sentence. The second line says that this sentence is a part of the first paragraph and the first page of the document. The span of the sentence contains a ordered list of word forms of the sentence: doc_001:wf1, doc_001:wf2 and so on.

Paragraphs and pages to which the sentences refer are defined in a similar way.

Word forms and terms

Of each word form the properties text, length and offset are defined. The word form is a part of a term, sentence, paragraph and page, and that is also defined for every word form. Take for example the the word form doc_001:wf2 defined as:

doc_001:wf2 a naf-base:wordform ;
    naf-base:hasText "Nafigator"^^rdf:XMLLiteral ;
    naf-base:hasLength "9"^^xsd:integer ;
    naf-base:hasOffset "4"^^xsd:integer ;
    naf-base:isPartOf doc_001:page1, 
        doc_001:para1, 
        doc_001:sent1.

In the next layer the terms of the word forms are defined, with their linguistic properties (lemma, grammatical number, part-of-speech and if applicable other properties such as verb voice and verb form). The term that refers to the word form above is

doc_001:term2 a naf-base:term ;
    naf-base:hasLemma "Nafigator"^^rdf:XMLLiteral ;
    naf-base:hasNumber olia:Singular ;
    naf-base:hasPos olia:ProperNoun ;
    naf-base:hasSpan ( doc_001:wf2 ) .

For the linguistic properties the OLiA ontology is used, which stand for Ontologies of Linguistic Annotations, an OWL taxonomy of data categories for linguistic annotations. The ontology contains precise definitions and interrelation between the linguistic categories. In this case the grammatical number (olia:Singular) and the part-of-speech tag (olia:ProperNoun) is included in the properties of this term. Depending of the term other properties are defined, for example verb forms. The span of the term refers back to the word forms (if you create a NAF ontology then you would define this as a transitive relationship, but for now, by including both relations we speed up the retrieval process).

Named entities

Next are the named entities that are stored in another NAF layer and here as separate subjects in the triple-store. An entity refers back to a term and has a certain type (organization, person, product, law, date and so on). The text of the entity is already stored in the term object so there is not need to include it here. External references could be added here, for example references to legal entities from Global LEI Foundation. Here is the example referring to the triples above.

doc_001:entity1 a naf-base:entity ;
    naf-base:hasType naf-entity:product ;
    naf-base:hasSpan ( doc_001:term2 ) .

Dependencies

Powerful NLP models exist that are able to derive relationships between words in within sentences. The dependencies are defined on the level of terms and stored in the dependency layer of NAF. In this RDF representation the dependencies are simply added to the terms.

doc_001:term3 a naf-base:term ;
    naf-rfunc:compound doc_001:term2 ;
    naf-rfunc:det doc_001:term1 .

The second and third line say that term3 (‘package’) forms a compound term with term2 (‘Nafigator’) and has its determinant in term1 (‘The’).

There are more annotation layers in NAF, but these are the most basic ones and if you have these, then many powerful NLP analyses already can be done.

Information retrieval from the RDF graph database

The conversion of text to RDF described above was applied to all webpages and documents of the website of DNB, 4.065 documents in total with 401.832 sentences containing 9.789.818 words. This text data led to over 221 million RDF triples in the tripe-store. I used a local database that was queried via a SPARQL endpoint. These numbers mentioned here can easily be extracted with SPARQL queries, for example to count the number of sentences we can use the SPARQL query:

SELECT (COUNT(?s) AS ?count) WHERE { ?s a naf-base:sentence . }

With this query all RDF subjects (the variable ?s) that are a sentence are counted and the result is stored in the variable ‘count’. The same can be done with other RDF subjects like word forms and documents.

The RDF representation described above allows you to store the content and annotations of a set of documents with their metadata in one single graph. You can then retrieve information from that graph from different perspectives and for different purposes.

Information retrieval

Suppose we want to find all references on the website with relations between ‘DNB’ and the verb ‘supervise’ by looking for sentences where ‘DNB’ is the nominal subject and ‘supervise’ is the lemma of the verb in the sentence. This is done with the following query

SELECT ?text
WHERE {
    ?term naf-base:hasLemma "supervise" .
	?term naf-rfunc:nsubj [naf-base:hasLemma "DNB" ] .
    ?term naf-base:hasSpan [ rdf:first ?wf ] .
    ?wf naf-base:isPartOf [ a naf-base:sentence ; naf-base:hasText ?text ].
}

It’s almost readable 🙂 The first line in the WHERE statement retrieves words that have ‘supervise’ as a lemma (this includes past, present and future tense and different verb forms). The second line narrows the selection down to where the nominal subject of the verb is ‘DNB’ (the lemma of the subject to be precise). The last two lines select the text of the sentences that includes the words that were found.

Execution of this query is done in a few milliseconds (on a desktop computer with a local database, nothing fancy) and results in 22 sentences, such as “DNB supervises adequate management of sustainability risks by financial institutions.”, “DNB supervises the cash payment system by providing information and guidance on the rules and procedures, data collection and examining compliance with the rules.”, and so on.

Term extraction

Terms are often multi-words and can be retrieved by part-of-speech tags and dependencies. Suppose we want to retrieve all two-words terms of the form adjective, common noun. Part-of-speech tags are defined in the terms layer. In the graph also the relation between the terms is defined, in this case by an adjectival modifier (amod) relation (the common noun is modified by an adjective). Then we can define a query that looks for exactly that: two words, an adjective and a common noun, where the mutual relationship is of an adjectival modifier. This is expressed in the first three lines in the WHERE-clause below. The last two lines retrieve the text of the words.

SELECT DISTINCT ?w1 ?w2 (count(*) as ?c)
WHERE {
    ?term1 naf-base:hasPos olia:CommonNoun .
    ?term2 naf-base:hasPos olia:Adjective .
    ?term1 naf-rfunc:amod ?term2 .
    ?term1 naf-base:hasSpan [ rdf:first/naf-base:hasText ?w1 ] .
    ?term2 naf-base:hasSpan [ rdf:first/naf-base:hasText ?w2 ] .
} GROUP BY ?w1 ?w2
ORDER BY DESC(?c)

Note that in the query a count of the number of occurrences of the term in the output and sort the output according to this count has been added.

Most often the term ‘monetary policy’ was found (2.348 times), followed by ‘financial institutions’ (1.734 times) and ‘financiële instellingen’ (Dutch translation of financial institution, 1.519 times), and so on. In total more than 127.000 of these patterns were found on the website (this is a more complicated query and took around 10 seconds). In this way all kinds of term patterns can be found, which can be collected in a termbase (terminology database).

Opinion extraction

I will give here a very simple example of opinion extraction based on part-of-speech tags. Suppose you want to extract sentences that contain the authors (or someone else’s) subjective opinion. You can look a the grammatical subject and the verb in a sentence, but you can also look at whether a sentence contains something like ‘too high’ or ‘too volatile’ (which often indicates a subjective content). In that case we have the word ‘too’ (an adverb) followed by an adjective, with mutual relation of adverbial modifier (advmod). In the Dutch language this has exactly the same form. The following query extracts these sentences.

SELECT ?text
WHERE {
    ?term1 naf-base:hasPos olia:Adjective .
    ?term2 naf-base:hasSpan [ rdf:first/naf-base:hasText "too" ] .
    ?term1 naf-rfunc:advmod ?term2 .
    ?term1 naf-base:hasSpan [ rdf:first ?wf1 ] .
    ?sent1 naf-base:hasSpan [ rdf:rest*/rdf:first ?wf1 ] .
    ?sent1 a naf-base:sentence .
    ?sent1 naf-base:hasText ?text .
}

With the last three lines the text of the sentence that includes the term is found (the output of the query). With the documents of the website of DNB, the output contains sentences like: “It is also clear that CO2 emissions are still too cheap and must be priced higher to sufficiently curtail emissions” and “Firms end up being too large” (in total 343 sentences in 0.3 seconds).

The examples shown here are just for illustrative purposes and do not always lead to accurate results, but they show that information extraction can be done fairly easy (if you know SPARQL) and reasonably quick. Once the data is stored into a graph database, named entities can be matched with other internal or external data sources and lemmas of terms can be matched with concept-based terminology databases. Then you have a graph where the text is not only available on a simple string-level but also, and more importantly, on a conceptual level.

UPDATE: I have written a follow-up on this blog here.

Leave a Reply