Tag Archives: RDF

Natural Language Processing in RDF graphs

Published by:

This blog shows how to store text data in a RDF graph and retrieve and analyze information from that graph. Resource Description Format (RDF) graphs are very suitable structures for storing Natural Language Processing (NLP) data. They enable combining NLP data with other data sets in RDF (such as legal entities data from the Global LEI Foundation and the EIOPA register of European insurance undertakings, terminology data, for example Solvency 2 terminology and data from XBRL reports); and they allow adding text semantics in the form of linguistic annotations, which enables NLP analyses simply by executing database queries.

Here is what I did. To get a proper amount of text data I web-scraped the entire website of De Nederlandsche Bank (text in webpages and in PDF documents, including speeches, press releases, research publications, sector information, dnbulletins, and all blogs by Maarten Gelderman and Olaf Sleijpen, consisting of over 4.000 documents). Text extraction from the web pages was done with the Python package newspaper3k (a great tip from my NLP colleagues from the Authority for Consumers and Markets). Text data was then converted to the NLP Annotation Format (NAF), for which I defined a RDF representation (implemented in the Nafigator package) to upload the data in a RDF triple-store. For the triple-store I used Ontotext’s GraphDB, one of the best RDF database currently available. Then, information can be retrieved from the graph database with SPARQL queries for all kinds of NLP analyses.

Using a triple-store for NLP data leads to an efficient retrieval process of text data, especially if you compare that to a process where you search through different annotation files. Triple-stores for RDF (and the new RDF-star) have become efficient and powerful solutions with the equal capabilities as property graphs but with advantages of RDF and ontologies.

I will describe two parts of this process that are not straightforward in detail: the RDF representation of NAF, and retrieving data from the graph database.

The NLP Annotation Format in RDF

The NLP Annotation Format is an easy format for storing text annotations (see here for links to the description). All documents that were scraped from the website were processed with the Python package Nafigator, that is able to convert PDF document and HTML-files to XML-files satisfying the NLP Annotation Format. Standard annotation layers with the raw text, word forms, terms, named entities and dependencies were added using the Stanford Stanza NLP processor.

In this representation every annotation (word forms, terms, named entities, etc.) of every document must have an Uniform Resource Identifier (URI). To do this, I used a prefix doc_xxx for each document in the document set. This prefix can, for example, be set by

@prefix doc_001: <http://rdf.mangosaurus.eu/doc_001/> .

Which in this case is an identifier based on the domain of this blog. For web-scraped documents you might also use the original URL of the document. Furthermore, for the RDF representation of NAF a provisional RDF Schema with prefix naf-base was made with the basic properties en classes of NAF.

The basic structure is set out below. All examples provided below are derived from the file example.pdf in the Nafigator package (the first sentences of the first page starts with: ‘The Nafigator package … ‘).

Document and header

Every document has a header and pages.

doc_001:doc a naf-base:document ;
    naf-base:hasHeader doc_001:nafHeader ;
    naf-base:hasPages ( doc_001:page1 ) .

Here naf-base:document is a RDF Class and naf-base:hasHeader and naf-base:hasPages are RDF Properties. The three lines above state that doc_001:doc is a document with header doc_001:nafHeader and a single page doc_001:page1.

In the header all metadata of the document is stored, including all linguistics processors and models that were used in processing the document. Below you see the metadata of the NAF text layer and the document metadata.

doc_001:nafHeader a naf-base:header ;
    naf-base:hasLinguisticProcessors [ 
        naf-base:hasLayer naf-base:text ;
        naf-base:lp [ 
            naf-base:hasBeginTimestamp "2022-04-10T13:45:43UTC" ;
            naf-base:hasEndTimestamp "2022-04-10T13:45:44UTC" ;
            naf-base:hasHostname "desktop-computer" ;
            naf-base:hasModel "stanza_resources\\en\\tokenize\\ewt.pt" ;
            naf-base:hasName "text" ;
            naf-base:hasVersion "stanza_version-1.2.2" 
        ] 
        ...
    ] ;
    naf-base:hasPublic [ 
        dc:format "application/pdf" ;
        dc:uri "data/example.pdf" 
    ] .

Sentences, paragraphs and pages

Here is an example of a sentence object with properties.

doc_001:sent1 a naf-base:sentence ;
    naf-base:isPartOf doc_001:para1, doc_001:page1 ;
    naf-base:hasSpan ( doc_001:wf1 doc_001:wf2 ...  doc_001:wf29 ) .

These three lines describe the properties of the RDF subject doc_001:sent1. The doc_001:sent1 identifies the RDF subject for the first sentence of the first document; the first line says that the subject doc_001:sent1 is a (rdf:type) sentence. The second line says that this sentence is a part of the first paragraph and the first page of the document. The span of the sentence contains a ordered list of word forms of the sentence: doc_001:wf1, doc_001:wf2 and so on.

Paragraphs and pages to which the sentences refer are defined in a similar way.

Word forms and terms

Of each word form the properties text, length and offset are defined. The word form is a part of a term, sentence, paragraph and page, and that is also defined for every word form. Take for example the the word form doc_001:wf2 defined as:

doc_001:wf2 a naf-base:wordform ;
    naf-base:hasText "Nafigator"^^rdf:XMLLiteral ;
    naf-base:hasLength "9"^^xsd:integer ;
    naf-base:hasOffset "4"^^xsd:integer ;
    naf-base:isPartOf doc_001:page1, 
        doc_001:para1, 
        doc_001:sent1.

In the next layer the terms of the word forms are defined, with their linguistic properties (lemma, grammatical number, part-of-speech and if applicable other properties such as verb voice and verb form). The term that refers to the word form above is

doc_001:term2 a naf-base:term ;
    naf-base:hasLemma "Nafigator"^^rdf:XMLLiteral ;
    naf-base:hasNumber olia:Singular ;
    naf-base:hasPos olia:ProperNoun ;
    naf-base:hasSpan ( doc_001:wf2 ) .

For the linguistic properties the OLiA ontology is used, which stand for Ontologies of Linguistic Annotations, an OWL taxonomy of data categories for linguistic annotations. The ontology contains precise definitions and interrelation between the linguistic categories. In this case the grammatical number (olia:Singular) and the part-of-speech tag (olia:ProperNoun) is included in the properties of this term. Depending of the term other properties are defined, for example verb forms. The span of the term refers back to the word forms (if you create a NAF ontology then you would define this as a transitive relationship, but for now, by including both relations we speed up the retrieval process).

Named entities

Next are the named entities that are stored in another NAF layer and here as separate subjects in the triple-store. An entity refers back to a term and has a certain type (organization, person, product, law, date and so on). The text of the entity is already stored in the term object so there is not need to include it here. External references could be added here, for example references to legal entities from Global LEI Foundation. Here is the example referring to the triples above.

doc_001:entity1 a naf-base:entity ;
    naf-base:hasType naf-entity:product ;
    naf-base:hasSpan ( doc_001:term2 ) .

Dependencies

Powerful NLP models exist that are able to derive relationships between words in within sentences. The dependencies are defined on the level of terms and stored in the dependency layer of NAF. In this RDF representation the dependencies are simply added to the terms.

doc_001:term3 a naf-base:term ;
    naf-rfunc:compound doc_001:term2 ;
    naf-rfunc:det doc_001:term1 .

The second and third line say that term3 (‘package’) forms a compound term with term2 (‘Nafigator’) and has its determinant in term1 (‘The’).

There are more annotation layers in NAF, but these are the most basic ones and if you have these, then many powerful NLP analyses already can be done.

Information retrieval from the RDF graph database

The conversion of text to RDF described above was applied to all webpages and documents of the website of DNB, 4.065 documents in total with 401.832 sentences containing 9.789.818 words. This text data led to over 221 million RDF triples in the tripe-store. I used a local database that was queried via a SPARQL endpoint. These numbers mentioned here can easily be extracted with SPARQL queries, for example to count the number of sentences we can use the SPARQL query:

SELECT (COUNT(?s) AS ?count) WHERE { ?s a naf-base:sentence . }

With this query all RDF subjects (the variable ?s) that are a sentence are counted and the result is stored in the variable ‘count’. The same can be done with other RDF subjects like word forms and documents.

The RDF representation described above allows you to store the content and annotations of a set of documents with their metadata in one single graph. You can then retrieve information from that graph from different perspectives and for different purposes.

Information retrieval

Suppose we want to find all references on the website with relations between ‘DNB’ and the verb ‘supervise’ by looking for sentences where ‘DNB’ is the nominal subject and ‘supervise’ is the lemma of the verb in the sentence. This is done with the following query

SELECT ?text
WHERE {
    ?term naf-base:hasLemma "supervise" .
	?term naf-rfunc:nsubj [naf-base:hasLemma "DNB" ] .
    ?term naf-base:hasSpan [ rdf:first ?wf ] .
    ?wf naf-base:isPartOf [ a naf-base:sentence ; naf-base:hasText ?text ].
}

It’s almost readable 🙂 The first line in the WHERE statement retrieves words that have ‘supervise’ as a lemma (this includes past, present and future tense and different verb forms). The second line narrows the selection down to where the nominal subject of the verb is ‘DNB’ (the lemma of the subject to be precise). The last two lines select the text of the sentences that includes the words that were found.

Execution of this query is done in a few milliseconds (on a desktop computer with a local database, nothing fancy) and results in 22 sentences, such as “DNB supervises adequate management of sustainability risks by financial institutions.”, “DNB supervises the cash payment system by providing information and guidance on the rules and procedures, data collection and examining compliance with the rules.”, and so on.

Term extraction

Terms are often multi-words and can be retrieved by part-of-speech tags and dependencies. Suppose we want to retrieve all two-words terms of the form adjective, common noun. Part-of-speech tags are defined in the terms layer. In the graph also the relation between the terms is defined, in this case by an adjectival modifier (amod) relation (the common noun is modified by an adjective). Then we can define a query that looks for exactly that: two words, an adjective and a common noun, where the mutual relationship is of an adjectival modifier. This is expressed in the first three lines in the WHERE-clause below. The last two lines retrieve the text of the words.

SELECT DISTINCT ?w1 ?w2 (count(*) as ?c)
WHERE {
    ?term1 naf-base:hasPos olia:CommonNoun .
    ?term2 naf-base:hasPos olia:Adjective .
    ?term1 naf-rfunc:amod ?term2 .
    ?term1 naf-base:hasSpan [ rdf:first/naf-base:hasText ?w1 ] .
    ?term2 naf-base:hasSpan [ rdf:first/naf-base:hasText ?w2 ] .
} GROUP BY ?w1 ?w2
ORDER BY DESC(?c)

Note that in the query a count of the number of occurrences of the term in the output and sort the output according to this count has been added.

Most often the term ‘monetary policy’ was found (2.348 times), followed by ‘financial institutions’ (1.734 times) and ‘financiële instellingen’ (Dutch translation of financial institution, 1.519 times), and so on. In total more than 127.000 of these patterns were found on the website (this is a more complicated query and took around 10 seconds). In this way all kinds of term patterns can be found, which can be collected in a termbase (terminology database).

Opinion extraction

I will give here a very simple example of opinion extraction based on part-of-speech tags. Suppose you want to extract sentences that contain the authors (or someone else’s) subjective opinion. You can look a the grammatical subject and the verb in a sentence, but you can also look at whether a sentence contains something like ‘too high’ or ‘too volatile’ (which often indicates a subjective content). In that case we have the word ‘too’ (an adverb) followed by an adjective, with mutual relation of adverbial modifier (advmod). In the Dutch language this has exactly the same form. The following query extracts these sentences.

SELECT ?text
WHERE {
    ?term1 naf-base:hasPos olia:Adjective .
    ?term2 naf-base:hasSpan [ rdf:first/naf-base:hasText "too" ] .
    ?term1 naf-rfunc:advmod ?term2 .
    ?term1 naf-base:hasSpan [ rdf:first ?wf1 ] .
    ?sent1 naf-base:hasSpan [ rdf:rest*/rdf:first ?wf1 ] .
    ?sent1 a naf-base:sentence .
    ?sent1 naf-base:hasText ?text .
}

With the last three lines the text of the sentence that includes the term is found (the output of the query). With the documents of the website of DNB, the output contains sentences like: “It is also clear that CO2 emissions are still too cheap and must be priced higher to sufficiently curtail emissions” and “Firms end up being too large” (in total 343 sentences in 0.3 seconds).

The examples shown here are just for illustrative purposes and do not always lead to accurate results, but they show that information extraction can be done fairly easy (if you know SPARQL) and reasonably quick. Once the data is stored into a graph database, named entities can be matched with other internal or external data sources and lemmas of terms can be matched with concept-based terminology databases. Then you have a graph where the text is not only available on a simple string-level but also, and more importantly, on a conceptual level.

EIOPA’s Solvency 2 taxonomy in RDF

Published by:

To use the metadata from XBRL taxonomies, like labels, hierarchies, template structures and formulas, often licensed software is needed to process the taxonomy and convert the XML content to readable formats. In an earlier blog I have shown that it is useful to convert XBRL instance data to a linked data set in RDF and then query that data to retrieve the desired information. In this blog I will show how to do this with taxonomies: by using a number of small SPARQL queries the complete Data Point Model (DPM) of (European) XBRL taxonomies can be retrieved.

The main purpose of retrieving metadata in this manner is to be able to use taxonomy metadata in data science environments, for example to be able to apply machine learning models that use taxonomy metadata like hierarchies and to use concept and element labels from a taxonomy in NLP, for example Named Entity Recognition tasks to link quantitative reports to (unstructured, or: not yet structured) text data.

The lightweight solution that I show here is completely based on open source code, in the form of my xbrl2rdf package. This package converts XBRL instance files and all related taxonomy files (schemas and linkbases) to RDF and RDF-star and uses for this lxml and rdflib, and nothing else.

The examples below use the Solvency 2 taxonomy, but other taxonomies works as well. Gist with the notebook with the code below can be found here.

Importing the data

With the xbrl2rdf-package I have converted the EIOPA instance example file for quarterly reports for solo undertaking (QRS) to RDF. All taxonomy concepts that are used in that instance are included in the RDF data set. This result can be read with rdflib into memory.

# RDF graph loading
path = "../data/rdf/qrs_240_instance.ttl"

g = RDFGraph()
g.parse(path, format='turtle')

print("rdflib Graph loaded successfully with {} triples".format(len(g)))

This returns

rdflib Graph loaded successfully with 1203744 triples

So we have a RDF graph with 1.2 million triples that contains all data (facts related to concepts in the taxonomy, including all labels, validation rules, template structures, etc). The original RDF data file is around 64 Mb (combining instance and taxonomy triples). Reading and processing this file into an in-memory RDF graph takes some time, but then the graph can easily be queried.

Extracting template URIs

Let’s start with a simple query. Table or template URIs are subjects in the triple “subject xl:type table:table”. To get a list with all templates of an instance (in this case the first five) we run

q = """
  SELECT ?a
  WHERE {
    ?a xl:type table:table .
  }"""
tables = [str(row[0]) for row in g.query(q)]
tables.sort()
tables[0:5]

This returns a list of the URIs of the templates contained in the instance file.

['http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2019-07-15/tab/S.01.01.02.01#s2md_tS.01.01.02.01',
 'http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2019-07-15/tab/S.01.02.01.01#s2md_tS.01.02.01.01',
 'http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2019-07-15/tab/S.02.01.02.01#s2md_tS.02.01.02.01',
 'http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2019-07-15/tab/S.05.01.02.01#s2md_tS.05.01.02.01',
 'http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2019-07-15/tab/S.05.01.02.02#s2md_tS.05.01.02.02']

Extracting the explicit domains

Next, we extract the explicit domains and related data in the taxonomy. A domain is specific XBRL terminology and means a set of elements sharing a specified semantic nature. An explicit domain has its elements enumerated in the taxonomy and can be found with the subject in the triple ‘subject rdf:type model:explicitDomainType’.

q = """
  SELECT DISTINCT ?t ?x1 ?x2 ?x3 ?x4
  WHERE {
    ?t rdf:type model:explicitDomainType .
    ?t xbrli:periodType ?x1 .
    ?t model:creationDate ?x2 .
    ?t xbrli:nillable ?x3 .
    ?t xbrli:abstract ?x4 .
  }"""

The first five domains (of 41 in total) are

indexDomain nameDomain labelperiod typecreation datenillableabstract
0LBLines of businessesinstant2014-07-07truetrue
1MCMain categoriesinstant2014-07-07truetrue
2TITime intervalsinstant2014-07-07truetrue
3AOArticle 112 and 167instant2014-07-07truetrue
4CGCollaterals/Guaranteesinstant2014-07-07truetrue
Domain names and labels with attributes

So, the label of the domain LB is Lines of businesses; it has been there since the early versions of the taxonomy. If a domain is modified then this is also included as a triple in the data set.

Extracting domain members

Elements of an explicit domain are called domain members. A domain member (or simply a member) is enumerated element of an explicit domain. All members from a domain share a certain common nature. To get the members of a domain, we define a function that finds all domain-members relations of a given domain and retrieve the label of the member. In SPARQL this is:

def members(domain):
    q = """
      SELECT DISTINCT ?t ?label
      WHERE {
        ?l arcrole7:domain-member [ xl:from <"""+str(domain)+"""> ;
                                    xl:to ?t ] .
        ?t rdf:type nonnum:domainItemType .
        ?x arcrole3:concept-label [ xl:from ?t ;
                                    xl:to [rdf:value ?label ] ] .
        }"""
    return g.query(q)

All members of all domains can be retrieved by running this function for all domains defined earlier. Based on the output we create a Pandas DataFrame with the results.

df_members = pd.DataFrame()
for d in df_domains.iloc[:, 0]:
    data = [[urldefrag(d)[1]]+[urldefrag(row[0])[1]]+list(row[1:]) for row in members(d)]
    columns = ['Domain',
               'Member',
               'Member label']
    df_members = df_members.append(pd.DataFrame(data=data,
                                                columns=columns))

In total there are 4.879 members of all domains (in this taxonomy).

index DomainMemberMember label
0LBx0Total/NA
1LBx1Accident and sickness
2LBx2Motor
3LBx3Fire and other damage to property
4LBx4Aviation, marine and transport
The first five member of the domain LB (Lines of Businesses)

This allows us, for example, to retrieve all facts in a report that are related to the term ‘motor’ because a reported fact contains references to the domain members to which the fact relates.

Extracting the template structure

Template structures are stored in the taxonomy as a tree of linked elements along the axes x, y and, if applicable, z. The elements have a label and a row-column code (this holds at least for EIOPA and DNB taxonomies), and have a certain depth, i.e. they can be subcategories of other elements. For example in the Solvency 2 balance sheet template, the element ‘Equities’ has subcategories ‘Equities – listed’ and ‘Equities unlisted’. I have not included the code here but with a few lines you can extract the complete template structures as they are stored in the taxonomy. For example for the balance sheet (S.02.01.02.01) we get:

 axisdepthrc-codelabel
1x1C0010Solvency II value
3y1 Assets
4y1 Liabilities
5y2R0010Goodwill
6y2R0020Deferred acquisition costs
7y2R0030Intangible assets
8y2R0040Deferred tax assets
9y2R0050Pension benefit surplus
10y2R0060Property, plant & equipment held for own use
11y2R0070Investments (other than assets held for index-…
12y3R0080Property (other than for own use)
13y3R0090Holdings in related undertakings, including pa…
14y3R0100Equities
15y4R0110Equities – listed
16y4R0120Equities – unlisted
17y3R0130Bonds
18y4R0140Government Bonds
19y4R0150Corporate Bonds
20y4R0160Structured notes
21y4R0170Collateralised securities
22y3R0180Collective Investments Undertakings
23y3R0190Derivatives
24y3R0200Deposits other than cash equivalents
The first 25 lines of the balance sheet template

With relatively small SPARQL queries, it is possible to retrieve metadata from XBRL taxonomies. This works well because we have converted the original taxonomy (in xml) to the linked data format RDF; and this format is especially well suited for representing and querying XBRL data.

The examples above show that it is possible to retrieve the complete Solvency 2 Data Point Model (and more) from the the taxonomy in RDF and make it available in a Python environment. This allows incorporation of metadata in machine learning models and in NLP applications. I hope that this approach will allow more data scientists to use existing metadata from XBRL taxonomies.

Converting XBRL to RDF-star

Published by:

Lately I have been working on the conversion of XBRL instances and related taxonomy schemas and linkbases to RDF and RDF-star. In these semantic data formats, you can link data in XBRL data with other data sources and you can query the data in a fairly easy manner. RDF-star is an extension of RDF that in some situations allows a more compact description of linked data, and by that it narrows the gap between RDF and property graphs. How this works, I will show in this blog using the XBRL taxonomy definitions as an example.

In a previous blog I showed that XBRL instance facts can be converted to RDF and visualized as a network. The same can be done with the related taxonomy elements. An XBRL taxonomy consists of concepts and relations between concepts that define calculations, presentations, labels and definitions. The concepts are laid down (mostly) in XML schemas and the relations in linkbases using XML schemas and XLinks. By converting the XBRL taxonomy to RDF, the XBRL fact data is linked to its corresponding metadata in the taxonomy.

XBRL to RDF

There has been done some work on the conversion of XBRL to RDF, most notably by Dave Raggett. His project xbrlimport, written in C++ and available on SourceForge, converts XBRL data to RDF triples. His approach is clean and straightforward and reuses the original namespaces of the XBRL data (with some obvious elements translated to predicates with RDF namespaces).

I used Raggett’s xbrlimport as a starting point, translated it to Python, added XBRL items that were introduced after publication of the code and improved a number of things. The code is now for example able to convert all EIOPA’s Solvency 2 taxonomy elements with all metadata available to RDF format. This code is available under the same license as xbrlimport (GNU General Public License) as a Python Package on pypi.org. You can take an XBRL instance with corresponding taxonomy (in the form of a zip-file) and convert the contents to RDF and RDF-star. This code will look up any references (URIs) in the XBRL instance to the taxonomy in the zip-file and convert the relevant files to RDF.

Let’s look at some examples of the Solvency 2 taxonomy converted to RDF. The RDF triple of an arbitrary XBRL concept from the Solvency 2 taxonomy looks like this (in turtle format):

s2md_met:mi362 
    rdf:type xbrli:monetaryItemType ;
    xbrli:periodType """instant"""^^rdf:XMLLiteral ;
    model:creationDate """2014-07-07"""^^xsd:dateTime ;
    xbrli:substitutionGroup xbrli:item ;
    xbrli:nillable "true"^^xsd:boolean ;

This example describes the triples of concept s2md_met:mi362 (a Solvency 2 metric). With these triples we have exactly the same data as in the related XML file but now in the form of triples. Namespaces are derived from the XML file (except rdf:type) and datatypes are transformed to RDF datatypes with proper RDF syntax.

This can be done with all concepts used to which the facts of an XBRL instance refer. If you have facts in RDF format, then in RDF these concept are automatically linked with the concepts in the taxonomy because the URIs of the concepts are the same. This creates a network of facts with all related metadata of the facts.

An XBRL taxonomy also contains links that relate concepts to each other for several purposes (to provide labels, definitions, presentations and calculations) . An example of a link is the following.

_:link2 arcrole:concept-label [
    xl:type xl:link ;
    xl:role role2:link ;
    xl:from s2md_met:mi362 ;
    xl:to s2md_met:label_s2md_mi362 ;
    ] .

The link relates concept mi362 with label mi362 by creating a new subject _:link2 with predicate arcrole:concept-label and an object which contains all data about the link (including the xl:from and xl:to and the attributes of the link). This way of introducing a new subject to specify a link between two concepts is called reification and a bit artificial because you would like to link the concept directly with the label, such as

s2md_met:mi281 arcrole:concept-label s2md_met:label_s2md_mi281

However, then you are unable in RDF to link the attributes (like the order and the role) to the predicates. It is one of the disadvantages of the current RDF format. There appears to be no easy way to do this in RDF, other than by using this artificial reification approach (some other solutions exist like the singleton property approach, but all of them have disadvantages.)

The new RDF-star format

Recently, the RDF-star working group published their first Draft Community Report. In this report they introduced new RDF-star and SPARQL-star specifications. These new specifications, although not yet a W3C standard, enable more compact specification of linked datasets and simpler graphs and less nodes.

Let’s look what this means for the XBRL linkbases with the following example. Suppose we have the following link definition.

_:link1 arcrole:breakdown-tree [
    xl:from _:s2md_a1 ;
    xl:to _:s2md_a1.root ;
    xl:type xl:link ;
    xl:role tab:S.01.01.02.01 ;
    xl:order "0"^^xsd:decimal ;
    ] .

The subject in this case is _:link1 with predicate arcrole:breakdown-tree, so this link describes a part of a table template. It points to a subject with all the information of the link, i.e. from, to, type, role and order from the xl namespace. Note that there is no triple with _:s2md_a1 (xl:from) as a subject and _:s2md_a1.root (xl:to) as an object. So if you want to know the relations of the concept _:s2md_a1 you need to look at the link triples and look for entries where xl:from equals the concept.

With the new RDF-star specifications you can just add the triple and then add properties to the triple as a whole, so the example would read

_:s2md_a1 arcrole:breakdown-tree _:s2md_a1.root .

<<_:s2md_a1 arcrole:breakdown-tree _:s2md_a1.root>> 
    xl:role tab:S.01.01.02.01 ;
    xl:order "0"^^xsd:decimal ;
    .

Which is basically what we need to define. If you now want to know the relations of the subject _:s2md_a1 then you just look for triples with this subject. In the visual presentation of the RDF dataset you will see a direct link between the two concepts. This new RDF format also implies simplifications of the SPARQL queries.

This blog has become a bit technical but I hope you see that the RDF-star specification allows a much needed simplification of RDF triples. I showed that the conversion of XBRL taxonomies to RDF-star leads to a smaller amount of triples and also to less complex triples. The resulting taxonomy triples lead to less complex graphs and can be used to derive the XBRL labels, template structures, validation rules and definitions, just by using SPARQL queries.

Converting supervisory reports to Semantic Webs: from XBRL to RDF

Published by:

A growing number of supervisory reports across Europe are based on the XML Extensible Business Reporting Language standard (XBRL). Financial entities such as banks, insurance undertakings and pension institutions are required to submit their reports to their supervisors in this format.

XBRL is a language for modeling, exchanging and automatically processing business and financial information. Reports in this format (called instance documents) are based on metadata (set out in taxonomies) that add semantic meaning to the data points that are reported. You can choose different implementations but overall an XBRL taxonomy provides a semantically rich data model and that has always been one of the main advantages of XBRL.

However, in its raw format (an XML file) each report is basically a machine readable document with a tree structure that does not enable easy integration with related data from other sources or integration with text documents and their contents.

In this blog, I will show that converting the XBRL reports to another format allows easier integration and understanding. That other format is based on Semantic Webs. It has been shown that XBRL converted to Semantic Webs can be done without any loss of information (see for example this article). So if we convert the XBRL format to a Semantic Web then we keep the structure and the meaning provided by the taxonomy. The result is basically a graph and this format enables integration with other linked data that is much easier.

A Semantic Web consists of formats and technologies that are rather old (from a computer science perspective): it originated around the same time as XBRL, some twenty years ago. And because it tried to solve similar problems (lack of semantic meaning in the World Wide Web) as the XBRL standard (lack of semantic meaning in business and financial data), to some extent it is based on similar concepts. It was however developed completely separate from XBRL.

The general concept of a Semantic Webs, where data is linked together to provide semantic meaning, is also known as a knowledge graph.

How does a Semantic Web work? One of the formats of the Semantic Web is the Resource Description Framework (RDF), originally designed as a metadata data model. RDF was adopted as a World Wide Web Consortium recommendation in 1999. The RDF 1.0 specification was published in 2004, and RDF 1.1 followed in 2014.

The RDF format is based on expressions in the form of subject-predicate-object, called triples. The subject and object denote (web) resources and the predicate denotes the relationship between the subject and the object. For example the expression ‘Spinoza has written the book Ethica Ordine Geometrico Demonstrata’ in RDF is a triple with a subject denoting “Spinoza”, a predicate denoting “has written”, and an object denoting “the book Ethica Ordine Geometrico Demonstrata”. This is a different approach than for example object-oriented models with an entity (Spinoza), attribute (book) and value (Ethica).

The RDF format could potentially solve some problems with the XBRL format. To explain this, I converted an XBRL-instance (a test instance file from EIOPA for Solvency 2) to RDF format.

Below you see the representation of one arbitrary data point in the report (called a fact) in RDF format and visualized as a network (I used the Python package networkx). The predicates contain the complete web resource so I limited the name to the last word to make it readable.

The red node is the starting point of the data point. The red labels on the lines describe the predicate between subject and object. You see that the fact (subject) ‘has decimals’ (predicate) 2 (object), and furthermore has unit EUR, has value 838522076.03, has type metmi503 (an internal code describing Payments for reported but not settled claims) and some other properties.

The data point also has a so-called context that defines the entity to which the fact applies, the period of time the fact is relevant (in this case 2019-12-31) and also a scenario, which consists of additional metadata of the data point. In this case we see that the data point is related to statutory accounts, non-life and health non-STL, direct business and accepted during the period (and a node without a label).

All facts in every XBRL instance are structured in this way, which means that for example you can search all facts with the label statutory accounts. Furthermore, because XBRL uses namespaces you can unambiguously identify predicates and objects in the report. For example, you see that the entity node has an identifier (starting with 0LFF1…) and a scheme (17442). The scheme refers to the web resource for the ISO standard 17442 which specifies the Legal Entity Identifier (LEI), so the entity is unambiguously identified with the given (LEI-)code. If you add other XBRL instances with references to that entity then the data is automatically linked because other instances will contain exactly the same entity node.

The RDF representation of the XBRL fact above is:

_:provenance1 xl:instance "filename".
_:unit_u xbrli:measure iso4217:EUR.
_:fact926
  xl:provenance :provenance1;
  xl:type xbrli:fact; 
  rdf:type s2md_met:mi503;
  rdf:value "838522076.03"^^xsd:decimal;
  xbrli:decimals "2"^^xsd:integer;
  xbrli:unit :unit_u; 
  xbrli:context :context_BLx79_DIx5_IZx1_TBx28_VGx84.
_:context_BLx79_DIx5_IZx1_TBx28_VGx84
  xl:type xbrli:context;
  xbrli:entity [
    xbrli:identifier "0LFF1WMNTWG5PTIYYI38";
    xbrli:scheme http://standards.iso.org/iso/17442;
    ];
  xbrli:scenario [
    xbrldi:explicitMember "s2c_LB:x79"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_DI:x5"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_RT:x1"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_LB:x28"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_AM:x84"^^rdf:XMLLiteral;
    ];
  xbrli:instant "2019-12-31"^^xsd:date.

Instead of storing the data in separate templates with often unclear code names you can also convert the XBRL data to one large Semantic Web where all facts are linked together. The RDF format thus provides a graph model which allows easier integration and visualization (and, for me at least, easier understanding). It allows adding and linking data from other sources, such as Solvency 2 documents and external data, in the same graph.

Typically, supervisory reports consists of thousands of data points and supervisors receive reports from many entities each period. How would you store that information? I think that the natural way to store an XBRL instance is not a relational database but a graph database (like graphDB or Neo4j). These databases can store the facts with all the metadata in a structured way and enable to query the graph efficiently. Next blog, I will explore graph databases and query languages for XBRL reports converted to the RDF format.