Tag Archives: XBRL

Multilingual termbases with metadata from reporting templates

Published by:

Domain-specific termbases are of great importance to many domain-specific NLP-tasks. They enable identification and annotation of terms in documents in situations where often not enough text is available to use statistical approaches. And more importantly, they form a step towards extracting structured facts from unstructured text data.

This blog shows how to construct and use multilingual termbases to annotate text from supervisory documents in different European languages with references to relevant parts of (quantitative) supervisory templates. By linking qualitative data (text) to quantitative data (numbers) we connect initially unstructured text data to data that are often to a high degree structured and well-defined in data point models and taxonomies. I will do this by constructing termbases that contain terminology data combined with linguistic data and metadata from the supervisory templates from different financial sectors.

For terminology data I will start with the IATE-database. Most terminology that is used in European quantitative reporting templates is based on and derived from European legislation. Having multilingualism as one of its founding principles is, the EU publishes terminology in the IATE-database in all official European languages to provide consistent translation of terms in European legislation. The IATE-database is published in the form of a file in TBX-format (TermBase eXchange). But termbases can also be in form of SKOS (Simple Knowledge Organization System, built upon the RDF-format). Both formats are data models that contain descriptions and properties of concepts and are to some extent interchangeable (see for example here).

For metadata on reporting templates I will use relevant XBRL Taxonomies in RDF-format (see here). Normally, XBRL Taxonomy are developed specifically for a single sector and therefore covers to some extent the financial terminology used within that sector. XBRL Taxonomies contain metadata of all data point in the reporting templates. From a XBRL Taxonomy a Data Point Model can be derived (that is: the taxonomy contains all definitions) and is often published together with the taxonomy which is only computer readable.

For linguistic data I will use the Python NLP package of Stanford Stanza, which provide pretrained NLP-models for all official European languages (in order of becoming an official EU language: Dutch, French, German, Italian (1958), Danish, English (1973), Greek (1981), Portuguese, Spanish (1986), Finnish, Swedish (1995), Czech, Estonian, Hungarian, Latvian, Lithuanian, Maltese, Polish, Slovak, Slovenian (2004), Bulgarian, Irish, Romanian (2007) and Croatian (2013)).

So we add semantic and linguistic structure to a terminology database. The resulting data structure is sometimes called an ontology, a taxonomy or a vocabulary, but these terms have no clear distinctive definitions. And moreover, the XBRL-people use the term taxonomy to refer to a structure that contains concepts with properties for definitions, labels, calculations and (table) presentations. To some extent it contains structured metadata of data points (i.e. the semantics of the data). Because of that you can say that it corresponds to an ontology within a Linked Data context. On the other hand a taxonomy within a Linked Data context (and everywhere else if I might add) is basically a description of concepts with sub-class relationships (concepts with hierarchical information). In the remainder of this blog I will use the term termbase for the resulting structure with semantic, linguistic and terminological data combined.

Constructing a termbase from IATE and XBRL

In a previous blog I have described how to set up a terminology database (termbase) specifically for insurance-related terms. Now I will add links from the concepts and terms in the IATE-database to the data point model of insurance reporting templates (thereby adding semantics to the termbase), and secondly I will add linguistic information at term-level like lemmas and part-of-speech tags to allow for easy usage in NLP-tasks. The TBX-format in which the IATE-database is published allows for storing references as well as linguistic data on term-level, so we can construct the termbase as a standalone file in TBX-format (another solution would be to add the terminology and linguistic information to the XBRL Taxonomy and use that as a basis).

The IATE-database currently contains almost 930.000 concepts and many of them have verbal expressions in multiple languages resulting in over 8.000.000 terms. A single (English) expression of a concept in the IATE-database looks like this.

<conceptEntry id="3539858">
  <descrip type="subjectField">insurance</descrip>
  <langSec xml:lang="en">
      <term>basic own funds</term>
      <termNote type="termType">fullForm</termNote>
      <descrip type="reliabilityCode">9</descrip>

Adding labels from the XBRL Taxonomy

For the termbase, we add every element in the XBRL Taxonomy that has a label (tables, concepts, elements, dimensions and members) to the termbase and we add an external cross reference to the template and the location in that template where the element is used (the row or column within the template). The TBX-format allows for fields called externalCrossReference which refer to a resource that is external to the terminology database. Then you get concept entries like this:

<conceptEntry id="http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2021-07-15/tab/s.">
  <descrip type="xbrlTypes">element</descrip>
  <xref type="externalCrossReference">S.,R0020</xref>
  <langSec xml:lang="en">
      <term>Basic own funds</term>
      <termNote type="termType">fullForm</termNote>
      <termNote type="termType">shortForm</termNote>

This means that the termbase concept with the id “http://eiopa.europa.eu/xbrl/…/s.” has English labels “Basic own funds” (full form) and “R0020” (short form). These are the labels of row 0020 of template S., i.e. in the template definition you will see Basic own funds on row 0020.

The template and rc-codes where elements refer to were extracted using SPARQL-queries on the XBRl Taxonomy in RDF-format.

Adding references between IATE- and XBRL-concepts

Now that we have added terms for labelled elements from the XBRL Taxonomy, the next step is to add cross references between the IATE-concepts and the XBRL-concepts. In the TBX-format the crossReference is a pointer to another related location, such as another entry or another term, in this case to a XBRL-concept. Below the references to the XBRL-concepts are added.

<conceptEntry id="3539858">
  <descrip type="subjectField">insurance</descrip>
  <langSec xml:lang="en">
      <term>basic own funds</term>
      <termNote type="termType">fullForm</termNote>
      <descrip type="reliabilityCode">9</descrip>
  <ref type="crossReference">http://eiopa.europa.eu/xbrl/s2c/dict/dom/el#x7</descrip>
  <ref type="crossReference">http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2021-07-15/tab/s.</descrip>
  <ref type="crossReference">http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2021-07-15/tab/s.</descrip>
  <ref type="crossReference">http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2021-07-15/tab/s.</descrip>

The IATE-concept with id 3539858 points to the domain item el#x7 (a domain in XBRL is a set of related items, in this case own funds items), and furthermore to (table) elements s., s. and s. These all refer to a single row or a single column within a template and the last one is given as an example above. It is the row in the table S. with label ‘Basic own funds’. IATE-concepts and XBRL-concepts are considered equal when their lowercase lemmas are the same (except for abbreviations).

For XBRL Taxonomy of Solvency 2 we find in this way 740 unique terms in the IATE-database that are identical to a XBRL-concept (in English), and 1.500 terms that occur in the labels of XBRL-concepts but are not identical, for example ‘net best estimate’ is a part of the label ‘Net Best Estimate of Premium Provisions’. For now I focused on identifying identical terms. How to process long labels with more than one term is a remaining challenge (this describes probably a useful approach).

Adding part-of-speech tags and lemmas

To make the termbase applicable for NLP-tasks we need to add additional linguistic information, such as part-of-speech patterns and lemmas. This improves the annotation process later on. Premium as an adjective has another meaning than premium as a common noun. By matching the PoS-pattern we get better matches (i.e. less false positives). And also the lemmas of a term will find terms irrespective of whether they are in grammatical singular or plural form. This is the concept to which PoS-patterns and lemma are added:

<conceptEntry id="3539858">
  <descrip type="subjectField">insurance</descrip>
  <langSec xml:lang="en">
      <term>basic own funds</term>
      <termNote type="termType">fullForm</termNote>
      <descrip type="reliabilityCode">9</descrip>
      <termNote type="termLemma">basic own fund</termNote>
      <termNote type="partOfSpeech">adj, adj, noun</termNote>
  <langSec xml:lang="it">
      <term>fondi propri di base</term>
      <termNote type="termType">fullForm</termNote>
      <descrip type="reliabilityCode">9</descrip>
      <termNote type="termLemma">fondo proprio di base</termNote>
      <termNote type="partOfSpeech">noun, det, adp, noun</termNote>

Annotating text in different languages

Because the termbase contains multilingual terms with references to templates and locations we are now able to annotate terms in documents in different European languages. Below you find some examples of what you can do with the termbase. I took an identical text from the Solvency 2 Delegated Acts in English, Finnish and Italian (the first part of article 52), converted the text to NAF and added annotations with the termbase (Nafigator has a function that processes a complete termbase and adds annotations to the NAF-file). This results in the following (using a visualizer from the spaCy package):

In English

Article 52 Mortality riskTerm S.,R0100 stress The mortality riskTerm S.,R0100 stress referred to in Article 77b(1)(f) of Directive 2009/138/EC shall be the more adverse of the 1. following two scenarios in terms of its impact on basic own fundsTerm S.,R0020: (a) an instantaneous permanent increase of 15 % in the mortality rates used for the calculation of the best estimateTerm S.,R0540; (b) an instantaneous increase of 0.15 percentage points in the mortality rates (expressed as percentages) which are used in the calculation of technical provisionsTerm S.,R0010 to reflect the mortality experience in the following 12 months.

In Finnish

52 artikla KuolevuusriskiinTerm S.,R0100 liittyvä stressi Direktiivin 2009/138/EY 77 b artiklan 1 kohdan f alakohdassa tarkoitetun, kuolevuusriskiinTerm S.,R0100 liittyvän stressin on 1. oltava seuraavista kahdesta skenaariosta se, jonka epäsuotuisa vaikutus omaan perusvarallisuuteenTerm S.,R0020 on suurempi: a) välitön, pysyvä 15 %:n nousu parhaan estimaatinTerm S.,R0540 laskennassa käytetyssä kuolevuudessa; b) välitön 0,15 prosenttiyksikön nousu prosentteina ilmaistussa kuolevuudessa, jota käytetään vakuutusteknisen vastuuvelanTerm S.,R0010 laskennassa ilmentämään havaittua kuolevuutta seuraavien 12 kuukauden aikana. Sovellettaessa 1 kohtaa kuolevuuden nousua sovelletaan ainoastaan vakuutuksiin, joissa kuolevuuden nousu johtaa

In German

Artikel 52 SterblichkeitsrisikostressTerm S.,R0100 Der in Artikel 77b Absatz 1 Buchstabe f der Richtlinie 2009/138/EG genannte SterblichkeitsrisikostressTerm S.,R0100 ist das im 1. Hinblick auf die Auswirkungen auf die BasiseigenmittelTerm S.,R0020 ungünstigere der beiden folgenden Szenarien: (a) plötzlicher dauerhafter Anstieg der bei der Berechnung des besten Schätzwerts zugrunde gelegten Sterblichkeitsraten um 15 %; (b) plötzlicher Anstieg der bei der Berechnung der versicherungstechnischen RückstellungenTerm S.,R0010 zugrunde gelegten Sterblichkeitsraten (ausgedrückt als Prozentsätze) um 0,15 Prozentpunkte, um die Sterblichkeit in den folgenden zwölf Monaten widerzuspiegeln.

In French

Article 52 Choc de risque de mortalitéTerm S.,R0100 Le choc de risque de mortalitéTerm S.,R0100 visé à l’article 77 ter, paragraphe 1, point f), de la directive 2009/138/CE 1. correspond au plus défavorable des deux scénarios suivants en termes d’impact sur les fonds propres de baseTerm S.,R0020: (a) une hausse permanente soudaine de 15 % des taux de mortalité utilisés pour le calcul de la meilleure estimationTerm S.,R0540; (b) une hausse soudaine de 0,15 point de pourcentage des taux de mortalité (exprimés en pourcentage) qui sont utilisés dans le calcul des provisions techniquesTerm S.,R0010 pour refléter l’évolution de la mortalité au cours des 12 mois à venir.

In Swedish

Artikel 52 DödsfallsriskstressTerm S.,R0100 Den dödsfallsriskstressTerm S.,R0100 som avses i artikel 77b.1 f i direktiv 2009/138/EG ska vara det som är mest negativt av 1. följande två scenarier i fråga om dess påverkan på kapitalbasen: (a) En omedelbar permanent ökning på 15 % av dödligheten som används för beräkning av bästa skattningenTerm S.,R0540. (b) En omedelbar ökning på 0,15 % av dödstalen (uttryckta i procent) som används i beräkningen av försäkringstekniska avsättningarTerm S.,R0010 för att återspegla dödligheten under de följande tolv månaderna.

In Italian

Articolo 52 Stress legato al rischio di mortalitàTerm S.,R0100 Lo stress legato al rischio di mortalitàTerm S.,R0100 di cui all’articolo 77 ter, paragrafo 1, lettera f), della direttiva 2009/138/CE è 1. il più sfavorevole dei due seguenti scenari in termini di impatto sui fondi propri di baseTerm S.,R0020: (a) un incremento permanente istantaneo del 15 % dei tassi di mortalità utilizzati per il calcolo della migliore stima; (b) un incremento istantaneo di 0,15 punti percentuali dei tassi di mortalità (espressi in percentuale) utilizzati nel calcolo delle riserve tecnicheTerm S.,R0010 per tener conto dei dati tratti dall’esperienza relativi alla mortalità nei 12 mesi successivi. Ai fini del paragrafo 1, l’incremento dei tassi di mortalità si applica soltanto alle polizze di assicurazione per le 2. quali tale incremento comporta un aumento delle riserve tecnicheTerm S.,R0010 tenendo conto di tutto quanto segue:

In the Italian text one reference is missing: the IATE-database does not yet contain an Italian translation for the English term best estimate. This happens because the IATE-database is far from complete. Not all terms are available in all languages, and the IATE-database probably does not contain all terminology from the reporting templates. And although the IATE-database is constantly updated, it might be necessary for certain use cases to add additional translations and concepts to the database.

This works for every XBRL Taxonomy, although every taxonomy has its own peculiarities (it’s a “standard” as one of my IT-colleagues likes to say). Following the procedure described above, I made the following termbases for insurance undertakings, credit institutions and Dutch pension funds based on the taxonomy versions mentioned:

  • Termbase of EIOPA Solvency 2, taxonomy version 2.6.0
  • Termbase of EBA CRD IV, taxonomy version
  • Termbase of DNB FTK, taxonomy version 2.3.0

These can be found on data.world.

EIOPA’s Solvency 2 taxonomy in RDF

Published by:

To use the metadata from XBRL taxonomies, like labels, hierarchies, template structures and formulas, often licensed software is needed to process the taxonomy and convert the XML content to readable formats. In an earlier blog I have shown that it is useful to convert XBRL instance data to a linked data set in RDF and then query that data to retrieve the desired information. In this blog I will show how to do this with taxonomies: by using a number of small SPARQL queries the complete Data Point Model (DPM) of (European) XBRL taxonomies can be retrieved.

The main purpose of retrieving metadata in this manner is to be able to use taxonomy metadata in data science environments, for example to be able to apply machine learning models that use taxonomy metadata like hierarchies and to use concept and element labels from a taxonomy in NLP, for example Named Entity Recognition tasks to link quantitative reports to (unstructured, or: not yet structured) text data.

The lightweight solution that I show here is completely based on open source code, in the form of my xbrl2rdf package. This package converts XBRL instance files and all related taxonomy files (schemas and linkbases) to RDF and RDF-star and uses for this lxml and rdflib, and nothing else.

The examples below use the Solvency 2 taxonomy, but other taxonomies works as well. Gist with the notebook with the code below can be found here.

Importing the data

With the xbrl2rdf-package I have converted the EIOPA instance example file for quarterly reports for solo undertaking (QRS) to RDF. All taxonomy concepts that are used in that instance are included in the RDF data set. This result can be read with rdflib into memory.

# RDF graph loading
path = "../data/rdf/qrs_240_instance.ttl"

g = RDFGraph()
g.parse(path, format='turtle')

print("rdflib Graph loaded successfully with {} triples".format(len(g)))

This returns

rdflib Graph loaded successfully with 1203744 triples

So we have a RDF graph with 1.2 million triples that contains all data (facts related to concepts in the taxonomy, including all labels, validation rules, template structures, etc). The original RDF data file is around 64 Mb (combining instance and taxonomy triples). Reading and processing this file into an in-memory RDF graph takes some time, but then the graph can easily be queried.

Extracting template URIs

Let’s start with a simple query. Table or template URIs are subjects in the triple “subject xl:type table:table”. To get a list with all templates of an instance (in this case the first five) we run

q = """
    ?a xl:type table:table .
tables = [str(row[0]) for row in g.query(q)]

This returns a list of the URIs of the templates contained in the instance file.


Extracting the explicit domains

Next, we extract the explicit domains and related data in the taxonomy. A domain is specific XBRL terminology and means a set of elements sharing a specified semantic nature. An explicit domain has its elements enumerated in the taxonomy and can be found with the subject in the triple ‘subject rdf:type model:explicitDomainType’.

q = """
  SELECT DISTINCT ?t ?x1 ?x2 ?x3 ?x4
    ?t rdf:type model:explicitDomainType .
    ?t xbrli:periodType ?x1 .
    ?t model:creationDate ?x2 .
    ?t xbrli:nillable ?x3 .
    ?t xbrli:abstract ?x4 .

The first five domains (of 41 in total) are

indexDomain nameDomain labelperiod typecreation datenillableabstract
0LBLines of businessesinstant2014-07-07truetrue
1MCMain categoriesinstant2014-07-07truetrue
2TITime intervalsinstant2014-07-07truetrue
3AOArticle 112 and 167instant2014-07-07truetrue
Domain names and labels with attributes

So, the label of the domain LB is Lines of businesses; it has been there since the early versions of the taxonomy. If a domain is modified then this is also included as a triple in the data set.

Extracting domain members

Elements of an explicit domain are called domain members. A domain member (or simply a member) is enumerated element of an explicit domain. All members from a domain share a certain common nature. To get the members of a domain, we define a function that finds all domain-members relations of a given domain and retrieve the label of the member. In SPARQL this is:

def members(domain):
    q = """
      SELECT DISTINCT ?t ?label
      WHERE {
        ?l arcrole7:domain-member [ xl:from <"""+str(domain)+"""> ;
                                    xl:to ?t ] .
        ?t rdf:type nonnum:domainItemType .
        ?x arcrole3:concept-label [ xl:from ?t ;
                                    xl:to [rdf:value ?label ] ] .
    return g.query(q)

All members of all domains can be retrieved by running this function for all domains defined earlier. Based on the output we create a Pandas DataFrame with the results.

df_members = pd.DataFrame()
for d in df_domains.iloc[:, 0]:
    data = [[urldefrag(d)[1]]+[urldefrag(row[0])[1]]+list(row[1:]) for row in members(d)]
    columns = ['Domain',
               'Member label']
    df_members = df_members.append(pd.DataFrame(data=data,

In total there are 4.879 members of all domains (in this taxonomy).

index DomainMemberMember label
1LBx1Accident and sickness
3LBx3Fire and other damage to property
4LBx4Aviation, marine and transport
The first five member of the domain LB (Lines of Businesses)

This allows us, for example, to retrieve all facts in a report that are related to the term ‘motor’ because a reported fact contains references to the domain members to which the fact relates.

Extracting the template structure

Template structures are stored in the taxonomy as a tree of linked elements along the axes x, y and, if applicable, z. The elements have a label and a row-column code (this holds at least for EIOPA and DNB taxonomies), and have a certain depth, i.e. they can be subcategories of other elements. For example in the Solvency 2 balance sheet template, the element ‘Equities’ has subcategories ‘Equities – listed’ and ‘Equities unlisted’. I have not included the code here but with a few lines you can extract the complete template structures as they are stored in the taxonomy. For example for the balance sheet (S. we get:

1x1C0010Solvency II value
3y1 Assets
4y1 Liabilities
6y2R0020Deferred acquisition costs
7y2R0030Intangible assets
8y2R0040Deferred tax assets
9y2R0050Pension benefit surplus
10y2R0060Property, plant & equipment held for own use
11y2R0070Investments (other than assets held for index-…
12y3R0080Property (other than for own use)
13y3R0090Holdings in related undertakings, including pa…
15y4R0110Equities – listed
16y4R0120Equities – unlisted
18y4R0140Government Bonds
19y4R0150Corporate Bonds
20y4R0160Structured notes
21y4R0170Collateralised securities
22y3R0180Collective Investments Undertakings
24y3R0200Deposits other than cash equivalents
The first 25 lines of the balance sheet template

With relatively small SPARQL queries, it is possible to retrieve metadata from XBRL taxonomies. This works well because we have converted the original taxonomy (in xml) to the linked data format RDF; and this format is especially well suited for representing and querying XBRL data.

The examples above show that it is possible to retrieve the complete Solvency 2 Data Point Model (and more) from the the taxonomy in RDF and make it available in a Python environment. This allows incorporation of metadata in machine learning models and in NLP applications. I hope that this approach will allow more data scientists to use existing metadata from XBRL taxonomies.

Converting XBRL to RDF-star

Published by:

Lately I have been working on the conversion of XBRL instances and related taxonomy schemas and linkbases to RDF and RDF-star. In these semantic data formats, you can link data in XBRL data with other data sources and you can query the data in a fairly easy manner. RDF-star is an extension of RDF that in some situations allows a more compact description of linked data, and by that it narrows the gap between RDF and property graphs. How this works, I will show in this blog using the XBRL taxonomy definitions as an example.

In a previous blog I showed that XBRL instance facts can be converted to RDF and visualized as a network. The same can be done with the related taxonomy elements. An XBRL taxonomy consists of concepts and relations between concepts that define calculations, presentations, labels and definitions. The concepts are laid down (mostly) in XML schemas and the relations in linkbases using XML schemas and XLinks. By converting the XBRL taxonomy to RDF, the XBRL fact data is linked to its corresponding metadata in the taxonomy.


There has been done some work on the conversion of XBRL to RDF, most notably by Dave Raggett. His project xbrlimport, written in C++ and available on SourceForge, converts XBRL data to RDF triples. His approach is clean and straightforward and reuses the original namespaces of the XBRL data (with some obvious elements translated to predicates with RDF namespaces).

I used Raggett’s xbrlimport as a starting point, translated it to Python, added XBRL items that were introduced after publication of the code and improved a number of things. The code is now for example able to convert all EIOPA’s Solvency 2 taxonomy elements with all metadata available to RDF format. This code is available under the same license as xbrlimport (GNU General Public License) as a Python Package on pypi.org. You can take an XBRL instance with corresponding taxonomy (in the form of a zip-file) and convert the contents to RDF and RDF-star. This code will look up any references (URIs) in the XBRL instance to the taxonomy in the zip-file and convert the relevant files to RDF.

Let’s look at some examples of the Solvency 2 taxonomy converted to RDF. The RDF triple of an arbitrary XBRL concept from the Solvency 2 taxonomy looks like this (in turtle format):

    rdf:type xbrli:monetaryItemType ;
    xbrli:periodType """instant"""^^rdf:XMLLiteral ;
    model:creationDate """2014-07-07"""^^xsd:dateTime ;
    xbrli:substitutionGroup xbrli:item ;
    xbrli:nillable "true"^^xsd:boolean ;

This example describes the triples of concept s2md_met:mi362 (a Solvency 2 metric). With these triples we have exactly the same data as in the related XML file but now in the form of triples. Namespaces are derived from the XML file (except rdf:type) and datatypes are transformed to RDF datatypes with proper RDF syntax.

This can be done with all concepts used to which the facts of an XBRL instance refer. If you have facts in RDF format, then in RDF these concept are automatically linked with the concepts in the taxonomy because the URIs of the concepts are the same. This creates a network of facts with all related metadata of the facts.

An XBRL taxonomy also contains links that relate concepts to each other for several purposes (to provide labels, definitions, presentations and calculations) . An example of a link is the following.

_:link2 arcrole:concept-label [
    xl:type xl:link ;
    xl:role role2:link ;
    xl:from s2md_met:mi362 ;
    xl:to s2md_met:label_s2md_mi362 ;
    ] .

The link relates concept mi362 with label mi362 by creating a new subject _:link2 with predicate arcrole:concept-label and an object which contains all data about the link (including the xl:from and xl:to and the attributes of the link). This way of introducing a new subject to specify a link between two concepts is called reification and a bit artificial because you would like to link the concept directly with the label, such as

s2md_met:mi281 arcrole:concept-label s2md_met:label_s2md_mi281

However, then you are unable in RDF to link the attributes (like the order and the role) to the predicates. It is one of the disadvantages of the current RDF format. There appears to be no easy way to do this in RDF, other than by using this artificial reification approach (some other solutions exist like the singleton property approach, but all of them have disadvantages.)

The new RDF-star format

Recently, the RDF-star working group published their first Draft Community Report. In this report they introduced new RDF-star and SPARQL-star specifications. These new specifications, although not yet a W3C standard, enable more compact specification of linked datasets and simpler graphs and less nodes.

Let’s look what this means for the XBRL linkbases with the following example. Suppose we have the following link definition.

_:link1 arcrole:breakdown-tree [
    xl:from _:s2md_a1 ;
    xl:to _:s2md_a1.root ;
    xl:type xl:link ;
    xl:role tab:S. ;
    xl:order "0"^^xsd:decimal ;
    ] .

The subject in this case is _:link1 with predicate arcrole:breakdown-tree, so this link describes a part of a table template. It points to a subject with all the information of the link, i.e. from, to, type, role and order from the xl namespace. Note that there is no triple with _:s2md_a1 (xl:from) as a subject and _:s2md_a1.root (xl:to) as an object. So if you want to know the relations of the concept _:s2md_a1 you need to look at the link triples and look for entries where xl:from equals the concept.

With the new RDF-star specifications you can just add the triple and then add properties to the triple as a whole, so the example would read

_:s2md_a1 arcrole:breakdown-tree _:s2md_a1.root .

<<_:s2md_a1 arcrole:breakdown-tree _:s2md_a1.root>> 
    xl:role tab:S. ;
    xl:order "0"^^xsd:decimal ;

Which is basically what we need to define. If you now want to know the relations of the subject _:s2md_a1 then you just look for triples with this subject. In the visual presentation of the RDF dataset you will see a direct link between the two concepts. This new RDF format also implies simplifications of the SPARQL queries.

This blog has become a bit technical but I hope you see that the RDF-star specification allows a much needed simplification of RDF triples. I showed that the conversion of XBRL taxonomies to RDF-star leads to a smaller amount of triples and also to less complex triples. The resulting taxonomy triples lead to less complex graphs and can be used to derive the XBRL labels, template structures, validation rules and definitions, just by using SPARQL queries.

Converting supervisory reports to Semantic Webs: from XBRL to RDF

Published by:

A growing number of supervisory reports across Europe are based on the XML Extensible Business Reporting Language standard (XBRL). Financial entities such as banks, insurance undertakings and pension institutions are required to submit their reports to their supervisors in this format.

XBRL is a language for modeling, exchanging and automatically processing business and financial information. Reports in this format (called instance documents) are based on metadata (set out in taxonomies) that add semantic meaning to the data points that are reported. You can choose different implementations but overall an XBRL taxonomy provides a semantically rich data model and that has always been one of the main advantages of XBRL.

However, in its raw format (an XML file) each report is basically a machine readable document with a tree structure that does not enable easy integration with related data from other sources or integration with text documents and their contents.

In this blog, I will show that converting the XBRL reports to another format allows easier integration and understanding. That other format is based on Semantic Webs. It has been shown that XBRL converted to Semantic Webs can be done without any loss of information (see for example this article). So if we convert the XBRL format to a Semantic Web then we keep the structure and the meaning provided by the taxonomy. The result is basically a graph and this format enables integration with other linked data that is much easier.

A Semantic Web consists of formats and technologies that are rather old (from a computer science perspective): it originated around the same time as XBRL, some twenty years ago. And because it tried to solve similar problems (lack of semantic meaning in the World Wide Web) as the XBRL standard (lack of semantic meaning in business and financial data), to some extent it is based on similar concepts. It was however developed completely separate from XBRL.

The general concept of a Semantic Webs, where data is linked together to provide semantic meaning, is also known as a knowledge graph.

How does a Semantic Web work? One of the formats of the Semantic Web is the Resource Description Framework (RDF), originally designed as a metadata data model. RDF was adopted as a World Wide Web Consortium recommendation in 1999. The RDF 1.0 specification was published in 2004, and RDF 1.1 followed in 2014.

The RDF format is based on expressions in the form of subject-predicate-object, called triples. The subject and object denote (web) resources and the predicate denotes the relationship between the subject and the object. For example the expression ‘Spinoza has written the book Ethica Ordine Geometrico Demonstrata’ in RDF is a triple with a subject denoting “Spinoza”, a predicate denoting “has written”, and an object denoting “the book Ethica Ordine Geometrico Demonstrata”. This is a different approach than for example object-oriented models with an entity (Spinoza), attribute (book) and value (Ethica).

The RDF format could potentially solve some problems with the XBRL format. To explain this, I converted an XBRL-instance (a test instance file from EIOPA for Solvency 2) to RDF format.

Below you see the representation of one arbitrary data point in the report (called a fact) in RDF format and visualized as a network (I used the Python package networkx). The predicates contain the complete web resource so I limited the name to the last word to make it readable.

The red node is the starting point of the data point. The red labels on the lines describe the predicate between subject and object. You see that the fact (subject) ‘has decimals’ (predicate) 2 (object), and furthermore has unit EUR, has value 838522076.03, has type metmi503 (an internal code describing Payments for reported but not settled claims) and some other properties.

The data point also has a so-called context that defines the entity to which the fact applies, the period of time the fact is relevant (in this case 2019-12-31) and also a scenario, which consists of additional metadata of the data point. In this case we see that the data point is related to statutory accounts, non-life and health non-STL, direct business and accepted during the period (and a node without a label).

All facts in every XBRL instance are structured in this way, which means that for example you can search all facts with the label statutory accounts. Furthermore, because XBRL uses namespaces you can unambiguously identify predicates and objects in the report. For example, you see that the entity node has an identifier (starting with 0LFF1…) and a scheme (17442). The scheme refers to the web resource for the ISO standard 17442 which specifies the Legal Entity Identifier (LEI), so the entity is unambiguously identified with the given (LEI-)code. If you add other XBRL instances with references to that entity then the data is automatically linked because other instances will contain exactly the same entity node.

The RDF representation of the XBRL fact above is:

_:provenance1 xl:instance "filename".
_:unit_u xbrli:measure iso4217:EUR.
  xl:provenance :provenance1;
  xl:type xbrli:fact; 
  rdf:type s2md_met:mi503;
  rdf:value "838522076.03"^^xsd:decimal;
  xbrli:decimals "2"^^xsd:integer;
  xbrli:unit :unit_u; 
  xbrli:context :context_BLx79_DIx5_IZx1_TBx28_VGx84.
  xl:type xbrli:context;
  xbrli:entity [
    xbrli:identifier "0LFF1WMNTWG5PTIYYI38";
    xbrli:scheme http://standards.iso.org/iso/17442;
  xbrli:scenario [
    xbrldi:explicitMember "s2c_LB:x79"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_DI:x5"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_RT:x1"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_LB:x28"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_AM:x84"^^rdf:XMLLiteral;
  xbrli:instant "2019-12-31"^^xsd:date.

Instead of storing the data in separate templates with often unclear code names you can also convert the XBRL data to one large Semantic Web where all facts are linked together. The RDF format thus provides a graph model which allows easier integration and visualization (and, for me at least, easier understanding). It allows adding and linking data from other sources, such as Solvency 2 documents and external data, in the same graph.

Typically, supervisory reports consists of thousands of data points and supervisors receive reports from many entities each period. How would you store that information? I think that the natural way to store an XBRL instance is not a relational database but a graph database (like graphDB or Neo4j). These databases can store the facts with all the metadata in a structured way and enable to query the graph efficiently. Next blog, I will explore graph databases and query languages for XBRL reports converted to the RDF format.