EIOPA’s Solvency 2 taxonomy in RDF

Tags: , ,

To use the metadata from XBRL taxonomies, like labels, hierarchies, template structures and formulas, often licensed software is needed to process the taxonomy and convert the XML content to readable formats. In an earlier blog I have shown that it is useful to convert XBRL instance data to a linked data set in RDF and then query that data to retrieve the desired information. In this blog I will show how to do this with taxonomies: by using a number of small SPARQL queries the complete Data Point Model (DPM) of (European) XBRL taxonomies can be retrieved.

The main purpose of retrieving metadata in this manner is to be able to use taxonomy metadata in data science environments, for example to be able to apply machine learning models that use taxonomy metadata like hierarchies and to use concept and element labels from a taxonomy in NLP, for example Named Entity Recognition tasks to link quantitative reports to (unstructured, or: not yet structured) text data.

The lightweight solution that I show here is completely based on open source code, in the form of my xbrl2rdf package. This package converts XBRL instance files and all related taxonomy files (schemas and linkbases) to RDF and RDF-star and uses for this lxml and rdflib, and nothing else.

The examples below use the Solvency 2 taxonomy, but other taxonomies works as well. Gist with the notebook with the code below can be found here.

Importing the data

With the xbrl2rdf-package I have converted the EIOPA instance example file for quarterly reports for solo undertaking (QRS) to RDF. All taxonomy concepts that are used in that instance are included in the RDF data set. This result can be read with rdflib into memory.

# RDF graph loading
path = "../data/rdf/qrs_240_instance.ttl"

g = RDFGraph()
g.parse(path, format='turtle')

print("rdflib Graph loaded successfully with {} triples".format(len(g)))

This returns

rdflib Graph loaded successfully with 1203744 triples

So we have a RDF graph with 1.2 million triples that contains all data (facts related to concepts in the taxonomy, including all labels, validation rules, template structures, etc). The original RDF data file is around 64 Mb (combining instance and taxonomy triples). Reading and processing this file into an in-memory RDF graph takes some time, but then the graph can easily be queried.

Extracting template URIs

Let’s start with a simple query. Table or template URIs are subjects in the triple “subject xl:type table:table”. To get a list with all templates of an instance (in this case the first five) we run

q = """
    ?a xl:type table:table .
tables = [str(row[0]) for row in g.query(q)]

This returns a list of the URIs of the templates contained in the instance file.


Extracting the explicit domains

Next, we extract the explicit domains and related data in the taxonomy. A domain is specific XBRL terminology and means a set of elements sharing a specified semantic nature. An explicit domain has its elements enumerated in the taxonomy and can be found with the subject in the triple ‘subject rdf:type model:explicitDomainType’.

q = """
  SELECT DISTINCT ?t ?x1 ?x2 ?x3 ?x4
    ?t rdf:type model:explicitDomainType .
    ?t xbrli:periodType ?x1 .
    ?t model:creationDate ?x2 .
    ?t xbrli:nillable ?x3 .
    ?t xbrli:abstract ?x4 .

The first five domains (of 41 in total) are

indexDomain nameDomain labelperiod typecreation datenillableabstract
0LBLines of businessesinstant2014-07-07truetrue
1MCMain categoriesinstant2014-07-07truetrue
2TITime intervalsinstant2014-07-07truetrue
3AOArticle 112 and 167instant2014-07-07truetrue
Domain names and labels with attributes

So, the label of the domain LB is Lines of businesses; it has been there since the early versions of the taxonomy. If a domain is modified then this is also included as a triple in the data set.

Extracting domain members

Elements of an explicit domain are called domain members. A domain member (or simply a member) is enumerated element of an explicit domain. All members from a domain share a certain common nature. To get the members of a domain, we define a function that finds all domain-members relations of a given domain and retrieve the label of the member. In SPARQL this is:

def members(domain):
    q = """
      SELECT DISTINCT ?t ?label
      WHERE {
        ?l arcrole7:domain-member [ xl:from <"""+str(domain)+"""> ;
                                    xl:to ?t ] .
        ?t rdf:type nonnum:domainItemType .
        ?x arcrole3:concept-label [ xl:from ?t ;
                                    xl:to [rdf:value ?label ] ] .
    return g.query(q)

All members of all domains can be retrieved by running this function for all domains defined earlier. Based on the output we create a Pandas DataFrame with the results.

df_members = pd.DataFrame()
for d in df_domains.iloc[:, 0]:
    data = [[urldefrag(d)[1]]+[urldefrag(row[0])[1]]+list(row[1:]) for row in members(d)]
    columns = ['Domain',
               'Member label']
    df_members = df_members.append(pd.DataFrame(data=data,

In total there are 4.879 members of all domains (in this taxonomy).

index DomainMemberMember label
1LBx1Accident and sickness
3LBx3Fire and other damage to property
4LBx4Aviation, marine and transport
The first five member of the domain LB (Lines of Businesses)

This allows us, for example, to retrieve all facts in a report that are related to the term ‘motor’ because a reported fact contains references to the domain members to which the fact relates.

Extracting the template structure

Template structures are stored in the taxonomy as a tree of linked elements along the axes x, y and, if applicable, z. The elements have a label and a row-column code (this holds at least for EIOPA and DNB taxonomies), and have a certain depth, i.e. they can be subcategories of other elements. For example in the Solvency 2 balance sheet template, the element ‘Equities’ has subcategories ‘Equities – listed’ and ‘Equities unlisted’. I have not included the code here but with a few lines you can extract the complete template structures as they are stored in the taxonomy. For example for the balance sheet (S. we get:

1x1C0010Solvency II value
3y1 Assets
4y1 Liabilities
6y2R0020Deferred acquisition costs
7y2R0030Intangible assets
8y2R0040Deferred tax assets
9y2R0050Pension benefit surplus
10y2R0060Property, plant & equipment held for own use
11y2R0070Investments (other than assets held for index-…
12y3R0080Property (other than for own use)
13y3R0090Holdings in related undertakings, including pa…
15y4R0110Equities – listed
16y4R0120Equities – unlisted
18y4R0140Government Bonds
19y4R0150Corporate Bonds
20y4R0160Structured notes
21y4R0170Collateralised securities
22y3R0180Collective Investments Undertakings
24y3R0200Deposits other than cash equivalents
The first 25 lines of the balance sheet template

With relatively small SPARQL queries, it is possible to retrieve metadata from XBRL taxonomies. This works well because we have converted the original taxonomy (in xml) to the linked data format RDF; and this format is especially well suited for representing and querying XBRL data.

The examples above show that it is possible to retrieve the complete Solvency 2 Data Point Model (and more) from the the taxonomy in RDF and make it available in a Python environment. This allows incorporation of metadata in machine learning models and in NLP applications. I hope that this approach will allow more data scientists to use existing metadata from XBRL taxonomies.

Leave a Reply