Tag Archives: RDF

EIOPA’s Solvency 2 taxonomy in RDF

Published by:

To use the metadata from XBRL taxonomies, like labels, hierarchies, template structures and formulas, often licensed software is needed to process the taxonomy and convert the XML content to readable formats. In an earlier blog I have shown that it is useful to convert XBRL instance data to a linked data set in RDF and then query that data to retrieve the desired information. In this blog I will show how to do this with taxonomies: by using a number of small SPARQL queries the complete Data Point Model (DPM) of (European) XBRL taxonomies can be retrieved.

The main purpose of retrieving metadata in this manner is to be able to use taxonomy metadata in data science environments, for example to be able to apply machine learning models that use taxonomy metadata like hierarchies and to use concept and element labels from a taxonomy in NLP, for example Named Entity Recognition tasks to link quantitative reports to (unstructured, or: not yet structured) text data.

The lightweight solution that I show here is completely based on open source code, in the form of my xbrl2rdf package. This package converts XBRL instance files and all related taxonomy files (schemas and linkbases) to RDF and RDF-star and uses for this lxml and rdflib, and nothing else.

The examples below use the Solvency 2 taxonomy, but other taxonomies works as well. Gist with the notebook with the code below can be found here.

Importing the data

With the xbrl2rdf-package I have converted the EIOPA instance example file for quarterly reports for solo undertaking (QRS) to RDF. All taxonomy concepts that are used in that instance are included in the RDF data set. This result can be read with rdflib into memory.

# RDF graph loading
path = "../data/rdf/qrs_240_instance.ttl"

g = RDFGraph()
g.parse(path, format='turtle')

print("rdflib Graph loaded successfully with {} triples".format(len(g)))

This returns

rdflib Graph loaded successfully with 1203744 triples

So we have a RDF graph with 1.2 million triples that contains all data (facts related to concepts in the taxonomy, including all labels, validation rules, template structures, etc). The original RDF data file is around 64 Mb (combining instance and taxonomy triples). Reading and processing this file into an in-memory RDF graph takes some time, but then the graph can easily be queried.

Extracting template URIs

Let’s start with a simple query. Table or template URIs are subjects in the triple “subject xl:type table:table”. To get a list with all templates of an instance (in this case the first five) we run

q = """
  SELECT ?a
  WHERE {
    ?a xl:type table:table .
  }"""
tables = [str(row[0]) for row in g.query(q)]
tables.sort()
tables[0:5]

This returns a list of the URIs of the templates contained in the instance file.

['http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2019-07-15/tab/S.01.01.02.01#s2md_tS.01.01.02.01',
 'http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2019-07-15/tab/S.01.02.01.01#s2md_tS.01.02.01.01',
 'http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2019-07-15/tab/S.02.01.02.01#s2md_tS.02.01.02.01',
 'http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2019-07-15/tab/S.05.01.02.01#s2md_tS.05.01.02.01',
 'http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2019-07-15/tab/S.05.01.02.02#s2md_tS.05.01.02.02']

Extracting the explicit domains

Next, we extract the explicit domains and related data in the taxonomy. A domain is specific XBRL terminology and means a set of elements sharing a specified semantic nature. An explicit domain has its elements enumerated in the taxonomy and can be found with the subject in the triple ‘subject rdf:type model:explicitDomainType’.

q = """
  SELECT DISTINCT ?t ?x1 ?x2 ?x3 ?x4
  WHERE {
    ?t rdf:type model:explicitDomainType .
    ?t xbrli:periodType ?x1 .
    ?t model:creationDate ?x2 .
    ?t xbrli:nillable ?x3 .
    ?t xbrli:abstract ?x4 .
  }"""

The first five domains (of 41 in total) are

indexDomain nameDomain labelperiod typecreation datenillableabstract
0LBLines of businessesinstant2014-07-07truetrue
1MCMain categoriesinstant2014-07-07truetrue
2TITime intervalsinstant2014-07-07truetrue
3AOArticle 112 and 167instant2014-07-07truetrue
4CGCollaterals/Guaranteesinstant2014-07-07truetrue
Domain names and labels with attributes

So, the label of the domain LB is Lines of businesses; it has been there since the early versions of the taxonomy. If a domain is modified then this is also included as a triple in the data set.

Extracting domain members

Elements of an explicit domain are called domain members. A domain member (or simply a member) is enumerated element of an explicit domain. All members from a domain share a certain common nature. To get the members of a domain, we define a function that finds all domain-members relations of a given domain and retrieve the label of the member. In SPARQL this is:

def members(domain):
    q = """
      SELECT DISTINCT ?t ?label
      WHERE {
        ?l arcrole7:domain-member [ xl:from <"""+str(domain)+"""> ;
                                    xl:to ?t ] .
        ?t rdf:type nonnum:domainItemType .
        ?x arcrole3:concept-label [ xl:from ?t ;
                                    xl:to [rdf:value ?label ] ] .
        }"""
    return g.query(q)

All members of all domains can be retrieved by running this function for all domains defined earlier. Based on the output we create a Pandas DataFrame with the results.

df_members = pd.DataFrame()
for d in df_domains.iloc[:, 0]:
    data = [[urldefrag(d)[1]]+[urldefrag(row[0])[1]]+list(row[1:]) for row in members(d)]
    columns = ['Domain',
               'Member',
               'Member label']
    df_members = df_members.append(pd.DataFrame(data=data,
                                                columns=columns))

In total there are 4.879 members of all domains (in this taxonomy).

index DomainMemberMember label
0LBx0Total/NA
1LBx1Accident and sickness
2LBx2Motor
3LBx3Fire and other damage to property
4LBx4Aviation, marine and transport
The first five member of the domain LB (Lines of Businesses)

This allows us, for example, to retrieve all facts in a report that are related to the term ‘motor’ because a reported fact contains references to the domain members to which the fact relates.

Extracting the template structure

Template structures are stored in the taxonomy as a tree of linked elements along the axes x, y and, if applicable, z. The elements have a label and a row-column code (this holds at least for EIOPA and DNB taxonomies), and have a certain depth, i.e. they can be subcategories of other elements. For example in the Solvency 2 balance sheet template, the element ‘Equities’ has subcategories ‘Equities – listed’ and ‘Equities unlisted’. I have not included the code here but with a few lines you can extract the complete template structures as they are stored in the taxonomy. For example for the balance sheet (S.02.01.02.01) we get:

 axisdepthrc-codelabel
1x1C0010Solvency II value
3y1 Assets
4y1 Liabilities
5y2R0010Goodwill
6y2R0020Deferred acquisition costs
7y2R0030Intangible assets
8y2R0040Deferred tax assets
9y2R0050Pension benefit surplus
10y2R0060Property, plant & equipment held for own use
11y2R0070Investments (other than assets held for index-…
12y3R0080Property (other than for own use)
13y3R0090Holdings in related undertakings, including pa…
14y3R0100Equities
15y4R0110Equities – listed
16y4R0120Equities – unlisted
17y3R0130Bonds
18y4R0140Government Bonds
19y4R0150Corporate Bonds
20y4R0160Structured notes
21y4R0170Collateralised securities
22y3R0180Collective Investments Undertakings
23y3R0190Derivatives
24y3R0200Deposits other than cash equivalents
The first 25 lines of the balance sheet template

With relatively small SPARQL queries, it is possible to retrieve metadata from XBRL taxonomies. This works well because we have converted the original taxonomy (in xml) to the linked data format RDF; and this format is especially well suited for representing and querying XBRL data.

The examples above show that it is possible to retrieve the complete Solvency 2 Data Point Model (and more) from the the taxonomy in RDF and make it available in a Python environment. This allows incorporation of metadata in machine learning models and in NLP applications. I hope that this approach will allow more data scientists to use existing metadata from XBRL taxonomies.

Converting XBRL to RDF-star

Published by:

Lately I have been working on the conversion of XBRL instances and related taxonomy schemas and linkbases to RDF and RDF-star. In these semantic data formats, you can link data in XBRL data with other data sources and you can query the data in a fairly easy manner. RDF-star is an extension of RDF that in some situations allows a more compact description of linked data, and by that it narrows the gap between RDF and property graphs. How this works, I will show in this blog using the XBRL taxonomy definitions as an example.

In a previous blog I showed that XBRL instance facts can be converted to RDF and visualized as a network. The same can be done with the related taxonomy elements. An XBRL taxonomy consists of concepts and relations between concepts that define calculations, presentations, labels and definitions. The concepts are laid down (mostly) in XML schemas and the relations in linkbases using XML schemas and XLinks. By converting the XBRL taxonomy to RDF, the XBRL fact data is linked to its corresponding metadata in the taxonomy.

XBRL to RDF

There has been done some work on the conversion of XBRL to RDF, most notably by Dave Raggett. His project xbrlimport, written in C++ and available on SourceForge, converts XBRL data to RDF triples. His approach is clean and straightforward and reuses the original namespaces of the XBRL data (with some obvious elements translated to predicates with RDF namespaces).

I used Raggett’s xbrlimport as a starting point, translated it to Python, added XBRL items that were introduced after publication of the code and improved a number of things. The code is now for example able to convert all EIOPA’s Solvency 2 taxonomy elements with all metadata available to RDF format. This code is available under the same license as xbrlimport (GNU General Public License) as a Python Package on pypi.org. You can take an XBRL instance with corresponding taxonomy (in the form of a zip-file) and convert the contents to RDF and RDF-star. This code will look up any references (URIs) in the XBRL instance to the taxonomy in the zip-file and convert the relevant files to RDF.

Let’s look at some examples of the Solvency 2 taxonomy converted to RDF. The RDF triple of an arbitrary XBRL concept from the Solvency 2 taxonomy looks like this (in turtle format):

s2md_met:mi362 
    rdf:type xbrli:monetaryItemType ;
    xbrli:periodType """instant"""^^rdf:XMLLiteral ;
    model:creationDate """2014-07-07"""^^xsd:dateTime ;
    xbrli:substitutionGroup xbrli:item ;
    xbrli:nillable "true"^^xsd:boolean ;

This example describes the triples of concept s2md_met:mi362 (a Solvency 2 metric). With these triples we have exactly the same data as in the related XML file but now in the form of triples. Namespaces are derived from the XML file (except rdf:type) and datatypes are transformed to RDF datatypes with proper RDF syntax.

This can be done with all concepts used to which the facts of an XBRL instance refer. If you have facts in RDF format, then in RDF these concept are automatically linked with the concepts in the taxonomy because the URIs of the concepts are the same. This creates a network of facts with all related metadata of the facts.

An XBRL taxonomy also contains links that relate concepts to each other for several purposes (to provide labels, definitions, presentations and calculations) . An example of a link is the following.

_:link2 arcrole:concept-label [
    xl:type xl:link ;
    xl:role role2:link ;
    xl:from s2md_met:mi362 ;
    xl:to s2md_met:label_s2md_mi362 ;
    ] .

The link relates concept mi362 with label mi362 by creating a new subject _:link2 with predicate arcrole:concept-label and an object which contains all data about the link (including the xl:from and xl:to and the attributes of the link). This way of introducing a new subject to specify a link between two concepts is called reification and a bit artificial because you would like to link the concept directly with the label, such as

s2md_met:mi281 arcrole:concept-label s2md_met:label_s2md_mi281

However, then you are unable in RDF to link the attributes (like the order and the role) to the predicates. It is one of the disadvantages of the current RDF format. There appears to be no easy way to do this in RDF, other than by using this artificial reification approach (some other solutions exist like the singleton property approach, but all of them have disadvantages.)

The new RDF-star format

Recently, the RDF-star working group published their first Draft Community Report. In this report they introduced new RDF-star and SPARQL-star specifications. These new specifications, although not yet a W3C standard, enable more compact specification of linked datasets and simpler graphs and less nodes.

Let’s look what this means for the XBRL linkbases with the following example. Suppose we have the following link definition.

_:link1 arcrole:breakdown-tree [
    xl:from _:s2md_a1 ;
    xl:to _:s2md_a1.root ;
    xl:type xl:link ;
    xl:role tab:S.01.01.02.01 ;
    xl:order "0"^^xsd:decimal ;
    ] .

The subject in this case is _:link1 with predicate arcrole:breakdown-tree, so this link describes a part of a table template. It points to a subject with all the information of the link, i.e. from, to, type, role and order from the xl namespace. Note that there is no triple with _:s2md_a1 (xl:from) as a subject and _:s2md_a1.root (xl:to) as an object. So if you want to know the relations of the concept _:s2md_a1 you need to look at the link triples and look for entries where xl:from equals the concept.

With the new RDF-star specifications you can just add the triple and then add properties to the triple as a whole, so the example would read

_:s2md_a1 arcrole:breakdown-tree _:s2md_a1.root .

<<_:s2md_a1 arcrole:breakdown-tree _:s2md_a1.root>> 
    xl:role tab:S.01.01.02.01 ;
    xl:order "0"^^xsd:decimal ;
    .

Which is basically what we need to define. If you now want to know the relations of the subject _:s2md_a1 then you just look for triples with this subject. In the visual presentation of the RDF dataset you will see a direct link between the two concepts. This new RDF format also implies simplifications of the SPARQL queries.

This blog has become a bit technical but I hope you see that the RDF-star specification allows a much needed simplification of RDF triples. I showed that the conversion of XBRL taxonomies to RDF-star leads to a smaller amount of triples and also to less complex triples. The resulting taxonomy triples lead to less complex graphs and can be used to derive the XBRL labels, template structures, validation rules and definitions, just by using SPARQL queries.

Converting supervisory reports to Semantic Webs: from XBRL to RDF

Published by:

A growing number of supervisory reports across Europe are based on the XML Extensible Business Reporting Language standard (XBRL). Financial entities such as banks, insurance undertakings and pension institutions are required to submit their reports to their supervisors in this format.

XBRL is a language for modeling, exchanging and automatically processing business and financial information. Reports in this format (called instance documents) are based on metadata (set out in taxonomies) that add semantic meaning to the data points that are reported. You can choose different implementations but overall an XBRL taxonomy provides a semantically rich data model and that has always been one of the main advantages of XBRL.

However, in its raw format (an XML file) each report is basically a machine readable document with a tree structure that does not enable easy integration with related data from other sources or integration with text documents and their contents.

In this blog, I will show that converting the XBRL reports to another format allows easier integration and understanding. That other format is based on Semantic Webs. It has been shown that XBRL converted to Semantic Webs can be done without any loss of information (see for example this article). So if we convert the XBRL format to a Semantic Web then we keep the structure and the meaning provided by the taxonomy. The result is basically a graph and this format enables integration with other linked data that is much easier.

A Semantic Web consists of formats and technologies that are rather old (from a computer science perspective): it originated around the same time as XBRL, some twenty years ago. And because it tried to solve similar problems (lack of semantic meaning in the World Wide Web) as the XBRL standard (lack of semantic meaning in business and financial data), to some extent it is based on similar concepts. It was however developed completely separate from XBRL.

The general concept of a Semantic Webs, where data is linked together to provide semantic meaning, is also known as a knowledge graph.

How does a Semantic Web work? One of the formats of the Semantic Web is the Resource Description Framework (RDF), originally designed as a metadata data model. RDF was adopted as a World Wide Web Consortium recommendation in 1999. The RDF 1.0 specification was published in 2004, and RDF 1.1 followed in 2014.

The RDF format is based on expressions in the form of subject-predicate-object, called triples. The subject and object denote (web) resources and the predicate denotes the relationship between the subject and the object. For example the expression ‘Spinoza has written the book Ethica Ordine Geometrico Demonstrata’ in RDF is a triple with a subject denoting “Spinoza”, a predicate denoting “has written”, and an object denoting “the book Ethica Ordine Geometrico Demonstrata”. This is a different approach than for example object-oriented models with an entity (Spinoza), attribute (book) and value (Ethica).

The RDF format could potentially solve some problems with the XBRL format. To explain this, I converted an XBRL-instance (a test instance file from EIOPA for Solvency 2) to RDF format.

Below you see the representation of one arbitrary data point in the report (called a fact) in RDF format and visualized as a network (I used the Python package networkx). The predicates contain the complete web resource so I limited the name to the last word to make it readable.

The red node is the starting point of the data point. The red labels on the lines describe the predicate between subject and object. You see that the fact (subject) ‘has decimals’ (predicate) 2 (object), and furthermore has unit EUR, has value 838522076.03, has type metmi503 (an internal code describing Payments for reported but not settled claims) and some other properties.

The data point also has a so-called context that defines the entity to which the fact applies, the period of time the fact is relevant (in this case 2019-12-31) and also a scenario, which consists of additional metadata of the data point. In this case we see that the data point is related to statutory accounts, non-life and health non-STL, direct business and accepted during the period (and a node without a label).

All facts in every XBRL instance are structured in this way, which means that for example you can search all facts with the label statutory accounts. Furthermore, because XBRL uses namespaces you can unambiguously identify predicates and objects in the report. For example, you see that the entity node has an identifier (starting with 0LFF1…) and a scheme (17442). The scheme refers to the web resource for the ISO standard 17442 which specifies the Legal Entity Identifier (LEI), so the entity is unambiguously identified with the given (LEI-)code. If you add other XBRL instances with references to that entity then the data is automatically linked because other instances will contain exactly the same entity node.

The RDF representation of the XBRL fact above is:

_:provenance1 xl:instance "filename".
_:unit_u xbrli:measure iso4217:EUR.
_:fact926
  xl:provenance :provenance1;
  xl:type xbrli:fact; 
  rdf:type s2md_met:mi503;
  rdf:value "838522076.03"^^xsd:decimal;
  xbrli:decimals "2"^^xsd:integer;
  xbrli:unit :unit_u; 
  xbrli:context :context_BLx79_DIx5_IZx1_TBx28_VGx84.
_:context_BLx79_DIx5_IZx1_TBx28_VGx84
  xl:type xbrli:context;
  xbrli:entity [
    xbrli:identifier "0LFF1WMNTWG5PTIYYI38";
    xbrli:scheme http://standards.iso.org/iso/17442;
    ];
  xbrli:scenario [
    xbrldi:explicitMember "s2c_LB:x79"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_DI:x5"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_RT:x1"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_LB:x28"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_AM:x84"^^rdf:XMLLiteral;
    ];
  xbrli:instant "2019-12-31"^^xsd:date.

Instead of storing the data in separate templates with often unclear code names you can also convert the XBRL data to one large Semantic Web where all facts are linked together. The RDF format thus provides a graph model which allows easier integration and visualization (and, for me at least, easier understanding). It allows adding and linking data from other sources, such as Solvency 2 documents and external data, in the same graph.

Typically, supervisory reports consists of thousands of data points and supervisors receive reports from many entities each period. How would you store that information? I think that the natural way to store an XBRL instance is not a relational database but a graph database (like graphDB or Neo4j). These databases can store the facts with all the metadata in a structured way and enable to query the graph efficiently. Next blog, I will explore graph databases and query languages for XBRL reports converted to the RDF format.