EIOPA’s Solvency 2 taxonomy in RDF

Published by:

To use the metadata from XBRL taxonomies, like labels, hierarchies, template structures and formulas, often licensed software is needed to process the taxonomy and convert the XML content to readable formats. In an earlier blog I have shown that it is useful to convert XBRL instance data to a linked data set in RDF and then query that data to retrieve the desired information. In this blog I will show how to do this with taxonomies: by using a number of small SPARQL queries the complete Data Point Model (DPM) of (European) XBRL taxonomies can be retrieved.

The main purpose of retrieving metadata in this manner is to be able to use taxonomy metadata in data science environments, for example to be able to apply machine learning models that use taxonomy metadata like hierarchies and to use concept and element labels from a taxonomy in NLP, for example Named Entity Recognition tasks to link quantitative reports to (unstructured, or: not yet structured) text data.

The lightweight solution that I show here is completely based on open source code, in the form of my xbrl2rdf package. This package converts XBRL instance files and all related taxonomy files (schemas and linkbases) to RDF and RDF-star and uses for this lxml and rdflib, and nothing else.

The examples below use the Solvency 2 taxonomy, but other taxonomies works as well. Gist with the notebook with the code below can be found here.

Importing the data

With the xbrl2rdf-package I have converted the EIOPA instance example file for quarterly reports for solo undertaking (QRS) to RDF. All taxonomy concepts that are used in that instance are included in the RDF data set. This result can be read with rdflib into memory.

# RDF graph loading
path = "../data/rdf/qrs_240_instance.ttl"

g = RDFGraph()
g.parse(path, format='turtle')

print("rdflib Graph loaded successfully with {} triples".format(len(g)))

This returns

rdflib Graph loaded successfully with 1203744 triples

So we have a RDF graph with 1.2 million triples that contains all data (facts related to concepts in the taxonomy, including all labels, validation rules, template structures, etc). The original RDF data file is around 64 Mb (combining instance and taxonomy triples). Reading and processing this file into an in-memory RDF graph takes some time, but then the graph can easily be queried.

Extracting template URIs

Let’s start with a simple query. Table or template URIs are subjects in the triple “subject xl:type table:table”. To get a list with all templates of an instance (in this case the first five) we run

q = """
    ?a xl:type table:table .
tables = [str(row[0]) for row in g.query(q)]

This returns a list of the URIs of the templates contained in the instance file.


Extracting the explicit domains

Next, we extract the explicit domains and related data in the taxonomy. A domain is specific XBRL terminology and means a set of elements sharing a specified semantic nature. An explicit domain has its elements enumerated in the taxonomy and can be found with the subject in the triple ‘subject rdf:type model:explicitDomainType’.

q = """
  SELECT DISTINCT ?t ?x1 ?x2 ?x3 ?x4
    ?t rdf:type model:explicitDomainType .
    ?t xbrli:periodType ?x1 .
    ?t model:creationDate ?x2 .
    ?t xbrli:nillable ?x3 .
    ?t xbrli:abstract ?x4 .

The first five domains (of 41 in total) are

indexDomain nameDomain labelperiod typecreation datenillableabstract
0LBLines of businessesinstant2014-07-07truetrue
1MCMain categoriesinstant2014-07-07truetrue
2TITime intervalsinstant2014-07-07truetrue
3AOArticle 112 and 167instant2014-07-07truetrue
Domain names and labels with attributes

So, the label of the domain LB is Lines of businesses; it has been there since the early versions of the taxonomy. If a domain is modified then this is also included as a triple in the data set.

Extracting domain members

Elements of an explicit domain are called domain members. A domain member (or simply a member) is enumerated element of an explicit domain. All members from a domain share a certain common nature. To get the members of a domain, we define a function that finds all domain-members relations of a given domain and retrieve the label of the member. In SPARQL this is:

def members(domain):
    q = """
      SELECT DISTINCT ?t ?label
      WHERE {
        ?l arcrole7:domain-member [ xl:from <"""+str(domain)+"""> ;
                                    xl:to ?t ] .
        ?t rdf:type nonnum:domainItemType .
        ?x arcrole3:concept-label [ xl:from ?t ;
                                    xl:to [rdf:value ?label ] ] .
    return g.query(q)

All members of all domains can be retrieved by running this function for all domains defined earlier. Based on the output we create a Pandas DataFrame with the results.

df_members = pd.DataFrame()
for d in df_domains.iloc[:, 0]:
    data = [[urldefrag(d)[1]]+[urldefrag(row[0])[1]]+list(row[1:]) for row in members(d)]
    columns = ['Domain',
               'Member label']
    df_members = df_members.append(pd.DataFrame(data=data,

In total there are 4.879 members of all domains (in this taxonomy).

index DomainMemberMember label
1LBx1Accident and sickness
3LBx3Fire and other damage to property
4LBx4Aviation, marine and transport
The first five member of the domain LB (Lines of Businesses)

This allows us, for example, to retrieve all facts in a report that are related to the term ‘motor’ because a reported fact contains references to the domain members to which the fact relates.

Extracting the template structure

Template structures are stored in the taxonomy as a tree of linked elements along the axes x, y and, if applicable, z. The elements have a label and a row-column code (this holds at least for EIOPA and DNB taxonomies), and have a certain depth, i.e. they can be subcategories of other elements. For example in the Solvency 2 balance sheet template, the element ‘Equities’ has subcategories ‘Equities – listed’ and ‘Equities unlisted’. I have not included the code here but with a few lines you can extract the complete template structures as they are stored in the taxonomy. For example for the balance sheet (S. we get:

1x1C0010Solvency II value
3y1 Assets
4y1 Liabilities
6y2R0020Deferred acquisition costs
7y2R0030Intangible assets
8y2R0040Deferred tax assets
9y2R0050Pension benefit surplus
10y2R0060Property, plant & equipment held for own use
11y2R0070Investments (other than assets held for index-…
12y3R0080Property (other than for own use)
13y3R0090Holdings in related undertakings, including pa…
15y4R0110Equities – listed
16y4R0120Equities – unlisted
18y4R0140Government Bonds
19y4R0150Corporate Bonds
20y4R0160Structured notes
21y4R0170Collateralised securities
22y3R0180Collective Investments Undertakings
24y3R0200Deposits other than cash equivalents
The first 25 lines of the balance sheet template

With relatively small SPARQL queries, it is possible to retrieve metadata from XBRL taxonomies. This works well because we have converted the original taxonomy (in xml) to the linked data format RDF; and this format is especially well suited for representing and querying XBRL data.

The examples above show that it is possible to retrieve the complete Solvency 2 Data Point Model (and more) from the the taxonomy in RDF and make it available in a Python environment. This allows incorporation of metadata in machine learning models and in NLP applications. I hope that this approach will allow more data scientists to use existing metadata from XBRL taxonomies.

Converting XBRL to RDF-star

Published by:

Lately I have been working on the conversion of XBRL instances and related taxonomy schemas and linkbases to RDF and RDF-star. In these semantic data formats, you can link data in XBRL data with other data sources and you can query the data in a fairly easy manner. RDF-star is an extension of RDF that in some situations allows a more compact description of linked data, and by that it narrows the gap between RDF and property graphs. How this works, I will show in this blog using the XBRL taxonomy definitions as an example.

In a previous blog I showed that XBRL instance facts can be converted to RDF and visualized as a network. The same can be done with the related taxonomy elements. An XBRL taxonomy consists of concepts and relations between concepts that define calculations, presentations, labels and definitions. The concepts are laid down (mostly) in XML schemas and the relations in linkbases using XML schemas and XLinks. By converting the XBRL taxonomy to RDF, the XBRL fact data is linked to its corresponding metadata in the taxonomy.


There has been done some work on the conversion of XBRL to RDF, most notably by Dave Raggett. His project xbrlimport, written in C++ and available on SourceForge, converts XBRL data to RDF triples. His approach is clean and straightforward and reuses the original namespaces of the XBRL data (with some obvious elements translated to predicates with RDF namespaces).

I used Raggett’s xbrlimport as a starting point, translated it to Python, added XBRL items that were introduced after publication of the code and improved a number of things. The code is now for example able to convert all EIOPA’s Solvency 2 taxonomy elements with all metadata available to RDF format. This code is available under the same license as xbrlimport (GNU General Public License) as a Python Package on pypi.org. You can take an XBRL instance with corresponding taxonomy (in the form of a zip-file) and convert the contents to RDF and RDF-star. This code will look up any references (URIs) in the XBRL instance to the taxonomy in the zip-file and convert the relevant files to RDF.

Let’s look at some examples of the Solvency 2 taxonomy converted to RDF. The RDF triple of an arbitrary XBRL concept from the Solvency 2 taxonomy looks like this (in turtle format):

    rdf:type xbrli:monetaryItemType ;
    xbrli:periodType """instant"""^^rdf:XMLLiteral ;
    model:creationDate """2014-07-07"""^^xsd:dateTime ;
    xbrli:substitutionGroup xbrli:item ;
    xbrli:nillable "true"^^xsd:boolean ;

This example describes the triples of concept s2md_met:mi362 (a Solvency 2 metric). With these triples we have exactly the same data as in the related XML file but now in the form of triples. Namespaces are derived from the XML file (except rdf:type) and datatypes are transformed to RDF datatypes with proper RDF syntax.

This can be done with all concepts used to which the facts of an XBRL instance refer. If you have facts in RDF format, then in RDF these concept are automatically linked with the concepts in the taxonomy because the URIs of the concepts are the same. This creates a network of facts with all related metadata of the facts.

An XBRL taxonomy also contains links that relate concepts to each other for several purposes (to provide labels, definitions, presentations and calculations) . An example of a link is the following.

_:link2 arcrole:concept-label [
    xl:type xl:link ;
    xl:role role2:link ;
    xl:from s2md_met:mi362 ;
    xl:to s2md_met:label_s2md_mi362 ;
    ] .

The link relates concept mi362 with label mi362 by creating a new subject _:link2 with predicate arcrole:concept-label and an object which contains all data about the link (including the xl:from and xl:to and the attributes of the link). This way of introducing a new subject to specify a link between two concepts is called reification and a bit artificial because you would like to link the concept directly with the label, such as

s2md_met:mi281 arcrole:concept-label s2md_met:label_s2md_mi281

However, then you are unable in RDF to link the attributes (like the order and the role) to the predicates. It is one of the disadvantages of the current RDF format. There appears to be no easy way to do this in RDF, other than by using this artificial reification approach (some other solutions exist like the singleton property approach, but all of them have disadvantages.)

The new RDF-star format

Recently, the RDF-star working group published their first Draft Community Report. In this report they introduced new RDF-star and SPARQL-star specifications. These new specifications, although not yet a W3C standard, enable more compact specification of linked datasets and simpler graphs and less nodes.

Let’s look what this means for the XBRL linkbases with the following example. Suppose we have the following link definition.

_:link1 arcrole:breakdown-tree [
    xl:from _:s2md_a1 ;
    xl:to _:s2md_a1.root ;
    xl:type xl:link ;
    xl:role tab:S. ;
    xl:order "0"^^xsd:decimal ;
    ] .

The subject in this case is _:link1 with predicate arcrole:breakdown-tree, so this link describes a part of a table template. It points to a subject with all the information of the link, i.e. from, to, type, role and order from the xl namespace. Note that there is no triple with _:s2md_a1 (xl:from) as a subject and _:s2md_a1.root (xl:to) as an object. So if you want to know the relations of the concept _:s2md_a1 you need to look at the link triples and look for entries where xl:from equals the concept.

With the new RDF-star specifications you can just add the triple and then add properties to the triple as a whole, so the example would read

_:s2md_a1 arcrole:breakdown-tree _:s2md_a1.root .

<<_:s2md_a1 arcrole:breakdown-tree _:s2md_a1.root>> 
    xl:role tab:S. ;
    xl:order "0"^^xsd:decimal ;

Which is basically what we need to define. If you now want to know the relations of the subject _:s2md_a1 then you just look for triples with this subject. In the visual presentation of the RDF dataset you will see a direct link between the two concepts. This new RDF format also implies simplifications of the SPARQL queries.

This blog has become a bit technical but I hope you see that the RDF-star specification allows a much needed simplification of RDF triples. I showed that the conversion of XBRL taxonomies to RDF-star leads to a smaller amount of triples and also to less complex triples. The resulting taxonomy triples lead to less complex graphs and can be used to derive the XBRL labels, template structures, validation rules and definitions, just by using SPARQL queries.

Europe’s insurance register linked to the GLEIF RDF dataset

Published by:

Number 7 of my New Year’s Resolutions list reads “only use and provide linked data”. So, to start the year well, I decided to do a little experiment to combine insurance undertakings register data with publicly available legal entity data. In concrete terms, I wanted to provide the European insurance register published by EIOPA (containing all licensed insurance undertakings in Europe) as linked data with the Legal Entity data from the Global Legal Entity Identifier Foundation (GLEIF). In this blog I will describe what I have done to do so.

The GLEIF data

The GLEIF data consists of information on all legal entities in the world (entities with a Legal Entity Identifier). A LEI is required by any legal entity who is involved with financial transactions or operates within the financial system. If an organization needs a LEI then it requests one at a local registration agent. For the Netherlands these are the Authority for the Financial Markets (AFM), the Chamber of Commerce (KvK) and the tax authority and others. GLEIF receives data from these agents in each country and makes the collected LEI data available in a number of forms (api, csv, etc).

The really cool thing is that in 2019, together with data.world, GLEIF developed an RDFS/OWL Ontology for Legal Entities, and began in 2020 to publish regularly the LEI data as a linked RDF dataset on data.world (see https://data.world/gleif, you need a (free) account to obtain the data). At the time of this writing, the size of the level 1 data (specifying who is who) is around 10.2 Gb with almost 92 million triples (subject-predicate-object), containing information about entity name, legal form, headquarter and legal address, geographical location, etc. Also related data such as who owns whom is published in this forms.

The EIOPA insurance register

The European Supervisory Authority EIOPA publishes the Register of Insurance undertakings based on information provided by the National Competent Authorities (NCAs). The NCA in each member state is responsible for authorization and registration of the insurance undertakings activities. EIOPA collects the data in the national registers and publishes an European insurance register, which includes more than 3.200 domestic insurance undertakings. The register contains entity data like international and commercial name, name of NCA, addresses, cross border status, registration dates etc. Every insurance undertaking requires a LEI and the LEI is included in the register; this enables us to link the data easily to the GLEIF data.

The EIOPA insurance register is available as CSV and Excel file, without formal naming and clear definitions of column names. Linking the register data with other sources is a tedious job, because it must be done by hand. Take for example the LEI data in the register, which is referred to with the column name ‘LEI’; this is perfectly understandable for humans, but for computers this is just a string of characters. Now that the GLEIF has published its ontologies there is a proper way to refer to a LEI, and that is with the Uniform Resource Identifier (URI) https://www.gleif.org/ontology/L1/LEI, or in a short form gleif-L1:LEI.

The idea is to publish the data in the European insurance register in the same manner; as linked data in RDF format using, where applicable, the GLEIF ontology for legal entities and creating an EIOPA ontology for the data that is unique for the insurance register. This allows users of the data to incorporate the insurance register data into the GLEIF RDF dataset and thereby extending the data available on the legal entities with the data from the insurance register.

Creating triples from the EIOPA register

To convert the EIOPA insurance register to linked data in RDF format, I did the following:

  • extract from the GLEIF RDF level 1 dataset on data.world all insurance undertakings and related data, based on the LEI in the EIOPA register;
  • create a provisional ontology with URIs based on https://www.eiopa.europe.eu/ontology/base (this should ideally be done by EIOPA, as they own the domain name in the URI);
  • transform, with this ontology, the data in the EIOPA register to triples (omitting all data from the EIOPA register that is already included in the GLEIF RDF dataset, like names and addresses);
  • publish the triples for the insurance register in RDF Turtle format on data.world.

Because I used the GLEIF ontology where applicable, the triples I created are automatically linked to the relevant data in the GLEIF dataset. Combining the insurance register dataset with the GLEIF RDF dataset results in a set where you have all the GLEIF level 1 data and all data in the EIOPA insurance register combined for all European insurance undertakings.

Querying the data

Let’s look what we have in this combined dataset. Querying the data in RDF is done with the SPARQL language. Here is an example to return the data on Achmea Schadeverzekeringen.

{ ?s gleif-L1:hasLegalName "Achmea schadeverzekeringen N.V." . 
  ?s ?p ?o .}

The query looks for triples where the predicate is gleif-base:hasLegalName and the object is Achmea Schadeverzekeringen N.V. and returns all data of the subject that satisfies this constraint. This returns (where I omitted the prefix of the objects):

    rdf#type                             LegalEntity
    gleif-base:hasLegalJurisdiction      NL  
    gleif-base:hasEntityStatus           EntityStatusActive  
    gleif-l1:hasLegalName                Achmea Schadeverzekeringen     
    gleif-l1:hasLegalForm                ELF-B5PM
    gleif-L1:hasHeadquartersAddress      L-72450067SU8C745IAV11-LAL  
    gleif-L1:hasLegalAddress             L-72450067SU8C745IAV11-LAL  
    gleif-base:hasRegistrationIdentifier BID-RA000463-08053410
    rdf#type                             InsuranceUndertaking
    eiopa-base:hasRegisterIdentifier     IURI-De-Nederlandsche-Bank-

We see that the rdf#type of this entity is LegalEntity (from the GLEIF data) and the jurisdiction is NL (this has a prefix that refers to the ISO 3166-1 country codes). The legal form refers to another subject called ELF-B5PM. The headquarters and legal address both refer to the same subject that contains the address data of this entity. Then there is a business identifier to the registration data. The last two lines are added by me: a triple to specify that this subject is not only a LegalEntity but also an InsuranceUndertaking (defined in the ontology), and a triple for the Insurance Undertaking Register Identifier (IURI) of this subject (also defined in the ontology).

Let’s look more closely at the references in this list. First the legal form of Achmea (i.e. the predicate and objects of legal form code ELF-B5PM). Included in the GLEIF data is the following (again omitting the prefix of the object):

rdf#type                         EntityLegalForm  
rdf#type                         EntityLegalFormIdentifier  
gleif-base:identifies            ELF-B5PM  
gleif-base:tag                   B5PM  
gleif-base:hasCoverageArea       NL  
gleif-base:hasNameTransliterated naamloze vennootschap  
gleif-base:hasNameLocal          naamloze vennootschap  
gleif-base:hasAbbreviationLocal  NV, N.V., n.v., nv

With the GLEIF data we have this data on all legal entity forms of insurance undertakings in Europe. The local abbreviations are particularly handy as they help us to link an entity’s name extracted from documents or other data sources with its corresponding LEI.

If we look more closely at the EIOPA Register Identifier IURI-De-Nederlandsche-Bank-W1686 then we find the register data of this Achmea entity:

owl:a                         InsuranceUndertakingRegisterIdentifier
gleif-base:identifies         L-72450067SU8C745IAV11
eiopa-base:hasNCA                          De Nederlandsche Bank  
eiopa-base:hasInsuranceUndertakingID       W1686  
eiopa-base:hasEUCountryWhereEntityOperates NL  
eiopa-base:hasCrossBorderStatus            DomesticUndertaking  
eiopa-base:hasRegistrationStartDate        23/12/1991 01:00:00  
eiopa-base:hasRegistrationEndDate          None  
eiopa-base:hasOperationStartDate           23/12/1991 01:00:00  
eiopa-base:hasOperationEndDate             None

The predicate gleif-base:identifies refers back to the subject where gleif-L1:hasLegalName equals the Achmea entity. The other predicates are based on the provisional ontology I made that contains the definitions of the attributes of the EIOPA insurance register. Here we see for example that W1686 is the identifier of this entity in DNB’s insurance register.

Let me give a tiny example of the advantage of using linked data. The GLEIF data contains the geographical location of all legal entities. With the combined dataset it is easy to obtain the location for the insurance undertakings in, for example, the Netherlands. This query returns entity names with latitude and longitude of the legal address of the entity.

SELECT DISTINCT ?name ?lat ?long
WHERE {?sub rdf:type eiopa-base:InsuranceUndertaking ;
            gleif-base:hasLegalJurisdiction CountryCodes:NL ;
            gleif-L1:hasLegalName ?name ;
            gleif-L1:hasLegalAddress/gleif-base:hasCity ?city .
       ?geo gleif-base:hasCity ?city ;
            geo:lat ?lat ; 
            geo:long ?long .}

This result can be plotted on a map, see the link below. If you click on one of the dots then the name of the insurance undertaking will appear.

All queries above and the code to make the map are included in the notebook EIOPA Register RDF datase – SPARQL queries.

The provisional ontology I created is not yet semantically correct and should be improved, for example by incorporating data on NCAs and providing formal definitions. And other data sources could be added, for example the level 2 dataset to identify insurance groups, and the ISIN to LEI relations that are published daily by GLEIF.

By introducing the RDFS/OWL ontologies, the Global LEI Foundation has set an example on how to publish (financial) entity data in an useful manner. The GLEIF RDF dataset reduces the time needed to link the data with other data sources significantly. I hope other organizations that publish financial entity data as part of their mandate will follow that example.

Converting supervisory reports to Semantic Webs: from XBRL to RDF

Published by:

A growing number of supervisory reports across Europe are based on the XML Extensible Business Reporting Language standard (XBRL). Financial entities such as banks, insurance undertakings and pension institutions are required to submit their reports to their supervisors in this format.

XBRL is a language for modeling, exchanging and automatically processing business and financial information. Reports in this format (called instance documents) are based on metadata (set out in taxonomies) that add semantic meaning to the data points that are reported. You can choose different implementations but overall an XBRL taxonomy provides a semantically rich data model and that has always been one of the main advantages of XBRL.

However, in its raw format (an XML file) each report is basically a machine readable document with a tree structure that does not enable easy integration with related data from other sources or integration with text documents and their contents.

In this blog, I will show that converting the XBRL reports to another format allows easier integration and understanding. That other format is based on Semantic Webs. It has been shown that XBRL converted to Semantic Webs can be done without any loss of information (see for example this article). So if we convert the XBRL format to a Semantic Web then we keep the structure and the meaning provided by the taxonomy. The result is basically a graph and this format enables integration with other linked data that is much easier.

A Semantic Web consists of formats and technologies that are rather old (from a computer science perspective): it originated around the same time as XBRL, some twenty years ago. And because it tried to solve similar problems (lack of semantic meaning in the World Wide Web) as the XBRL standard (lack of semantic meaning in business and financial data), to some extent it is based on similar concepts. It was however developed completely separate from XBRL.

The general concept of a Semantic Webs, where data is linked together to provide semantic meaning, is also known as a knowledge graph.

How does a Semantic Web work? One of the formats of the Semantic Web is the Resource Description Framework (RDF), originally designed as a metadata data model. RDF was adopted as a World Wide Web Consortium recommendation in 1999. The RDF 1.0 specification was published in 2004, and RDF 1.1 followed in 2014.

The RDF format is based on expressions in the form of subject-predicate-object, called triples. The subject and object denote (web) resources and the predicate denotes the relationship between the subject and the object. For example the expression ‘Spinoza has written the book Ethica Ordine Geometrico Demonstrata’ in RDF is a triple with a subject denoting “Spinoza”, a predicate denoting “has written”, and an object denoting “the book Ethica Ordine Geometrico Demonstrata”. This is a different approach than for example object-oriented models with an entity (Spinoza), attribute (book) and value (Ethica).

The RDF format could potentially solve some problems with the XBRL format. To explain this, I converted an XBRL-instance (a test instance file from EIOPA for Solvency 2) to RDF format.

Below you see the representation of one arbitrary data point in the report (called a fact) in RDF format and visualized as a network (I used the Python package networkx). The predicates contain the complete web resource so I limited the name to the last word to make it readable.

The red node is the starting point of the data point. The red labels on the lines describe the predicate between subject and object. You see that the fact (subject) ‘has decimals’ (predicate) 2 (object), and furthermore has unit EUR, has value 838522076.03, has type metmi503 (an internal code describing Payments for reported but not settled claims) and some other properties.

The data point also has a so-called context that defines the entity to which the fact applies, the period of time the fact is relevant (in this case 2019-12-31) and also a scenario, which consists of additional metadata of the data point. In this case we see that the data point is related to statutory accounts, non-life and health non-STL, direct business and accepted during the period (and a node without a label).

All facts in every XBRL instance are structured in this way, which means that for example you can search all facts with the label statutory accounts. Furthermore, because XBRL uses namespaces you can unambiguously identify predicates and objects in the report. For example, you see that the entity node has an identifier (starting with 0LFF1…) and a scheme (17442). The scheme refers to the web resource for the ISO standard 17442 which specifies the Legal Entity Identifier (LEI), so the entity is unambiguously identified with the given (LEI-)code. If you add other XBRL instances with references to that entity then the data is automatically linked because other instances will contain exactly the same entity node.

The RDF representation of the XBRL fact above is:

_:provenance1 xl:instance "filename".
_:unit_u xbrli:measure iso4217:EUR.
  xl:provenance :provenance1;
  xl:type xbrli:fact; 
  rdf:type s2md_met:mi503;
  rdf:value "838522076.03"^^xsd:decimal;
  xbrli:decimals "2"^^xsd:integer;
  xbrli:unit :unit_u; 
  xbrli:context :context_BLx79_DIx5_IZx1_TBx28_VGx84.
  xl:type xbrli:context;
  xbrli:entity [
    xbrli:identifier "0LFF1WMNTWG5PTIYYI38";
    xbrli:scheme http://standards.iso.org/iso/17442;
  xbrli:scenario [
    xbrldi:explicitMember "s2c_LB:x79"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_DI:x5"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_RT:x1"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_LB:x28"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_AM:x84"^^rdf:XMLLiteral;
  xbrli:instant "2019-12-31"^^xsd:date.

Instead of storing the data in separate templates with often unclear code names you can also convert the XBRL data to one large Semantic Web where all facts are linked together. The RDF format thus provides a graph model which allows easier integration and visualization (and, for me at least, easier understanding). It allows adding and linking data from other sources, such as Solvency 2 documents and external data, in the same graph.

Typically, supervisory reports consists of thousands of data points and supervisors receive reports from many entities each period. How would you store that information? I think that the natural way to store an XBRL instance is not a relational database but a graph database (like graphDB or Neo4j). These databases can store the facts with all the metadata in a structured way and enable to query the graph efficiently. Next blog, I will explore graph databases and query languages for XBRL reports converted to the RDF format.

Pilot Data Quality Rules

Published by:

Data Quality is receiving more and more attention within the financial sector, and deservedly so. That’s why DNB will start a pilot in September with the insurance sector to enable entities to run locally the required open source code and to evaluate Solvency 2 quantitative reports with our Data Quality Rules.

In the coming weeks we will:

With these tools you are able to assess the data quality of your Solvency 2 quantitative reports before submitting them to DNB. You can do that within your own data science environment.

We worked hard to make this as easy as possible; the only thing you need is Anaconda / Jupyter Notebooks (Python) and Git to clone our repositories from Github (all free and open source software). And of course the data you want to check. We also provide code to evaluate the XBRL instance files.

We are planning workshops to explain how to use the code and validation rules and to go through the process step by step.

Want to join or know more, please let me know (w.j.willemse at dnb.nl).

Code moved to GitHub/DNB

Published by:

It has been a bit quiet here on my side. But I have been busy moving the source code from the notebooks here to the GitHub-account of DNB. And in doing so we improved the code in many areas. Code development can now be done efficiently by using tools for Continuous Integration and Deployment. This enables everyone to work with us to explore new ideas and to test, improve and expand the code.

Let me shortly introduce our repositories (these are all based on code that I wrote for the blogs on this website).

Our insurance repositories


A package for retrieving Solvency 2-data. Our idea is to provide you with one package for all Solvency 2-data retrieval.

Currently it is able to download the Risk Free Interest Term Structures from the EIOPA website, so you don’t need to search on the EIOPA website. And it implements the Smith-Wilson algorithm so you can make your own curves with different parameters.

This code is deployed as a package to the Python Package Index (a.k.a. the cheese shop), so you can install it with pip.


A package aimed to improve data quality of your reports. With this code you can generate and evaluate patterns, and we plan to publish validation and plausibility patterns in addition to the existing ones in the taxonomies. With this package you can evaluate your reports with these patterns.

This package is also deployed to the Python Package Index.


A project with data science applications and tutorials on using the packages above. Currently, we have a data science tutorial using the public Solvency 2 data and a tutorial for the data-patterns package.


Our experimental Natural Language Processing projects with Solvency 2 documents. I already published some results here with NLP (reading the Solvency 2 legislation documents and Word2Vec and Topic Modelling with SFCR documents) and we are planning to provide these and other applications in this repository.

All repositories were made from cookiecutter templates, which is a very easy way to set up your projects.

Take a look at the repositories. If you have suggestions for further improvements or ideas for new features, do not hesitate to raise an issue on the GitHub-site. In the documentation of each repository you can find more information on how to contribute.

Word2vec models for SFCRs

Published by:

Word2vec is a well-known algorithm for natural language processing that often leads to surprisingly good results, if trained properly. It consists of a shallow neural network that maps words to a n-dimensional number space, i.e. it produces vector representations of words (so-called word embeddings). Word2vec does this in a way that words used in the same context are embedded near to each other (their respective vectors are close to each other). In this blog I will show you some of the results of word2vec models trained with Wikipedia and insurance-related documents.

One of the nice properties of a word2vec model is that it allows us to do calculations with words. The distance between two word vectors provides a measure for linguistic or semantic similarity of the corresponding words. So if we calculate the nearest neighbors of the word vector then we find similar words of that word. It is also possible to calculate vector differences between two word vectors. For example, it appears that for word2vec model trained with a large data set, the vector difference between man and woman is roughly equal to the difference between king and queen, or in vector notation kingman + woman = queen. If you find this utterly strange then you are not alone. Besides some intuitive and informal explanations, it is not yet completely clear why word2vec models in general yield these results.

Word2vec models need to be trained with a large corpus of text data in order to achieve word embeddings that allow these kind of calculations. There are some large pre-trained word vectors available, such as the GloVe Twitter word vectors, trained with 2 billion tweets, and the word2vec based on google news (trained with 100 billion words). However, most of them are in the English language and are often trained on words that are generally used, and not domain specific.

So let’s see if we can train word2vec models specifically for non-English European languages and trained with specific insurance vocabulary. A way to do this is to train a word2vec model with Wikipedia pages of a specific language and additionally train the model with sentences we found in public documents of insurance undertakings (SFCRs) and in the insurance legislation. In doing so the word2vec model should be able to capture the specific language domain of insurance.

The Wikipedia word2vec model

Data dumps of all Wikimedia wikis, in the form of a XML-files, are provided here. I obtained the latest Wikipedia pages and articles of all official European languages (bg, es, cs, da, de, et, el, en, fr, hr, it, lv, lt, hu, mt, nl, pl pt, ro, sk, sl, fi, sv). These are compressed files and their size range from 8.6 MB (Maltese) to 16.9 GB (English). The Dutch file is around 1.5 GB. These files are bz2-compressed; the uncompressed Dutch file is about 5 times the compressed size and contains more than 2.5 million Wikipedia pages. This is too large to store into memory (at least on my computer), so you need to use Python generator-functions to process the files without the need to store them completely into memory.

The downloaded XML-files are parsed and page titles and texts are then processed with the nltk-package (stop words are deleted and sentences are tokenized and preprocessed). No n-grams were applied. For the word2vec model I used the implementation in the gensim-package.

Let’s look at some results of the resulting Wikipedia word2vec models. If we get the top ten nearest word vectors of the Dutch word for elephant then we get:

In []: model.wv.most_similar('olifant', topn = 10)
Out[]: [('olifanten', 0.704888105392456),
        ('neushoorn', 0.6430075168609619),
        ('tijger', 0.6399451494216919),
        ('luipaard', 0.6376790404319763),
        ('nijlpaard', 0.6358680725097656),
        ('kameel', 0.5886276960372925),
        ('neushoorns', 0.5880545377731323),
        ('ezel', 0.5879943370819092),
        ('giraf', 0.5807977914810181),
        ('struisvogel', 0.5724758505821228)]

These are all general Dutch names for (wild) animals. So, the Dutch word2vec model appears to map animal names in the same area of the vector space. The word2vec models of other languages appear to do the same, for example norsut (Finnish for elephant) has the following top ten similar words: krokotiilit, sarvikuonot, käärmeet, virtahevot, apinat, hylkeet, hyeenat, kilpikonnat, jänikset and merileijonat. Again, these are all names for animals (with a slight preference for Nordic sea animals).

In the Danish word2vec model, the top 10 most similar words for mads (in Danish a first name derived from matthew) are:

In []: model.wv.most_similar('mads', topn = 10)
Out[]: [('mikkel', 0.6680521965026855),
        ('nicolaj', 0.6564826965332031),
        ('kasper', 0.6114416122436523),
        ('mathias', 0.6102851033210754),
        ('rasmus', 0.6025335788726807),
        ('theis', 0.6013824343681335),
        ('rikke', 0.5957099199295044),
        ('janni', 0.5956574082374573),
        ('refslund', 0.5891965627670288),
        ('kristoffer', 0.5842193365097046)]

Almost all are first names except for Refslund, a former Danish chef whose first name was Mads. The Danish word2vec model appears to map first names in the same domain in the vector space, resulting is a high similarity between first names.

Re-training the Wikipedia Word2vec with SFCRs

The second step is to train the word2vec models with the insurance related text documents. Although the Wikipedia pages for many languages contain some pages on insurance and insurance undertakings, it is difficult to derive the specific language of this domain from these pages. For example the Dutch word for risk margin does not occur in the Dutch Wikipedia pages, and the same holds for many other technical terms. In addition to the Wikipedia pages, we should therefore train the model with insurance specific documents. For this I used the public Solvency and Financial Condition Reports (SFCRs) of Dutch insurance undertakings and the Dutch text of the Solvency II Delegated Acts (here is how to download and read it).

The SFCR sentences are processed in the same manner as the Wikipedia pages, although here I applied bi- and trigrams to be able to distinguish insurance terms rather than separate words (for example technical provisions is a bigram and treated as one word, technical_provisions).

Now the model is able to derive similar words to the Dutch word for risk margin.

In []: model.wv.most_similar('risicomarge')
Out[]: [('beste_schatting', 0.43119704723358154),
        ('technische_voorziening', 0.42812830209732056),
        ('technische_voorzieningen', 0.4108312726020813),
        ('inproduct', 0.409644216299057),
        ('heffingskorting', 0.4008549451828003),
        ('voorziening', 0.3887258470058441),
        ('best_estimate', 0.3886040449142456),
        ('contant_maken', 0.37772029638290405),
        ('optelling', 0.3554660379886627),
        ('brutowinst', 0.3554105758666992)]

This already looks nice. Closest to risk margin is the Dutch term beste_schatting (English: best estimate) and technische_voorziening(en) (English: technical provision, singular and plural). The relation to heffingskorting is strange here. Perhaps the word risk margin is not solely being used in insurance.

Let’s do another one. The acronym skv is the same as scr (solvency capital requirement) in English.

In []: model.wv.most_similar('skv')
Out[]: [('mkv', 0.6492390036582947),
        ('mcr_ratio', 0.4787723124027252),
        ('kapitaalseis', 0.46219778060913086),
        ('mcr', 0.440476655960083),
        ('bscr', 0.4224048852920532),
        ('scr_ratio', 0.41769397258758545),
        ('ðhail', 0.41652536392211914),
        ('solvency_capital', 0.4136047661304474),
        ('mcr_scr', 0.40923237800598145),
        ('solvabiliteits', 0.406883180141449)]

The SFCR documents were sufficient to derive an association between skv and mkv (English equivalent of mcr), and the English acronyms scr and mcr (apparently the Dutch documents sometimes use scr and mcr in the same context). Other similar words are kapitaalseis (English: capital requirement) and bscr. Because they learn from context, the word2vec models are able to learn words that are synonyms and sometimes antonyms (for example we say ‘today is a cold day’ and ‘today is a hot day’, where hot and cold are used in the same manner).

For an example of a vector calculation look at the following result.

In []: model.wv.most_similar(positive = ['dnb', 'duitsland'], 
                             negative = ['nederland'], topn = 5)
Out[]: [('bundesbank', 0.4988047778606415),
        ('bundestag', 0.4865422248840332),
        ('simplesearch', 0.452720582485199),
        ('deutsche', 0.437085896730423),
        ('bondsdag', 0.43249475955963135)]

This function finds the top five similar words of the vector DNBNederland + Duitsland. This expression basically asks for the German equivalent of De Nederlandsche Bank (DNB). The model generates the correct answer: the German analogy of DNB as a central bank is the Bundesbank. I think this is somehow incorporated in the Wikipedia pages, because the German equivalent of DNB as a insurance supervisor is not the Bundesbank but Bafin, and this was not picked up by the model. It is not perfect (the other words in the list are less related and for other countries this does not work as well). We need more documents to find more stable associations. But this to me is already pretty surprising.

There has been some research where the word vectors of word2vec models of two languages were mapped onto each other with a linear transformation (see for example Exploiting Similarities among Languages for Machine Translation, Mikolov, et al). In doing so, it was possible to obtain a model for machine translation. So perhaps it is possible for some European languages with a sufficiently large corpus of SFCRs to generate one large model that is to some extent language independent. To derive the translation matrices we could use the different translations of European legislative texts because in their nature these texts provide one of the most reliable translations available.

But that’s it for me for now. Word2vec is a versatile and powerful algorithm that can be used in numerous natural language applications. It is relatively easy to generate these models in other languages than the English language and it is possible to train these models that can deal with the specifics of insurance terminology, as I showed in this blog.

Pattern discovery in Solvency 2 data (2)

Published by:

Pattern discovery is a general technique to derive patterns from the data itself, without using explicit knowledge about the data. Previously, I described how to discover patterns in Solvency 2 data, for example that values of column x are most of the time or (almost) always equal to the values of column y, or higher, or lower, etc. I also showed how to find columns whose values are the sum of values of other columns, which often occurs in financial reports.

Here I will take pattern discovery one step further and discover constant ratios between values of columns. That is, patterns where the quotient between values of two columns equals a constant.

An example will clarify this. We take the own funds data of Dutch insurance undertakings (sheet 20 of the Excel file) and select the last year of data.

df = get_sheet(20)
df = df.xs(datetime(2017,12,31), axis = 0, level = 1, drop_level = False)

Now we run the generate-function from the insurlib package on Github. The generate-function allows the parameter pattern = “ratio”. Make sure the minimum support is higher than one otherwise you end up with a lot of ratio-patterns. Patterns are generated with the following code, with minimum support 3 and minimum confidence 10%.

rules = patterns.generate(dataframe = df,
                          pattern = "ratio",
                          parameters = {"min_confidence": 0.1, 
                                        "min_support": 3}))
rules = list(rules)

The first two patterns in the list are

['ratio', '9/20', ['mcr , total', 'scr , total']], 19, 120, 0.1367]


['ratio', '1/4', ['mcr , total', 'scr , total']], 36, 103, 0.259]

The two patterns state that the quotient between the values of the columns ‘mcr , total’ and ‘scr , total’ is 9/20 in 19 cases with confidence of almost 14% and is 1/4 (with 120 exception to this rule) is 36 cases with confidence of almost 26% (with 103 exceptions). In Solvency 2, the acronym MCR stands for Minimum Capital Requirement, and SCR stand for Solvency Capital Requirement (the solvency 2 risk-based capital requirement with VaR 99,5%). So the identified ratios are 9/20 (i.e. 45%) and 1/4 (i.e. 25%).

It is not desirable that the MCR is too low or too high in relation to the SCR. Solvency 2 legislation therefore prescribes that the MCR is bounded between 25% and 45% of the SCR. We see that these legal boundaries are found in the data itself. We have 36 insurance undertakings where the ratio is 25%, 19 undertaking where the ratio is 45%. The rest (should) have a ratio somewhere between 25% and 45%.

The algorithm applies a brute-force search for ratios between columns. It uses the fraction module of Python’s standard library to convert every quotient to a fraction, which enables the use of rational number arithmetic. For example

>>> Fraction(9,4)

The fraction expression you see in the pattern definition is the string representation of the fraction object.

A common problem with dealing with quotients is that often the original data set contains quotients like 0.2499999 and 0.2500001 because values are rounded off to euros. In the fraction module, this can be overcome by putting a limit on the denominator (default is 1e7, but you can pass this as a parameter). The result is that the nearest fraction is found given the limit on the denominator.

The other rules that were found in the own funds data are

['ratio', '9/100', ['available and eligible own funds|total eligible own funds to meet the mcr , tier 2', 'scr , total']], 3, 12, 0.2]

['ratio', '1/5', ['available and eligible own funds|total eligible own funds to meet the mcr , tier 2', 'mcr , total']], 12, 3, 0.8]

['ratio', '3/20', ['available and eligible own funds|total eligible own funds to meet the scr , tier 3', 'scr , total']], 6, 15, 0.2857]

['ratio', '1/3', ['available and eligible own funds|total eligible own funds to meet the scr , tier 3', 'mcr , total']], 3, 18, 0.1429]

These ratio-patterns are boundaries on the available and eligible own funds and are prescribed by Solvency 2 legislation. In the Netherlands, they have a relatively low support because a limited number of Dutch insurance undertakings hit the legal boundaries of tier two and tier three capital.

Here we looked at the own funds data. But many more patterns of this type can be found if you run the complete data set of Dutch insurance undertakings. It takes some time but then you will find in total 121 ratio-patterns (with minimum support of 3) covering loss-absorbing capacity of deferred taxes, impact of transitional measures, operational risk capital, etc. And of course other data sets than Solvency 2 data are possible.

These ratio-patterns work well because the Solvency 2 legislation contains legal boundaries that are represented in the data. Also other nonlegal constant ratios can be found, such as entity-specific ratios. You can find ratio-patterns for each insurance undertaking separately and then signal when a pattern is violated. This would work with a relatively low limit on the denominator of the fractions, because we want to find constant ratios with a high confidence. But for this to work well you will need more than two years of data.

A brute-force approach works well for the Dutch public Solvency 2 data set (around 1270 dimensions). But for high dimensional data it will take some time because of the possible combinations. Perhaps smart ways exist to detect ratios more easily, for example via statistical correlations.

The ratio-patterns that were discovered could be related to the Solvency 2 legislation by automatically reading ratios and percents in the text. And in the same manner, entity-specific ratios could be related to the SFCR documents of that entity. But that is for future work.

Pattern discovery in Solvency 2 data (1)

Published by:

This blog describes some results with algorithms for pattern discovery in Solvency 2 quantitative reports. The main idea here is to automatically uncover patterns that are present in these reports. We want to find patterns that represent widely or commonly occurring situations, and possibly represent business rules and relations prescribed by the underlying legislation. If we are able to find these patterns, then we are also able to identify when data satisfy and do not satisfy the patterns.

How can we find these patterns? Initially, I started with association rules. This is a rule-based machine learning approach that leads to transparent and explainable results. However, for this approach the quantitative data has to be encoded to a set of features (for example by replacing every quantitative value with an appropriate nominal value) and with high dimensional data this quickly becomes computationally expensive.

After some increments I decided to program a pattern discovery algorithm especially designed for analyzing the quantitative data points of reports. The goal was to speed up the process, while maintaining the association rules mining approach and performance measures. Below I will give some examples of how and which patterns can be found. The code is part of the insurlib package on Github. I used the public Solvency 2 data of Dutch insurers to find patterns in the quantitative data of these insurers (but applications to other data sets are possible).

First read the pandas and the insurlib package. The patterns-part consists of functions to generate patterns in numerical columns of dataframes.

import pandas as pd
from insurlib import patterns

Now read the public Solvency 2 data as was described in How to analyze public Solvency 2 data of Dutch insurers (by reading the Excel file and defining the read_sheet function). To recall, the Excel consists of the following worksheets:

  • Worksheet 14: balance sheet
  • Worksheet 15: premiums – life
  • Worksheet 16: premiums – non-life
  • Worksheet 17: technical provisions – life
  • Worksheet 18: technical provisions – non-life
  • Worksheet 19: transition and adjustments
  • Worksheet 20: own funds
  • Worksheet 21: solvency capital requirements – 1
  • Worksheet 22: solvency capital requirements – 2
  • Worksheet 23: minimum capital requirements
  • Worksheet 24: additional information life
  • Worksheet 25: additional information non-life
  • Worksheet 26: additional information reinsurance

Example 1: comparing two dataframes

Suppose we want to compare the worksheet balance sheet and the worksheet non-life technical provision and find the relations between the contents in these worksheets.

df1 = get_sheet(14)
df2 = get_sheet(18)
df2.columns = [str(df2.columns[i]) for i in range(len(df2.columns))]

The last line is to convert the multiple level columns to one level (so that we can compare it more easily with other dataframes).

You can generate patterns with the generate-function. Patterns that are found have the ‘association rule’-structure P -> Q. If you input two dataframes (P_dataframe and Q_dataframe) then all columns of P_dataframe are compared to all columns of Q_dataframe. The pattern we are looking for is ‘=’, so patterns with corresponding values are found. You can also use other patterns, such as ‘<‘, ‘<=’ , ‘>’, ‘>=’ and ‘!=’. Also a dict of parameters is used as input, with in this case the minimum confidence and the minimum support.

rules = patterns.generate(P_dataframe = df1,
                          Q_dataframe = df2,
                          pattern = "=",
                          parameters = {"min_confidence": 0.75, 
                                        "min_support": 10}))
rules = list(rules)
print("Number of rules: " + str(len(rules)))
Number of rules: 2

The output is a generator that we can convert to a list of rules. Two rules are found in this case. Let’s look a the first rule.

    'assets|reinsurance recoverables from:|non-life and health 
    similar to non-life , solvency ii value', 
    "('total non-life obligation', 'Technical provisions -
    total|Recoverable from reinsurance contract/SPV and Finite
    Re after the adjustment for expected losses due to counterparty
    - total')"], 

The first pattern states that the value of the reinsurance recoverables on the asset side for non-life on the balance sheet equates the value of the recoverable from reinsurance contracts in the technical provision for non-life obligations in the technical provisions sheet. The rule has confidence of almost 97%, and a support of 127. This means that there are 127 occurrences of this pattern in the data and in 97% of all occurrences (with nonzero data points) the patterns holds. We also see that in four cases the patterns are not present, i.e. the reinsurance recoverables does not equate to the recoverable in the technical provision (these are presumably data errors).

The second rule that was found reads:

    'liabilities|technical provisions – non-life , solvency ii value',
    "('total non-life obligation', 'Technical provisions -
    total|Technical provisions - total')"], 

This rule also has a high confidence. It says that the value of the technical provisions for non-life in the balance sheet equals the value of the total technical provisions in the technical provision sheet. This is a plain consistency rule between the sheets. The six exceptions are presumably data errors.

Both rules were, at the moment of publication, not part of the automatic and predefined validation rules of the Solvency II reports (otherwise the confidence would be 100%), as part of the XBRL-taxonomy. But by analyzing the reports in this manner we were able to uncover them automatically.

Example 2: patterns of sums

Often financial reports contain sums within the report. We can analyze the column names to detect potential sums (often a hierarchy in the columns name can be identified), but we can also find patterns of sums. The following code does that. We input the balance sheet dataframe and let the algorithm search for ‘sum’-patterns. The parameters sum_elements states the maximum elements in the sum (in this case three).

rules = patterns.generate(dataframe = df1,
                          pattern = "sum",
                          parameters = {"sum_elements": 3})
rules = list(rules)
print("Number of rules: " + str(len(rules)))
Number of rules: 7

Let’s take a look at the first rule:

    ['assets|investments (other than assets held for index-linked and 
     unit-linked contracts)|equities|equities - listed , solvency ii 
     'assets|investments (other than assets held for index-linked and 
     unit-linked contracts)|equities|equities - unlisted , solvency ii
    'assets|investments (other than assets held for index-linked and 
     unit-linked contracts)|equities , solvency ii value'], 

The rule states that the sum of the listed and unlisted equities equals to the equities (so equities are either listed or non-listed). This rule has a confidence of 100%, and there is presumably a validation rule within the reports. Six rules in this structure were found in this way. This is however somewhat computationally expensive.

Example 3: patterns with a given value

The last example searches for patterns with specific values. In this case we want to know in how many cases the investments are higher than zero. We can do this in the following way. We input the dataframe like in example 2 and we add a parameter columns and set it to the name of the column we want to investigate (in fact you can input a list of columns).

P = ['assets|investments (other than assets held for index-linked and 
      unit-linked contracts) , solvency ii value']
 rules = patterns.generate(dataframe = df1, 
                           pattern = ">", 
                           columns = P, 
                           value = 0, 
                           parameters = {'min_confidence': 0.75,
                                         'min_support': 1})
 rules = list(rules)
     'assets|investments (other than assets held for index-linked and 
     unit-linked contracts) , solvency ii value', 

The value of investment is, with confidence of 96%, higher than zero. In eleven cases the value is not higher than zero. This rule has a high confidence because, normally, insurers invest premiums collected for insurance policies into a wide range of investment assets. If no list of columns is added, patterns in all numerical columns in the dataframe returned.

The aim of these examples is to give a general idea of pattern discovery in Solvency 2 quantitative data. Numerous patterns can be found in this way by using the complete data set. And by using the measures confidence and support we can find patterns that are not exactly perfect but do provide information about the data, without taking recourse to statistical methods. Data errors and specific situations that lead to exceptions in the data are not expressions of pure randomness and should therefore not be analyzed by statistical methods. With these patterns we are able to reconstruct basic patterns in the data that provide information about the data.

Of course, many improvements are possible in order to find more complex patterns (and that why there is a (1) in the title of this blog). Presumably all existing validation rules can be found in this manner, and much more. Hopefully I will be able to implement these improvements and present them in a new blog.

Text modeling with S2 SFCRs

Published by:

European insurance undertakings are required to publish each year a Solvency and Financial Condition Report (SFCR). These SFCRs are often made available via the insurance undertaking’s website. In this blog I will show some first results of a text modeling exercise using these SFCRs.

Text modeling was done with Latent Dirichlet Allocation (LDA) with the Mallet’s implementation, via the gensim-package (found here: https://radimrehurek.com/gensim/index.html). A description you can find here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/. LDA is an unsupervised learning algorithm that generates latent (hidden) distributions over topics for each document or sentence and a distribution over words for each topic.

To get the data I scraped as many SFCRs (in all European languages) as I could find on the Internet. As a result of this I have a data set of 4.36 GB with around 2,500 SFCR documents in PDF-format (until proven otherwise, I probably have the largest library of SFCR documents in Europe). Among these are 395 SFCRs in the English language, consisting in total of 287,579 sentences and 8.1 million words.

In a SFCR an insurance undertaking publicly discloses information about a number of topics prescribed by the Solvency II legislation, such as its business and performance, system of governance, risk profile, valuation and capital management. Every SFCR therefore contains the same topics.

The LDA algorithm is able to find dominant keywords that represents each topic, given a set of documents. It is assumed that each document is about one topic. We want to use the LDA algorithm to identify the different topics within the SFCRs, such that, for example, we can extracts all sentences about the solvency requirements. To do so, I will run the LDA algorithm with sentences from the SFCRs (and thereby assuming that each sentence is about one topic).

I followed the usual steps; some data preparation to read the pdf files properly. Then I selected the top 9.000 words and I selected the sentences with more than 10 words (it is known that the LDA algorithm does not work very good with words that are not often used and very short documents/sentences). I did not built bigram and trigram models because this did not really change the outcome. Then the data was lemmatized such that only nouns, adjectives, verbs and adverbs were selected. The spacy-package provides functions to tag the data and select the allowed postags.

The main inputs for the LDA algorithm is a dictionary and a corpus. The dictionary contains all lemmatized words used in the documents with a unique id for each word. The corpus is a mapping of word id to word frequency in each sentence. After we generated these, we can run the LDA algorithm with the number of topics as one of the parameters.

The quality of the topic modeling is measured by the coherence score.

The coherence score per number of topics

Therefore, a low number of topics that performs well would be nine topics (0.65) and the highest coherence score is attained at 22 topics (0.67), which is pretty high in general. From this we conclude that nine topics would be a good start.

What does the LDA algorithm produce? It generates for each topic a list of keywords with weights that represent that topic. The weights indicate how strong the relation between the keyword and the topic is: the higher the weight the more representative the word is for that specific topic. Below the first ten keywords are listed with their weights. The algorithm does not classify the topic with one or two words, so per topic I determined a description that more or less covers the topic (with the main subjects of the Solvency II legislation in mind).

Topic 0 ‘Governance’: 0.057*”management” + 0.051*”board” + 0.049*”function” + 0.046*”internal” + 0.038*”committee” + 0.035*”audit” + 0.034*”control” + 0.032*”system” + 0.030*”compliance” + 0.025*”director”

Topic 1 ‘Valuation’: 0.067*”asset” + 0.054*”investment” + 0.036*”liability” + 0.030*”valuation” + 0.024*”cash” + 0.022*”balance” + 0.020*”tax” + 0.019*”cost” + 0.017*”account” + 0.016*”difference”

Topic 2 ‘Reporting and performance’: 0.083*”report” + 0.077*”solvency” + 0.077*”financial” + 0.038*”condition” + 0.032*”information” + 0.026*”performance” + 0.026*”group” + 0.025*”material” + 0.021*”december” + 0.018*”company”

Topic 3 ‘Solvency’: 0.092*”capital” + 0.059*”requirement” + 0.049*”solvency” + 0.039*”year” + 0.032*”scr” + 0.030*”fund” + 0.027*”model” + 0.024*”standard” + 0.021*”result” + 0.018*”base”

Topic 4 ‘Claims and assumptions’: 0.023*”claim” + 0.021*”term” + 0.019*”business” + 0.016*”assumption” + 0.016*”market” + 0.015*”future” + 0.014*”base” + 0.014*”product” + 0.013*”make” + 0.012*”increase”

Topic 5 ‘Undertaking’s strategy’: 0.039*”policy” + 0.031*”process” + 0.031*”business” + 0.030*”company” + 0.025*”ensure” + 0.022*”management” + 0.017*”plan” + 0.015*”manage” + 0.015*”strategy” + 0.015*”orsa”

Topic 6 ‘Risk management’: 0.325*”risk” + 0.030*”market” + 0.027*”rate” + 0.024*”change” + 0.022*”operational” + 0.021*”underwriting” + 0.019*”credit” + 0.019*”exposure” + 0.013*”interest” + 0.013*”liquidity”

Topic 7 ‘Insurance and technical provisions’: 0.049*”insurance” + 0.045*”reinsurance” + 0.043*”provision” + 0.039*”life” + 0.034*”technical” + 0.029*”total” + 0.025*”premium” + 0.023*”fund” + 0.020*”gross” + 0.019*”estimate”

Topic 8 ‘Undertaking’: 0.065*”company” + 0.063*”group” + 0.029*”insurance” + 0.029*”method” + 0.023*”limit” + 0.022*”include” + 0.017*”service” + 0.016*”limited” + 0.015*”specific” + 0.013*”mutual”

To determine the topic of a sentences we calculate for each topic the weight of the words in the sentences. The main topic of the sentence is then expected to be the topic with the highest sum.

If we run the following sentence (found in one of the SFCRs) through the model

"For the purposes of solvency, the Insurance Group’s insurance obligations
are divided into the following business segments: 1. Insurance with profit
participation 2. Unit-linked and index-linked insurance 3. Other life
insurance 4. Health insurance 5. Medical expence insurance for non-life
insurance 6. Income protection insurance for non-life insurance Pension &
Försäkring (Sweden) Pension & Försäkring offers insurance solutions on the
Swedish market within risk and unit-linked insurance and traditional life

then we get the following results per topic:

[(0, 0.08960573476702509), 
(1, 0.0692951015531661),
(2, 0.0692951015531661),
(3, 0.06332138590203108),
(4, 0.08363201911589009),
(5, 0.0692951015531661),
(6, 0.08004778972520908),
(7, 0.3369175627240143),
(8, 0.13859020310633216)]

Topic seven (‘Insurance and technical provisions’) has clearly the highest score 0.34 , followed by topic eight (‘Undertaking’). This suggests that these sentences are about the insurances and technical provisions of the undertaking (that we can verify).

Likewise, for the sentence

"Chief Risk Officer and Risk Function 
The Board has appointed a Chief Risk Officer (CRO) who reports directly to
the Board and has responsibility for managing the risk function and
monitoring the effectiveness of the risk management system."

we get the following results:

[(0, 0.2926447574334898), 
(1, 0.08294209702660407),
(2, 0.07824726134585289),
(3, 0.07824726134585289),
(4, 0.07824726134585289),
(5, 0.08450704225352113),
(6, 0.14866979655712048),
(7, 0.07824726134585289),
(8, 0.07824726134585289)]

Therefore, topic zero (‘Governance’) and topic six (‘Risk management’) have the highest score and this suggests that this sentence is about the governance of the insurance undertaking and to a lesser extent risk management.

The nine topics that were identified reflect fairly different elements in the SFCR, but we also see that some topics consist of several subtopics that could be identified separately. For example, the topic that I described as ‘Valuation’ covers assets and investments but it might be more appropriate to distinguish investment strategies from valuation. The topic ‘Solvency’ covers own funds as well as solvency requirements. If we increase the number of topics then some of the above topics will be split into more topics and the topic determination will be more accurate.

Once we have made the LDA model we can use the results for several applications. First, of course, we can use the model to determine the topics of previously unseen documents and sentences. We can also analyze topic distributions across different SFCRs, we can get similar sentences for any given sentence (based on the distance of the probability scores of the given sentence to other sentences).

In this blog I described first steps in text modeling of Solvency and Financial Condition Reports of insurance undertakings. The coherence scores were fairly high and the identified topics represented genuine topics from the Solvency II legislation, especially with a sufficient number of topics. Some examples showed that the LDA model is able to identify the topic of specific sentences. However, this does not yet work perfectly; an important element of SFCR documents are the numerical information often stored in table form in the PDF. These are difficult to analyze with the LDA algorithm.