Category Archives: Data extraction

Converting supervisory reports to Semantic Webs: from XBRL to RDF

Published by:

A growing number of supervisory reports across Europe are based on the XML Extensible Business Reporting Language standard (XBRL). Financial entities such as banks, insurance undertakings and pension institutions are required to submit their reports to their supervisors in this format.

XBRL is a language for modeling, exchanging and automatically processing business and financial information. Reports in this format (called instance documents) are based on metadata (set out in taxonomies) that add semantic meaning to the data points that are reported. You can choose different implementations but overall an XBRL taxonomy provides a semantically rich data model and that has always been one of the main advantages of XBRL.

However, in its raw format (an XML file) each report is basically a machine readable document with a tree structure that does not enable easy integration with related data from other sources or integration with text documents and their contents.

In this blog, I will show that converting the XBRL reports to another format allows easier integration and understanding. That other format is based on Semantic Webs. It has been shown that XBRL converted to Semantic Webs can be done without any loss of information (see for example this article). So if we convert the XBRL format to a Semantic Web then we keep the structure and the meaning provided by the taxonomy. The result is basically a graph and this format enables integration with other linked data that is much easier.

A Semantic Web consists of formats and technologies that are rather old (from a computer science perspective): it originated around the same time as XBRL, some twenty years ago. And because it tried to solve similar problems (lack of semantic meaning in the World Wide Web) as the XBRL standard (lack of semantic meaning in business and financial data), to some extent it is based on similar concepts. It was however developed completely separate from XBRL.

The general concept of a Semantic Webs, where data is linked together to provide semantic meaning, is also known as a knowledge graph.

How does a Semantic Web work? One of the formats of the Semantic Web is the Resource Description Framework (RDF), originally designed as a metadata data model. RDF was adopted as a World Wide Web Consortium recommendation in 1999. The RDF 1.0 specification was published in 2004, and RDF 1.1 followed in 2014.

The RDF format is based on expressions in the form of subject-predicate-object, called triples. The subject and object denote (web) resources and the predicate denotes the relationship between the subject and the object. For example the expression ‘Spinoza has written the book Ethica Ordine Geometrico Demonstrata’ in RDF is a triple with a subject denoting “Spinoza”, a predicate denoting “has written”, and an object denoting “the book Ethica Ordine Geometrico Demonstrata”. This is a different approach than for example object-oriented models with an entity (Spinoza), attribute (book) and value (Ethica).

The RDF format could potentially solve some problems with the XBRL format. To explain this, I converted an XBRL-instance (a test instance file from EIOPA for Solvency 2) to RDF format.

Below you see the representation of one arbitrary data point in the report (called a fact) in RDF format and visualized as a network (I used the Python package networkx). The predicates contain the complete web resource so I limited the name to the last word to make it readable.

The red node is the starting point of the data point. The red labels on the lines describe the predicate between subject and object. You see that the fact (subject) ‘has decimals’ (predicate) 2 (object), and furthermore has unit EUR, has value 838522076.03, has type metmi503 (an internal code describing Payments for reported but not settled claims) and some other properties.

The data point also has a so-called context that defines the entity to which the fact applies, the period of time the fact is relevant (in this case 2019-12-31) and also a scenario, which consists of additional metadata of the data point. In this case we see that the data point is related to statutory accounts, non-life and health non-STL, direct business and accepted during the period (and a node without a label).

All facts in every XBRL instance are structured in this way, which means that for example you can search all facts with the label statutory accounts. Furthermore, because XBRL uses namespaces you can unambiguously identify predicates and objects in the report. For example, you see that the entity node has an identifier (starting with 0LFF1…) and a scheme (17442). The scheme refers to the web resource for the ISO standard 17442 which specifies the Legal Entity Identifier (LEI), so the entity is unambiguously identified with the given (LEI-)code. If you add other XBRL instances with references to that entity then the data is automatically linked because other instances will contain exactly the same entity node.

The RDF representation of the XBRL fact above is:

_:provenance1 xl:instance "filename".
_:unit_u xbrli:measure iso4217:EUR.
_:fact926
  xl:provenance :provenance1;
  xl:type xbrli:fact; 
  rdf:type s2md_met:mi503;
  rdf:value "838522076.03"^^xsd:decimal;
  xbrli:decimals "2"^^xsd:integer;
  xbrli:unit :unit_u; 
  xbrli:context :context_BLx79_DIx5_IZx1_TBx28_VGx84.
_:context_BLx79_DIx5_IZx1_TBx28_VGx84
  xl:type xbrli:context;
  xbrli:entity [
    xbrli:identifier "0LFF1WMNTWG5PTIYYI38";
    xbrli:scheme http://standards.iso.org/iso/17442;
    ];
  xbrli:scenario [
    xbrldi:explicitMember "s2c_LB:x79"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_DI:x5"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_RT:x1"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_LB:x28"^^rdf:XMLLiteral;
    xbrldi:explicitMember "s2c_AM:x84"^^rdf:XMLLiteral;
    ];
  xbrli:instant "2019-12-31"^^xsd:date.

Instead of storing the data in separate templates with often unclear code names you can also convert the XBRL data to one large Semantic Web where all facts are linked together. The RDF format thus provides a graph model which allows easier integration and visualization (and, for me at least, easier understanding). It allows adding and linking data from other sources, such as Solvency 2 documents and external data, in the same graph.

Typically, supervisory reports consists of thousands of data points and supervisors receive reports from many entities each period. How would you store that information? I think that the natural way to store an XBRL instance is not a relational database but a graph database (like graphDB or Neo4j). These databases can store the facts with all the metadata in a structured way and enable to query the graph efficiently. Next blog, I will explore graph databases and query languages for XBRL reports converted to the RDF format.

Pilot Data Quality Rules

Published by:

Data Quality is receiving more and more attention within the financial sector, and deservedly so. That’s why DNB will start a pilot in September with the insurance sector to enable entities to run locally the required open source code and to evaluate Solvency 2 quantitative reports with our Data Quality Rules.

In the coming weeks we will:

With these tools you are able to assess the data quality of your Solvency 2 quantitative reports before submitting them to DNB. You can do that within your own data science environment.

We worked hard to make this as easy as possible; the only thing you need is Anaconda / Jupyter Notebooks (Python) and Git to clone our repositories from Github (all free and open source software). And of course the data you want to check. We also provide code to evaluate the XBRL instance files.

We are planning workshops to explain how to use the code and validation rules and to go through the process step by step.

Want to join or know more, please let me know (w.j.willemse at dnb.nl).

How to analyze public Solvency 2 data of Dutch insurers

Published by:

In this blog we will use the public Solvency II data of all Dutch insurance undertakings and present it in one large data vector per undertaking. In doing so, we are able to use some easy but powerful machine learning algorithms to analyze the data. The notebook can be found here.

Solvency II data of individual insurance undertakings is published yearly by the Statistics department of DNB. The data represents the financial and solvency situation of each insurance undertaking per the end of each year. Currently, we have two years of data: ultimo 2016 and ultimo 2017. The publication of DNB is in the form of an Excel file with a number of worksheets containing the aggregated data and the individual data. Here we will use the individual data.

We will read the data in a Pandas Data Frame en use Numpy for data manipulations. Furthermore we need the datetime object

import pandas as pd
import numpy as np
from datetime import datetime

You can find the data with the following url

https://statistiek.dnb.nl/en/downloads/index.aspx#/?statistics_type=toezichtdata&theme=verzekeraars

Download the file and make sure that the Excel file is accessible.

data_path = "../../20_local_data/"
xls = pd.ExcelFile(data_path + "Data individual insurers
                                (year).xlsx")

The Excel file contains several worksheets with data. We want to combine all the data together in one Data Frame. To do that we need some data preparation and data cleaning for each worksheet.

In the following function a worksheet is put into a Data Frame and the columns names are set to lower case. Then an index of the data frame is set to the insurance undertaking name and the reporting period. Then we perform some cleaning (the original worksheets contain some process information). In addition, some worksheets in the file contain 2-dimensional data, that need to be pivoted such that we obtain one large vector with all the data per insurance undertaking in one row.

def get_sheet(num):
    # read entire Excel sheet
    df = xls.parse(num)
    # columns names to lower case
    df.columns = [c.lower() for c in df.columns]
    # set index to name and period
    df.set_index(['name', 'period'], inplace = True)
    # data cleaning (the excel sheet contains some
                     additional data that we don't need)
    drop_list = [i for i in df.columns 
                     if 'unnamed' in i or 'selectielijst' in i]
    df.drop(drop_list, axis = 1, inplace = True)
    # pivot data frame
    if "row_name" in df.columns:
        df.drop("row_name", axis = 1, inplace = True)
        df = df.pivot(columns = 'row_header')
    if df.columns.nlevels > 1:
        df.columns = [str(df.columns[i]) for i in
                          range(len(df.columns))]
    return df

Creating one large vector per insurer

With the function above we can read a part of the Excel file and store it in a Pandas data frame. The following worksheets are contained in the Excel file published by DNB.

  • Worksheet 14: balance sheet
  • Worksheet 15: premiums – life
  • Worksheet 16: premiums – non-life
  • Worksheet 17: technical provisions – 1
  • Worksheet 18: technical provisions – 2
  • Worksheet 19: transition and adjustments
  • Worksheet 20: own funds
  • Worksheet 21: solvency capital requirements – 1
  • Worksheet 22: solvency capital requirements – 2
  • Worksheet 23: minimum capital requirements
  • Worksheet 24: additional information life
  • Worksheet 25: additional information non-life
  • Worksheet 26: additional information reinsurance

Let’s read the first worksheet with data and then append the other sheets to it. We shall not read the last three worksheets, because these contain the country specific reports.

df = get_sheet(14)
for l in range(15, 24):
    df_temp = get_sheet(l)
    df = df.join(df_temp, rsuffix = "_"+ str(l))

Let’s get the shape of the Data Frame.

print("Number of rows   : " + str(len(df.index)))
print("Number of columns: " + str(len(df.columns)))

Out: Number of rows   : 286
     Number of columns: 1273

So we have 286 rows of insurance undertakings data for ultimo 2016 and ultimo 2017. And we have 1273 columns with financial and solvency data about each insurance undertaking. Let’s take the data from the year 2017, and select all columns that have floating numbers in them.

df = df.xs(datetime(2017,12,31), 
           axis = 0, 
           level = 1,
           drop_level = True)
df = df[[df.columns[c] for c in range(len(df.columns)) 
          if df.dtypes[c]=='float64']]
df.fillna(0, inplace = True)
print("Number of rows   : " + str(len(df.index)))
print("Number of columns: " + str(len(df.columns)))

Out: Number of rows   : 139
     Number of columns: 1272

Apparently there is one columns with a non-floating values. For ultimo 2017 we have the data of 139 insurance undertaking.

Now we can perform all kinds of numerical analysis. Let’s calculate the total amount of assets of all Dutch insurance undertakings.

df['assets|total assets , solvency ii value'].sum()

Out: 486702222727.1401

That’s almost 487 billion euro at the end of 2017.

Finding similar insurers

Now let’s apply some algorithms to this data set. Suppose we want to know what insurance undertakings are similar with respect to their financial and solvency structure. To do that we can calculate the distances between all the data points of each insurance undertakings. An insurance undertaking with a low distance to another insurance undertaking might be similar to that undertaking.

If we divide each row by the total assets we do as if all insurance undertakings have equal size, and then the distances indicate similarity in financial and solvency structure (and not similarity in size).

X = df.div(df['assets|total assets , solvency ii value'],
              axis = 0)

The scikit-learn package provides numerous algorithms to do calculations with distances. Below we apply the NearestNeighbors algorithm to find the neighbors of each insurance undertaking. Then we get the distances and the indices of the data set and store them.

from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors = 2, 
                        algorithm = 'brute').fit(X.values)
distances, indices = nbrs.kneighbors(X)

What are the nearest neighbors of the first ten insurance undertakings in the list?

for i in indices[0:10]:
     print(X.index[i[0]] + " --> " + X.index[i[1]])

Out: 
ABN AMRO Captive N.V. --> Rabo Herverzekeringsmaatschappij N.V.
ABN AMRO Levensverzekering N.V. --> Delta Lloyd Levensverzekering N.V.
ABN AMRO Schadeverzekering N.V. --> Ansvar Verzekeringsmaatschappij N.V.
AEGON Levensverzekering N.V. --> ASR Levensverzekering N.V.
AEGON Schadeverzekering N.V. --> Nationale-Nederlanden Schadeverzekering Maatschappij N.V.
AEGON Spaarkas N.V. --> Robein Leven N.V.
ASR Aanvullende Ziektekostenverzekeringen N.V. --> ONVZ Aanvullende Verzekering N.V.
ASR Basis Ziektekostenverzekeringen N.V. --> VGZ Zorgverzekeraar N.V.
ASR Levensverzekering N.V. --> Achmea Pensioen- en Levensverzekeringen N.V.
ASR Schadeverzekering N.V. --> Veherex Schade N.V.

And with the shortest distance between two insurance undertakings we can find the two insurance undertakings that have the highest similarity in their structure.

min_list = np.where(distances[:,1] == distances[:,1].min())
list(X.index[min_list])

Out: ['IZA Zorgverzekeraar N.V.', 'Univé Zorg, N.V.']

If you want to understand the financial performance it is of course handy to know which insurance undertakings are similar. A more general approach when comparing insurance undertakings is to cluster them into a small number of peer groups.

Clustering the insurers

Can we cluster the insurance undertakings based on the 1272-dimensional data? To do this we apply the t-sne algorithm (that we used before).

First we import all the packages required.

import matplotlib.pyplot as pyplot
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

Then we run the t-sne algorithm.

Y = TSNE(n_components = 2, 
         perplexity = 5, 
         verbose = 0, 
         random_state = 1).fit_transform(X.values)

And we plot the results

pyplot.figure(figsize = (7, 7))
pyplot.scatter(x = Y[:, 0], 
               y = Y[:, 1], 
               s = 7)
pyplot.show()

Depending on how you zoom in you see different clusters in this picture. In the above left you see the health insurance undertakings (with more clusters within that set: those offering basic health insurance and other offering additional health insurances, or both). On the right are (mostly) life insurance undertakings, and on the left (middle to below) are non-life insurance undertakings. And both clusters can be divided into several more sub clusters. These clusters can be used in further analysis. For example, you could use these as peer groups of similar insurance undertakings.

Clustering the features

Given that we have a 1272-dimensional vector of each insurance undertaking we might wish somehow to cluster the features in the data set. That is, we want to know which columns belong to each other and what columns are different.

An initial form of clustering were the different worksheets in the original Excel file. The data was clustered around the balance sheet, premiums, technical provisions, etc. But can we also find clusters within the total vector without any prior knowledge of the different worksheets?

A simple and effective way is to transpose the data matrix and feed it into the t-sne algorithm. That is, instead of assuming that each feature provides additional information about an insurance undertaking, we assume that each insurance undertaking provides additional information about a feature.

Let’s do this for only the balance sheet. In a balance sheet it is not immediately straightforward how the left side is related to the right side of the balance sheet, i.e. which assets are related to which liabilities. If you cluster all the data of the balance sheet then related items are clustered (irrespective of whether they are assets or liabilities).

df = get_sheet(14)

Instead of the scaled values we now take whether or not a data point was reported or not, and then transpose the matrix.

X = (df != 0).T

Then we apply the t-sne algorithm. In this case with a lower perplexity.

Y = TSNE(n_components = 2, 
         perplexity = 1.0, 
         verbose = 0, 
         random_state = 0, 
         learning_rate = 20, 
         n_iter = 10000).fit_transform(X.values)

And we plot the result with 15 identified clusters.

pyplot.figure(figsize = (7, 7))
pyplot.scatter(x = Y[:, 0], 
              y = Y[:, 1], 
              s = 5)
kmeans = KMeans(n_clusters = 15, random_state = 0, n_init  = 10).fit(Y)
for i in range(len(kmeans.cluster_centers_)):
    pyplot.scatter(x = kmeans.cluster_centers_[i,0],
                   y = kmeans.cluster_centers_[i,1],
                   s = 1,
                   c = 'yellow')
    pyplot.annotate(str(i), 
                   xy = (kmeans.cluster_centers_[i, 0], kmeans.cluster_centers_[i, 1]), 
                   size = 13)
pyplot.show()

Then we get this.

We see are large number of different clusters.

for i in df.T.loc[kmeans.labels_ == 6].index:
    print(i)

Out:
assets|assets held for index-linked and unit-linked contracts , solvency ii value
assets|loans and mortgages|loans and mortgages to individuals , solvency ii value
assets|reinsurance recoverables from:|life and health similar to life, excl health,index-linked,unit-linked|life excluding health,index-linked,unit-linked , solvency ii value
liabilities|technical provisions – index-linked and unit-linked , solvency ii value
liabilities|technical provisions – index-linked and unit-linked|best estimate , solvency ii value
liabilities|technical provisions – index-linked and unit-linked|risk margin , solvency ii value

So the assets held for index-linked and unit-linked contracts are in the same cluster as the technical provisions for index-linked and unit-linked items (and some other related items are found).

However, the relations found are not always perfect. To improve the clustering we should cluster the data that is related in their changes over time. But because we have just two years available (and so just one yearly difference) we presumably do not have enough data to do that.

How to download and read the Solvency 2 legislation

Published by:

In our first Natural Language Processing project we will read the Solvency II legislation from the website of the European Union and extract the text within the articles by using regular expressions.

For this notebook, we have chosen the text of the Delegated Acts of Solvency II. This part of the Solvency II regulation is directly into force (because it is a Regulation) and the wording of the Delegated Acts is more detailed than the Solvency II Directive and very precise and internally consistent. This makes it suitable for NLP. From the text we are able to extract features and text data on Solvency II for our future projects.

The code of this notebook can be found in here

Step 1: data Retrieval

We use several packages to read and process the pdfs. For reading we use the fitz-package. Furthermore we need the re-package (regular expressions) for cleaning the text data.

import os
import re
import requests
import fitz

We want to read the Delegated Acts in all available languages. The languages of the European Union are Bulgarian (BG), Spanish (ES), Czech (CS), Danish (DA), German (DE), Estonian (ET), Greek (EL), English (EN), French (FR), Croatian (HR), Italian (IT), Latvian (LV), Lithuanian (LT), Hungarian (HU), Maltese (MT), Dutch (NL), Polish (PL), Portuguese (PT), Romanian (RO), Slovak (SK), Solvenian (SL), Finnish (FI), Swedish (SV).

languages = ['BG','ES','CS','DA','DE','ET','EL',
             'EN','FR','HR','IT','LV','LT','HU',
             'MT','NL','PL','PT','RO','SK','SL',
             'FI','SV']

The urls of the Delegated Acts of Solvency 2 are constructed for these languages by the following list comprehension.

urls = ['https://eur-lex.europa.eu/legal-content/' + lang +
        '/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN' 
        for lang in  languages]

The following for loop retrieves the pdfs of the Delegated Acts from the website of the European Union and stores them in da_path.

da_path = 'data/solvency ii/'
for index in range(len(urls)):
    filename = 'Solvency II Delegated Acts - ' + languages[index] + '.pdf'
    if not(os.path.isfile(da_path + filename)):
        r = requests.get(urls[index])
        f = open(da_path + filename,'wb+')
        f.write(r.content) 
        f.close()
 else:
        print("--> already read.")

Step 2: data cleaning

If you look at the pdfs then you see that each page has a header with page number and information about the legislation and the language. These headers must be deleted to access the articles in the text.

DA_dict = dict({
                'BG': 'Официален вестник на Европейския съюз',
                'CS': 'Úřední věstník Evropské unie',
                'DA': 'Den Europæiske Unions Tidende',
                'DE': 'Amtsblatt der Europäischen Union',
                'EL': 'Επίσημη Εφημερίδα της Ευρωπαϊκής Ένωσης',
                'EN': 'Official Journal of the European Union',
                'ES': 'Diario Oficial de la Unión Europea',
                'ET': 'Euroopa Liidu Teataja',           
                'FI': 'Euroopan unionin virallinen lehti',
                'FR': "Journal officiel de l'Union européenne",
                'HR': 'Službeni list Europske unije',         
                'HU': 'Az Európai Unió Hivatalos Lapja',      
                'IT': "Gazzetta ufficiale dell'Unione europea",
                'LT': 'Europos Sąjungos oficialusis leidinys',
                'LV': 'Eiropas Savienības Oficiālais Vēstnesis',
                'MT': 'Il-Ġurnal Uffiċjali tal-Unjoni Ewropea',
                'NL': 'Publicatieblad van de Europese Unie',  
                'PL': 'Dziennik Urzędowy Unii Europejskiej',  
                'PT': 'Jornal Oficial da União Europeia',     
                'RO': 'Jurnalul Oficial al Uniunii Europene', 
                'SK': 'Úradný vestník Európskej únie',        
                'SL': 'Uradni list Evropske unije',            
                'SV': 'Europeiska unionens officiella tidning'})

The following code reads the pdfs, deletes the headers from all pages and saves the clean text to a .txt file.

DA = dict()
files = [f for f in os.listdir(da_path) if os.path.isfile(os.path.join(da_path, f))]    
for language in languages:
    if not("Delegated_Acts_" + language + ".txt" in files):
        # reading pages from pdf file
        da_pdf = fitz.open(da_path + 'Solvency II Delegated Acts - ' + language + '.pdf')
        da_pages = [page.getText(output = "text") for page in da_pdf]
        da_pdf.close()
        # deleting page headers
        header = "17.1.2015\\s+L\\s+\\d+/\\d+\\s+" + DA_dict[language].replace(' ','\\s+') + "\\s+" + language + "\\s+"
        da_pages = [re.sub(header, '', page) for page in da_pages]
        DA[language] = ''.join(da_pages)
        # some preliminary cleaning -> could be more 
        DA[language] = DA[language].replace('\xad ', '')
        # saving txt file
        da_txt = open(da_path + "Delegated_Acts_" + language + ".txt", "wb")
        da_txt.write(DA[language].encode('utf-8'))
        da_txt.close()
    else:
        # loading txt file
        da_txt = open(da_path + "Delegated_Acts_" + language + ".txt", "rb")
        DA[language] = da_txt.read().decode('utf-8')
        da_txt.close()

Step 3: retrieve the text within articles

Retrieving the text within articles is not straightforward. In English we have ‘Article 1 some text’, i.e. de word Article is put before the number. But some European languages put the word after the number and there are two languages, HU and LV, that put a dot between the number and the article. To be able to read the text within the articles we need to know this ordering (and we need of course the word for article in every language).

art_dict = dict({
                'BG': ['Член',      'pre'],
                'CS': ['Článek',    'pre'],
                'DA': ['Artikel',   'pre'],
                'DE': ['Artikel',   'pre'],
                'EL': ['Άρθρο',     'pre'],
                'EN': ['Article',   'pre'],
                'ES': ['Artículo',  'pre'],
                'ET': ['Artikkel',  'pre'],
                'FI': ['artikla',   'post'],
                'FR': ['Article',   'pre'],
                'HR': ['Članak',    'pre'],
                'HU': ['cikk',      'postdot'],
                'IT': ['Articolo',  'pre'],
                'LT': ['straipsnis','post'],
                'LV': ['pants',     'postdot'],
                'MT': ['Artikolu',  'pre'],
                'NL': ['Artikel',   'pre'],
                'PL': ['Artykuł',   'pre'],
                'PT': ['Artigo',    'pre'],
                'RO': ['Articolul', 'pre'],
                'SK': ['Článok',    'pre'],
                'SL': ['Člen',      'pre'],
                'SV': ['Artikel',   'pre']})

Next we can define a regex to select the text within an article.

def retrieve_article(language, article_num):

    method = art_dict[language][1]
    
    if method == 'pre':
        string = art_dict[language][0] + ' ' + str(article_num) + '(.*?)' + art_dict[language][0] + ' ' + str(article_num + 1)
    elif method == 'post':
        string = str(article_num) + ' ' + art_dict[language][0] + '(.*?)' + str(article_num + 1) + ' ' + art_dict[language][0]
    elif method == 'postdot':
        string = str(article_num) + '. ' + art_dict[language][0] + '(.*?)' + str(article_num + 1) + '. ' + art_dict[language][0]

    r = re.compile(string, re.DOTALL)
            
    result = ' '.join(r.search(DA[language])[1].split())
            
    return result

Now we have a function that can retrieve the text of all the articles in the Delegated Acts for each European language.

Now we are able to read the text of the articles from the Delegated Acts. In the following we give three examples (article 292 with states the summary of the Solvency and Financial Conditions Report).

retrieve_article('EN', 292)
"Summary 1. The solvency and financial condition report shall include a clear and concise summary. The summary of the report
shall be understandable to policy holders and beneficiaries. 2. The
summary of the report shall highlight any material changes to the 
insurance or reinsurance undertaking's business and performance, 
system of governance, risk profile, valuation for solvency purposes 
and capital management over the reporting period."
retrieve_article('DE', 292)
'Zusammenfassung 1. Der Bericht über Solvabilität und Finanzlage 
enthält eine klare, knappe Zusammenfassung. Die Zusammenfassung des
Berichts ist für Versicherungsnehmer und Anspruchsberechtigte
verständlich. 2. In der Zusammenfassung werden etwaige wesentliche
Änderungen in Bezug auf Geschäftstätigkeit und Leistung des
Versicherungs- oder Rückversicherungsunternehmens, sein 
Governance-System, sein Risikoprofil, die Bewertung für 
Solvabilitätszwecke und das Kapitalmanagement im Berichtszeitraum 
herausgestellt.'
retrieve_article('EL', 292)
'Περίληψη 1. Η έκθεση φερεγγυότητας και χρηματοοικονομικής
κατάστασης περιλαμβάνει σαφή και σύντομη περίληψη. Η περίληψη της
έκθεσης πρέπει να είναι κατανοητή από τους αντισυμβαλλομένους και
τους δικαιούχους. 2. Η περίληψη της έκθεσης επισημαίνει τυχόν
ουσιώδεις αλλαγές όσον αφορά τη δραστηριότητα και τις επιδόσεις της
ασφαλιστικής και αντασφαλιστικής επιχείρησης, το σύστημα
διακυβέρνησης, το προφίλ κινδύνου, την εκτίμηση της αξίας για τους
σκοπούς φερεγγυότητας και τη διαχείριση κεφαλαίου κατά την περίοδο
αναφοράς.'

How to import the Solvency 2 RFR

Published by:

UPDATE: The code mentioned in this blog has been moved to https://github.com/wjwillemse/solvency2-data

If you want to make insurance calculations you often need an recent interest rate term structure. In this example we show how to import the Solvency 2 Risk Free Rate from the EIOPA website in a convenient Pandas Data Frame, ready to be used for future calculations.

The code can be found on https://github.com/wjwillemse/insurlib.

First we import insurlib.solvency2, the Python package that contains functions to generate the names of the files, import the zip-file from the EIOPA website, extract it to an Excel file (both stored on disk) and read the Excel file in a proper Pandas Data Frame.

In [1]: import pandas as pd
        from datetime import datetime
        from insurlib import solvency2

We have now all the functions we need.

The function that does all this is solvency2.rfr.read, it returns a Python dictionary with all information about the RFR of a certain reference date.

If you do not add a input datetime, i.e. solvency2.rfr.read(), then the function with use today() and you will receive the most recent published RFR.

In [2]: d = solvency2.rfr.read(datetime(2018,1,1))

What information is stored in the dictionary?

In [3]: d.keys()
Out[3]: dict_keys(['input_date', 
           'reference_date', 
           'url', 
           'path_zipfile', 
           'name_zipfile', 
           'path_excelfile', 
           'name_excelfile', 
           'metadata', 
           'RFR_spot_no_VA', 
           'RFR_spot_with_VA'])

Let’s take a look at the individual elements of the dictionary.

The original date by which the function was called is stored in the dictionary as input_date

In [4]: d['input_date']
Out[4]: datetime.datetime(2018, 1, 1, 0, 0)

You can call the function with any date and the function will generate a proper reference date from it. The reference date is the most recent end of the month prior to the input date. So if for example the input is datetime(2018, 1, 1) then the reference date is '20171231', because this the most recent end of the month prior to the input date. The reference date is a string because it is used in the name of the files to be downloaded from the EIOPA-website.

In [5]: d['reference_date']
Out[5]: '20171231'

Furthermore the url, location and filenames are stored in the dictionary.

In [6]: print(d['url'])
        print(d['name_zipfile'])
        print(d['name_excelfile'])
Out[6]: https://eiopa.europa.eu/Publications/Standards/
        EIOPA_RFR_20171231.zip
        EIOPA_RFR_20171231_Term_Structures.xlsx

Now, let’s take a look at the available RFR’s.

In [7]: d['metadata'].columns
Out[7]: Index(['Euro', 'Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland', 'Italy', 'Latvia', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Russia', 'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'United Kingdom', 'Australia', 'Brazil', 'Canada', 'Chile', 'China', 'Colombia', 'Hong Kong', 'India', 'Japan', 'Malaysia', 'Mexico', 'New Zealand', 'Singapore', 'South Africa', 'South Korea', 'Taiwan', 'Thailand', 'Turkey', 'United States'],    dtype='object')

To get all the metadata of the French RFR we select metadata from the dictionary.

In [8]: d['metadata'].loc[:,'France']
Out[8]: Info              FR_31_12_2017_SWP_LLP_20_EXT_40_UFR_4.2
Coupon_freq                                             1
LLP                                                    20
Convergence                                            40
UFR                                                   4.2
alpha                                            0.126759
CRA                                                    10
VA                                                      4
reference date                                   20171231
Name: France, dtype: object

That is: the Coupon frequency is one year, the Last Liquid Point is 20 years, the Convergence period is 40 years, the Ultimate Forward Rate is 4.2 (this rate changed at the beginning of 2018 to 4.05), the alpha parameter of the Smith-Wilson algorithm is 0.126759, the Credit Rate Adjustment is 10 basis points and because we have the curve without the Volatility adjustment the VA is 4 basis points. It is identical to the Euro curve.

To get one single item from the metadata we can use the following line (note that this is the UFR at the end of 2017).

In [9]: d['metadata'].loc["UFR",'Germany']
Out[9]: 4.2

To get the euro RFR without Volatility Adjustment (the first ten durations) we use

In [10]: d['RFR_spot_no_VA']['Euro'].head(10)
Out[10]: Duration
1    -0.00358
2     -0.0025
3    -0.00088
4     0.00069
5     0.00209
6     0.00347
7     0.00469
8     0.00585
9     0.00695
10    0.00802
Name: Euro, dtype: object

Now suppose that we want to store the RFR of six consecutive months into one Data Frame. This is how we can do that.

First we define ref_dates with the reference dates we want to acquire.

In [11]: ref_dates = pd.date_range(start='2018-01-01', periods = 6,freq = 'MS')
         ref_dates
Out[11]: DatetimeIndex(['2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01'], dtype='datetime64[ns]', freq='MS')

Then we use a Python list comprehension to obtain the RFR’s of the reference dates and we show the Data Frame with the first ten durations.

In [12]: rfr = [solvency2.rfr.read(ref_date)['RFR_spot_no_VA']['Euro'] for ref_date in ref_dates]
         df_euro = pd.DataFrame(data = rfr, index = ref_dates).T
         print(df_euro.head(10))

A list comprehension can also be used for the metadata. The following code obtains the metadata of the UK RFR.

In [13]: rfr = [solvency2.rfr.read(ref_date)['metadata']['United Kingdom'] for ref_date in ref_dates]
         print(pd.DataFrame(data = rfr, index = ref_dates))
Out[13]:           2018-01-01  2018-02-01  2018-03-01  2018-04-01  2018-05-01  \
Duration                                                               
1           -0.00358    -0.00363    -0.00352    -0.00362    -0.00358   
2           -0.00250    -0.00225    -0.00220    -0.00258    -0.00244   
3           -0.00088    -0.00020    -0.00022    -0.00083    -0.00065   
4            0.00069     0.00190     0.00178     0.00104     0.00120   
5            0.00209     0.00380     0.00361     0.00285     0.00286   
6            0.00347     0.00537     0.00521     0.00418     0.00441   
7            0.00469     0.00670     0.00666     0.00556     0.00577   
8            0.00585     0.00791     0.00793     0.00672     0.00698   
9            0.00695     0.00899     0.00906     0.00783     0.00809   
10           0.00802     0.00987     0.01007     0.00884     0.00911   

          2018-06-01  
Duration              
1           -0.00331  
2           -0.00236  
3           -0.00098  
4            0.00057  
5            0.00213  
6            0.00356  
7            0.00490  
8            0.00613  
9            0.00725  
10           0.00824