How to download and read the Solvency 2 legislation

Published by:

In our first Natural Language Processing project we will read the Solvency II legislation from the website of the European Union and extract the text within the articles by using regular expressions.

For this notebook, we have chosen the text of the Delegated Acts of Solvency II. This part of the Solvency II regulation is directly into force (because it is a Regulation) and the wording of the Delegated Acts is more detailed than the Solvency II Directive and very precise and internally consistent. This makes it suitable for NLP. From the text we are able to extract features and text data on Solvency II for our future projects.

The code of this notebook can be found in here

Step 1: data Retrieval

We use several packages to read and process the pdfs. For reading we use the fitz-package. Furthermore we need the re-package (regular expressions) for cleaning the text data.

import os
import re
import requests
import fitz

We want to read the Delegated Acts in all available languages. The languages of the European Union are Bulgarian (BG), Spanish (ES), Czech (CS), Danish (DA), German (DE), Estonian (ET), Greek (EL), English (EN), French (FR), Croatian (HR), Italian (IT), Latvian (LV), Lithuanian (LT), Hungarian (HU), Maltese (MT), Dutch (NL), Polish (PL), Portuguese (PT), Romanian (RO), Slovak (SK), Solvenian (SL), Finnish (FI), Swedish (SV).

languages = ['BG','ES','CS','DA','DE','ET','EL',
             'EN','FR','HR','IT','LV','LT','HU',
             'MT','NL','PL','PT','RO','SK','SL',
             'FI','SV']

The urls of the Delegated Acts of Solvency 2 are constructed for these languages by the following list comprehension.

urls = ['https://eur-lex.europa.eu/legal-content/' + lang +
        '/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN' 
        for lang in  languages]

The following for loop retrieves the pdfs of the Delegated Acts from the website of the European Union and stores them in da_path.

da_path = 'data/solvency ii/'
for index in range(len(urls)):
    filename = 'Solvency II Delegated Acts - ' + languages[index] + '.pdf'
    if not(os.path.isfile(da_path + filename)):
        r = requests.get(urls[index])
        f = open(da_path + filename,'wb+')
        f.write(r.content) 
        f.close()
 else:
        print("--> already read.")

Step 2: data cleaning

If you look at the pdfs then you see that each page has a header with page number and information about the legislation and the language. These headers must be deleted to access the articles in the text.

DA_dict = dict({
                'BG': 'Официален вестник на Европейския съюз',
                'CS': 'Úřední věstník Evropské unie',
                'DA': 'Den Europæiske Unions Tidende',
                'DE': 'Amtsblatt der Europäischen Union',
                'EL': 'Επίσημη Εφημερίδα της Ευρωπαϊκής Ένωσης',
                'EN': 'Official Journal of the European Union',
                'ES': 'Diario Oficial de la Unión Europea',
                'ET': 'Euroopa Liidu Teataja',           
                'FI': 'Euroopan unionin virallinen lehti',
                'FR': "Journal officiel de l'Union européenne",
                'HR': 'Službeni list Europske unije',         
                'HU': 'Az Európai Unió Hivatalos Lapja',      
                'IT': "Gazzetta ufficiale dell'Unione europea",
                'LT': 'Europos Sąjungos oficialusis leidinys',
                'LV': 'Eiropas Savienības Oficiālais Vēstnesis',
                'MT': 'Il-Ġurnal Uffiċjali tal-Unjoni Ewropea',
                'NL': 'Publicatieblad van de Europese Unie',  
                'PL': 'Dziennik Urzędowy Unii Europejskiej',  
                'PT': 'Jornal Oficial da União Europeia',     
                'RO': 'Jurnalul Oficial al Uniunii Europene', 
                'SK': 'Úradný vestník Európskej únie',        
                'SL': 'Uradni list Evropske unije',            
                'SV': 'Europeiska unionens officiella tidning'})

The following code reads the pdfs, deletes the headers from all pages and saves the clean text to a .txt file.

DA = dict()
files = [f for f in os.listdir(da_path) if os.path.isfile(os.path.join(da_path, f))]    
for language in languages:
    if not("Delegated_Acts_" + language + ".txt" in files):
        # reading pages from pdf file
        da_pdf = fitz.open(da_path + 'Solvency II Delegated Acts - ' + language + '.pdf')
        da_pages = [page.getText(output = "text") for page in da_pdf]
        da_pdf.close()
        # deleting page headers
        header = "17.1.2015\\s+L\\s+\\d+/\\d+\\s+" + DA_dict[language].replace(' ','\\s+') + "\\s+" + language + "\\s+"
        da_pages = [re.sub(header, '', page) for page in da_pages]
        DA[language] = ''.join(da_pages)
        # some preliminary cleaning -> could be more 
        DA[language] = DA[language].replace('\xad ', '')
        # saving txt file
        da_txt = open(da_path + "Delegated_Acts_" + language + ".txt", "wb")
        da_txt.write(DA[language].encode('utf-8'))
        da_txt.close()
    else:
        # loading txt file
        da_txt = open(da_path + "Delegated_Acts_" + language + ".txt", "rb")
        DA[language] = da_txt.read().decode('utf-8')
        da_txt.close()

Step 3: retrieve the text within articles

Retrieving the text within articles is not straightforward. In English we have ‘Article 1 some text’, i.e. de word Article is put before the number. But some European languages put the word after the number and there are two languages, HU and LV, that put a dot between the number and the article. To be able to read the text within the articles we need to know this ordering (and we need of course the word for article in every language).

art_dict = dict({
                'BG': ['Член',      'pre'],
                'CS': ['Článek',    'pre'],
                'DA': ['Artikel',   'pre'],
                'DE': ['Artikel',   'pre'],
                'EL': ['Άρθρο',     'pre'],
                'EN': ['Article',   'pre'],
                'ES': ['Artículo',  'pre'],
                'ET': ['Artikkel',  'pre'],
                'FI': ['artikla',   'post'],
                'FR': ['Article',   'pre'],
                'HR': ['Članak',    'pre'],
                'HU': ['cikk',      'postdot'],
                'IT': ['Articolo',  'pre'],
                'LT': ['straipsnis','post'],
                'LV': ['pants',     'postdot'],
                'MT': ['Artikolu',  'pre'],
                'NL': ['Artikel',   'pre'],
                'PL': ['Artykuł',   'pre'],
                'PT': ['Artigo',    'pre'],
                'RO': ['Articolul', 'pre'],
                'SK': ['Článok',    'pre'],
                'SL': ['Člen',      'pre'],
                'SV': ['Artikel',   'pre']})

Next we can define a regex to select the text within an article.

def retrieve_article(language, article_num):

    method = art_dict[language][1]
    
    if method == 'pre':
        string = art_dict[language][0] + ' ' + str(article_num) + '(.*?)' + art_dict[language][0] + ' ' + str(article_num + 1)
    elif method == 'post':
        string = str(article_num) + ' ' + art_dict[language][0] + '(.*?)' + str(article_num + 1) + ' ' + art_dict[language][0]
    elif method == 'postdot':
        string = str(article_num) + '. ' + art_dict[language][0] + '(.*?)' + str(article_num + 1) + '. ' + art_dict[language][0]

    r = re.compile(string, re.DOTALL)
            
    result = ' '.join(r.search(DA[language])[1].split())
            
    return result

Now we have a function that can retrieve the text of all the articles in the Delegated Acts for each European language.

Now we are able to read the text of the articles from the Delegated Acts. In the following we give three examples (article 292 with states the summary of the Solvency and Financial Conditions Report).

retrieve_article('EN', 292)
"Summary 1. The solvency and financial condition report shall include a clear and concise summary. The summary of the report
shall be understandable to policy holders and beneficiaries. 2. The
summary of the report shall highlight any material changes to the 
insurance or reinsurance undertaking's business and performance, 
system of governance, risk profile, valuation for solvency purposes 
and capital management over the reporting period."
retrieve_article('DE', 292)
'Zusammenfassung 1. Der Bericht über Solvabilität und Finanzlage 
enthält eine klare, knappe Zusammenfassung. Die Zusammenfassung des
Berichts ist für Versicherungsnehmer und Anspruchsberechtigte
verständlich. 2. In der Zusammenfassung werden etwaige wesentliche
Änderungen in Bezug auf Geschäftstätigkeit und Leistung des
Versicherungs- oder Rückversicherungsunternehmens, sein 
Governance-System, sein Risikoprofil, die Bewertung für 
Solvabilitätszwecke und das Kapitalmanagement im Berichtszeitraum 
herausgestellt.'
retrieve_article('EL', 292)
'Περίληψη 1. Η έκθεση φερεγγυότητας και χρηματοοικονομικής
κατάστασης περιλαμβάνει σαφή και σύντομη περίληψη. Η περίληψη της
έκθεσης πρέπει να είναι κατανοητή από τους αντισυμβαλλομένους και
τους δικαιούχους. 2. Η περίληψη της έκθεσης επισημαίνει τυχόν
ουσιώδεις αλλαγές όσον αφορά τη δραστηριότητα και τις επιδόσεις της
ασφαλιστικής και αντασφαλιστικής επιχείρησης, το σύστημα
διακυβέρνησης, το προφίλ κινδύνου, την εκτίμηση της αξίας για τους
σκοπούς φερεγγυότητας και τη διαχείριση κεφαλαίου κατά την περίοδο
αναφοράς.'

Dutch insurance market visualization

Published by:

In this example we will analyze the Dutch insurance market with two machine learning algorithms, t-SNE, a technique for dimensionality reduction developed by Laurens van der Maat, combined with KMeans, an algorithm to find clusters in the data.

We use publicly available register data of all Dutch insurance undertakings that we web scraped from the DNB website (public-register). This register contains the specific license and the lines of business (LoB’s) in which an insurance undertaking is allowed to operate.

Insurance undertakings in the European Union have different types of licenses, for example life or non-life and where the undertaking is based and where it is allowed to operate. They sell different kinds of insurance products, i.e. they operate in different LoB’s (for example motor vehicle insurance, general life insurance or health insurance). There are 6 LoB’s for life and 19 LoB’s for non-life.

There are groups of insurance undertakings that, looking at their lines of business, are similar. For example some undertakings are small and specialized and sell products only within a very limited number of lines of business. Other larger general insurance undertakings sell insurance products from all lines of business.

How can we find these clusters (groups) of insurance undertakings that are similar with respect to their set of lines of business? We will create a vector per insurance undertakings with their allowed lines of business. Then we will use the t-SNE algorithm to reduce this vector to a 2d vector such that we can plot it in a 2d plane. Undertakings with similar LoB sets are then plotted near to each other. Then we detect the clusters with the kMeans algorithm.

I won’t give all the Python code of the notebook because it is somewhat cumbersome to obtain the license data from the DNB register, but if you are interested, a part can be found here.

import pandas as pd
import numpy as np
import ast
import matplotlib.pyplot as pyplot
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

Step 1: reading the data

Earlier, we stored the license and lines of business data in a csv file.

filename = 'data/licenses_insurers.csv'
df = pd.read_csv(filename, delimiter = ',', encoding = 'utf-8')

Step 2: data preparation

First we need some data preparation. The .csv contains string data that has to be interpreted. First we extract the data on the licenses and the lines of business. If we look at the available licenses we get:

List of available licenses:
2:27 lid 1 SII-schadeverzekeraar
2:27(1) SII insurer
2:45(1) Carrying on the business of a life insurer (non-EEA)
2:45(1) Carrying on the business of a non-life insurer (non-EEA)
2:45(1) non-EEA-based life insurer providing services to NL
2:45(1) non-EEA-based non-life insurer providing services to NL
2:48(1) life insurer with low magnitude of risk
2:48(1) non-life insurer with low magnitude of risk
2:48(1) pre-paid funeral services insurer with low magnitude of risk
Section 1:104(3) Business being wound up

Apparently not all items have been translated to English.

Insurers with low magnitude of risk are too small for Solvency II regulation and for these insurers there is a (simplified) regime in place. Some entities are active from outside the European Economic Area (EEA). The codes refer to the articles in the Dutch Financial Supervision Act.

Now we can get the list of lines of business:

List of available lines of business:
L01. Life insurance - general
L02. Life insurance related to marriage or birth
L03. Life insurance linked to common funds
L05. Holdings in savings pools
L06. Capitalisation activities
L07. Collective pension funds management
S01. Accident insurance
S02. Health insurance
S03. Motor vehicle insurance
S04. Railway rolling stock insurance
S05. Aircraft hull insurance
S06. Marine hull insurance
S07. Goods-in-transit insurance
S08. Fire and natural forces insurance
S09. Other property damage insurance
S10a. Motor vehicle liability insurance
S10b. Road transport liability insurance
S11. Aircraft liability insurance
S12. Marine liability insurance (sea, lake & river and canal vessels)
S13. General liability insurance
S14. Credit insurance
S15. Suretyship
S16. Pecuniary loss insurance
S17. Legal assistance insurance
S18. Assistance

This list of lines of business is applied across the European Union and was already in place before the Solvency II regulation.

We can obtain a data frame df_entities_lobs with all insurance undertakings with their allowed lines of business (25 possible LoB’s). That will be the input for the t-SNE algorithm.

Step 3: data analysis

We will use the t-SNE algorithm from the package sklearn.manifold. The input for the algorithm is the data frame with dummy encoded lines of business per entity prepared previously.

X = df_entities_lobs
Y = TSNE(n_components = 2, 
         perplexity = 18, 
         verbose = 1, 
         random_state = 1).fit_transform(X)

Next we use the k-means algorithm to determine the clusters in the Dutch insurance market. It appears that there are about eight clusters that are identifiable.

kmeans = KMeans(n_clusters = 8, random_state = 0, n_init  = 10).fit(Y)

Step 4: data visualization

The last step is to visualize the results of the t-SNE algorithm. For this we first produce the labels of the clusters with the average number of lines of business and the two most dominant lines of business. With that information we can describe the basic properties of the cluster.

Results of the tsne algorithm of the lines of business of Dutch insurance undertakings

Cluster 1 and 5 are the two clusters with life insurance undertakings. Cluster 5 consists of general life insurance undertakings with a broad line of products in difference life insurance lines of business. Cluster 1 consists of specialized life insurance undertakings with only one line of business. These undertakings are often relatively small.

The other clusters are consists of non-life insurance undertakings. Cluster 2 is an easily identifiable cluster with health insurance undertakings, with exacly two lines of business (accident and health insurance). Some health insurance undertakings also have some other lines of business; they form more general health insurance undertakings cluster 6 (near the uniform cluster 2).

Then we have cluster 3 with general non-life insurance undertakings operating a large number of lines of business (13 on average). Cluster 7 consists of medium non-life insurance undertakings with less lines of business on average. And cluster 4 and cluster 0 are specialized and often small non-life insurance undertaking with 1 or 2 lines of business on average (cluster 0: specialized property damage and fire insurance undertakings, and cluster 4: specialized general liability and legal assistance insurance undertakings).

We can find the corresponding undertakings for each cluster. For example the specialized small life insurance undertaking (cluster 1):

list(df_entities_lobs.index[kmeans.labels_ == 1])
['DELA Natura- en levensverzekeringen N.V.',
 'Isle of Man Assurance Limited',
 'Monuta Verzekeringen N.V.',
 'N.V. Noordhollandsche van 1816, Levensverzekeringsmaatschappij',
 'Nordben Life and Pension Insurance Co. Limited',
 "Onderling Fonds 'Sliedrecht' B.A.",
 'Tiels Onderling Fonds tot Uitkering bij Overlijden Gustaaf Adolf U.A.',
 'Yarden Uitvaartverzekeringen N.V.',
 'Zurich Life Insurance Company Limited']

How to import the Solvency 2 RFR

Published by:

UPDATE: The code mentioned in this blog has been moved to https://github.com/wjwillemse/solvency2-data

If you want to make insurance calculations you often need an recent interest rate term structure. In this example we show how to import the Solvency 2 Risk Free Rate from the EIOPA website in a convenient Pandas Data Frame, ready to be used for future calculations.

The code can be found on https://github.com/wjwillemse/insurlib.

First we import insurlib.solvency2, the Python package that contains functions to generate the names of the files, import the zip-file from the EIOPA website, extract it to an Excel file (both stored on disk) and read the Excel file in a proper Pandas Data Frame.

In [1]: import pandas as pd
        from datetime import datetime
        from insurlib import solvency2

We have now all the functions we need.

The function that does all this is solvency2.rfr.read, it returns a Python dictionary with all information about the RFR of a certain reference date.

If you do not add a input datetime, i.e. solvency2.rfr.read(), then the function with use today() and you will receive the most recent published RFR.

In [2]: d = solvency2.rfr.read(datetime(2018,1,1))

What information is stored in the dictionary?

In [3]: d.keys()
Out[3]: dict_keys(['input_date', 
           'reference_date', 
           'url', 
           'path_zipfile', 
           'name_zipfile', 
           'path_excelfile', 
           'name_excelfile', 
           'metadata', 
           'RFR_spot_no_VA', 
           'RFR_spot_with_VA'])

Let’s take a look at the individual elements of the dictionary.

The original date by which the function was called is stored in the dictionary as input_date

In [4]: d['input_date']
Out[4]: datetime.datetime(2018, 1, 1, 0, 0)

You can call the function with any date and the function will generate a proper reference date from it. The reference date is the most recent end of the month prior to the input date. So if for example the input is datetime(2018, 1, 1) then the reference date is '20171231', because this the most recent end of the month prior to the input date. The reference date is a string because it is used in the name of the files to be downloaded from the EIOPA-website.

In [5]: d['reference_date']
Out[5]: '20171231'

Furthermore the url, location and filenames are stored in the dictionary.

In [6]: print(d['url'])
        print(d['name_zipfile'])
        print(d['name_excelfile'])
Out[6]: https://eiopa.europa.eu/Publications/Standards/
        EIOPA_RFR_20171231.zip
        EIOPA_RFR_20171231_Term_Structures.xlsx

Now, let’s take a look at the available RFR’s.

In [7]: d['metadata'].columns
Out[7]: Index(['Euro', 'Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland', 'Italy', 'Latvia', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Russia', 'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'United Kingdom', 'Australia', 'Brazil', 'Canada', 'Chile', 'China', 'Colombia', 'Hong Kong', 'India', 'Japan', 'Malaysia', 'Mexico', 'New Zealand', 'Singapore', 'South Africa', 'South Korea', 'Taiwan', 'Thailand', 'Turkey', 'United States'],    dtype='object')

To get all the metadata of the French RFR we select metadata from the dictionary.

In [8]: d['metadata'].loc[:,'France']
Out[8]: Info              FR_31_12_2017_SWP_LLP_20_EXT_40_UFR_4.2
Coupon_freq                                             1
LLP                                                    20
Convergence                                            40
UFR                                                   4.2
alpha                                            0.126759
CRA                                                    10
VA                                                      4
reference date                                   20171231
Name: France, dtype: object

That is: the Coupon frequency is one year, the Last Liquid Point is 20 years, the Convergence period is 40 years, the Ultimate Forward Rate is 4.2 (this rate changed at the beginning of 2018 to 4.05), the alpha parameter of the Smith-Wilson algorithm is 0.126759, the Credit Rate Adjustment is 10 basis points and because we have the curve without the Volatility adjustment the VA is 4 basis points. It is identical to the Euro curve.

To get one single item from the metadata we can use the following line (note that this is the UFR at the end of 2017).

In [9]: d['metadata'].loc["UFR",'Germany']
Out[9]: 4.2

To get the euro RFR without Volatility Adjustment (the first ten durations) we use

In [10]: d['RFR_spot_no_VA']['Euro'].head(10)
Out[10]: Duration
1    -0.00358
2     -0.0025
3    -0.00088
4     0.00069
5     0.00209
6     0.00347
7     0.00469
8     0.00585
9     0.00695
10    0.00802
Name: Euro, dtype: object

Now suppose that we want to store the RFR of six consecutive months into one Data Frame. This is how we can do that.

First we define ref_dates with the reference dates we want to acquire.

In [11]: ref_dates = pd.date_range(start='2018-01-01', periods = 6,freq = 'MS')
         ref_dates
Out[11]: DatetimeIndex(['2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01'], dtype='datetime64[ns]', freq='MS')

Then we use a Python list comprehension to obtain the RFR’s of the reference dates and we show the Data Frame with the first ten durations.

In [12]: rfr = [solvency2.rfr.read(ref_date)['RFR_spot_no_VA']['Euro'] for ref_date in ref_dates]
         df_euro = pd.DataFrame(data = rfr, index = ref_dates).T
         print(df_euro.head(10))

A list comprehension can also be used for the metadata. The following code obtains the metadata of the UK RFR.

In [13]: rfr = [solvency2.rfr.read(ref_date)['metadata']['United Kingdom'] for ref_date in ref_dates]
         print(pd.DataFrame(data = rfr, index = ref_dates))
Out[13]:           2018-01-01  2018-02-01  2018-03-01  2018-04-01  2018-05-01  \
Duration                                                               
1           -0.00358    -0.00363    -0.00352    -0.00362    -0.00358   
2           -0.00250    -0.00225    -0.00220    -0.00258    -0.00244   
3           -0.00088    -0.00020    -0.00022    -0.00083    -0.00065   
4            0.00069     0.00190     0.00178     0.00104     0.00120   
5            0.00209     0.00380     0.00361     0.00285     0.00286   
6            0.00347     0.00537     0.00521     0.00418     0.00441   
7            0.00469     0.00670     0.00666     0.00556     0.00577   
8            0.00585     0.00791     0.00793     0.00672     0.00698   
9            0.00695     0.00899     0.00906     0.00783     0.00809   
10           0.00802     0.00987     0.01007     0.00884     0.00911   

          2018-06-01  
Duration              
1           -0.00331  
2           -0.00236  
3           -0.00098  
4            0.00057  
5            0.00213  
6            0.00356  
7            0.00490  
8            0.00613  
9            0.00725  
10           0.00824