Daily Archives: 7 Dec 2018

How to download and read the Solvency 2 legislation

Published by:

In our first Natural Language Processing project we will read the Solvency II legislation from the website of the European Union and extract the text within the articles by using regular expressions.

For this notebook, we have chosen the text of the Delegated Acts of Solvency II. This part of the Solvency II regulation is directly into force (because it is a Regulation) and the wording of the Delegated Acts is more detailed than the Solvency II Directive and very precise and internally consistent. This makes it suitable for NLP. From the text we are able to extract features and text data on Solvency II for our future projects.

The code of this notebook can be found in here

Step 1: data Retrieval

We use several packages to read and process the pdfs. For reading we use the fitz-package. Furthermore we need the re-package (regular expressions) for cleaning the text data.

import os
import re
import requests
import fitz

We want to read the Delegated Acts in all available languages. The languages of the European Union are Bulgarian (BG), Spanish (ES), Czech (CS), Danish (DA), German (DE), Estonian (ET), Greek (EL), English (EN), French (FR), Croatian (HR), Italian (IT), Latvian (LV), Lithuanian (LT), Hungarian (HU), Maltese (MT), Dutch (NL), Polish (PL), Portuguese (PT), Romanian (RO), Slovak (SK), Solvenian (SL), Finnish (FI), Swedish (SV).

languages = ['BG','ES','CS','DA','DE','ET','EL',

The urls of the Delegated Acts of Solvency 2 are constructed for these languages by the following list comprehension.

urls = ['https://eur-lex.europa.eu/legal-content/' + lang +
        for lang in  languages]

The following for loop retrieves the pdfs of the Delegated Acts from the website of the European Union and stores them in da_path.

da_path = 'data/solvency ii/'
for index in range(len(urls)):
    filename = 'Solvency II Delegated Acts - ' + languages[index] + '.pdf'
    if not(os.path.isfile(da_path + filename)):
        r = requests.get(urls[index])
        f = open(da_path + filename,'wb+')
        print("--> already read.")

Step 2: data cleaning

If you look at the pdfs then you see that each page has a header with page number and information about the legislation and the language. These headers must be deleted to access the articles in the text.

DA_dict = dict({
                'BG': 'Официален вестник на Европейския съюз',
                'CS': 'Úřední věstník Evropské unie',
                'DA': 'Den Europæiske Unions Tidende',
                'DE': 'Amtsblatt der Europäischen Union',
                'EL': 'Επίσημη Εφημερίδα της Ευρωπαϊκής Ένωσης',
                'EN': 'Official Journal of the European Union',
                'ES': 'Diario Oficial de la Unión Europea',
                'ET': 'Euroopa Liidu Teataja',           
                'FI': 'Euroopan unionin virallinen lehti',
                'FR': "Journal officiel de l'Union européenne",
                'HR': 'Službeni list Europske unije',         
                'HU': 'Az Európai Unió Hivatalos Lapja',      
                'IT': "Gazzetta ufficiale dell'Unione europea",
                'LT': 'Europos Sąjungos oficialusis leidinys',
                'LV': 'Eiropas Savienības Oficiālais Vēstnesis',
                'MT': 'Il-Ġurnal Uffiċjali tal-Unjoni Ewropea',
                'NL': 'Publicatieblad van de Europese Unie',  
                'PL': 'Dziennik Urzędowy Unii Europejskiej',  
                'PT': 'Jornal Oficial da União Europeia',     
                'RO': 'Jurnalul Oficial al Uniunii Europene', 
                'SK': 'Úradný vestník Európskej únie',        
                'SL': 'Uradni list Evropske unije',            
                'SV': 'Europeiska unionens officiella tidning'})

The following code reads the pdfs, deletes the headers from all pages and saves the clean text to a .txt file.

DA = dict()
files = [f for f in os.listdir(da_path) if os.path.isfile(os.path.join(da_path, f))]    
for language in languages:
    if not("Delegated_Acts_" + language + ".txt" in files):
        # reading pages from pdf file
        da_pdf = fitz.open(da_path + 'Solvency II Delegated Acts - ' + language + '.pdf')
        da_pages = [page.getText(output = "text") for page in da_pdf]
        # deleting page headers
        header = "17.1.2015\\s+L\\s+\\d+/\\d+\\s+" + DA_dict[language].replace(' ','\\s+') + "\\s+" + language + "\\s+"
        da_pages = [re.sub(header, '', page) for page in da_pages]
        DA[language] = ''.join(da_pages)
        # some preliminary cleaning -> could be more 
        DA[language] = DA[language].replace('\xad ', '')
        # saving txt file
        da_txt = open(da_path + "Delegated_Acts_" + language + ".txt", "wb")
        # loading txt file
        da_txt = open(da_path + "Delegated_Acts_" + language + ".txt", "rb")
        DA[language] = da_txt.read().decode('utf-8')

Step 3: retrieve the text within articles

Retrieving the text within articles is not straightforward. In English we have ‘Article 1 some text’, i.e. de word Article is put before the number. But some European languages put the word after the number and there are two languages, HU and LV, that put a dot between the number and the article. To be able to read the text within the articles we need to know this ordering (and we need of course the word for article in every language).

art_dict = dict({
                'BG': ['Член',      'pre'],
                'CS': ['Článek',    'pre'],
                'DA': ['Artikel',   'pre'],
                'DE': ['Artikel',   'pre'],
                'EL': ['Άρθρο',     'pre'],
                'EN': ['Article',   'pre'],
                'ES': ['Artículo',  'pre'],
                'ET': ['Artikkel',  'pre'],
                'FI': ['artikla',   'post'],
                'FR': ['Article',   'pre'],
                'HR': ['Članak',    'pre'],
                'HU': ['cikk',      'postdot'],
                'IT': ['Articolo',  'pre'],
                'LT': ['straipsnis','post'],
                'LV': ['pants',     'postdot'],
                'MT': ['Artikolu',  'pre'],
                'NL': ['Artikel',   'pre'],
                'PL': ['Artykuł',   'pre'],
                'PT': ['Artigo',    'pre'],
                'RO': ['Articolul', 'pre'],
                'SK': ['Článok',    'pre'],
                'SL': ['Člen',      'pre'],
                'SV': ['Artikel',   'pre']})

Next we can define a regex to select the text within an article.

def retrieve_article(language, article_num):

    method = art_dict[language][1]
    if method == 'pre':
        string = art_dict[language][0] + ' ' + str(article_num) + '(.*?)' + art_dict[language][0] + ' ' + str(article_num + 1)
    elif method == 'post':
        string = str(article_num) + ' ' + art_dict[language][0] + '(.*?)' + str(article_num + 1) + ' ' + art_dict[language][0]
    elif method == 'postdot':
        string = str(article_num) + '. ' + art_dict[language][0] + '(.*?)' + str(article_num + 1) + '. ' + art_dict[language][0]

    r = re.compile(string, re.DOTALL)
    result = ' '.join(r.search(DA[language])[1].split())
    return result

Now we have a function that can retrieve the text of all the articles in the Delegated Acts for each European language.

Now we are able to read the text of the articles from the Delegated Acts. In the following we give three examples (article 292 with states the summary of the Solvency and Financial Conditions Report).

retrieve_article('EN', 292)
"Summary 1. The solvency and financial condition report shall include a clear and concise summary. The summary of the report
shall be understandable to policy holders and beneficiaries. 2. The
summary of the report shall highlight any material changes to the 
insurance or reinsurance undertaking's business and performance, 
system of governance, risk profile, valuation for solvency purposes 
and capital management over the reporting period."
retrieve_article('DE', 292)
'Zusammenfassung 1. Der Bericht über Solvabilität und Finanzlage 
enthält eine klare, knappe Zusammenfassung. Die Zusammenfassung des
Berichts ist für Versicherungsnehmer und Anspruchsberechtigte
verständlich. 2. In der Zusammenfassung werden etwaige wesentliche
Änderungen in Bezug auf Geschäftstätigkeit und Leistung des
Versicherungs- oder Rückversicherungsunternehmens, sein 
Governance-System, sein Risikoprofil, die Bewertung für 
Solvabilitätszwecke und das Kapitalmanagement im Berichtszeitraum 
retrieve_article('EL', 292)
'Περίληψη 1. Η έκθεση φερεγγυότητας και χρηματοοικονομικής
κατάστασης περιλαμβάνει σαφή και σύντομη περίληψη. Η περίληψη της
έκθεσης πρέπει να είναι κατανοητή από τους αντισυμβαλλομένους και
τους δικαιούχους. 2. Η περίληψη της έκθεσης επισημαίνει τυχόν
ουσιώδεις αλλαγές όσον αφορά τη δραστηριότητα και τις επιδόσεις της
ασφαλιστικής και αντασφαλιστικής επιχείρησης, το σύστημα
διακυβέρνησης, το προφίλ κινδύνου, την εκτίμηση της αξίας για τους
σκοπούς φερεγγυότητας και τη διαχείριση κεφαλαίου κατά την περίοδο

Dutch insurance market visualization

Published by:

In this example we will analyze the Dutch insurance market with two machine learning algorithms, t-SNE, a technique for dimensionality reduction developed by Laurens van der Maat, combined with KMeans, an algorithm to find clusters in the data.

We use publicly available register data of all Dutch insurance undertakings that we web scraped from the DNB website (public-register). This register contains the specific license and the lines of business (LoB’s) in which an insurance undertaking is allowed to operate.

Insurance undertakings in the European Union have different types of licenses, for example life or non-life and where the undertaking is based and where it is allowed to operate. They sell different kinds of insurance products, i.e. they operate in different LoB’s (for example motor vehicle insurance, general life insurance or health insurance). There are 6 LoB’s for life and 19 LoB’s for non-life.

There are groups of insurance undertakings that, looking at their lines of business, are similar. For example some undertakings are small and specialized and sell products only within a very limited number of lines of business. Other larger general insurance undertakings sell insurance products from all lines of business.

How can we find these clusters (groups) of insurance undertakings that are similar with respect to their set of lines of business? We will create a vector per insurance undertakings with their allowed lines of business. Then we will use the t-SNE algorithm to reduce this vector to a 2d vector such that we can plot it in a 2d plane. Undertakings with similar LoB sets are then plotted near to each other. Then we detect the clusters with the kMeans algorithm.

I won’t give all the Python code of the notebook because it is somewhat cumbersome to obtain the license data from the DNB register, but if you are interested, a part can be found here.

import pandas as pd
import numpy as np
import ast
import matplotlib.pyplot as pyplot
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

Step 1: reading the data

Earlier, we stored the license and lines of business data in a csv file.

filename = 'data/licenses_insurers.csv'
df = pd.read_csv(filename, delimiter = ',', encoding = 'utf-8')

Step 2: data preparation

First we need some data preparation. The .csv contains string data that has to be interpreted. First we extract the data on the licenses and the lines of business. If we look at the available licenses we get:

List of available licenses:
2:27 lid 1 SII-schadeverzekeraar
2:27(1) SII insurer
2:45(1) Carrying on the business of a life insurer (non-EEA)
2:45(1) Carrying on the business of a non-life insurer (non-EEA)
2:45(1) non-EEA-based life insurer providing services to NL
2:45(1) non-EEA-based non-life insurer providing services to NL
2:48(1) life insurer with low magnitude of risk
2:48(1) non-life insurer with low magnitude of risk
2:48(1) pre-paid funeral services insurer with low magnitude of risk
Section 1:104(3) Business being wound up

Apparently not all items have been translated to English.

Insurers with low magnitude of risk are too small for Solvency II regulation and for these insurers there is a (simplified) regime in place. Some entities are active from outside the European Economic Area (EEA). The codes refer to the articles in the Dutch Financial Supervision Act.

Now we can get the list of lines of business:

List of available lines of business:
L01. Life insurance - general
L02. Life insurance related to marriage or birth
L03. Life insurance linked to common funds
L05. Holdings in savings pools
L06. Capitalisation activities
L07. Collective pension funds management
S01. Accident insurance
S02. Health insurance
S03. Motor vehicle insurance
S04. Railway rolling stock insurance
S05. Aircraft hull insurance
S06. Marine hull insurance
S07. Goods-in-transit insurance
S08. Fire and natural forces insurance
S09. Other property damage insurance
S10a. Motor vehicle liability insurance
S10b. Road transport liability insurance
S11. Aircraft liability insurance
S12. Marine liability insurance (sea, lake & river and canal vessels)
S13. General liability insurance
S14. Credit insurance
S15. Suretyship
S16. Pecuniary loss insurance
S17. Legal assistance insurance
S18. Assistance

This list of lines of business is applied across the European Union and was already in place before the Solvency II regulation.

We can obtain a data frame df_entities_lobs with all insurance undertakings with their allowed lines of business (25 possible LoB’s). That will be the input for the t-SNE algorithm.

Step 3: data analysis

We will use the t-SNE algorithm from the package sklearn.manifold. The input for the algorithm is the data frame with dummy encoded lines of business per entity prepared previously.

X = df_entities_lobs
Y = TSNE(n_components = 2, 
         perplexity = 18, 
         verbose = 1, 
         random_state = 1).fit_transform(X)

Next we use the k-means algorithm to determine the clusters in the Dutch insurance market. It appears that there are about eight clusters that are identifiable.

kmeans = KMeans(n_clusters = 8, random_state = 0, n_init  = 10).fit(Y)

Step 4: data visualization

The last step is to visualize the results of the t-SNE algorithm. For this we first produce the labels of the clusters with the average number of lines of business and the two most dominant lines of business. With that information we can describe the basic properties of the cluster.

Results of the tsne algorithm of the lines of business of Dutch insurance undertakings

Cluster 1 and 5 are the two clusters with life insurance undertakings. Cluster 5 consists of general life insurance undertakings with a broad line of products in difference life insurance lines of business. Cluster 1 consists of specialized life insurance undertakings with only one line of business. These undertakings are often relatively small.

The other clusters are consists of non-life insurance undertakings. Cluster 2 is an easily identifiable cluster with health insurance undertakings, with exacly two lines of business (accident and health insurance). Some health insurance undertakings also have some other lines of business; they form more general health insurance undertakings cluster 6 (near the uniform cluster 2).

Then we have cluster 3 with general non-life insurance undertakings operating a large number of lines of business (13 on average). Cluster 7 consists of medium non-life insurance undertakings with less lines of business on average. And cluster 4 and cluster 0 are specialized and often small non-life insurance undertaking with 1 or 2 lines of business on average (cluster 0: specialized property damage and fire insurance undertakings, and cluster 4: specialized general liability and legal assistance insurance undertakings).

We can find the corresponding undertakings for each cluster. For example the specialized small life insurance undertaking (cluster 1):

list(df_entities_lobs.index[kmeans.labels_ == 1])
['DELA Natura- en levensverzekeringen N.V.',
 'Isle of Man Assurance Limited',
 'Monuta Verzekeringen N.V.',
 'N.V. Noordhollandsche van 1816, Levensverzekeringsmaatschappij',
 'Nordben Life and Pension Insurance Co. Limited',
 "Onderling Fonds 'Sliedrecht' B.A.",
 'Tiels Onderling Fonds tot Uitkering bij Overlijden Gustaaf Adolf U.A.',
 'Yarden Uitvaartverzekeringen N.V.',
 'Zurich Life Insurance Company Limited']