Pattern discovery in Solvency 2 data (2)

Published by:

Pattern discovery is a general technique to derive patterns from the data itself, without using explicit knowledge about the data. Previously, I described how to discover patterns in Solvency 2 data, for example that values of column x are most of the time or (almost) always equal to the values of column y, or higher, or lower, etc. I also showed how to find columns whose values are the sum of values of other columns, which often occurs in financial reports.

Here I will take pattern discovery one step further and discover constant ratios between values of columns. That is, patterns where the quotient between values of two columns equals a constant.

An example will clarify this. We take the own funds data of Dutch insurance undertakings (sheet 20 of the Excel file) and select the last year of data.

df = get_sheet(20)
df = df.xs(datetime(2017,12,31), axis = 0, level = 1, drop_level = False)

Now we run the generate-function from the insurlib package on Github. The generate-function allows the parameter pattern = “ratio”. Make sure the minimum support is higher than one otherwise you end up with a lot of ratio-patterns. Patterns are generated with the following code, with minimum support 3 and minimum confidence 10%.

rules = patterns.generate(dataframe = df,
                          pattern = "ratio",
                          parameters = {"min_confidence": 0.1, 
                                        "min_support": 3}))
rules = list(rules)

The first two patterns in the list are

['ratio', '9/20', ['mcr , total', 'scr , total']], 19, 120, 0.1367]

and

['ratio', '1/4', ['mcr , total', 'scr , total']], 36, 103, 0.259]

The two patterns state that the quotient between the values of the columns ‘mcr , total’ and ‘scr , total’ is 9/20 in 19 cases with confidence of almost 14% and is 1/4 (with 120 exception to this rule) is 36 cases with confidence of almost 26% (with 103 exceptions). In Solvency 2, the acronym MCR stands for Minimum Capital Requirement, and SCR stand for Solvency Capital Requirement (the solvency 2 risk-based capital requirement with VaR 99,5%). So the identified ratios are 9/20 (i.e. 45%) and 1/4 (i.e. 25%).

It is not desirable that the MCR is too low or too high in relation to the SCR. Solvency 2 legislation therefore prescribes that the MCR is bounded between 25% and 45% of the SCR. We see that these legal boundaries are found in the data itself. We have 36 insurance undertakings where the ratio is 25%, 19 undertaking where the ratio is 45%. The rest (should) have a ratio somewhere between 25% and 45%.

The algorithm applies a brute-force search for ratios between columns. It uses the fraction module of Python’s standard library to convert every quotient to a fraction, which enables the use of rational number arithmetic. For example

Fraction(2.25)
>>> Fraction(9,4)

The fraction expression you see in the pattern definition is the string representation of the fraction object.

A common problem with dealing with quotients is that often the original data set contains quotients like 0.2499999 and 0.2500001 because values are rounded off to euros. In the fraction module, this can be overcome by putting a limit on the denominator (default is 1e7, but you can pass this as a parameter). The result is that the nearest fraction is found given the limit on the denominator.

The other rules that were found in the own funds data are

['ratio', '9/100', ['available and eligible own funds|total eligible own funds to meet the mcr , tier 2', 'scr , total']], 3, 12, 0.2]

['ratio', '1/5', ['available and eligible own funds|total eligible own funds to meet the mcr , tier 2', 'mcr , total']], 12, 3, 0.8]

['ratio', '3/20', ['available and eligible own funds|total eligible own funds to meet the scr , tier 3', 'scr , total']], 6, 15, 0.2857]

['ratio', '1/3', ['available and eligible own funds|total eligible own funds to meet the scr , tier 3', 'mcr , total']], 3, 18, 0.1429]

These ratio-patterns are boundaries on the available and eligible own funds and are prescribed by Solvency 2 legislation. In the Netherlands, they have a relatively low support because a limited number of Dutch insurance undertakings hit the legal boundaries of tier two and tier three capital.

Here we looked at the own funds data. But many more patterns of this type can be found if you run the complete data set of Dutch insurance undertakings. It takes some time but then you will find in total 121 ratio-patterns (with minimum support of 3) covering loss-absorbing capacity of deferred taxes, impact of transitional measures, operational risk capital, etc. And of course other data sets than Solvency 2 data are possible.

These ratio-patterns work well because the Solvency 2 legislation contains legal boundaries that are represented in the data. Also other nonlegal constant ratios can be found, such as entity-specific ratios. You can find ratio-patterns for each insurance undertaking separately and then signal when a pattern is violated. This would work with a relatively low limit on the denominator of the fractions, because we want to find constant ratios with a high confidence. But for this to work well you will need more than two years of data.

A brute-force approach works well for the Dutch public Solvency 2 data set (around 1270 dimensions). But for high dimensional data it will take some time because of the possible combinations. Perhaps smart ways exist to detect ratios more easily, for example via statistical correlations.

The ratio-patterns that were discovered could be related to the Solvency 2 legislation by automatically reading ratios and percents in the text. And in the same manner, entity-specific ratios could be related to the SFCR documents of that entity. But that is for future work.

Pattern discovery in Solvency 2 data (1)

Published by:

This blog describes some results with algorithms for pattern discovery in Solvency 2 quantitative reports. The main idea here is to automatically uncover patterns that are present in these reports. We want to find patterns that represent widely or commonly occurring situations, and possibly represent business rules and relations prescribed by the underlying legislation. If we are able to find these patterns, then we are also able to identify when data satisfy and do not satisfy the patterns.

How can we find these patterns? Initially, I started with association rules. This is a rule-based machine learning approach that leads to transparent and explainable results. However, for this approach the quantitative data has to be encoded to a set of features (for example by replacing every quantitative value with an appropriate nominal value) and with high dimensional data this quickly becomes computationally expensive.

After some increments I decided to program a pattern discovery algorithm especially designed for analyzing the quantitative data points of reports. The goal was to speed up the process, while maintaining the association rules mining approach and performance measures. Below I will give some examples of how and which patterns can be found. The code is part of the insurlib package on Github. I used the public Solvency 2 data of Dutch insurers to find patterns in the quantitative data of these insurers (but applications to other data sets are possible).

First read the pandas and the insurlib package. The patterns-part consists of functions to generate patterns in numerical columns of dataframes.

import pandas as pd
from insurlib import patterns

Now read the public Solvency 2 data as was described in How to analyze public Solvency 2 data of Dutch insurers (by reading the Excel file and defining the read_sheet function). To recall, the Excel consists of the following worksheets:

  • Worksheet 14: balance sheet
  • Worksheet 15: premiums – life
  • Worksheet 16: premiums – non-life
  • Worksheet 17: technical provisions – life
  • Worksheet 18: technical provisions – non-life
  • Worksheet 19: transition and adjustments
  • Worksheet 20: own funds
  • Worksheet 21: solvency capital requirements – 1
  • Worksheet 22: solvency capital requirements – 2
  • Worksheet 23: minimum capital requirements
  • Worksheet 24: additional information life
  • Worksheet 25: additional information non-life
  • Worksheet 26: additional information reinsurance

Example 1: comparing two dataframes

Suppose we want to compare the worksheet balance sheet and the worksheet non-life technical provision and find the relations between the contents in these worksheets.

df1 = get_sheet(14)
df2 = get_sheet(18)
df2.columns = [str(df2.columns[i]) for i in range(len(df2.columns))]

The last line is to convert the multiple level columns to one level (so that we can compare it more easily with other dataframes).

You can generate patterns with the generate-function. Patterns that are found have the ‘association rule’-structure P -> Q. If you input two dataframes (P_dataframe and Q_dataframe) then all columns of P_dataframe are compared to all columns of Q_dataframe. The pattern we are looking for is ‘=’, so patterns with corresponding values are found. You can also use other patterns, such as ‘<‘, ‘<=’ , ‘>’, ‘>=’ and ‘!=’. Also a dict of parameters is used as input, with in this case the minimum confidence and the minimum support.

rules = patterns.generate(P_dataframe = df1,
                          Q_dataframe = df2,
                          pattern = "=",
                          parameters = {"min_confidence": 0.75, 
                                        "min_support": 10}))
rules = list(rules)
print("Number of rules: " + str(len(rules)))
Number of rules: 2

The output is a generator that we can convert to a list of rules. Two rules are found in this case. Let’s look a the first rule.

print(rules[0])
[
    ['=', 
    'assets|reinsurance recoverables from:|non-life and health 
    similar to non-life , solvency ii value', 
    "('total non-life obligation', 'Technical provisions -
    total|Recoverable from reinsurance contract/SPV and Finite
    Re after the adjustment for expected losses due to counterparty
    default
    - total')"], 
    127, 
    4, 
    0.9695
]

The first pattern states that the value of the reinsurance recoverables on the asset side for non-life on the balance sheet equates the value of the recoverable from reinsurance contracts in the technical provision for non-life obligations in the technical provisions sheet. The rule has confidence of almost 97%, and a support of 127. This means that there are 127 occurrences of this pattern in the data and in 97% of all occurrences (with nonzero data points) the patterns holds. We also see that in four cases the patterns are not present, i.e. the reinsurance recoverables does not equate to the recoverable in the technical provision (these are presumably data errors).

The second rule that was found reads:

print(rules[1])
[
    ['=', 
    'liabilities|technical provisions – non-life , solvency ii value',
    "('total non-life obligation', 'Technical provisions -
    total|Technical provisions - total')"], 
    209, 
    6,
    0.9721
]

This rule also has a high confidence. It says that the value of the technical provisions for non-life in the balance sheet equals the value of the total technical provisions in the technical provision sheet. This is a plain consistency rule between the sheets. The six exceptions are presumably data errors.

Both rules were, at the moment of publication, not part of the automatic and predefined validation rules of the Solvency II reports (otherwise the confidence would be 100%), as part of the XBRL-taxonomy. But by analyzing the reports in this manner we were able to uncover them automatically.

Example 2: patterns of sums

Often financial reports contain sums within the report. We can analyze the column names to detect potential sums (often a hierarchy in the columns name can be identified), but we can also find patterns of sums. The following code does that. We input the balance sheet dataframe and let the algorithm search for ‘sum’-patterns. The parameters sum_elements states the maximum elements in the sum (in this case three).

rules = patterns.generate(dataframe = df1,
                          pattern = "sum",
                          parameters = {"sum_elements": 3})
rules = list(rules)
print("Number of rules: " + str(len(rules)))
Number of rules: 7

Let’s take a look at the first rule:

print(rules[0])
[
    ['sum', 
    ['assets|investments (other than assets held for index-linked and 
     unit-linked contracts)|equities|equities - listed , solvency ii 
     value', 
     'assets|investments (other than assets held for index-linked and 
     unit-linked contracts)|equities|equities - unlisted , solvency ii
     value'], 
    'assets|investments (other than assets held for index-linked and 
     unit-linked contracts)|equities , solvency ii value'], 
    52, 
    0, 
    1.0
]

The rule states that the sum of the listed and unlisted equities equals to the equities (so equities are either listed or non-listed). This rule has a confidence of 100%, and there is presumably a validation rule within the reports. Six rules in this structure were found in this way. This is however somewhat computationally expensive.

Example 3: patterns with a given value

The last example searches for patterns with specific values. In this case we want to know in how many cases the investments are higher than zero. We can do this in the following way. We input the dataframe like in example 2 and we add a parameter columns and set it to the name of the column we want to investigate (in fact you can input a list of columns).

P = ['assets|investments (other than assets held for index-linked and 
      unit-linked contracts) , solvency ii value']
 rules = patterns.generate(dataframe = df1, 
                           pattern = ">", 
                           columns = P, 
                           value = 0, 
                           parameters = {'min_confidence': 0.75,
                                         'min_support': 1})
 rules = list(rules)
 print(rules[0])
 [
     ['>', 
     'assets|investments (other than assets held for index-linked and 
     unit-linked contracts) , solvency ii value', 
     0], 
     275, 
     11, 
     0.9615
 ]

The value of investment is, with confidence of 96%, higher than zero. In eleven cases the value is not higher than zero. This rule has a high confidence because, normally, insurers invest premiums collected for insurance policies into a wide range of investment assets. If no list of columns is added, patterns in all numerical columns in the dataframe returned.

The aim of these examples is to give a general idea of pattern discovery in Solvency 2 quantitative data. Numerous patterns can be found in this way by using the complete data set. And by using the measures confidence and support we can find patterns that are not exactly perfect but do provide information about the data, without taking recourse to statistical methods. Data errors and specific situations that lead to exceptions in the data are not expressions of pure randomness and should therefore not be analyzed by statistical methods. With these patterns we are able to reconstruct basic patterns in the data that provide information about the data.

Of course, many improvements are possible in order to find more complex patterns (and that why there is a (1) in the title of this blog). Presumably all existing validation rules can be found in this manner, and much more. Hopefully I will be able to implement these improvements and present them in a new blog.

Text modeling with S2 SFCRs

Published by:

European insurance undertakings are required to publish each year a Solvency and Financial Condition Report (SFCR). These SFCRs are often made available via the insurance undertaking’s website. In this blog I will show some first results of a text modeling exercise using these SFCRs.

Text modeling was done with Latent Dirichlet Allocation (LDA) with the Mallet’s implementation, via the gensim-package (found here: https://radimrehurek.com/gensim/index.html). A description you can find here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/. LDA is an unsupervised learning algorithm that generates latent (hidden) distributions over topics for each document or sentence and a distribution over words for each topic.

To get the data I scraped as many SFCRs (in all European languages) as I could find on the Internet. As a result of this I have a data set of 4.36 GB with around 2,500 SFCR documents in PDF-format (until proven otherwise, I probably have the largest library of SFCR documents in Europe). Among these are 395 SFCRs in the English language, consisting in total of 287,579 sentences and 8.1 million words.

In a SFCR an insurance undertaking publicly discloses information about a number of topics prescribed by the Solvency II legislation, such as its business and performance, system of governance, risk profile, valuation and capital management. Every SFCR therefore contains the same topics.

The LDA algorithm is able to find dominant keywords that represents each topic, given a set of documents. It is assumed that each document is about one topic. We want to use the LDA algorithm to identify the different topics within the SFCRs, such that, for example, we can extracts all sentences about the solvency requirements. To do so, I will run the LDA algorithm with sentences from the SFCRs (and thereby assuming that each sentence is about one topic).

I followed the usual steps; some data preparation to read the pdf files properly. Then I selected the top 9.000 words and I selected the sentences with more than 10 words (it is known that the LDA algorithm does not work very good with words that are not often used and very short documents/sentences). I did not built bigram and trigram models because this did not really change the outcome. Then the data was lemmatized such that only nouns, adjectives, verbs and adverbs were selected. The spacy-package provides functions to tag the data and select the allowed postags.

The main inputs for the LDA algorithm is a dictionary and a corpus. The dictionary contains all lemmatized words used in the documents with a unique id for each word. The corpus is a mapping of word id to word frequency in each sentence. After we generated these, we can run the LDA algorithm with the number of topics as one of the parameters.

The quality of the topic modeling is measured by the coherence score.

The coherence score per number of topics

Therefore, a low number of topics that performs well would be nine topics (0.65) and the highest coherence score is attained at 22 topics (0.67), which is pretty high in general. From this we conclude that nine topics would be a good start.

What does the LDA algorithm produce? It generates for each topic a list of keywords with weights that represent that topic. The weights indicate how strong the relation between the keyword and the topic is: the higher the weight the more representative the word is for that specific topic. Below the first ten keywords are listed with their weights. The algorithm does not classify the topic with one or two words, so per topic I determined a description that more or less covers the topic (with the main subjects of the Solvency II legislation in mind).

Topic 0 ‘Governance’: 0.057*”management” + 0.051*”board” + 0.049*”function” + 0.046*”internal” + 0.038*”committee” + 0.035*”audit” + 0.034*”control” + 0.032*”system” + 0.030*”compliance” + 0.025*”director”

Topic 1 ‘Valuation’: 0.067*”asset” + 0.054*”investment” + 0.036*”liability” + 0.030*”valuation” + 0.024*”cash” + 0.022*”balance” + 0.020*”tax” + 0.019*”cost” + 0.017*”account” + 0.016*”difference”

Topic 2 ‘Reporting and performance’: 0.083*”report” + 0.077*”solvency” + 0.077*”financial” + 0.038*”condition” + 0.032*”information” + 0.026*”performance” + 0.026*”group” + 0.025*”material” + 0.021*”december” + 0.018*”company”

Topic 3 ‘Solvency’: 0.092*”capital” + 0.059*”requirement” + 0.049*”solvency” + 0.039*”year” + 0.032*”scr” + 0.030*”fund” + 0.027*”model” + 0.024*”standard” + 0.021*”result” + 0.018*”base”

Topic 4 ‘Claims and assumptions’: 0.023*”claim” + 0.021*”term” + 0.019*”business” + 0.016*”assumption” + 0.016*”market” + 0.015*”future” + 0.014*”base” + 0.014*”product” + 0.013*”make” + 0.012*”increase”

Topic 5 ‘Undertaking’s strategy’: 0.039*”policy” + 0.031*”process” + 0.031*”business” + 0.030*”company” + 0.025*”ensure” + 0.022*”management” + 0.017*”plan” + 0.015*”manage” + 0.015*”strategy” + 0.015*”orsa”

Topic 6 ‘Risk management’: 0.325*”risk” + 0.030*”market” + 0.027*”rate” + 0.024*”change” + 0.022*”operational” + 0.021*”underwriting” + 0.019*”credit” + 0.019*”exposure” + 0.013*”interest” + 0.013*”liquidity”

Topic 7 ‘Insurance and technical provisions’: 0.049*”insurance” + 0.045*”reinsurance” + 0.043*”provision” + 0.039*”life” + 0.034*”technical” + 0.029*”total” + 0.025*”premium” + 0.023*”fund” + 0.020*”gross” + 0.019*”estimate”

Topic 8 ‘Undertaking’: 0.065*”company” + 0.063*”group” + 0.029*”insurance” + 0.029*”method” + 0.023*”limit” + 0.022*”include” + 0.017*”service” + 0.016*”limited” + 0.015*”specific” + 0.013*”mutual”

To determine the topic of a sentences we calculate for each topic the weight of the words in the sentences. The main topic of the sentence is then expected to be the topic with the highest sum.

If we run the following sentence (found in one of the SFCRs) through the model

"For the purposes of solvency, the Insurance Group’s insurance obligations
are divided into the following business segments: 1. Insurance with profit
participation 2. Unit-linked and index-linked insurance 3. Other life
insurance 4. Health insurance 5. Medical expence insurance for non-life
insurance 6. Income protection insurance for non-life insurance Pension &
Försäkring (Sweden) Pension & Försäkring offers insurance solutions on the
Swedish market within risk and unit-linked insurance and traditional life
insurance."

then we get the following results per topic:

[(0, 0.08960573476702509), 
(1, 0.0692951015531661),
(2, 0.0692951015531661),
(3, 0.06332138590203108),
(4, 0.08363201911589009),
(5, 0.0692951015531661),
(6, 0.08004778972520908),
(7, 0.3369175627240143),
(8, 0.13859020310633216)]

Topic seven (‘Insurance and technical provisions’) has clearly the highest score 0.34 , followed by topic eight (‘Undertaking’). This suggests that these sentences are about the insurances and technical provisions of the undertaking (that we can verify).

Likewise, for the sentence

"Chief Risk Officer and Risk Function 
The Board has appointed a Chief Risk Officer (CRO) who reports directly to
the Board and has responsibility for managing the risk function and
monitoring the effectiveness of the risk management system."

we get the following results:

[(0, 0.2926447574334898), 
(1, 0.08294209702660407),
(2, 0.07824726134585289),
(3, 0.07824726134585289),
(4, 0.07824726134585289),
(5, 0.08450704225352113),
(6, 0.14866979655712048),
(7, 0.07824726134585289),
(8, 0.07824726134585289)]

Therefore, topic zero (‘Governance’) and topic six (‘Risk management’) have the highest score and this suggests that this sentence is about the governance of the insurance undertaking and to a lesser extent risk management.

The nine topics that were identified reflect fairly different elements in the SFCR, but we also see that some topics consist of several subtopics that could be identified separately. For example, the topic that I described as ‘Valuation’ covers assets and investments but it might be more appropriate to distinguish investment strategies from valuation. The topic ‘Solvency’ covers own funds as well as solvency requirements. If we increase the number of topics then some of the above topics will be split into more topics and the topic determination will be more accurate.

Once we have made the LDA model we can use the results for several applications. First, of course, we can use the model to determine the topics of previously unseen documents and sentences. We can also analyze topic distributions across different SFCRs, we can get similar sentences for any given sentence (based on the distance of the probability scores of the given sentence to other sentences).

In this blog I described first steps in text modeling of Solvency and Financial Condition Reports of insurance undertakings. The coherence scores were fairly high and the identified topics represented genuine topics from the Solvency II legislation, especially with a sufficient number of topics. Some examples showed that the LDA model is able to identify the topic of specific sentences. However, this does not yet work perfectly; an important element of SFCR documents are the numerical information often stored in table form in the PDF. These are difficult to analyze with the LDA algorithm.

How to analyze public Solvency 2 data of Dutch insurers

Published by:

In this blog we will use the public Solvency II data of all Dutch insurance undertakings and present it in one large data vector per undertaking. In doing so, we are able to use some easy but powerful machine learning algorithms to analyze the data. The notebook can be found here.

Solvency II data of individual insurance undertakings is published yearly by the Statistics department of DNB. The data represents the financial and solvency situation of each insurance undertaking per the end of each year. Currently, we have two years of data: ultimo 2016 and ultimo 2017. The publication of DNB is in the form of an Excel file with a number of worksheets containing the aggregated data and the individual data. Here we will use the individual data.

We will read the data in a Pandas Data Frame en use Numpy for data manipulations. Furthermore we need the datetime object

import pandas as pd
import numpy as np
from datetime import datetime

You can find the data with the following url

https://statistiek.dnb.nl/en/downloads/index.aspx#/?statistics_type=toezichtdata&theme=verzekeraars

Download the file and make sure that the Excel file is accessible.

data_path = "../../20_local_data/"
xls = pd.ExcelFile(data_path + "Data individual insurers
                                (year).xlsx")

The Excel file contains several worksheets with data. We want to combine all the data together in one Data Frame. To do that we need some data preparation and data cleaning for each worksheet.

In the following function a worksheet is put into a Data Frame and the columns names are set to lower case. Then an index of the data frame is set to the insurance undertaking name and the reporting period. Then we perform some cleaning (the original worksheets contain some process information). In addition, some worksheets in the file contain 2-dimensional data, that need to be pivoted such that we obtain one large vector with all the data per insurance undertaking in one row.

def get_sheet(num):
    # read entire Excel sheet
    df = xls.parse(num)
    # columns names to lower case
    df.columns = [c.lower() for c in df.columns]
    # set index to name and period
    df.set_index(['name', 'period'], inplace = True)
    # data cleaning (the excel sheet contains some
                     additional data that we don't need)
    drop_list = [i for i in df.columns 
                     if 'unnamed' in i or 'selectielijst' in i]
    df.drop(drop_list, axis = 1, inplace = True)
    # pivot data frame
    if "row_name" in df.columns:
        df.drop("row_name", axis = 1, inplace = True)
        df = df.pivot(columns = 'row_header')
    if df.columns.nlevels > 1:
        df.columns = [str(df.columns[i]) for i in
                          range(len(df.columns))]
    return df

Creating one large vector per insurer

With the function above we can read a part of the Excel file and store it in a Pandas data frame. The following worksheets are contained in the Excel file published by DNB.

  • Worksheet 14: balance sheet
  • Worksheet 15: premiums – life
  • Worksheet 16: premiums – non-life
  • Worksheet 17: technical provisions – 1
  • Worksheet 18: technical provisions – 2
  • Worksheet 19: transition and adjustments
  • Worksheet 20: own funds
  • Worksheet 21: solvency capital requirements – 1
  • Worksheet 22: solvency capital requirements – 2
  • Worksheet 23: minimum capital requirements
  • Worksheet 24: additional information life
  • Worksheet 25: additional information non-life
  • Worksheet 26: additional information reinsurance

Let’s read the first worksheet with data and then append the other sheets to it. We shall not read the last three worksheets, because these contain the country specific reports.

df = get_sheet(14)
for l in range(15, 24):
    df_temp = get_sheet(l)
    df = df.join(df_temp, rsuffix = "_"+ str(l))

Let’s get the shape of the Data Frame.

print("Number of rows   : " + str(len(df.index)))
print("Number of columns: " + str(len(df.columns)))

Out: Number of rows   : 286
     Number of columns: 1273

So we have 286 rows of insurance undertakings data for ultimo 2016 and ultimo 2017. And we have 1273 columns with financial and solvency data about each insurance undertaking. Let’s take the data from the year 2017, and select all columns that have floating numbers in them.

df = df.xs(datetime(2017,12,31), 
           axis = 0, 
           level = 1,
           drop_level = True)
df = df[[df.columns[c] for c in range(len(df.columns)) 
          if df.dtypes[c]=='float64']]
df.fillna(0, inplace = True)
print("Number of rows   : " + str(len(df.index)))
print("Number of columns: " + str(len(df.columns)))

Out: Number of rows   : 139
     Number of columns: 1272

Apparently there is one columns with a non-floating values. For ultimo 2017 we have the data of 139 insurance undertaking.

Now we can perform all kinds of numerical analysis. Let’s calculate the total amount of assets of all Dutch insurance undertakings.

df['assets|total assets , solvency ii value'].sum()

Out: 486702222727.1401

That’s almost 487 billion euro at the end of 2017.

Finding similar insurers

Now let’s apply some algorithms to this data set. Suppose we want to know what insurance undertakings are similar with respect to their financial and solvency structure. To do that we can calculate the distances between all the data points of each insurance undertakings. An insurance undertaking with a low distance to another insurance undertaking might be similar to that undertaking.

If we divide each row by the total assets we do as if all insurance undertakings have equal size, and then the distances indicate similarity in financial and solvency structure (and not similarity in size).

X = df.div(df['assets|total assets , solvency ii value'],
              axis = 0)

The scikit-learn package provides numerous algorithms to do calculations with distances. Below we apply the NearestNeighbors algorithm to find the neighbors of each insurance undertaking. Then we get the distances and the indices of the data set and store them.

from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors = 2, 
                        algorithm = 'brute').fit(X.values)
distances, indices = nbrs.kneighbors(X)

What are the nearest neighbors of the first ten insurance undertakings in the list?

for i in indices[0:10]:
     print(X.index[i[0]] + " --> " + X.index[i[1]])

Out: 
ABN AMRO Captive N.V. --> Rabo Herverzekeringsmaatschappij N.V.
ABN AMRO Levensverzekering N.V. --> Delta Lloyd Levensverzekering N.V.
ABN AMRO Schadeverzekering N.V. --> Ansvar Verzekeringsmaatschappij N.V.
AEGON Levensverzekering N.V. --> ASR Levensverzekering N.V.
AEGON Schadeverzekering N.V. --> Nationale-Nederlanden Schadeverzekering Maatschappij N.V.
AEGON Spaarkas N.V. --> Robein Leven N.V.
ASR Aanvullende Ziektekostenverzekeringen N.V. --> ONVZ Aanvullende Verzekering N.V.
ASR Basis Ziektekostenverzekeringen N.V. --> VGZ Zorgverzekeraar N.V.
ASR Levensverzekering N.V. --> Achmea Pensioen- en Levensverzekeringen N.V.
ASR Schadeverzekering N.V. --> Veherex Schade N.V.

And with the shortest distance between two insurance undertakings we can find the two insurance undertakings that have the highest similarity in their structure.

min_list = np.where(distances[:,1] == distances[:,1].min())
list(X.index[min_list])

Out: ['IZA Zorgverzekeraar N.V.', 'Univé Zorg, N.V.']

If you want to understand the financial performance it is of course handy to know which insurance undertakings are similar. A more general approach when comparing insurance undertakings is to cluster them into a small number of peer groups.

Clustering the insurers

Can we cluster the insurance undertakings based on the 1272-dimensional data? To do this we apply the t-sne algorithm (that we used before).

First we import all the packages required.

import matplotlib.pyplot as pyplot
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

Then we run the t-sne algorithm.

Y = TSNE(n_components = 2, 
         perplexity = 5, 
         verbose = 0, 
         random_state = 1).fit_transform(X.values)

And we plot the results

pyplot.figure(figsize = (7, 7))
pyplot.scatter(x = Y[:, 0], 
               y = Y[:, 1], 
               s = 7)
pyplot.show()

Depending on how you zoom in you see different clusters in this picture. In the above left you see the health insurance undertakings (with more clusters within that set: those offering basic health insurance and other offering additional health insurances, or both). On the right are (mostly) life insurance undertakings, and on the left (middle to below) are non-life insurance undertakings. And both clusters can be divided into several more sub clusters. These clusters can be used in further analysis. For example, you could use these as peer groups of similar insurance undertakings.

Clustering the features

Given that we have a 1272-dimensional vector of each insurance undertaking we might wish somehow to cluster the features in the data set. That is, we want to know which columns belong to each other and what columns are different.

An initial form of clustering were the different worksheets in the original Excel file. The data was clustered around the balance sheet, premiums, technical provisions, etc. But can we also find clusters within the total vector without any prior knowledge of the different worksheets?

A simple and effective way is to transpose the data matrix and feed it into the t-sne algorithm. That is, instead of assuming that each feature provides additional information about an insurance undertaking, we assume that each insurance undertaking provides additional information about a feature.

Let’s do this for only the balance sheet. In a balance sheet it is not immediately straightforward how the left side is related to the right side of the balance sheet, i.e. which assets are related to which liabilities. If you cluster all the data of the balance sheet then related items are clustered (irrespective of whether they are assets or liabilities).

df = get_sheet(14)

Instead of the scaled values we now take whether or not a data point was reported or not, and then transpose the matrix.

X = (df != 0).T

Then we apply the t-sne algorithm. In this case with a lower perplexity.

Y = TSNE(n_components = 2, 
         perplexity = 1.0, 
         verbose = 0, 
         random_state = 0, 
         learning_rate = 20, 
         n_iter = 10000).fit_transform(X.values)

And we plot the result with 15 identified clusters.

pyplot.figure(figsize = (7, 7))
pyplot.scatter(x = Y[:, 0], 
              y = Y[:, 1], 
              s = 5)
kmeans = KMeans(n_clusters = 15, random_state = 0, n_init  = 10).fit(Y)
for i in range(len(kmeans.cluster_centers_)):
    pyplot.scatter(x = kmeans.cluster_centers_[i,0],
                   y = kmeans.cluster_centers_[i,1],
                   s = 1,
                   c = 'yellow')
    pyplot.annotate(str(i), 
                   xy = (kmeans.cluster_centers_[i, 0], kmeans.cluster_centers_[i, 1]), 
                   size = 13)
pyplot.show()

Then we get this.

We see are large number of different clusters.

for i in df.T.loc[kmeans.labels_ == 6].index:
    print(i)

Out:
assets|assets held for index-linked and unit-linked contracts , solvency ii value
assets|loans and mortgages|loans and mortgages to individuals , solvency ii value
assets|reinsurance recoverables from:|life and health similar to life, excl health,index-linked,unit-linked|life excluding health,index-linked,unit-linked , solvency ii value
liabilities|technical provisions – index-linked and unit-linked , solvency ii value
liabilities|technical provisions – index-linked and unit-linked|best estimate , solvency ii value
liabilities|technical provisions – index-linked and unit-linked|risk margin , solvency ii value

So the assets held for index-linked and unit-linked contracts are in the same cluster as the technical provisions for index-linked and unit-linked items (and some other related items are found).

However, the relations found are not always perfect. To improve the clustering we should cluster the data that is related in their changes over time. But because we have just two years available (and so just one yearly difference) we presumably do not have enough data to do that.

What is learning in ‘deep learning’?

Published by:

The general process in an artificial neural network when it is trained is called deep learning. The adjective deep in this case refers to neural networks consisting of several layers with a large number of weights that are calibrated with input and output examples. But what does learning here mean? The question I want to discuss here is whether the learning in a deep learning network is learning in the genuine sense: does a deep learning algorithm learn in the way human beings do? And if a network learns it must learn something, then: what is it that a deep learning network learns?

Suppose you have an artificial neural network with a suitable architecture for image detection and you want to train it to detect cats in pictures. You have a training set of labelled pictures. You can train the network by back propagating through the network and change the weights between the nodes to minimize the output error. During the training process the parameters of each node in the network are changed in a similar way as would be done in fitting a (linear) regression model: the output of an error function is minimized by changing weights, and this is done for each picture in the training set, for each node in the network, and in a number of iterations. If you have done that then, with a sufficiently large training set, the network would be able to detect cats in pictures.

The question is: would it be correct to say that the structure of the network with all its hyper parameters somehow learned the basic properties of cats such that it can detect them in pictures? And to take this one step further, would it be correct to say that the network with trained weights and other parameters somehow knows what cats are?

It depends, of course, on what we understand by learning. If we would define the result of learning as the ability to recognize cats in specific pictures then it would qualify as learning. But if the result of learning is the ability to explain or to know what a cat is then I don’t think what the network does comes anywhere near this definition. A trained neural network calculates a complex function. There is no knowledge in or above the network about what cats are, at least not in a form that we can easily understand. Nowhere during the learning phase there occurs a magical jump from data to knowledge; nowhere becomes the pile of sand a heap. And yet it is able to recognize cats.

What kind of learning it this? Perhaps a better word for the way in which artificial neural networks learn is that they are being conditioned in a certain manner. The basic mode of learning of artificial neural networks, I would say, is conditioning and not learning in a genuine sense (where it would be able to explain what a cat is). In conditioning an association is formed between action (or output) and reward (or desired output); this is similar to the way animals learn.

This mode of learning results is an ability without knowledge. And there has yet to be found a general way to reconstruct knowledge that is tacitly present in the network. If an artificial network learns then it learns in such a way that what is learned remains concealed for us (until somehow we are able to extract from the network the knowledge it learned, but that would be our learning and not the networks learning).

By calling deep learning a form of learning we project a certain idea of learning onto the training process of neural networks. I think the reason for this is the apparent analogy with human brains, and natural neural networks in general, and artificial ones. For this analogy the reasoning is: because a human brain is able to learn and because an artificial neural network mimics the human brain processes, it is only logical to conclude that artificial neural networks are also able to learn. But there is a difference between human learning and the biological process during this learning. I myself am able to learn in the sense that I can explain what I learned (I hope); my brain is conditioned during this learning.

From this perspective a neural network is similar to any non human brain. During a learning phase it is conditioned for an ability. It does not acquire any knowledge during this process about what it has learned.

What is a black box?

Published by:

In discussions about the use of machine learning and deep learning algorithms the issue is often raised that many of these algorithms are black boxes and that we do not know how they work. What do we mean by a black box in relation to machine learning algorithms? What is a black box in itself?

Literally a black box is something that does not emit any light such that we are unable to see what is going on. An initial definition of a black box might be that it is an process or algorithm with observable input and output where the causal mechanism between input and output is unknown. This lack of knowledge implies that we are unable to see and explain what happens within the process or how the algorithm works.

But that is not a formal and satisfying definition, for we can ask: by whom should this causal mechanism be known and what constitutes this knowledge? Is it a not-knowing of an arbitrary individual, or is it a not-knowing of a group of experts of the matter at hand after a sufficient amount of time and resources? And is knowledge expressed in logic, math and natural language, or do we also count something like intuition to knowledge?

And furthermore, in practice there are different levels of knowledge of a process or algorithm. You might be able to explain only a small part of the process or algorithm in isolation, but not the full process from input to output with all interactions between the parts within the process. You might be able to explain how one particular input resulted in an output or how a set of related inputs led to the outputs. And you might be able to explain how changes in the input or conditions of the process or parameters of the algorithm change the output (the causal mechanism behind changes in input and conditions). This all constitutes some form of knowledge about an algorithm or process.

A definition of a black box based on this lack of knowledge experienced by someone is not a good idea. It depends on who is experiencing the black box. And more important, by defining a black box as the absence of something else we have not said what a black box in itself is. So in this way the definition of a black box remains hidden.

Another way to look at it is to see a complex deep learning algorithm as a very complex natural process, like a changing atmosphere, motion of fluids or a neurological process in a human brain. In these cases we observe input and output, but the internal mechanism depends on many factors and is extremely complex. It is not that we do not understand how these processes in general behave; for isolated parts we sometimes know precisely what happens. But because of the size and complexity of these processes and the huge amount of input data that could determine the outcome the causal relations between input and output are simply too complex to comprehend and to express in a succinct form. (I called it an analogy because we do know that a deep learning network is deterministic but for natural processes we do not know that). We often have some knowledge and understanding how certain simple processes in their isolated form behave, but when it comes to any precise prediction many processes are too complex. The same we see in deep learning algorithms; large amounts of input data and several layers with incomprehensible weights.

In light of this analogy it is perhaps better to see a black box algorithm as something that is open for investigation, just like any natural process is. What are the rules of a deep learning algorithm? How can we extract knowledge from a trained neural network? Of course the structure of these algorithm makes it particular hard to extract knowledge but it is not impossible. At least we should not dismiss the problem altogether to call a deep learning algorithm a black box and stop investigating.

And there has been made some progress in this area; some deep learning algorithms emit some light of their internal processes. We can try to generalize an algorithm and look at feature importance, we could use techniques such as LIME, and we could works backwards from the output to the input by back-propagation to learn the feature selection inside the algorithm. But this is just the beginning.

We currently lack a proper terminology to describe the processes in deep neural networks. Terms like interpretability and explainability that have been introduced in the area of deep learning are simply not well defined and too vague to describe what is going on. We need a proper science of neural networks that is able to rationally reconstruct the knowledge that is hidden inside in the weights and other parameters of the deep learning network.

So let’s change the definition of the term black box. Instead of absence of knowledge, basically a form of nothingness, we should see a black box in a more positive sense like nature before we understood (some of) her laws: open to be discovered. In the meantime, what do we do when we lack knowledge of a deep learning process? For me the answer lies in the analogy presented above; we should view the outcome as the outcome of a natural process. What that means is something for another blog.

How to download and read the Solvency 2 legislation

Published by:

In our first Natural Language Processing project we will read the Solvency II legislation from the website of the European Union and extract the text within the articles by using regular expressions.

For this notebook, we have chosen the text of the Delegated Acts of Solvency II. This part of the Solvency II regulation is directly into force (because it is a Regulation) and the wording of the Delegated Acts is more detailed than the Solvency II Directive and very precise and internally consistent. This makes it suitable for NLP. From the text we are able to extract features and text data on Solvency II for our future projects.

The code of this notebook can be found in here

Step 1: data Retrieval

We use several packages to read and process the pdfs. For reading we use the fitz-package. Furthermore we need the re-package (regular expressions) for cleaning the text data.

import os
import re
import requests
import fitz

We want to read the Delegated Acts in all available languages. The languages of the European Union are Bulgarian (BG), Spanish (ES), Czech (CS), Danish (DA), German (DE), Estonian (ET), Greek (EL), English (EN), French (FR), Croatian (HR), Italian (IT), Latvian (LV), Lithuanian (LT), Hungarian (HU), Maltese (MT), Dutch (NL), Polish (PL), Portuguese (PT), Romanian (RO), Slovak (SK), Solvenian (SL), Finnish (FI), Swedish (SV).

languages = ['BG','ES','CS','DA','DE','ET','EL',
             'EN','FR','HR','IT','LV','LT','HU',
             'MT','NL','PL','PT','RO','SK','SL',
             'FI','SV']

The urls of the Delegated Acts of Solvency 2 are constructed for these languages by the following list comprehension.

urls = ['https://eur-lex.europa.eu/legal-content/' + lang +
        '/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN' 
        for lang in  languages]

The following for loop retrieves the pdfs of the Delegated Acts from the website of the European Union and stores them in da_path.

da_path = 'data/solvency ii/'
for index in range(len(urls)):
    filename = 'Solvency II Delegated Acts - ' + languages[index] + '.pdf'
    if not(os.path.isfile(da_path + filename)):
        r = requests.get(urls[index])
        f = open(da_path + filename,'wb+')
        f.write(r.content) 
        f.close()
 else:
        print("--> already read.")

Step 2: data cleaning

If you look at the pdfs then you see that each page has a header with page number and information about the legislation and the language. These headers must be deleted to access the articles in the text.

DA_dict = dict({
                'BG': 'Официален вестник на Европейския съюз',
                'CS': 'Úřední věstník Evropské unie',
                'DA': 'Den Europæiske Unions Tidende',
                'DE': 'Amtsblatt der Europäischen Union',
                'EL': 'Επίσημη Εφημερίδα της Ευρωπαϊκής Ένωσης',
                'EN': 'Official Journal of the European Union',
                'ES': 'Diario Oficial de la Unión Europea',
                'ET': 'Euroopa Liidu Teataja',           
                'FI': 'Euroopan unionin virallinen lehti',
                'FR': "Journal officiel de l'Union européenne",
                'HR': 'Službeni list Europske unije',         
                'HU': 'Az Európai Unió Hivatalos Lapja',      
                'IT': "Gazzetta ufficiale dell'Unione europea",
                'LT': 'Europos Sąjungos oficialusis leidinys',
                'LV': 'Eiropas Savienības Oficiālais Vēstnesis',
                'MT': 'Il-Ġurnal Uffiċjali tal-Unjoni Ewropea',
                'NL': 'Publicatieblad van de Europese Unie',  
                'PL': 'Dziennik Urzędowy Unii Europejskiej',  
                'PT': 'Jornal Oficial da União Europeia',     
                'RO': 'Jurnalul Oficial al Uniunii Europene', 
                'SK': 'Úradný vestník Európskej únie',        
                'SL': 'Uradni list Evropske unije',            
                'SV': 'Europeiska unionens officiella tidning'})

The following code reads the pdfs, deletes the headers from all pages and saves the clean text to a .txt file.

DA = dict()
files = [f for f in os.listdir(da_path) if os.path.isfile(os.path.join(da_path, f))]    
for language in languages:
    if not("Delegated_Acts_" + language + ".txt" in files):
        # reading pages from pdf file
        da_pdf = fitz.open(da_path + 'Solvency II Delegated Acts - ' + language + '.pdf')
        da_pages = [page.getText(output = "text") for page in da_pdf]
        da_pdf.close()
        # deleting page headers
        header = "17.1.2015\\s+L\\s+\\d+/\\d+\\s+" + DA_dict[language].replace(' ','\\s+') + "\\s+" + language + "\\s+"
        da_pages = [re.sub(header, '', page) for page in da_pages]
        DA[language] = ''.join(da_pages)
        # some preliminary cleaning -> could be more 
        DA[language] = DA[language].replace('\xad ', '')
        # saving txt file
        da_txt = open(da_path + "Delegated_Acts_" + language + ".txt", "wb")
        da_txt.write(DA[language].encode('utf-8'))
        da_txt.close()
    else:
        # loading txt file
        da_txt = open(da_path + "Delegated_Acts_" + language + ".txt", "rb")
        DA[language] = da_txt.read().decode('utf-8')
        da_txt.close()

Step 3: retrieve the text within articles

Retrieving the text within articles is not straightforward. In English we have ‘Article 1 some text’, i.e. de word Article is put before the number. But some European languages put the word after the number and there are two languages, HU and LV, that put a dot between the number and the article. To be able to read the text within the articles we need to know this ordering (and we need of course the word for article in every language).

art_dict = dict({
                'BG': ['Член',      'pre'],
                'CS': ['Článek',    'pre'],
                'DA': ['Artikel',   'pre'],
                'DE': ['Artikel',   'pre'],
                'EL': ['Άρθρο',     'pre'],
                'EN': ['Article',   'pre'],
                'ES': ['Artículo',  'pre'],
                'ET': ['Artikkel',  'pre'],
                'FI': ['artikla',   'post'],
                'FR': ['Article',   'pre'],
                'HR': ['Članak',    'pre'],
                'HU': ['cikk',      'postdot'],
                'IT': ['Articolo',  'pre'],
                'LT': ['straipsnis','post'],
                'LV': ['pants',     'postdot'],
                'MT': ['Artikolu',  'pre'],
                'NL': ['Artikel',   'pre'],
                'PL': ['Artykuł',   'pre'],
                'PT': ['Artigo',    'pre'],
                'RO': ['Articolul', 'pre'],
                'SK': ['Článok',    'pre'],
                'SL': ['Člen',      'pre'],
                'SV': ['Artikel',   'pre']})

Next we can define a regex to select the text within an article.

def retrieve_article(language, article_num):

    method = art_dict[language][1]
    
    if method == 'pre':
        string = art_dict[language][0] + ' ' + str(article_num) + '(.*?)' + art_dict[language][0] + ' ' + str(article_num + 1)
    elif method == 'post':
        string = str(article_num) + ' ' + art_dict[language][0] + '(.*?)' + str(article_num + 1) + ' ' + art_dict[language][0]
    elif method == 'postdot':
        string = str(article_num) + '. ' + art_dict[language][0] + '(.*?)' + str(article_num + 1) + '. ' + art_dict[language][0]

    r = re.compile(string, re.DOTALL)
            
    result = ' '.join(r.search(DA[language])[1].split())
            
    return result

Now we have a function that can retrieve the text of all the articles in the Delegated Acts for each European language.

Now we are able to read the text of the articles from the Delegated Acts. In the following we give three examples (article 292 with states the summary of the Solvency and Financial Conditions Report).

retrieve_article('EN', 292)
"Summary 1. The solvency and financial condition report shall include a clear and concise summary. The summary of the report
shall be understandable to policy holders and beneficiaries. 2. The
summary of the report shall highlight any material changes to the 
insurance or reinsurance undertaking's business and performance, 
system of governance, risk profile, valuation for solvency purposes 
and capital management over the reporting period."
retrieve_article('DE', 292)
'Zusammenfassung 1. Der Bericht über Solvabilität und Finanzlage 
enthält eine klare, knappe Zusammenfassung. Die Zusammenfassung des
Berichts ist für Versicherungsnehmer und Anspruchsberechtigte
verständlich. 2. In der Zusammenfassung werden etwaige wesentliche
Änderungen in Bezug auf Geschäftstätigkeit und Leistung des
Versicherungs- oder Rückversicherungsunternehmens, sein 
Governance-System, sein Risikoprofil, die Bewertung für 
Solvabilitätszwecke und das Kapitalmanagement im Berichtszeitraum 
herausgestellt.'
retrieve_article('EL', 292)
'Περίληψη 1. Η έκθεση φερεγγυότητας και χρηματοοικονομικής
κατάστασης περιλαμβάνει σαφή και σύντομη περίληψη. Η περίληψη της
έκθεσης πρέπει να είναι κατανοητή από τους αντισυμβαλλομένους και
τους δικαιούχους. 2. Η περίληψη της έκθεσης επισημαίνει τυχόν
ουσιώδεις αλλαγές όσον αφορά τη δραστηριότητα και τις επιδόσεις της
ασφαλιστικής και αντασφαλιστικής επιχείρησης, το σύστημα
διακυβέρνησης, το προφίλ κινδύνου, την εκτίμηση της αξίας για τους
σκοπούς φερεγγυότητας και τη διαχείριση κεφαλαίου κατά την περίοδο
αναφοράς.'

Dutch insurance market visualization

Published by:

In this example we will analyze the Dutch insurance market with two machine learning algorithms, t-SNE, a technique for dimensionality reduction developed by Laurens van der Maat, combined with KMeans, an algorithm to find clusters in the data.

We use publicly available register data of all Dutch insurance undertakings that we web scraped from the DNB website (public-register). This register contains the specific license and the lines of business (LoB’s) in which an insurance undertaking is allowed to operate.

Insurance undertakings in the European Union have different types of licenses, for example life or non-life and where the undertaking is based and where it is allowed to operate. They sell different kinds of insurance products, i.e. they operate in different LoB’s (for example motor vehicle insurance, general life insurance or health insurance). There are 6 LoB’s for life and 19 LoB’s for non-life.

There are groups of insurance undertakings that, looking at their lines of business, are similar. For example some undertakings are small and specialized and sell products only within a very limited number of lines of business. Other larger general insurance undertakings sell insurance products from all lines of business.

How can we find these clusters (groups) of insurance undertakings that are similar with respect to their set of lines of business? We will create a vector per insurance undertakings with their allowed lines of business. Then we will use the t-SNE algorithm to reduce this vector to a 2d vector such that we can plot it in a 2d plane. Undertakings with similar LoB sets are then plotted near to each other. Then we detect the clusters with the kMeans algorithm.

I won’t give all the Python code of the notebook because it is somewhat cumbersome to obtain the license data from the DNB register, but if you are interested, a part can be found here.

import pandas as pd
import numpy as np
import ast
import matplotlib.pyplot as pyplot
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

Step 1: reading the data

Earlier, we stored the license and lines of business data in a csv file.

filename = 'data/licenses_insurers.csv'
df = pd.read_csv(filename, delimiter = ',', encoding = 'utf-8')

Step 2: data preparation

First we need some data preparation. The .csv contains string data that has to be interpreted. First we extract the data on the licenses and the lines of business. If we look at the available licenses we get:

List of available licenses:
2:27 lid 1 SII-schadeverzekeraar
2:27(1) SII insurer
2:45(1) Carrying on the business of a life insurer (non-EEA)
2:45(1) Carrying on the business of a non-life insurer (non-EEA)
2:45(1) non-EEA-based life insurer providing services to NL
2:45(1) non-EEA-based non-life insurer providing services to NL
2:48(1) life insurer with low magnitude of risk
2:48(1) non-life insurer with low magnitude of risk
2:48(1) pre-paid funeral services insurer with low magnitude of risk
Section 1:104(3) Business being wound up

Apparently not all items have been translated to English.

Insurers with low magnitude of risk are too small for Solvency II regulation and for these insurers there is a (simplified) regime in place. Some entities are active from outside the European Economic Area (EEA). The codes refer to the articles in the Dutch Financial Supervision Act.

Now we can get the list of lines of business:

List of available lines of business:
L01. Life insurance - general
L02. Life insurance related to marriage or birth
L03. Life insurance linked to common funds
L05. Holdings in savings pools
L06. Capitalisation activities
L07. Collective pension funds management
S01. Accident insurance
S02. Health insurance
S03. Motor vehicle insurance
S04. Railway rolling stock insurance
S05. Aircraft hull insurance
S06. Marine hull insurance
S07. Goods-in-transit insurance
S08. Fire and natural forces insurance
S09. Other property damage insurance
S10a. Motor vehicle liability insurance
S10b. Road transport liability insurance
S11. Aircraft liability insurance
S12. Marine liability insurance (sea, lake & river and canal vessels)
S13. General liability insurance
S14. Credit insurance
S15. Suretyship
S16. Pecuniary loss insurance
S17. Legal assistance insurance
S18. Assistance

This list of lines of business is applied across the European Union and was already in place before the Solvency II regulation.

We can obtain a data frame df_entities_lobs with all insurance undertakings with their allowed lines of business (25 possible LoB’s). That will be the input for the t-SNE algorithm.

Step 3: data analysis

We will use the t-SNE algorithm from the package sklearn.manifold. The input for the algorithm is the data frame with dummy encoded lines of business per entity prepared previously.

X = df_entities_lobs
Y = TSNE(n_components = 2, 
         perplexity = 18, 
         verbose = 1, 
         random_state = 1).fit_transform(X)

Next we use the k-means algorithm to determine the clusters in the Dutch insurance market. It appears that there are about eight clusters that are identifiable.

kmeans = KMeans(n_clusters = 8, random_state = 0, n_init  = 10).fit(Y)

Step 4: data visualization

The last step is to visualize the results of the t-SNE algorithm. For this we first produce the labels of the clusters with the average number of lines of business and the two most dominant lines of business. With that information we can describe the basic properties of the cluster.

Results of the tsne algorithm of the lines of business of Dutch insurance undertakings

Cluster 1 and 5 are the two clusters with life insurance undertakings. Cluster 5 consists of general life insurance undertakings with a broad line of products in difference life insurance lines of business. Cluster 1 consists of specialized life insurance undertakings with only one line of business. These undertakings are often relatively small.

The other clusters are consists of non-life insurance undertakings. Cluster 2 is an easily identifiable cluster with health insurance undertakings, with exacly two lines of business (accident and health insurance). Some health insurance undertakings also have some other lines of business; they form more general health insurance undertakings cluster 6 (near the uniform cluster 2).

Then we have cluster 3 with general non-life insurance undertakings operating a large number of lines of business (13 on average). Cluster 7 consists of medium non-life insurance undertakings with less lines of business on average. And cluster 4 and cluster 0 are specialized and often small non-life insurance undertaking with 1 or 2 lines of business on average (cluster 0: specialized property damage and fire insurance undertakings, and cluster 4: specialized general liability and legal assistance insurance undertakings).

We can find the corresponding undertakings for each cluster. For example the specialized small life insurance undertaking (cluster 1):

list(df_entities_lobs.index[kmeans.labels_ == 1])
['DELA Natura- en levensverzekeringen N.V.',
 'Isle of Man Assurance Limited',
 'Monuta Verzekeringen N.V.',
 'N.V. Noordhollandsche van 1816, Levensverzekeringsmaatschappij',
 'Nordben Life and Pension Insurance Co. Limited',
 "Onderling Fonds 'Sliedrecht' B.A.",
 'Tiels Onderling Fonds tot Uitkering bij Overlijden Gustaaf Adolf U.A.',
 'Yarden Uitvaartverzekeringen N.V.',
 'Zurich Life Insurance Company Limited']

How to import the Solvency 2 RFR

Published by:

UPDATE: The code mentioned in this blog has been moved to https://github.com/wjwillemse/solvency2-data

If you want to make insurance calculations you often need an recent interest rate term structure. In this example we show how to import the Solvency 2 Risk Free Rate from the EIOPA website in a convenient Pandas Data Frame, ready to be used for future calculations.

The code can be found on https://github.com/wjwillemse/insurlib.

First we import insurlib.solvency2, the Python package that contains functions to generate the names of the files, import the zip-file from the EIOPA website, extract it to an Excel file (both stored on disk) and read the Excel file in a proper Pandas Data Frame.

In [1]: import pandas as pd
        from datetime import datetime
        from insurlib import solvency2

We have now all the functions we need.

The function that does all this is solvency2.rfr.read, it returns a Python dictionary with all information about the RFR of a certain reference date.

If you do not add a input datetime, i.e. solvency2.rfr.read(), then the function with use today() and you will receive the most recent published RFR.

In [2]: d = solvency2.rfr.read(datetime(2018,1,1))

What information is stored in the dictionary?

In [3]: d.keys()
Out[3]: dict_keys(['input_date', 
           'reference_date', 
           'url', 
           'path_zipfile', 
           'name_zipfile', 
           'path_excelfile', 
           'name_excelfile', 
           'metadata', 
           'RFR_spot_no_VA', 
           'RFR_spot_with_VA'])

Let’s take a look at the individual elements of the dictionary.

The original date by which the function was called is stored in the dictionary as input_date

In [4]: d['input_date']
Out[4]: datetime.datetime(2018, 1, 1, 0, 0)

You can call the function with any date and the function will generate a proper reference date from it. The reference date is the most recent end of the month prior to the input date. So if for example the input is datetime(2018, 1, 1) then the reference date is '20171231', because this the most recent end of the month prior to the input date. The reference date is a string because it is used in the name of the files to be downloaded from the EIOPA-website.

In [5]: d['reference_date']
Out[5]: '20171231'

Furthermore the url, location and filenames are stored in the dictionary.

In [6]: print(d['url'])
        print(d['name_zipfile'])
        print(d['name_excelfile'])
Out[6]: https://eiopa.europa.eu/Publications/Standards/
        EIOPA_RFR_20171231.zip
        EIOPA_RFR_20171231_Term_Structures.xlsx

Now, let’s take a look at the available RFR’s.

In [7]: d['metadata'].columns
Out[7]: Index(['Euro', 'Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland', 'Italy', 'Latvia', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Russia', 'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'United Kingdom', 'Australia', 'Brazil', 'Canada', 'Chile', 'China', 'Colombia', 'Hong Kong', 'India', 'Japan', 'Malaysia', 'Mexico', 'New Zealand', 'Singapore', 'South Africa', 'South Korea', 'Taiwan', 'Thailand', 'Turkey', 'United States'],    dtype='object')

To get all the metadata of the French RFR we select metadata from the dictionary.

In [8]: d['metadata'].loc[:,'France']
Out[8]: Info              FR_31_12_2017_SWP_LLP_20_EXT_40_UFR_4.2
Coupon_freq                                             1
LLP                                                    20
Convergence                                            40
UFR                                                   4.2
alpha                                            0.126759
CRA                                                    10
VA                                                      4
reference date                                   20171231
Name: France, dtype: object

That is: the Coupon frequency is one year, the Last Liquid Point is 20 years, the Convergence period is 40 years, the Ultimate Forward Rate is 4.2 (this rate changed at the beginning of 2018 to 4.05), the alpha parameter of the Smith-Wilson algorithm is 0.126759, the Credit Rate Adjustment is 10 basis points and because we have the curve without the Volatility adjustment the VA is 4 basis points. It is identical to the Euro curve.

To get one single item from the metadata we can use the following line (note that this is the UFR at the end of 2017).

In [9]: d['metadata'].loc["UFR",'Germany']
Out[9]: 4.2

To get the euro RFR without Volatility Adjustment (the first ten durations) we use

In [10]: d['RFR_spot_no_VA']['Euro'].head(10)
Out[10]: Duration
1    -0.00358
2     -0.0025
3    -0.00088
4     0.00069
5     0.00209
6     0.00347
7     0.00469
8     0.00585
9     0.00695
10    0.00802
Name: Euro, dtype: object

Now suppose that we want to store the RFR of six consecutive months into one Data Frame. This is how we can do that.

First we define ref_dates with the reference dates we want to acquire.

In [11]: ref_dates = pd.date_range(start='2018-01-01', periods = 6,freq = 'MS')
         ref_dates
Out[11]: DatetimeIndex(['2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01'], dtype='datetime64[ns]', freq='MS')

Then we use a Python list comprehension to obtain the RFR’s of the reference dates and we show the Data Frame with the first ten durations.

In [12]: rfr = [solvency2.rfr.read(ref_date)['RFR_spot_no_VA']['Euro'] for ref_date in ref_dates]
         df_euro = pd.DataFrame(data = rfr, index = ref_dates).T
         print(df_euro.head(10))

A list comprehension can also be used for the metadata. The following code obtains the metadata of the UK RFR.

In [13]: rfr = [solvency2.rfr.read(ref_date)['metadata']['United Kingdom'] for ref_date in ref_dates]
         print(pd.DataFrame(data = rfr, index = ref_dates))
Out[13]:           2018-01-01  2018-02-01  2018-03-01  2018-04-01  2018-05-01  \
Duration                                                               
1           -0.00358    -0.00363    -0.00352    -0.00362    -0.00358   
2           -0.00250    -0.00225    -0.00220    -0.00258    -0.00244   
3           -0.00088    -0.00020    -0.00022    -0.00083    -0.00065   
4            0.00069     0.00190     0.00178     0.00104     0.00120   
5            0.00209     0.00380     0.00361     0.00285     0.00286   
6            0.00347     0.00537     0.00521     0.00418     0.00441   
7            0.00469     0.00670     0.00666     0.00556     0.00577   
8            0.00585     0.00791     0.00793     0.00672     0.00698   
9            0.00695     0.00899     0.00906     0.00783     0.00809   
10           0.00802     0.00987     0.01007     0.00884     0.00911   

          2018-06-01  
Duration              
1           -0.00331  
2           -0.00236  
3           -0.00098  
4            0.00057  
5            0.00213  
6            0.00356  
7            0.00490  
8            0.00613  
9            0.00725  
10           0.00824