In this blog we will use the public Solvency II data of all Dutch insurance undertakings and present it in one large data vector per undertaking. In doing so, we are able to use some easy but powerful machine learning algorithms to analyze the data. The notebook can be found here.

Solvency II data of individual insurance undertakings is published yearly by the Statistics department of DNB. The data represents the financial and solvency situation of each insurance undertaking per the end of each year. Currently, we have two years of data: ultimo 2016 and ultimo 2017. The publication of DNB is in the form of an Excel file with a number of worksheets containing the aggregated data and the individual data. Here we will use the individual data.

We will read the data in a Pandas Data Frame en use Numpy for data manipulations. Furthermore we need the datetime object

```
import pandas as pd
import numpy as np
from datetime import datetime
```

You can find the data with the following url

https://statistiek.dnb.nl/en/downloads/index.aspx#/?statistics_type=toezichtdata&theme=verzekeraars

Download the file and make sure that the Excel file is accessible.

```
data_path = "../../20_local_data/"
xls = pd.ExcelFile(data_path + "Data individual insurers
(year).xlsx")
```

The Excel file contains several worksheets with data. We want to combine all the data together in one Data Frame. To do that we need some data preparation and data cleaning for each worksheet.

In the following function a worksheet is put into a Data Frame and the columns names are set to lower case. Then an index of the data frame is set to the insurance undertaking name and the reporting period. Then we perform some cleaning (the original worksheets contain some process information). In addition, some worksheets in the file contain 2-dimensional data, that need to be pivoted such that we obtain one large vector with all the data per insurance undertaking in one row.

```
def get_sheet(num):
# read entire Excel sheet
df = xls.parse(num)
# columns names to lower case
df.columns = [c.lower() for c in df.columns]
# set index to name and period
df.set_index(['name', 'period'], inplace = True)
# data cleaning (the excel sheet contains some
additional data that we don't need)
drop_list = [i for i in df.columns
if 'unnamed' in i or 'selectielijst' in i]
df.drop(drop_list, axis = 1, inplace = True)
# pivot data frame
if "row_name" in df.columns:
df.drop("row_name", axis = 1, inplace = True)
df = df.pivot(columns = 'row_header')
if df.columns.nlevels > 1:
df.columns = [str(df.columns[i]) for i in
range(len(df.columns))]
return df
```

## Creating one large vector per insurer

With the function above we can read a part of the Excel file and store it in a Pandas data frame. The following worksheets are contained in the Excel file published by DNB.

- Worksheet 14: balance sheet
- Worksheet 15: premiums – life
- Worksheet 16: premiums – non-life
- Worksheet 17: technical provisions – 1
- Worksheet 18: technical provisions – 2
- Worksheet 19: transition and adjustments
- Worksheet 20: own funds
- Worksheet 21: solvency capital requirements – 1
- Worksheet 22: solvency capital requirements – 2
- Worksheet 23: minimum capital requirements
- Worksheet 24: additional information life
- Worksheet 25: additional information non-life
- Worksheet 26: additional information reinsurance

Let’s read the first worksheet with data and then append the other sheets to it. We shall not read the last three worksheets, because these contain the country specific reports.

```
df = get_sheet(14)
for l in range(15, 24):
df_temp = get_sheet(l)
df = df.join(df_temp, rsuffix = "_"+ str(l))
```

Let’s get the shape of the Data Frame.

```
print("Number of rows : " + str(len(df.index)))
print("Number of columns: " + str(len(df.columns)))
Out: Number of rows : 286
Number of columns: 1273
```

So we have 286 rows of insurance undertakings data for ultimo 2016 and ultimo 2017. And we have 1273 columns with financial and solvency data about each insurance undertaking. Let’s take the data from the year 2017, and select all columns that have floating numbers in them.

```
df = df.xs(datetime(2017,12,31),
axis = 0,
level = 1,
drop_level = True)
df = df[[df.columns[c] for c in range(len(df.columns))
if df.dtypes[c]=='float64']]
df.fillna(0, inplace = True)
print("Number of rows : " + str(len(df.index)))
print("Number of columns: " + str(len(df.columns)))
Out: Number of rows : 139
Number of columns: 1272
```

Apparently there is one columns with a non-floating values. For ultimo 2017 we have the data of 139 insurance undertaking.

Now we can perform all kinds of numerical analysis. Let’s calculate the total amount of assets of all Dutch insurance undertakings.

```
df['assets|total assets , solvency ii value'].sum()
Out: 486702222727.1401
```

That’s almost 487 billion euro at the end of 2017.

## Finding similar insurers

Now let’s apply some algorithms to this data set. Suppose we want to know what insurance undertakings are similar with respect to their financial and solvency structure. To do that we can calculate the distances between all the data points of each insurance undertakings. An insurance undertaking with a low distance to another insurance undertaking might be similar to that undertaking.

If we divide each row by the total assets we do as if all insurance undertakings have equal size, and then the distances indicate similarity in financial and solvency structure (and not similarity in size).

```
X = df.div(df['assets|total assets , solvency ii value'],
axis = 0)
```

The scikit-learn package provides numerous algorithms to do calculations with distances. Below we apply the NearestNeighbors algorithm to find the neighbors of each insurance undertaking. Then we get the distances and the indices of the data set and store them.

```
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors = 2,
algorithm = 'brute').fit(X.values)
distances, indices = nbrs.kneighbors(X)
```

What are the nearest neighbors of the first ten insurance undertakings in the list?

```
for i in indices[0:10]:
print(X.index[i[0]] + " --> " + X.index[i[1]])
Out:
ABN AMRO Captive N.V. --> Rabo Herverzekeringsmaatschappij N.V.
ABN AMRO Levensverzekering N.V. --> Delta Lloyd Levensverzekering N.V.
ABN AMRO Schadeverzekering N.V. --> Ansvar Verzekeringsmaatschappij N.V.
AEGON Levensverzekering N.V. --> ASR Levensverzekering N.V.
AEGON Schadeverzekering N.V. --> Nationale-Nederlanden Schadeverzekering Maatschappij N.V.
AEGON Spaarkas N.V. --> Robein Leven N.V.
ASR Aanvullende Ziektekostenverzekeringen N.V. --> ONVZ Aanvullende Verzekering N.V.
ASR Basis Ziektekostenverzekeringen N.V. --> VGZ Zorgverzekeraar N.V.
ASR Levensverzekering N.V. --> Achmea Pensioen- en Levensverzekeringen N.V.
ASR Schadeverzekering N.V. --> Veherex Schade N.V.
```

And with the shortest distance between two insurance undertakings we can find the two insurance undertakings that have the highest similarity in their structure.

```
min_list = np.where(distances[:,1] == distances[:,1].min())
list(X.index[min_list])
Out: ['IZA Zorgverzekeraar N.V.', 'Univé Zorg, N.V.']
```

If you want to understand the financial performance it is of course handy to know which insurance undertakings are similar. A more general approach when comparing insurance undertakings is to cluster them into a small number of peer groups.

## Clustering the insurers

Can we cluster the insurance undertakings based on the 1272-dimensional data? To do this we apply the t-sne algorithm (that we used before).

First we import all the packages required.

```
import matplotlib.pyplot as pyplot
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
```

Then we run the t-sne algorithm.

```
Y = TSNE(n_components = 2,
perplexity = 5,
verbose = 0,
random_state = 1).fit_transform(X.values)
```

And we plot the results

```
pyplot.figure(figsize = (7, 7))
pyplot.scatter(x = Y[:, 0],
y = Y[:, 1],
s = 7)
pyplot.show()
```

Depending on how you zoom in you see different clusters in this picture. In the above left you see the health insurance undertakings (with more clusters within that set: those offering basic health insurance and other offering additional health insurances, or both). On the right are (mostly) life insurance undertakings, and on the left (middle to below) are non-life insurance undertakings. And both clusters can be divided into several more sub clusters. These clusters can be used in further analysis. For example, you could use these as peer groups of similar insurance undertakings.

## Clustering the features

Given that we have a 1272-dimensional vector of each insurance undertaking we might wish somehow to cluster the features in the data set. That is, we want to know which columns belong to each other and what columns are different.

An initial form of clustering were the different worksheets in the original Excel file. The data was clustered around the balance sheet, premiums, technical provisions, etc. But can we also find clusters within the total vector without any prior knowledge of the different worksheets?

A simple and effective way is to transpose the data matrix and feed it into the t-sne algorithm. That is, instead of assuming that each feature provides additional information about an insurance undertaking, we assume that each insurance undertaking provides additional information about a feature.

Let’s do this for only the balance sheet. In a balance sheet it is not immediately straightforward how the left side is related to the right side of the balance sheet, i.e. which assets are related to which liabilities. If you cluster all the data of the balance sheet then related items are clustered (irrespective of whether they are assets or liabilities).

```
df = get_sheet(14)
```

Instead of the scaled values we now take whether or not a data point was reported or not, and then transpose the matrix.

```
X = (df != 0).T
```

Then we apply the t-sne algorithm. In this case with a lower perplexity.

```
Y = TSNE(n_components = 2,
perplexity = 1.0,
verbose = 0,
random_state = 0,
learning_rate = 20,
n_iter = 10000).fit_transform(X.values)
```

And we plot the result with 15 identified clusters.

```
pyplot.figure(figsize = (7, 7))
pyplot.scatter(x = Y[:, 0],
y = Y[:, 1],
s = 5)
kmeans = KMeans(n_clusters = 15, random_state = 0, n_init = 10).fit(Y)
for i in range(len(kmeans.cluster_centers_)):
pyplot.scatter(x = kmeans.cluster_centers_[i,0],
y = kmeans.cluster_centers_[i,1],
s = 1,
c = 'yellow')
pyplot.annotate(str(i),
xy = (kmeans.cluster_centers_[i, 0], kmeans.cluster_centers_[i, 1]),
size = 13)
pyplot.show()
```

Then we get this.

We see are large number of different clusters.

```
for i in df.T.loc[kmeans.labels_ == 6].index:
print(i)
Out:
assets|assets held for index-linked and unit-linked contracts , solvency ii value
assets|loans and mortgages|loans and mortgages to individuals , solvency ii value
assets|reinsurance recoverables from:|life and health similar to life, excl health,index-linked,unit-linked|life excluding health,index-linked,unit-linked , solvency ii value
liabilities|technical provisions – index-linked and unit-linked , solvency ii value
liabilities|technical provisions – index-linked and unit-linked|best estimate , solvency ii value
liabilities|technical provisions – index-linked and unit-linked|risk margin , solvency ii value
```

So the assets held for index-linked and unit-linked contracts are in the same cluster as the technical provisions for index-linked and unit-linked items (and some other related items are found).

However, the relations found are not always perfect. To improve the clustering we should cluster the data that is related in their changes over time. But because we have just two years available (and so just one yearly difference) we presumably do not have enough data to do that.