In this example we will analyze the Dutch insurance market with two machine learning algorithms, t-SNE, a technique for dimensionality reduction developed by Laurens van der Maat, combined with KMeans, an algorithm to find clusters in the data.
We use publicly available register data of all Dutch insurance undertakings that we web scraped from the DNB website (public-register). This register contains the specific license and the lines of business (LoB’s) in which an insurance undertaking is allowed to operate.
Insurance undertakings in the European Union have different types of licenses, for example life or non-life and where the undertaking is based and where it is allowed to operate. They sell different kinds of insurance products, i.e. they operate in different LoB’s (for example motor vehicle insurance, general life insurance or health insurance). There are 6 LoB’s for life and 19 LoB’s for non-life.
There are groups of insurance undertakings that, looking at their lines of business, are similar. For example some undertakings are small and specialized and sell products only within a very limited number of lines of business. Other larger general insurance undertakings sell insurance products from all lines of business.
How can we find these clusters (groups) of insurance undertakings that are similar with respect to their set of lines of business? We will create a vector per insurance undertakings with their allowed lines of business. Then we will use the t-SNE algorithm to reduce this vector to a 2d vector such that we can plot it in a 2d plane. Undertakings with similar LoB sets are then plotted near to each other. Then we detect the clusters with the kMeans algorithm.
I won’t give all the Python code of the notebook because it is somewhat cumbersome to obtain the license data from the DNB register, but if you are interested, a part can be found here.
import pandas as pd
import numpy as np
import ast
import matplotlib.pyplot as pyplot
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
Step 1: reading the data
Earlier, we stored the license and lines of business data in a csv file.
filename = 'data/licenses_insurers.csv'
df = pd.read_csv(filename, delimiter = ',', encoding = 'utf-8')
Step 2: data preparation
First we need some data preparation. The .csv contains string data that has to be interpreted. First we extract the data on the licenses and the lines of business. If we look at the available licenses we get:
List of available licenses:
2:27 lid 1 SII-schadeverzekeraar
2:27(1) SII insurer
2:45(1) Carrying on the business of a life insurer (non-EEA)
2:45(1) Carrying on the business of a non-life insurer (non-EEA)
2:45(1) non-EEA-based life insurer providing services to NL
2:45(1) non-EEA-based non-life insurer providing services to NL
2:48(1) life insurer with low magnitude of risk
2:48(1) non-life insurer with low magnitude of risk
2:48(1) pre-paid funeral services insurer with low magnitude of risk
Section 1:104(3) Business being wound up
Apparently not all items have been translated to English.
Insurers with low magnitude of risk are too small for Solvency II regulation and for these insurers there is a (simplified) regime in place. Some entities are active from outside the European Economic Area (EEA). The codes refer to the articles in the Dutch Financial Supervision Act.
Now we can get the list of lines of business:
List of available lines of business:
L01. Life insurance - general
L02. Life insurance related to marriage or birth
L03. Life insurance linked to common funds
L05. Holdings in savings pools
L06. Capitalisation activities
L07. Collective pension funds management
S01. Accident insurance
S02. Health insurance
S03. Motor vehicle insurance
S04. Railway rolling stock insurance
S05. Aircraft hull insurance
S06. Marine hull insurance
S07. Goods-in-transit insurance
S08. Fire and natural forces insurance
S09. Other property damage insurance
S10a. Motor vehicle liability insurance
S10b. Road transport liability insurance
S11. Aircraft liability insurance
S12. Marine liability insurance (sea, lake & river and canal vessels)
S13. General liability insurance
S14. Credit insurance
S15. Suretyship
S16. Pecuniary loss insurance
S17. Legal assistance insurance
S18. Assistance
This list of lines of business is applied across the European Union and was already in place before the Solvency II regulation.
We can obtain a data frame df_entities_lobs with all insurance undertakings with their allowed lines of business (25 possible LoB’s). That will be the input for the t-SNE algorithm.
Step 3: data analysis
We will use the t-SNE algorithm from the package sklearn.manifold. The input for the algorithm is the data frame with dummy encoded lines of business per entity prepared previously.
X = df_entities_lobs
Y = TSNE(n_components = 2,
perplexity = 18,
verbose = 1,
random_state = 1).fit_transform(X)
Next we use the k-means algorithm to determine the clusters in the Dutch insurance market. It appears that there are about eight clusters that are identifiable.
kmeans = KMeans(n_clusters = 8, random_state = 0, n_init = 10).fit(Y)
Step 4: data visualization
The last step is to visualize the results of the t-SNE algorithm. For this we first produce the labels of the clusters with the average number of lines of business and the two most dominant lines of business. With that information we can describe the basic properties of the cluster.

Cluster 1 and 5 are the two clusters with life insurance undertakings. Cluster 5 consists of general life insurance undertakings with a broad line of products in difference life insurance lines of business. Cluster 1 consists of specialized life insurance undertakings with only one line of business. These undertakings are often relatively small.
The other clusters are consists of non-life insurance undertakings. Cluster 2 is an easily identifiable cluster with health insurance undertakings, with exacly two lines of business (accident and health insurance). Some health insurance undertakings also have some other lines of business; they form more general health insurance undertakings cluster 6 (near the uniform cluster 2).
Then we have cluster 3 with general non-life insurance undertakings operating a large number of lines of business (13 on average). Cluster 7 consists of medium non-life insurance undertakings with less lines of business on average. And cluster 4 and cluster 0 are specialized and often small non-life insurance undertaking with 1 or 2 lines of business on average (cluster 0: specialized property damage and fire insurance undertakings, and cluster 4: specialized general liability and legal assistance insurance undertakings).
We can find the corresponding undertakings for each cluster. For example the specialized small life insurance undertaking (cluster 1):
list(df_entities_lobs.index[kmeans.labels_ == 1])
['DELA Natura- en levensverzekeringen N.V.',
'Isle of Man Assurance Limited',
'Monuta Verzekeringen N.V.',
'N.V. Noordhollandsche van 1816, Levensverzekeringsmaatschappij',
'Nordben Life and Pension Insurance Co. Limited',
"Onderling Fonds 'Sliedrecht' B.A.",
'Tiels Onderling Fonds tot Uitkering bij Overlijden Gustaaf Adolf U.A.',
'Yarden Uitvaartverzekeringen N.V.',
'Zurich Life Insurance Company Limited']