Pattern discovery in Solvency 2 data (1)

This blog describes some results with algorithms for pattern discovery in Solvency 2 quantitative reports. The main idea here is to automatically uncover patterns that are present in these reports. We want to find patterns that represent widely or commonly occurring situations, and possibly represent business rules and relations prescribed by the underlying legislation. If we are able to find these patterns, then we are also able to identify when data satisfy and do not satisfy the patterns.

How can we find these patterns? Initially, I started with association rules. This is a rule-based machine learning approach that leads to transparent and explainable results. However, for this approach the quantitative data has to be encoded to a set of features (for example by replacing every quantitative value with an appropriate nominal value) and with high dimensional data this quickly becomes computationally expensive.

After some increments I decided to program a pattern discovery algorithm especially designed for analyzing the quantitative data points of reports. The goal was to speed up the process, while maintaining the association rules mining approach and performance measures. Below I will give some examples of how and which patterns can be found. The code is part of the insurlib package on Github. I used the public Solvency 2 data of Dutch insurers to find patterns in the quantitative data of these insurers (but applications to other data sets are possible).

First read the pandas and the insurlib package. The patterns-part consists of functions to generate patterns in numerical columns of dataframes.

import pandas as pd
from insurlib import patterns

Now read the public Solvency 2 data as was described in How to analyze public Solvency 2 data of Dutch insurers (by reading the Excel file and defining the read_sheet function). To recall, the Excel consists of the following worksheets:

  • Worksheet 14: balance sheet
  • Worksheet 15: premiums – life
  • Worksheet 16: premiums – non-life
  • Worksheet 17: technical provisions – life
  • Worksheet 18: technical provisions – non-life
  • Worksheet 19: transition and adjustments
  • Worksheet 20: own funds
  • Worksheet 21: solvency capital requirements – 1
  • Worksheet 22: solvency capital requirements – 2
  • Worksheet 23: minimum capital requirements
  • Worksheet 24: additional information life
  • Worksheet 25: additional information non-life
  • Worksheet 26: additional information reinsurance

Example 1: comparing two dataframes

Suppose we want to compare the worksheet balance sheet and the worksheet non-life technical provision and find the relations between the contents in these worksheets.

df1 = get_sheet(14)
df2 = get_sheet(18)
df2.columns = [str(df2.columns[i]) for i in range(len(df2.columns))]

The last line is to convert the multiple level columns to one level (so that we can compare it more easily with other dataframes).

You can generate patterns with the generate-function. Patterns that are found have the ‘association rule’-structure P -> Q. If you input two dataframes (P_dataframe and Q_dataframe) then all columns of P_dataframe are compared to all columns of Q_dataframe. The pattern we are looking for is ‘=’, so patterns with corresponding values are found. You can also use other patterns, such as ‘<‘, ‘<=’ , ‘>’, ‘>=’ and ‘!=’. Also a dict of parameters is used as input, with in this case the minimum confidence and the minimum support.

rules = patterns.generate(P_dataframe = df1,
                          Q_dataframe = df2,
                          pattern = "=",
                          parameters = {"min_confidence": 0.75, 
                                        "min_support": 10}))
rules = list(rules)
print("Number of rules: " + str(len(rules)))
Number of rules: 2

The output is a generator that we can convert to a list of rules. Two rules are found in this case. Let’s look a the first rule.

    'assets|reinsurance recoverables from:|non-life and health 
    similar to non-life , solvency ii value', 
    "('total non-life obligation', 'Technical provisions -
    total|Recoverable from reinsurance contract/SPV and Finite
    Re after the adjustment for expected losses due to counterparty
    - total')"], 

The first pattern states that the value of the reinsurance recoverables on the asset side for non-life on the balance sheet equates the value of the recoverable from reinsurance contracts in the technical provision for non-life obligations in the technical provisions sheet. The rule has confidence of almost 97%, and a support of 127. This means that there are 127 occurrences of this pattern in the data and in 97% of all occurrences (with nonzero data points) the patterns holds. We also see that in four cases the patterns are not present, i.e. the reinsurance recoverables does not equate to the recoverable in the technical provision (these are presumably data errors).

The second rule that was found reads:

    'liabilities|technical provisions – non-life , solvency ii value',
    "('total non-life obligation', 'Technical provisions -
    total|Technical provisions - total')"], 

This rule also has a high confidence. It says that the value of the technical provisions for non-life in the balance sheet equals the value of the total technical provisions in the technical provision sheet. This is a plain consistency rule between the sheets. The six exceptions are presumably data errors.

Both rules were, at the moment of publication, not part of the automatic and predefined validation rules of the Solvency II reports (otherwise the confidence would be 100%), as part of the XBRL-taxonomy. But by analyzing the reports in this manner we were able to uncover them automatically.

Example 2: patterns of sums

Often financial reports contain sums within the report. We can analyze the column names to detect potential sums (often a hierarchy in the columns name can be identified), but we can also find patterns of sums. The following code does that. We input the balance sheet dataframe and let the algorithm search for ‘sum’-patterns. The parameters sum_elements states the maximum elements in the sum (in this case three).

rules = patterns.generate(dataframe = df1,
                          pattern = "sum",
                          parameters = {"sum_elements": 3})
rules = list(rules)
print("Number of rules: " + str(len(rules)))
Number of rules: 7

Let’s take a look at the first rule:

    ['assets|investments (other than assets held for index-linked and 
     unit-linked contracts)|equities|equities - listed , solvency ii 
     'assets|investments (other than assets held for index-linked and 
     unit-linked contracts)|equities|equities - unlisted , solvency ii
    'assets|investments (other than assets held for index-linked and 
     unit-linked contracts)|equities , solvency ii value'], 

The rule states that the sum of the listed and unlisted equities equals to the equities (so equities are either listed or non-listed). This rule has a confidence of 100%, and there is presumably a validation rule within the reports. Six rules in this structure were found in this way. This is however somewhat computationally expensive.

Example 3: patterns with a given value

The last example searches for patterns with specific values. In this case we want to know in how many cases the investments are higher than zero. We can do this in the following way. We input the dataframe like in example 2 and we add a parameter columns and set it to the name of the column we want to investigate (in fact you can input a list of columns).

P = ['assets|investments (other than assets held for index-linked and 
      unit-linked contracts) , solvency ii value']
 rules = patterns.generate(dataframe = df1, 
                           pattern = ">", 
                           columns = P, 
                           value = 0, 
                           parameters = {'min_confidence': 0.75,
                                         'min_support': 1})
 rules = list(rules)
     'assets|investments (other than assets held for index-linked and 
     unit-linked contracts) , solvency ii value', 

The value of investment is, with confidence of 96%, higher than zero. In eleven cases the value is not higher than zero. This rule has a high confidence because, normally, insurers invest premiums collected for insurance policies into a wide range of investment assets. If no list of columns is added, patterns in all numerical columns in the dataframe returned.

The aim of these examples is to give a general idea of pattern discovery in Solvency 2 quantitative data. Numerous patterns can be found in this way by using the complete data set. And by using the measures confidence and support we can find patterns that are not exactly perfect but do provide information about the data, without taking recourse to statistical methods. Data errors and specific situations that lead to exceptions in the data are not expressions of pure randomness and should therefore not be analyzed by statistical methods. With these patterns we are able to reconstruct basic patterns in the data that provide information about the data.

Of course, many improvements are possible in order to find more complex patterns (and that why there is a (1) in the title of this blog). Presumably all existing validation rules can be found in this manner, and much more. Hopefully I will be able to implement these improvements and present them in a new blog.

Leave a Reply