European insurance undertakings are required to publish each year a Solvency and Financial Condition Report (SFCR). These SFCRs are often made available via the insurance undertaking’s website. In this blog I will show some first results of a text modeling exercise using these SFCRs.
Text modeling was done with Latent Dirichlet Allocation (LDA) with the Mallet’s implementation, via the gensim-package (found here: https://radimrehurek.com/gensim/index.html). A description you can find here: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/. LDA is an unsupervised learning algorithm that generates latent (hidden) distributions over topics for each document or sentence and a distribution over words for each topic.
To get the data I scraped as many SFCRs (in all European languages) as I could find on the Internet. As a result of this I have a data set of 4.36 GB with around 2,500 SFCR documents in PDF-format (until proven otherwise, I probably have the largest library of SFCR documents in Europe). Among these are 395 SFCRs in the English language, consisting in total of 287,579 sentences and 8.1 million words.
In a SFCR an insurance undertaking publicly discloses information about a number of topics prescribed by the Solvency II legislation, such as its business and performance, system of governance, risk profile, valuation and capital management. Every SFCR therefore contains the same topics.
The LDA algorithm is able to find dominant keywords that represents each topic, given a set of documents. It is assumed that each document is about one topic. We want to use the LDA algorithm to identify the different topics within the SFCRs, such that, for example, we can extracts all sentences about the solvency requirements. To do so, I will run the LDA algorithm with sentences from the SFCRs (and thereby assuming that each sentence is about one topic).
I followed the usual steps; some data preparation to read the pdf files properly. Then I selected the top 9.000 words and I selected the sentences with more than 10 words (it is known that the LDA algorithm does not work very good with words that are not often used and very short documents/sentences). I did not built bigram and trigram models because this did not really change the outcome. Then the data was lemmatized such that only nouns, adjectives, verbs and adverbs were selected. The spacy-package provides functions to tag the data and select the allowed postags.
The main inputs for the LDA algorithm is a dictionary and a corpus. The dictionary contains all lemmatized words used in the documents with a unique id for each word. The corpus is a mapping of word id to word frequency in each sentence. After we generated these, we can run the LDA algorithm with the number of topics as one of the parameters.
The quality of the topic modeling is measured by the coherence score.

Therefore, a low number of topics that performs well would be nine topics (0.65) and the highest coherence score is attained at 22 topics (0.67), which is pretty high in general. From this we conclude that nine topics would be a good start.
What does the LDA algorithm produce? It generates for each topic a list of keywords with weights that represent that topic. The weights indicate how strong the relation between the keyword and the topic is: the higher the weight the more representative the word is for that specific topic. Below the first ten keywords are listed with their weights. The algorithm does not classify the topic with one or two words, so per topic I determined a description that more or less covers the topic (with the main subjects of the Solvency II legislation in mind).
Topic 0 ‘Governance’: 0.057*”management” + 0.051*”board” + 0.049*”function” + 0.046*”internal” + 0.038*”committee” + 0.035*”audit” + 0.034*”control” + 0.032*”system” + 0.030*”compliance” + 0.025*”director”
Topic 1 ‘Valuation’: 0.067*”asset” + 0.054*”investment” + 0.036*”liability” + 0.030*”valuation” + 0.024*”cash” + 0.022*”balance” + 0.020*”tax” + 0.019*”cost” + 0.017*”account” + 0.016*”difference”
Topic 2 ‘Reporting and performance’: 0.083*”report” + 0.077*”solvency” + 0.077*”financial” + 0.038*”condition” + 0.032*”information” + 0.026*”performance” + 0.026*”group” + 0.025*”material” + 0.021*”december” + 0.018*”company”
Topic 3 ‘Solvency’: 0.092*”capital” + 0.059*”requirement” + 0.049*”solvency” + 0.039*”year” + 0.032*”scr” + 0.030*”fund” + 0.027*”model” + 0.024*”standard” + 0.021*”result” + 0.018*”base”
Topic 4 ‘Claims and assumptions’: 0.023*”claim” + 0.021*”term” + 0.019*”business” + 0.016*”assumption” + 0.016*”market” + 0.015*”future” + 0.014*”base” + 0.014*”product” + 0.013*”make” + 0.012*”increase”
Topic 5 ‘Undertaking’s strategy’: 0.039*”policy” + 0.031*”process” + 0.031*”business” + 0.030*”company” + 0.025*”ensure” + 0.022*”management” + 0.017*”plan” + 0.015*”manage” + 0.015*”strategy” + 0.015*”orsa”
Topic 6 ‘Risk management’: 0.325*”risk” + 0.030*”market” + 0.027*”rate” + 0.024*”change” + 0.022*”operational” + 0.021*”underwriting” + 0.019*”credit” + 0.019*”exposure” + 0.013*”interest” + 0.013*”liquidity”
Topic 7 ‘Insurance and technical provisions’: 0.049*”insurance” + 0.045*”reinsurance” + 0.043*”provision” + 0.039*”life” + 0.034*”technical” + 0.029*”total” + 0.025*”premium” + 0.023*”fund” + 0.020*”gross” + 0.019*”estimate”
Topic 8 ‘Undertaking’: 0.065*”company” + 0.063*”group” + 0.029*”insurance” + 0.029*”method” + 0.023*”limit” + 0.022*”include” + 0.017*”service” + 0.016*”limited” + 0.015*”specific” + 0.013*”mutual”
To determine the topic of a sentences we calculate for each topic the weight of the words in the sentences. The main topic of the sentence is then expected to be the topic with the highest sum.
If we run the following sentence (found in one of the SFCRs) through the model
"For the purposes of solvency, the Insurance Group’s insurance obligations
are divided into the following business segments: 1. Insurance with profit
participation 2. Unit-linked and index-linked insurance 3. Other life
insurance 4. Health insurance 5. Medical expence insurance for non-life
insurance 6. Income protection insurance for non-life insurance Pension &
Försäkring (Sweden) Pension & Försäkring offers insurance solutions on the
Swedish market within risk and unit-linked insurance and traditional life
insurance."
then we get the following results per topic:
[(0, 0.08960573476702509),
(1, 0.0692951015531661),
(2, 0.0692951015531661),
(3, 0.06332138590203108),
(4, 0.08363201911589009),
(5, 0.0692951015531661),
(6, 0.08004778972520908),
(7, 0.3369175627240143),
(8, 0.13859020310633216)]
Topic seven (‘Insurance and technical provisions’) has clearly the highest score 0.34 , followed by topic eight (‘Undertaking’). This suggests that these sentences are about the insurances and technical provisions of the undertaking (that we can verify).
Likewise, for the sentence
"Chief Risk Officer and Risk Function
The Board has appointed a Chief Risk Officer (CRO) who reports directly to
the Board and has responsibility for managing the risk function and
monitoring the effectiveness of the risk management system."
we get the following results:
[(0, 0.2926447574334898),
(1, 0.08294209702660407),
(2, 0.07824726134585289),
(3, 0.07824726134585289),
(4, 0.07824726134585289),
(5, 0.08450704225352113),
(6, 0.14866979655712048),
(7, 0.07824726134585289),
(8, 0.07824726134585289)]
Therefore, topic zero (‘Governance’) and topic six (‘Risk management’) have the highest score and this suggests that this sentence is about the governance of the insurance undertaking and to a lesser extent risk management.
The nine topics that were identified reflect fairly different elements in the SFCR, but we also see that some topics consist of several subtopics that could be identified separately. For example, the topic that I described as ‘Valuation’ covers assets and investments but it might be more appropriate to distinguish investment strategies from valuation. The topic ‘Solvency’ covers own funds as well as solvency requirements. If we increase the number of topics then some of the above topics will be split into more topics and the topic determination will be more accurate.
Once we have made the LDA model we can use the results for several applications. First, of course, we can use the model to determine the topics of previously unseen documents and sentences. We can also analyze topic distributions across different SFCRs, we can get similar sentences for any given sentence (based on the distance of the probability scores of the given sentence to other sentences).
In this blog I described first steps in text modeling of Solvency and Financial Condition Reports of insurance undertakings. The coherence scores were fairly high and the identified topics represented genuine topics from the Solvency II legislation, especially with a sufficient number of topics. Some examples showed that the LDA model is able to identify the topic of specific sentences. However, this does not yet work perfectly; an important element of SFCR documents are the numerical information often stored in table form in the PDF. These are difficult to analyze with the LDA algorithm.