Category Archives: Natural language processing

Multilingual termbases with metadata from reporting templates

Published by:

Domain-specific termbases are of great importance to many domain-specific NLP-tasks. They enable identification and annotation of terms in documents in situations where often not enough text is available to use statistical approaches. And more importantly, they form a step towards extracting structured facts from unstructured text data.

This blog shows how to construct and use multilingual termbases to annotate text from supervisory documents in different European languages with references to relevant parts of (quantitative) supervisory templates. By linking qualitative data (text) to quantitative data (numbers) we connect initially unstructured text data to data that are often to a high degree structured and well-defined in data point models and taxonomies. I will do this by constructing termbases that contain terminology data combined with linguistic data and metadata from the supervisory templates from different financial sectors.

For terminology data I will start with the IATE-database. Most terminology that is used in European quantitative reporting templates is based on and derived from European legislation. Having multilingualism as one of its founding principles is, the EU publishes terminology in the IATE-database in all official European languages to provide consistent translation of terms in European legislation. The IATE-database is published in the form of a file in TBX-format (TermBase eXchange). But termbases can also be in form of SKOS (Simple Knowledge Organization System, built upon the RDF-format). Both formats are data models that contain descriptions and properties of concepts and are to some extent interchangeable (see for example here).

For metadata on reporting templates I will use relevant XBRL Taxonomies in RDF-format (see here). Normally, XBRL Taxonomy are developed specifically for a single sector and therefore covers to some extent the financial terminology used within that sector. XBRL Taxonomies contain metadata of all data point in the reporting templates. From a XBRL Taxonomy a Data Point Model can be derived (that is: the taxonomy contains all definitions) and is often published together with the taxonomy which is only computer readable.

For linguistic data I will use the Python NLP package of Stanford Stanza, which provide pretrained NLP-models for all official European languages (in order of becoming an official EU language: Dutch, French, German, Italian (1958), Danish, English (1973), Greek (1981), Portuguese, Spanish (1986), Finnish, Swedish (1995), Czech, Estonian, Hungarian, Latvian, Lithuanian, Maltese, Polish, Slovak, Slovenian (2004), Bulgarian, Irish, Romanian (2007) and Croatian (2013)).

So we add semantic and linguistic structure to a terminology database. The resulting data structure is sometimes called an ontology, a taxonomy or a vocabulary, but these terms have no clear distinctive definitions. And moreover, the XBRL-people use the term taxonomy to refer to a structure that contains concepts with properties for definitions, labels, calculations and (table) presentations. To some extent it contains structured metadata of data points (i.e. the semantics of the data). Because of that you can say that it corresponds to an ontology within a Linked Data context. On the other hand a taxonomy within a Linked Data context (and everywhere else if I might add) is basically a description of concepts with sub-class relationships (concepts with hierarchical information). In the remainder of this blog I will use the term termbase for the resulting structure with semantic, linguistic and terminological data combined.

Constructing a termbase from IATE and XBRL

In a previous blog I have described how to set up a terminology database (termbase) specifically for insurance-related terms. Now I will add links from the concepts and terms in the IATE-database to the data point model of insurance reporting templates (thereby adding semantics to the termbase), and secondly I will add linguistic information at term-level like lemmas and part-of-speech tags to allow for easy usage in NLP-tasks. The TBX-format in which the IATE-database is published allows for storing references as well as linguistic data on term-level, so we can construct the termbase as a standalone file in TBX-format (another solution would be to add the terminology and linguistic information to the XBRL Taxonomy and use that as a basis).

The IATE-database currently contains almost 930.000 concepts and many of them have verbal expressions in multiple languages resulting in over 8.000.000 terms. A single (English) expression of a concept in the IATE-database looks like this.

<conceptEntry id="3539858">
  <descrip type="subjectField">insurance</descrip>
  <langSec xml:lang="en">
      <term>basic own funds</term>
      <termNote type="termType">fullForm</termNote>
      <descrip type="reliabilityCode">9</descrip>

Adding labels from the XBRL Taxonomy

For the termbase, we add every element in the XBRL Taxonomy that has a label (tables, concepts, elements, dimensions and members) to the termbase and we add an external cross reference to the template and the location in that template where the element is used (the row or column within the template). The TBX-format allows for fields called externalCrossReference which refer to a resource that is external to the terminology database. Then you get concept entries like this:

<conceptEntry id="">
  <descrip type="xbrlTypes">element</descrip>
  <xref type="externalCrossReference">S.,R0020</xref>
  <langSec xml:lang="en">
      <term>Basic own funds</term>
      <termNote type="termType">fullForm</termNote>
      <termNote type="termType">shortForm</termNote>

This means that the termbase concept with the id “…/s.” has English labels “Basic own funds” (full form) and “R0020” (short form). These are the labels of row 0020 of template S., i.e. in the template definition you will see Basic own funds on row 0020.

The template and rc-codes where elements refer to were extracted using SPARQL-queries on the XBRl Taxonomy in RDF-format.

Adding references between IATE- and XBRL-concepts

Now that we have added terms for labelled elements from the XBRL Taxonomy, the next step is to add cross references between the IATE-concepts and the XBRL-concepts. In the TBX-format the crossReference is a pointer to another related location, such as another entry or another term, in this case to a XBRL-concept. Below the references to the XBRL-concepts are added.

<conceptEntry id="3539858">
  <descrip type="subjectField">insurance</descrip>
  <langSec xml:lang="en">
      <term>basic own funds</term>
      <termNote type="termType">fullForm</termNote>
      <descrip type="reliabilityCode">9</descrip>
  <ref type="crossReference"></descrip>
  <ref type="crossReference"></descrip>
  <ref type="crossReference"></descrip>
  <ref type="crossReference"></descrip>

The IATE-concept with id 3539858 points to the domain item el#x7 (a domain in XBRL is a set of related items, in this case own funds items), and furthermore to (table) elements s., s. and s. These all refer to a single row or a single column within a template and the last one is given as an example above. It is the row in the table S. with label ‘Basic own funds’. IATE-concepts and XBRL-concepts are considered equal when their lowercase lemmas are the same (except for abbreviations).

For XBRL Taxonomy of Solvency 2 we find in this way 740 unique terms in the IATE-database that are identical to a XBRL-concept (in English), and 1.500 terms that occur in the labels of XBRL-concepts but are not identical, for example ‘net best estimate’ is a part of the label ‘Net Best Estimate of Premium Provisions’. For now I focused on identifying identical terms. How to process long labels with more than one term is a remaining challenge (this describes probably a useful approach).

Adding part-of-speech tags and lemmas

To make the termbase applicable for NLP-tasks we need to add additional linguistic information, such as part-of-speech patterns and lemmas. This improves the annotation process later on. Premium as an adjective has another meaning than premium as a common noun. By matching the PoS-pattern we get better matches (i.e. less false positives). And also the lemmas of a term will find terms irrespective of whether they are in grammatical singular or plural form. This is the concept to which PoS-patterns and lemma are added:

<conceptEntry id="3539858">
  <descrip type="subjectField">insurance</descrip>
  <langSec xml:lang="en">
      <term>basic own funds</term>
      <termNote type="termType">fullForm</termNote>
      <descrip type="reliabilityCode">9</descrip>
      <termNote type="termLemma">basic own fund</termNote>
      <termNote type="partOfSpeech">adj, adj, noun</termNote>
  <langSec xml:lang="it">
      <term>fondi propri di base</term>
      <termNote type="termType">fullForm</termNote>
      <descrip type="reliabilityCode">9</descrip>
      <termNote type="termLemma">fondo proprio di base</termNote>
      <termNote type="partOfSpeech">noun, det, adp, noun</termNote>

Annotating text in different languages

Because the termbase contains multilingual terms with references to templates and locations we are now able to annotate terms in documents in different European languages. Below you find some examples of what you can do with the termbase. I took an identical text from the Solvency 2 Delegated Acts in English, Finnish and Italian (the first part of article 52), converted the text to NAF and added annotations with the termbase (Nafigator has a function that processes a complete termbase and adds annotations to the NAF-file). This results in the following (using a visualizer from the spaCy package):

In English

Article 52 Mortality riskTerm S.,R0100 stress The mortality riskTerm S.,R0100 stress referred to in Article 77b(1)(f) of Directive 2009/138/EC shall be the more adverse of the 1. following two scenarios in terms of its impact on basic own fundsTerm S.,R0020: (a) an instantaneous permanent increase of 15 % in the mortality rates used for the calculation of the best estimateTerm S.,R0540; (b) an instantaneous increase of 0.15 percentage points in the mortality rates (expressed as percentages) which are used in the calculation of technical provisionsTerm S.,R0010 to reflect the mortality experience in the following 12 months.

In Finnish

52 artikla KuolevuusriskiinTerm S.,R0100 liittyvä stressi Direktiivin 2009/138/EY 77 b artiklan 1 kohdan f alakohdassa tarkoitetun, kuolevuusriskiinTerm S.,R0100 liittyvän stressin on 1. oltava seuraavista kahdesta skenaariosta se, jonka epäsuotuisa vaikutus omaan perusvarallisuuteenTerm S.,R0020 on suurempi: a) välitön, pysyvä 15 %:n nousu parhaan estimaatinTerm S.,R0540 laskennassa käytetyssä kuolevuudessa; b) välitön 0,15 prosenttiyksikön nousu prosentteina ilmaistussa kuolevuudessa, jota käytetään vakuutusteknisen vastuuvelanTerm S.,R0010 laskennassa ilmentämään havaittua kuolevuutta seuraavien 12 kuukauden aikana. Sovellettaessa 1 kohtaa kuolevuuden nousua sovelletaan ainoastaan vakuutuksiin, joissa kuolevuuden nousu johtaa

In Italian

Articolo 52 Stress legato al rischio di mortalitàTerm S.,R0100 Lo stress legato al rischio di mortalitàTerm S.,R0100 di cui all’articolo 77 ter, paragrafo 1, lettera f), della direttiva 2009/138/CE è 1. il più sfavorevole dei due seguenti scenari in termini di impatto sui fondi propri di baseTerm S.,R0020: (a) un incremento permanente istantaneo del 15 % dei tassi di mortalità utilizzati per il calcolo della migliore stima; (b) un incremento istantaneo di 0,15 punti percentuali dei tassi di mortalità (espressi in percentuale) utilizzati nel calcolo delle riserve tecnicheTerm S.,R0010 per tener conto dei dati tratti dall’esperienza relativi alla mortalità nei 12 mesi successivi. Ai fini del paragrafo 1, l’incremento dei tassi di mortalità si applica soltanto alle polizze di assicurazione per le 2. quali tale incremento comporta un aumento delle riserve tecnicheTerm S.,R0010 tenendo conto di tutto quanto segue:

In the Italian text one reference is missing: the IATE-database does not yet contain an Italian translation for the English term best estimate. This happens because the IATE-database is far from complete. Not all terms are available in all languages, and the IATE-database probably does not contain all terminology from the reporting templates. And although the IATE-database is constantly updated, it might be necessary for certain use cases to add additional translations and concepts to the database.

This works for every XBRL Taxonomy, although every taxonomy has its own peculiarities (it’s a “standard” as one of my IT-colleagues likes to say). Following the procedure described above, I made the following termbases for insurance undertakings, credit institutions and Dutch pension funds based on the taxonomy versions mentioned:

  • Termbase of EIOPA Solvency 2, taxonomy version 2.6.0
  • Termbase of EBA CRD IV, taxonomy version
  • Termbase of DNB FTK, taxonomy version 2.3.0

These can be found on

Natural Language Processing in RDF graphs

Published by:

This blog shows how to store text data in a RDF graph and retrieve and analyze information from that graph. Resource Description Format (RDF) graphs are very suitable structures for storing Natural Language Processing (NLP) data. They enable combining NLP data with other data sets in RDF (such as legal entities data from the Global LEI Foundation and the EIOPA register of European insurance undertakings, terminology data, for example Solvency 2 terminology and data from XBRL reports); and they allow adding text semantics in the form of linguistic annotations, which enables NLP analyses simply by executing database queries.

Here is what I did. To get a proper amount of text data I web-scraped the entire website of De Nederlandsche Bank (text in webpages and in PDF documents, including speeches, press releases, research publications, sector information, dnbulletins, and all blogs by Maarten Gelderman and Olaf Sleijpen, consisting of over 4.000 documents). Text extraction from the web pages was done with the Python package newspaper3k (a great tip from my NLP colleagues from the Authority for Consumers and Markets). Text data was then converted to the NLP Annotation Format (NAF), for which I defined a RDF representation (implemented in the Nafigator package) to upload the data in a RDF triple-store. For the triple-store I used Ontotext’s GraphDB, one of the best RDF database currently available. Then, information can be retrieved from the graph database with SPARQL queries for all kinds of NLP analyses.

Using a triple-store for NLP data leads to an efficient retrieval process of text data, especially if you compare that to a process where you search through different annotation files. Triple-stores for RDF (and the new RDF-star) have become efficient and powerful solutions with the equal capabilities as property graphs but with advantages of RDF and ontologies.

I will describe two parts of this process that are not straightforward in detail: the RDF representation of NAF, and retrieving data from the graph database.

The NLP Annotation Format in RDF

The NLP Annotation Format is an easy format for storing text annotations (see here for links to the description). All documents that were scraped from the website were processed with the Python package Nafigator, that is able to convert PDF document and HTML-files to XML-files satisfying the NLP Annotation Format. Standard annotation layers with the raw text, word forms, terms, named entities and dependencies were added using the Stanford Stanza NLP processor.

In this representation every annotation (word forms, terms, named entities, etc.) of every document must have an Uniform Resource Identifier (URI). To do this, I used a prefix doc_xxx for each document in the document set. This prefix can, for example, be set by

@prefix doc_001: <> .

Which in this case is an identifier based on the domain of this blog. For web-scraped documents you might also use the original URL of the document. Furthermore, for the RDF representation of NAF a provisional RDF Schema with prefix naf-base was made with the basic properties en classes of NAF.

The basic structure is set out below. All examples provided below are derived from the file example.pdf in the Nafigator package (the first sentences of the first page starts with: ‘The Nafigator package … ‘).

Document and header

Every document has a header and pages.

doc_001:doc a naf-base:document ;
    naf-base:hasHeader doc_001:nafHeader ;
    naf-base:hasPages ( doc_001:page1 ) .

Here naf-base:document is a RDF Class and naf-base:hasHeader and naf-base:hasPages are RDF Properties. The three lines above state that doc_001:doc is a document with header doc_001:nafHeader and a single page doc_001:page1.

In the header all metadata of the document is stored, including all linguistics processors and models that were used in processing the document. Below you see the metadata of the NAF text layer and the document metadata.

doc_001:nafHeader a naf-base:header ;
    naf-base:hasLinguisticProcessors [ 
        naf-base:hasLayer naf-base:text ;
        naf-base:lp [ 
            naf-base:hasBeginTimestamp "2022-04-10T13:45:43UTC" ;
            naf-base:hasEndTimestamp "2022-04-10T13:45:44UTC" ;
            naf-base:hasHostname "desktop-computer" ;
            naf-base:hasModel "stanza_resources\\en\\tokenize\\" ;
            naf-base:hasName "text" ;
            naf-base:hasVersion "stanza_version-1.2.2" 
    ] ;
    naf-base:hasPublic [ 
        dc:format "application/pdf" ;
        dc:uri "data/example.pdf" 
    ] .

Sentences, paragraphs and pages

Here is an example of a sentence object with properties.

doc_001:sent1 a naf-base:sentence ;
    naf-base:isPartOf doc_001:para1, doc_001:page1 ;
    naf-base:hasSpan ( doc_001:wf1 doc_001:wf2 ...  doc_001:wf29 ) .

These three lines describe the properties of the RDF subject doc_001:sent1. The doc_001:sent1 identifies the RDF subject for the first sentence of the first document; the first line says that the subject doc_001:sent1 is a (rdf:type) sentence. The second line says that this sentence is a part of the first paragraph and the first page of the document. The span of the sentence contains a ordered list of word forms of the sentence: doc_001:wf1, doc_001:wf2 and so on.

Paragraphs and pages to which the sentences refer are defined in a similar way.

Word forms and terms

Of each word form the properties text, length and offset are defined. The word form is a part of a term, sentence, paragraph and page, and that is also defined for every word form. Take for example the the word form doc_001:wf2 defined as:

doc_001:wf2 a naf-base:wordform ;
    naf-base:hasText "Nafigator"^^rdf:XMLLiteral ;
    naf-base:hasLength "9"^^xsd:integer ;
    naf-base:hasOffset "4"^^xsd:integer ;
    naf-base:isPartOf doc_001:page1, 

In the next layer the terms of the word forms are defined, with their linguistic properties (lemma, grammatical number, part-of-speech and if applicable other properties such as verb voice and verb form). The term that refers to the word form above is

doc_001:term2 a naf-base:term ;
    naf-base:hasLemma "Nafigator"^^rdf:XMLLiteral ;
    naf-base:hasNumber olia:Singular ;
    naf-base:hasPos olia:ProperNoun ;
    naf-base:hasSpan ( doc_001:wf2 ) .

For the linguistic properties the OLiA ontology is used, which stand for Ontologies of Linguistic Annotations, an OWL taxonomy of data categories for linguistic annotations. The ontology contains precise definitions and interrelation between the linguistic categories. In this case the grammatical number (olia:Singular) and the part-of-speech tag (olia:ProperNoun) is included in the properties of this term. Depending of the term other properties are defined, for example verb forms. The span of the term refers back to the word forms (if you create a NAF ontology then you would define this as a transitive relationship, but for now, by including both relations we speed up the retrieval process).

Named entities

Next are the named entities that are stored in another NAF layer and here as separate subjects in the triple-store. An entity refers back to a term and has a certain type (organization, person, product, law, date and so on). The text of the entity is already stored in the term object so there is not need to include it here. External references could be added here, for example references to legal entities from Global LEI Foundation. Here is the example referring to the triples above.

doc_001:entity1 a naf-base:entity ;
    naf-base:hasType naf-entity:product ;
    naf-base:hasSpan ( doc_001:term2 ) .


Powerful NLP models exist that are able to derive relationships between words in within sentences. The dependencies are defined on the level of terms and stored in the dependency layer of NAF. In this RDF representation the dependencies are simply added to the terms.

doc_001:term3 a naf-base:term ;
    naf-rfunc:compound doc_001:term2 ;
    naf-rfunc:det doc_001:term1 .

The second and third line say that term3 (‘package’) forms a compound term with term2 (‘Nafigator’) and has its determinant in term1 (‘The’).

There are more annotation layers in NAF, but these are the most basic ones and if you have these, then many powerful NLP analyses already can be done.

Information retrieval from the RDF graph database

The conversion of text to RDF described above was applied to all webpages and documents of the website of DNB, 4.065 documents in total with 401.832 sentences containing 9.789.818 words. This text data led to over 221 million RDF triples in the tripe-store. I used a local database that was queried via a SPARQL endpoint. These numbers mentioned here can easily be extracted with SPARQL queries, for example to count the number of sentences we can use the SPARQL query:

SELECT (COUNT(?s) AS ?count) WHERE { ?s a naf-base:sentence . }

With this query all RDF subjects (the variable ?s) that are a sentence are counted and the result is stored in the variable ‘count’. The same can be done with other RDF subjects like word forms and documents.

The RDF representation described above allows you to store the content and annotations of a set of documents with their metadata in one single graph. You can then retrieve information from that graph from different perspectives and for different purposes.

Information retrieval

Suppose we want to find all references on the website with relations between ‘DNB’ and the verb ‘supervise’ by looking for sentences where ‘DNB’ is the nominal subject and ‘supervise’ is the lemma of the verb in the sentence. This is done with the following query

SELECT ?text
    ?term naf-base:hasLemma "supervise" .
	?term naf-rfunc:nsubj [naf-base:hasLemma "DNB" ] .
    ?term naf-base:hasSpan [ rdf:first ?wf ] .
    ?wf naf-base:isPartOf [ a naf-base:sentence ; naf-base:hasText ?text ].

It’s almost readable 🙂 The first line in the WHERE statement retrieves words that have ‘supervise’ as a lemma (this includes past, present and future tense and different verb forms). The second line narrows the selection down to where the nominal subject of the verb is ‘DNB’ (the lemma of the subject to be precise). The last two lines select the text of the sentences that includes the words that were found.

Execution of this query is done in a few milliseconds (on a desktop computer with a local database, nothing fancy) and results in 22 sentences, such as “DNB supervises adequate management of sustainability risks by financial institutions.”, “DNB supervises the cash payment system by providing information and guidance on the rules and procedures, data collection and examining compliance with the rules.”, and so on.

Term extraction

Terms are often multi-words and can be retrieved by part-of-speech tags and dependencies. Suppose we want to retrieve all two-words terms of the form adjective, common noun. Part-of-speech tags are defined in the terms layer. In the graph also the relation between the terms is defined, in this case by an adjectival modifier (amod) relation (the common noun is modified by an adjective). Then we can define a query that looks for exactly that: two words, an adjective and a common noun, where the mutual relationship is of an adjectival modifier. This is expressed in the first three lines in the WHERE-clause below. The last two lines retrieve the text of the words.

SELECT DISTINCT ?w1 ?w2 (count(*) as ?c)
    ?term1 naf-base:hasPos olia:CommonNoun .
    ?term2 naf-base:hasPos olia:Adjective .
    ?term1 naf-rfunc:amod ?term2 .
    ?term1 naf-base:hasSpan [ rdf:first/naf-base:hasText ?w1 ] .
    ?term2 naf-base:hasSpan [ rdf:first/naf-base:hasText ?w2 ] .
} GROUP BY ?w1 ?w2

Note that in the query a count of the number of occurrences of the term in the output and sort the output according to this count has been added.

Most often the term ‘monetary policy’ was found (2.348 times), followed by ‘financial institutions’ (1.734 times) and ‘financiële instellingen’ (Dutch translation of financial institution, 1.519 times), and so on. In total more than 127.000 of these patterns were found on the website (this is a more complicated query and took around 10 seconds). In this way all kinds of term patterns can be found, which can be collected in a termbase (terminology database).

Opinion extraction

I will give here a very simple example of opinion extraction based on part-of-speech tags. Suppose you want to extract sentences that contain the authors (or someone else’s) subjective opinion. You can look a the grammatical subject and the verb in a sentence, but you can also look at whether a sentence contains something like ‘too high’ or ‘too volatile’ (which often indicates a subjective content). In that case we have the word ‘too’ (an adverb) followed by an adjective, with mutual relation of adverbial modifier (advmod). In the Dutch language this has exactly the same form. The following query extracts these sentences.

SELECT ?text
    ?term1 naf-base:hasPos olia:Adjective .
    ?term2 naf-base:hasSpan [ rdf:first/naf-base:hasText "too" ] .
    ?term1 naf-rfunc:advmod ?term2 .
    ?term1 naf-base:hasSpan [ rdf:first ?wf1 ] .
    ?sent1 naf-base:hasSpan [ rdf:rest*/rdf:first ?wf1 ] .
    ?sent1 a naf-base:sentence .
    ?sent1 naf-base:hasText ?text .

With the last three lines the text of the sentence that includes the term is found (the output of the query). With the documents of the website of DNB, the output contains sentences like: “It is also clear that CO2 emissions are still too cheap and must be priced higher to sufficiently curtail emissions” and “Firms end up being too large” (in total 343 sentences in 0.3 seconds).

The examples shown here are just for illustrative purposes and do not always lead to accurate results, but they show that information extraction can be done fairly easy (if you know SPARQL) and reasonably quick. Once the data is stored into a graph database, named entities can be matched with other internal or external data sources and lemmas of terms can be matched with concept-based terminology databases. Then you have a graph where the text is not only available on a simple string-level but also, and more importantly, on a conceptual level.

The Solvency termbase for NLP

Published by:

This blog describes a way to construct a terminology database for the insurance supervision knowledge domain. The goal of this termbase is provide a reliable basis to extract insurance supervision terminology within different NLP analyses.

The terminology of solvency and insurance supervision forms an expert domain of terminology based on economics, mathematics, accounting and finance terminologies. Like probably many other knowledge domains the terminology used is very technical and specific. Some terms are only used within this domain with only a limited number of occurrences (which often hinders the use of statistical methods for finding terms). And many other words have general meanings outside the domain that do not coincide with the specific meanings within the domain. Translation of terms from this specific domain often requires extensive knowledge about the meaning and use of these terms.

What is a termbase?

A termbase is a database containing terminology and related information. It consists of concepts with their verbal designations (terms, i.e. single words or composed of multi-word strings) of a specific knowledge domain, often in different languages. It contains the full form of concepts, but also abbreviations, synonyms and variants and additional information of concepts, such as definitions and external references. To indicate the accuracy or completeness often a reliability code is added to individual terms of a concept. A proper termbase is an important terminology tool to achieve standardization of information and consistent use of (translations) of concepts in documents. And because of that, they are often used by professional translators.

The European Union translates legal documents in all member state languages and uses for this one common publicly available termbase: the IATE (Interactive Terminology for Europe) terminology database. The IATE termbase is used in the EU institutions and agencies since 2004 for the collection, dissemination and management of EU-specific terminology. This helps to avoid divergences in the application of European Law within Europe (there exists a vast amount of literature on the effects on language ambiguity in European legislation). The translations of European legislation are therefore of the highest quality with strong consistency between different directives, regulations, delegated and implementing acts and so on. This termbase is very useful for information extraction from documents and for linking terminology concepts between different documents. They can be extended with abbreviations, synonyms and common variants of terms.

Termbases is very useful for information extraction from documents and for linking terminology concepts between different documents. They can be extended with abbreviations, synonyms and common variants of terms.

The Solvency termbase for NLP

To create a first Solvency termbase for NLP purposes, I extracted terms from Solvency 2 Delegated Acts in a number of languages, looked up these terms in the IATE database and copied the corresponding concepts. It often happens that for one language the same term refers to different concepts (for example, the term ‘balance’ means something different in chemistry and in accounting). But if for one legal document the terms from different languages refer to the same concept, then we probably have the right concept (that was used in the translation of the legal document). So, the more references from the same legal document, the more reliable the term-concept relation is. And if we have the proper term-concept relationship, we automatically have all reliable translations of that concept.

Term extraction was done with part-of-speech patterns (such as adj-noun and adj-noun-noun patterns). To do this, for every language the Delegated Acts was converted to the NLP Annotation Format (NAF). The functionality for conversion to NAF and for extracting terms based on pos patterns is part of the nafigator package. As an NLP engine for nafigator, I used the Stanford Stanza package that contains tokenizers and part-of-speech models for every European language. The termbase itself was made with the terminator repository (currently under construction).

For terms in Dutch, I also added to the termbase additional part-of-speech tags, lemma’s and morphological properties from the Lassy Klein-corpus from the Instituut voor de Nederlandse taal (Dutch Language Institute). This data set consists of approximately 1 million words with manually verified syntactic annotations. I expanded this data set with solvency related words. Linguistical properties of terms of other languages can be added it a reliable data set is available.

Below, you see one concept from the resulting termbase (the concept of which ‘solvency capital requirement’ is the English term) in TermBase eXchange format (TBX). This is an international standard (ISO 30042:2019) for the representation of structured concept-oriented terminological data, based on xml.

<conceptEntry id="249">
 <descrip type="subjectField">insurance</descrip>
 <langSec xml:lang="nl">
   <termNote type="partOfSpeech">noun</termNote>
   <note>source: ../naf-data/data/legislation/Solvency II Delegated Acts - NL.txt (#hits=331)</note>
   <termNote type="termType">fullForm</termNote>
   <descrip type="reliabilityCode">9</descrip>
   <termNote type="lemma">solvabiliteits_kapitaalvereiste</termNote>
   <termNote type="grammaticalNumber">singular</termNote>
    <termNote type="component">solvabiliteits-</termNote>
    <termNote type="component">kapitaal-</termNote>
    <termNote type="component">vereiste</termNote>
 <langSec xml:lang="en">
   <termNote type="termType">abbreviation</termNote>
   <descrip type="reliabilityCode">9</descrip>
   <term>solvency capital requirement</term>
   <termNote type="termType">fullForm</termNote>
   <descrip type="reliabilityCode">9</descrip>
   <termNote type="partOfSpeech">noun, noun, noun</termNote>
   <note>source: ../naf-data/data/legislation/Solvency II Delegated Acts - EN.txt (#hits=266)</note>
 <langSec xml:lang="fr">
   <term>capital de solvabilité requis</term>
   <termNote type="termType">fullForm</termNote>
   <descrip type="reliabilityCode">9</descrip>
   <termNote type="partOfSpeech">noun, adp, noun, adj</termNote>
   <note>source: ../naf-data/data/legislation/Solvency II Delegated Acts - FR.txt (#hits=198)</note>
   <termNote type="termType">abbreviation</termNote>
   <descrip type="reliabilityCode">9</descrip>

You see that the concept contains a link to the IATE database entry with the definition of the concept (the link in this blog actually works so you can try it out). Then a number of language sections contain terms of this concept for different languages. The English section contains the term SCR as an English abbreviation of this concept (the French section contains the abbreviation CSR for the same concept). For every term the part-of-speech tags were added (which are not part of the IATE database) and, for Dutch only, with the lemma and grammatical number of the term and its word components. These additional linguistical attributes allow easier use within NLP analyses. Furthermore, as a note the number of all occurrences in the original legal document are included.

The concept entry contains related terms in all European languages. In Greek the SCR is κεφαλαιακή απαίτηση φερεγγυότητας, in Irish it is ‘ceanglas maidir le caipiteal sócmhainneachta’ (although the Solvency 2 Delegated Acts is not available in the Irish language), in Portuguese it is ‘requisito de capital de solvência’, in Estonian ‘solventsuskapitalinõue’, and so on. These are reliable translations as they are used in legal documents of that language.

The termbase contains all terms from the Solvency 2 Delegated Acts that can be found in the IATE database. In addition, terms that were not found in that database are added with the termNote “NewTerm”, to indicate that this term has yet to be reviewed by a knowledge domain expert. This would also be the way to add synonyms and variants of terms.

The Solvency termbase basically allows to scan for a Solvency 2 concept in a document in any of the 23 European languages (given that it is in the IATE database). This is of course an initial approach to construct a termbase to test whether it is feasible and practical. The terminology that insurance undertakings use in their solvency reports is very likely to differ from the one used in legal documents. I will be testing this with a number of documents to identify Solvency 2 terminology to get an idea of how many synonyms and variants are missing.

Besides this Solvency termbase, it is in the same way possible to construct a Climate termbase based on the European Climate Law (a European regulation from 2021). This law contains a large number of climate-related terminology and is available in all European languages. A Climate termbase gives the possibility to extract climate-related information from all kinds of documents. Furthermore, we have the Sustainable Finance Disclosure Regulation (a European regulation also from 2021) for environmental, social, and governance (ESG) terminology, which could provide a starting point for an ESG termbase. And of course I eagerly await the European Regulation on Artificial Intelligence.

Two new Python packages

Published by:

Here is a short update about two new Python packages I have been working on. The first is about structuring Natural Language Processing projects (a subject I have been working on a lot recently) and the second is about rule mining in arbitrary datasets (in my case supervisory quantitative reports).

nafigator package

This packages converts the content of (native) pdf documents, MS Word documents and html files into files that satisfy the NLP Annotation Format (NAF). It can use a default spaCy or Stanza pipeline or a custom made pipeline. The package creates a file in xml-format with the first basic NAF layers to which additional layers with annotations can be added. It is also possible to convert to RDF format (turtle-syntax and rdf-xml-syntax). This allows the use of this content in graph databases.

This is my approach for storing (intermediate) NLP results in a structured way. Conversion to NAF is done in only one computationally intensive step, and then the NAF files contain all necessary NLP data and can be processed very fast for further analyses. This allows you to use the results efficiently in downstream NLP projects. The NAF files also enable you to keep track of all annotation processes to make sure NLP results remain reproducible.

The NLP Annotation Format from VU University Amsterdam is a practical solution to store NLP annotations and relatively easy to implement. The main purpose of NAF is to provide a standardized format that, with a layered extensible structure, is flexible enough to be used in different NLP projects. The NAF standard has had a number of subsequent versions, and is still under development.

The idea of the format is that basically all NLP processing steps add annotations to a text. You start with the raw text (stored in the raw layer). This raw text is tokenized into pages, paragraphs, sentences and word forms and the result is stored in a text layer. Annotations to each word form (like lemmatized form, part-of-speech tag and morphological features) are stored in the terms layers. Each additional layer builds upon previous layers and adds more complex annotations to the text.

See for more information about the NLP Annotation Format and Nafigator on github.

ruleminer package

Earlier, I have already made some progress with mining datasets for rules and patterns (see for two earlier blogs on this subject here and here). New insights led to a complete new set-up of code that I have published as a Python package under the name ruleminer. The new package improves the data-patterns package in a number of ways:

  • The speed of the rule mining process is improved significantly for if-then rules (at least six times faster). All candidate rules are now internally evaluated as Numpy expressions.
  • Additional evaluation metrics have been added to allow new ways to assess how interesting newly mined rules are. Besides support and confidence, we now also have the metrics lift, added value, casual confidence and conviction. New metrics can be added relatively easy.
  • A rule pruning procedure has been added to delete, to some extent, rules that are semantically identical (for example, if A=B and B=A are mined then one of them is pruned from the list).
  • A proper Pyparsing Grammar for rule expressions is used.

Look for more information on ruleminer here on github.

Word2vec models for SFCRs

Published by:

Word2vec is a well-known algorithm for natural language processing that often leads to surprisingly good results, if trained properly. It consists of a shallow neural network that maps words to a n-dimensional number space, i.e. it produces vector representations of words (so-called word embeddings). Word2vec does this in a way that words used in the same context are embedded near to each other (their respective vectors are close to each other). In this blog I will show you some of the results of word2vec models trained with Wikipedia and insurance-related documents.

One of the nice properties of a word2vec model is that it allows us to do calculations with words. The distance between two word vectors provides a measure for linguistic or semantic similarity of the corresponding words. So if we calculate the nearest neighbors of the word vector then we find similar words of that word. It is also possible to calculate vector differences between two word vectors. For example, it appears that for word2vec model trained with a large data set, the vector difference between man and woman is roughly equal to the difference between king and queen, or in vector notation kingman + woman = queen. If you find this utterly strange then you are not alone. Besides some intuitive and informal explanations, it is not yet completely clear why word2vec models in general yield these results.

Word2vec models need to be trained with a large corpus of text data in order to achieve word embeddings that allow these kind of calculations. There are some large pre-trained word vectors available, such as the GloVe Twitter word vectors, trained with 2 billion tweets, and the word2vec based on google news (trained with 100 billion words). However, most of them are in the English language and are often trained on words that are generally used, and not domain specific.

So let’s see if we can train word2vec models specifically for non-English European languages and trained with specific insurance vocabulary. A way to do this is to train a word2vec model with Wikipedia pages of a specific language and additionally train the model with sentences we found in public documents of insurance undertakings (SFCRs) and in the insurance legislation. In doing so the word2vec model should be able to capture the specific language domain of insurance.

The Wikipedia word2vec model

Data dumps of all Wikimedia wikis, in the form of a XML-files, are provided here. I obtained the latest Wikipedia pages and articles of all official European languages (bg, es, cs, da, de, et, el, en, fr, hr, it, lv, lt, hu, mt, nl, pl pt, ro, sk, sl, fi, sv). These are compressed files and their size range from 8.6 MB (Maltese) to 16.9 GB (English). The Dutch file is around 1.5 GB. These files are bz2-compressed; the uncompressed Dutch file is about 5 times the compressed size and contains more than 2.5 million Wikipedia pages. This is too large to store into memory (at least on my computer), so you need to use Python generator-functions to process the files without the need to store them completely into memory.

The downloaded XML-files are parsed and page titles and texts are then processed with the nltk-package (stop words are deleted and sentences are tokenized and preprocessed). No n-grams were applied. For the word2vec model I used the implementation in the gensim-package.

Let’s look at some results of the resulting Wikipedia word2vec models. If we get the top ten nearest word vectors of the Dutch word for elephant then we get:

In []: model.wv.most_similar('olifant', topn = 10)
Out[]: [('olifanten', 0.704888105392456),
        ('neushoorn', 0.6430075168609619),
        ('tijger', 0.6399451494216919),
        ('luipaard', 0.6376790404319763),
        ('nijlpaard', 0.6358680725097656),
        ('kameel', 0.5886276960372925),
        ('neushoorns', 0.5880545377731323),
        ('ezel', 0.5879943370819092),
        ('giraf', 0.5807977914810181),
        ('struisvogel', 0.5724758505821228)]

These are all general Dutch names for (wild) animals. So, the Dutch word2vec model appears to map animal names in the same area of the vector space. The word2vec models of other languages appear to do the same, for example norsut (Finnish for elephant) has the following top ten similar words: krokotiilit, sarvikuonot, käärmeet, virtahevot, apinat, hylkeet, hyeenat, kilpikonnat, jänikset and merileijonat. Again, these are all names for animals (with a slight preference for Nordic sea animals).

In the Danish word2vec model, the top 10 most similar words for mads (in Danish a first name derived from matthew) are:

In []: model.wv.most_similar('mads', topn = 10)
Out[]: [('mikkel', 0.6680521965026855),
        ('nicolaj', 0.6564826965332031),
        ('kasper', 0.6114416122436523),
        ('mathias', 0.6102851033210754),
        ('rasmus', 0.6025335788726807),
        ('theis', 0.6013824343681335),
        ('rikke', 0.5957099199295044),
        ('janni', 0.5956574082374573),
        ('refslund', 0.5891965627670288),
        ('kristoffer', 0.5842193365097046)]

Almost all are first names except for Refslund, a former Danish chef whose first name was Mads. The Danish word2vec model appears to map first names in the same domain in the vector space, resulting is a high similarity between first names.

Re-training the Wikipedia Word2vec with SFCRs

The second step is to train the word2vec models with the insurance related text documents. Although the Wikipedia pages for many languages contain some pages on insurance and insurance undertakings, it is difficult to derive the specific language of this domain from these pages. For example the Dutch word for risk margin does not occur in the Dutch Wikipedia pages, and the same holds for many other technical terms. In addition to the Wikipedia pages, we should therefore train the model with insurance specific documents. For this I used the public Solvency and Financial Condition Reports (SFCRs) of Dutch insurance undertakings and the Dutch text of the Solvency II Delegated Acts (here is how to download and read it).

The SFCR sentences are processed in the same manner as the Wikipedia pages, although here I applied bi- and trigrams to be able to distinguish insurance terms rather than separate words (for example technical provisions is a bigram and treated as one word, technical_provisions).

Now the model is able to derive similar words to the Dutch word for risk margin.

In []: model.wv.most_similar('risicomarge')
Out[]: [('beste_schatting', 0.43119704723358154),
        ('technische_voorziening', 0.42812830209732056),
        ('technische_voorzieningen', 0.4108312726020813),
        ('inproduct', 0.409644216299057),
        ('heffingskorting', 0.4008549451828003),
        ('voorziening', 0.3887258470058441),
        ('best_estimate', 0.3886040449142456),
        ('contant_maken', 0.37772029638290405),
        ('optelling', 0.3554660379886627),
        ('brutowinst', 0.3554105758666992)]

This already looks nice. Closest to risk margin is the Dutch term beste_schatting (English: best estimate) and technische_voorziening(en) (English: technical provision, singular and plural). The relation to heffingskorting is strange here. Perhaps the word risk margin is not solely being used in insurance.

Let’s do another one. The acronym skv is the same as scr (solvency capital requirement) in English.

In []: model.wv.most_similar('skv')
Out[]: [('mkv', 0.6492390036582947),
        ('mcr_ratio', 0.4787723124027252),
        ('kapitaalseis', 0.46219778060913086),
        ('mcr', 0.440476655960083),
        ('bscr', 0.4224048852920532),
        ('scr_ratio', 0.41769397258758545),
        ('ðhail', 0.41652536392211914),
        ('solvency_capital', 0.4136047661304474),
        ('mcr_scr', 0.40923237800598145),
        ('solvabiliteits', 0.406883180141449)]

The SFCR documents were sufficient to derive an association between skv and mkv (English equivalent of mcr), and the English acronyms scr and mcr (apparently the Dutch documents sometimes use scr and mcr in the same context). Other similar words are kapitaalseis (English: capital requirement) and bscr. Because they learn from context, the word2vec models are able to learn words that are synonyms and sometimes antonyms (for example we say ‘today is a cold day’ and ‘today is a hot day’, where hot and cold are used in the same manner).

For an example of a vector calculation look at the following result.

In []: model.wv.most_similar(positive = ['dnb', 'duitsland'], 
                             negative = ['nederland'], topn = 5)
Out[]: [('bundesbank', 0.4988047778606415),
        ('bundestag', 0.4865422248840332),
        ('simplesearch', 0.452720582485199),
        ('deutsche', 0.437085896730423),
        ('bondsdag', 0.43249475955963135)]

This function finds the top five similar words of the vector DNBNederland + Duitsland. This expression basically asks for the German equivalent of De Nederlandsche Bank (DNB). The model generates the correct answer: the German analogy of DNB as a central bank is the Bundesbank. I think this is somehow incorporated in the Wikipedia pages, because the German equivalent of DNB as a insurance supervisor is not the Bundesbank but Bafin, and this was not picked up by the model. It is not perfect (the other words in the list are less related and for other countries this does not work as well). We need more documents to find more stable associations. But this to me is already pretty surprising.

There has been some research where the word vectors of word2vec models of two languages were mapped onto each other with a linear transformation (see for example Exploiting Similarities among Languages for Machine Translation, Mikolov, et al). In doing so, it was possible to obtain a model for machine translation. So perhaps it is possible for some European languages with a sufficiently large corpus of SFCRs to generate one large model that is to some extent language independent. To derive the translation matrices we could use the different translations of European legislative texts because in their nature these texts provide one of the most reliable translations available.

But that’s it for me for now. Word2vec is a versatile and powerful algorithm that can be used in numerous natural language applications. It is relatively easy to generate these models in other languages than the English language and it is possible to train these models that can deal with the specifics of insurance terminology, as I showed in this blog.

Text modeling with S2 SFCRs

Published by:

European insurance undertakings are required to publish each year a Solvency and Financial Condition Report (SFCR). These SFCRs are often made available via the insurance undertaking’s website. In this blog I will show some first results of a text modeling exercise using these SFCRs.

Text modeling was done with Latent Dirichlet Allocation (LDA) with the Mallet’s implementation, via the gensim-package (found here: A description you can find here: LDA is an unsupervised learning algorithm that generates latent (hidden) distributions over topics for each document or sentence and a distribution over words for each topic.

To get the data I scraped as many SFCRs (in all European languages) as I could find on the Internet. As a result of this I have a data set of 4.36 GB with around 2,500 SFCR documents in PDF-format (until proven otherwise, I probably have the largest library of SFCR documents in Europe). Among these are 395 SFCRs in the English language, consisting in total of 287,579 sentences and 8.1 million words.

In a SFCR an insurance undertaking publicly discloses information about a number of topics prescribed by the Solvency II legislation, such as its business and performance, system of governance, risk profile, valuation and capital management. Every SFCR therefore contains the same topics.

The LDA algorithm is able to find dominant keywords that represents each topic, given a set of documents. It is assumed that each document is about one topic. We want to use the LDA algorithm to identify the different topics within the SFCRs, such that, for example, we can extracts all sentences about the solvency requirements. To do so, I will run the LDA algorithm with sentences from the SFCRs (and thereby assuming that each sentence is about one topic).

I followed the usual steps; some data preparation to read the pdf files properly. Then I selected the top 9.000 words and I selected the sentences with more than 10 words (it is known that the LDA algorithm does not work very good with words that are not often used and very short documents/sentences). I did not built bigram and trigram models because this did not really change the outcome. Then the data was lemmatized such that only nouns, adjectives, verbs and adverbs were selected. The spacy-package provides functions to tag the data and select the allowed postags.

The main inputs for the LDA algorithm is a dictionary and a corpus. The dictionary contains all lemmatized words used in the documents with a unique id for each word. The corpus is a mapping of word id to word frequency in each sentence. After we generated these, we can run the LDA algorithm with the number of topics as one of the parameters.

The quality of the topic modeling is measured by the coherence score.

The coherence score per number of topics

Therefore, a low number of topics that performs well would be nine topics (0.65) and the highest coherence score is attained at 22 topics (0.67), which is pretty high in general. From this we conclude that nine topics would be a good start.

What does the LDA algorithm produce? It generates for each topic a list of keywords with weights that represent that topic. The weights indicate how strong the relation between the keyword and the topic is: the higher the weight the more representative the word is for that specific topic. Below the first ten keywords are listed with their weights. The algorithm does not classify the topic with one or two words, so per topic I determined a description that more or less covers the topic (with the main subjects of the Solvency II legislation in mind).

Topic 0 ‘Governance’: 0.057*”management” + 0.051*”board” + 0.049*”function” + 0.046*”internal” + 0.038*”committee” + 0.035*”audit” + 0.034*”control” + 0.032*”system” + 0.030*”compliance” + 0.025*”director”

Topic 1 ‘Valuation’: 0.067*”asset” + 0.054*”investment” + 0.036*”liability” + 0.030*”valuation” + 0.024*”cash” + 0.022*”balance” + 0.020*”tax” + 0.019*”cost” + 0.017*”account” + 0.016*”difference”

Topic 2 ‘Reporting and performance’: 0.083*”report” + 0.077*”solvency” + 0.077*”financial” + 0.038*”condition” + 0.032*”information” + 0.026*”performance” + 0.026*”group” + 0.025*”material” + 0.021*”december” + 0.018*”company”

Topic 3 ‘Solvency’: 0.092*”capital” + 0.059*”requirement” + 0.049*”solvency” + 0.039*”year” + 0.032*”scr” + 0.030*”fund” + 0.027*”model” + 0.024*”standard” + 0.021*”result” + 0.018*”base”

Topic 4 ‘Claims and assumptions’: 0.023*”claim” + 0.021*”term” + 0.019*”business” + 0.016*”assumption” + 0.016*”market” + 0.015*”future” + 0.014*”base” + 0.014*”product” + 0.013*”make” + 0.012*”increase”

Topic 5 ‘Undertaking’s strategy’: 0.039*”policy” + 0.031*”process” + 0.031*”business” + 0.030*”company” + 0.025*”ensure” + 0.022*”management” + 0.017*”plan” + 0.015*”manage” + 0.015*”strategy” + 0.015*”orsa”

Topic 6 ‘Risk management’: 0.325*”risk” + 0.030*”market” + 0.027*”rate” + 0.024*”change” + 0.022*”operational” + 0.021*”underwriting” + 0.019*”credit” + 0.019*”exposure” + 0.013*”interest” + 0.013*”liquidity”

Topic 7 ‘Insurance and technical provisions’: 0.049*”insurance” + 0.045*”reinsurance” + 0.043*”provision” + 0.039*”life” + 0.034*”technical” + 0.029*”total” + 0.025*”premium” + 0.023*”fund” + 0.020*”gross” + 0.019*”estimate”

Topic 8 ‘Undertaking’: 0.065*”company” + 0.063*”group” + 0.029*”insurance” + 0.029*”method” + 0.023*”limit” + 0.022*”include” + 0.017*”service” + 0.016*”limited” + 0.015*”specific” + 0.013*”mutual”

To determine the topic of a sentences we calculate for each topic the weight of the words in the sentences. The main topic of the sentence is then expected to be the topic with the highest sum.

If we run the following sentence (found in one of the SFCRs) through the model

"For the purposes of solvency, the Insurance Group’s insurance obligations
are divided into the following business segments: 1. Insurance with profit
participation 2. Unit-linked and index-linked insurance 3. Other life
insurance 4. Health insurance 5. Medical expence insurance for non-life
insurance 6. Income protection insurance for non-life insurance Pension &
Försäkring (Sweden) Pension & Försäkring offers insurance solutions on the
Swedish market within risk and unit-linked insurance and traditional life

then we get the following results per topic:

[(0, 0.08960573476702509), 
(1, 0.0692951015531661),
(2, 0.0692951015531661),
(3, 0.06332138590203108),
(4, 0.08363201911589009),
(5, 0.0692951015531661),
(6, 0.08004778972520908),
(7, 0.3369175627240143),
(8, 0.13859020310633216)]

Topic seven (‘Insurance and technical provisions’) has clearly the highest score 0.34 , followed by topic eight (‘Undertaking’). This suggests that these sentences are about the insurances and technical provisions of the undertaking (that we can verify).

Likewise, for the sentence

"Chief Risk Officer and Risk Function 
The Board has appointed a Chief Risk Officer (CRO) who reports directly to
the Board and has responsibility for managing the risk function and
monitoring the effectiveness of the risk management system."

we get the following results:

[(0, 0.2926447574334898), 
(1, 0.08294209702660407),
(2, 0.07824726134585289),
(3, 0.07824726134585289),
(4, 0.07824726134585289),
(5, 0.08450704225352113),
(6, 0.14866979655712048),
(7, 0.07824726134585289),
(8, 0.07824726134585289)]

Therefore, topic zero (‘Governance’) and topic six (‘Risk management’) have the highest score and this suggests that this sentence is about the governance of the insurance undertaking and to a lesser extent risk management.

The nine topics that were identified reflect fairly different elements in the SFCR, but we also see that some topics consist of several subtopics that could be identified separately. For example, the topic that I described as ‘Valuation’ covers assets and investments but it might be more appropriate to distinguish investment strategies from valuation. The topic ‘Solvency’ covers own funds as well as solvency requirements. If we increase the number of topics then some of the above topics will be split into more topics and the topic determination will be more accurate.

Once we have made the LDA model we can use the results for several applications. First, of course, we can use the model to determine the topics of previously unseen documents and sentences. We can also analyze topic distributions across different SFCRs, we can get similar sentences for any given sentence (based on the distance of the probability scores of the given sentence to other sentences).

In this blog I described first steps in text modeling of Solvency and Financial Condition Reports of insurance undertakings. The coherence scores were fairly high and the identified topics represented genuine topics from the Solvency II legislation, especially with a sufficient number of topics. Some examples showed that the LDA model is able to identify the topic of specific sentences. However, this does not yet work perfectly; an important element of SFCR documents are the numerical information often stored in table form in the PDF. These are difficult to analyze with the LDA algorithm.

How to download and read the Solvency 2 legislation

Published by:

In our first Natural Language Processing project we will read the Solvency II legislation from the website of the European Union and extract the text within the articles by using regular expressions.

For this notebook, we have chosen the text of the Delegated Acts of Solvency II. This part of the Solvency II regulation is directly into force (because it is a Regulation) and the wording of the Delegated Acts is more detailed than the Solvency II Directive and very precise and internally consistent. This makes it suitable for NLP. From the text we are able to extract features and text data on Solvency II for our future projects.

The code of this notebook can be found in here

Step 1: data Retrieval

We use several packages to read and process the pdfs. For reading we use the fitz-package. Furthermore we need the re-package (regular expressions) for cleaning the text data.

import os
import re
import requests
import fitz

We want to read the Delegated Acts in all available languages. The languages of the European Union are Bulgarian (BG), Spanish (ES), Czech (CS), Danish (DA), German (DE), Estonian (ET), Greek (EL), English (EN), French (FR), Croatian (HR), Italian (IT), Latvian (LV), Lithuanian (LT), Hungarian (HU), Maltese (MT), Dutch (NL), Polish (PL), Portuguese (PT), Romanian (RO), Slovak (SK), Solvenian (SL), Finnish (FI), Swedish (SV).

languages = ['BG','ES','CS','DA','DE','ET','EL',

The urls of the Delegated Acts of Solvency 2 are constructed for these languages by the following list comprehension.

urls = ['' + lang +
        for lang in  languages]

The following for loop retrieves the pdfs of the Delegated Acts from the website of the European Union and stores them in da_path.

da_path = 'data/solvency ii/'
for index in range(len(urls)):
    filename = 'Solvency II Delegated Acts - ' + languages[index] + '.pdf'
    if not(os.path.isfile(da_path + filename)):
        r = requests.get(urls[index])
        f = open(da_path + filename,'wb+')
        print("--> already read.")

Step 2: data cleaning

If you look at the pdfs then you see that each page has a header with page number and information about the legislation and the language. These headers must be deleted to access the articles in the text.

DA_dict = dict({
                'BG': 'Официален вестник на Европейския съюз',
                'CS': 'Úřední věstník Evropské unie',
                'DA': 'Den Europæiske Unions Tidende',
                'DE': 'Amtsblatt der Europäischen Union',
                'EL': 'Επίσημη Εφημερίδα της Ευρωπαϊκής Ένωσης',
                'EN': 'Official Journal of the European Union',
                'ES': 'Diario Oficial de la Unión Europea',
                'ET': 'Euroopa Liidu Teataja',           
                'FI': 'Euroopan unionin virallinen lehti',
                'FR': "Journal officiel de l'Union européenne",
                'HR': 'Službeni list Europske unije',         
                'HU': 'Az Európai Unió Hivatalos Lapja',      
                'IT': "Gazzetta ufficiale dell'Unione europea",
                'LT': 'Europos Sąjungos oficialusis leidinys',
                'LV': 'Eiropas Savienības Oficiālais Vēstnesis',
                'MT': 'Il-Ġurnal Uffiċjali tal-Unjoni Ewropea',
                'NL': 'Publicatieblad van de Europese Unie',  
                'PL': 'Dziennik Urzędowy Unii Europejskiej',  
                'PT': 'Jornal Oficial da União Europeia',     
                'RO': 'Jurnalul Oficial al Uniunii Europene', 
                'SK': 'Úradný vestník Európskej únie',        
                'SL': 'Uradni list Evropske unije',            
                'SV': 'Europeiska unionens officiella tidning'})

The following code reads the pdfs, deletes the headers from all pages and saves the clean text to a .txt file.

DA = dict()
files = [f for f in os.listdir(da_path) if os.path.isfile(os.path.join(da_path, f))]    
for language in languages:
    if not("Delegated_Acts_" + language + ".txt" in files):
        # reading pages from pdf file
        da_pdf = + 'Solvency II Delegated Acts - ' + language + '.pdf')
        da_pages = [page.getText(output = "text") for page in da_pdf]
        # deleting page headers
        header = "17.1.2015\\s+L\\s+\\d+/\\d+\\s+" + DA_dict[language].replace(' ','\\s+') + "\\s+" + language + "\\s+"
        da_pages = [re.sub(header, '', page) for page in da_pages]
        DA[language] = ''.join(da_pages)
        # some preliminary cleaning -> could be more 
        DA[language] = DA[language].replace('\xad ', '')
        # saving txt file
        da_txt = open(da_path + "Delegated_Acts_" + language + ".txt", "wb")
        # loading txt file
        da_txt = open(da_path + "Delegated_Acts_" + language + ".txt", "rb")
        DA[language] ='utf-8')

Step 3: retrieve the text within articles

Retrieving the text within articles is not straightforward. In English we have ‘Article 1 some text’, i.e. de word Article is put before the number. But some European languages put the word after the number and there are two languages, HU and LV, that put a dot between the number and the article. To be able to read the text within the articles we need to know this ordering (and we need of course the word for article in every language).

art_dict = dict({
                'BG': ['Член',      'pre'],
                'CS': ['Článek',    'pre'],
                'DA': ['Artikel',   'pre'],
                'DE': ['Artikel',   'pre'],
                'EL': ['Άρθρο',     'pre'],
                'EN': ['Article',   'pre'],
                'ES': ['Artículo',  'pre'],
                'ET': ['Artikkel',  'pre'],
                'FI': ['artikla',   'post'],
                'FR': ['Article',   'pre'],
                'HR': ['Članak',    'pre'],
                'HU': ['cikk',      'postdot'],
                'IT': ['Articolo',  'pre'],
                'LT': ['straipsnis','post'],
                'LV': ['pants',     'postdot'],
                'MT': ['Artikolu',  'pre'],
                'NL': ['Artikel',   'pre'],
                'PL': ['Artykuł',   'pre'],
                'PT': ['Artigo',    'pre'],
                'RO': ['Articolul', 'pre'],
                'SK': ['Článok',    'pre'],
                'SL': ['Člen',      'pre'],
                'SV': ['Artikel',   'pre']})

Next we can define a regex to select the text within an article.

def retrieve_article(language, article_num):

    method = art_dict[language][1]
    if method == 'pre':
        string = art_dict[language][0] + ' ' + str(article_num) + '(.*?)' + art_dict[language][0] + ' ' + str(article_num + 1)
    elif method == 'post':
        string = str(article_num) + ' ' + art_dict[language][0] + '(.*?)' + str(article_num + 1) + ' ' + art_dict[language][0]
    elif method == 'postdot':
        string = str(article_num) + '. ' + art_dict[language][0] + '(.*?)' + str(article_num + 1) + '. ' + art_dict[language][0]

    r = re.compile(string, re.DOTALL)
    result = ' '.join([language])[1].split())
    return result

Now we have a function that can retrieve the text of all the articles in the Delegated Acts for each European language.

Now we are able to read the text of the articles from the Delegated Acts. In the following we give three examples (article 292 with states the summary of the Solvency and Financial Conditions Report).

retrieve_article('EN', 292)
"Summary 1. The solvency and financial condition report shall include a clear and concise summary. The summary of the report
shall be understandable to policy holders and beneficiaries. 2. The
summary of the report shall highlight any material changes to the 
insurance or reinsurance undertaking's business and performance, 
system of governance, risk profile, valuation for solvency purposes 
and capital management over the reporting period."
retrieve_article('DE', 292)
'Zusammenfassung 1. Der Bericht über Solvabilität und Finanzlage 
enthält eine klare, knappe Zusammenfassung. Die Zusammenfassung des
Berichts ist für Versicherungsnehmer und Anspruchsberechtigte
verständlich. 2. In der Zusammenfassung werden etwaige wesentliche
Änderungen in Bezug auf Geschäftstätigkeit und Leistung des
Versicherungs- oder Rückversicherungsunternehmens, sein 
Governance-System, sein Risikoprofil, die Bewertung für 
Solvabilitätszwecke und das Kapitalmanagement im Berichtszeitraum 
retrieve_article('EL', 292)
'Περίληψη 1. Η έκθεση φερεγγυότητας και χρηματοοικονομικής
κατάστασης περιλαμβάνει σαφή και σύντομη περίληψη. Η περίληψη της
έκθεσης πρέπει να είναι κατανοητή από τους αντισυμβαλλομένους και
τους δικαιούχους. 2. Η περίληψη της έκθεσης επισημαίνει τυχόν
ουσιώδεις αλλαγές όσον αφορά τη δραστηριότητα και τις επιδόσεις της
ασφαλιστικής και αντασφαλιστικής επιχείρησης, το σύστημα
διακυβέρνησης, το προφίλ κινδύνου, την εκτίμηση της αξίας για τους
σκοπούς φερεγγυότητας και τη διαχείριση κεφαλαίου κατά την περίοδο