Computational Classics: Finding errors in annotated ancient Greek texts with association rules mining

Published by:

This blog describes some experiments with ruleminer for finding morphological patterns in annotated data of ancient Greek texts. Ruleminer is a Python package for association rules mining, a rules-based machine learning method for discovering patterns in large data sets. The regex and dataframe approach in ruleminer (set out in this article) is used to enable a controlled search in the data set. Previously, I have used ruleminer mainly for quantitative data, but it might be worth investigating whether it is applicable to annotated (NLP) text data.

Finding annotation errors

The idea is to extract morphological patterns from annotated data and with these patterns detect annotation errors made by the NLP parser that was used for the annotations. Morphological patterns are recurrent relations between word forms and features, such as part of speech, tense, mood, case, number and gender. These recurrent relations can be expressed as association rules and mining algorithms can be used to find these relations. By looking at those situation where patterns were not satisfied it could be possible to identify errors made by the NLP parser. This is useful because many NLP pipelines use these annotation for subsequent analyses.

For many languages NLP parsers are available to annotate documents and determine lemmas and the morphological features of word forms within these documents. The performance of these models is often measured in the percentage of correct annotations against predefined treebanks, text corpora with annotations verified by linguists. Normally these models use deep learning algorithms and no model is yet able to achieve fully correct annotations; for many models these percentages lie around 95%. The Perseus model in Stanford Stanza for ancient Greek texts provides annotations with the following scores: universal part-of-speech tags (92,4%), treebank-specific part-of-speech tags (85,0%), treebank-specific morphological features (91,0%), and lemmas (88.3%).

Preprocessing steps

As a basis for ancient Greek data I took a number of dialogues of Plato (Apology, Crito, Euthydemos, Euthypron, Gorgias, Laws, Phaedon, Phaedrus, Republic, Symposium and Timaois). The text documents were annotated with the Perseus model (the model was originally not trained with this data) and the result was converted to the NLP Interchange Format (NIF 2.0) in RDF with OLiA annotations (all done with the nafigator package using the Stanford Stanza pipeline).

The data set consists of 312.955 annotated word forms. In order to apply ruleminer all words were extracted from the RDF-graph and stored in a Pandas DataFrame. Each word is a row in the DataFrame with the original text of the word (nif:anchorOf), the lemma of the word derived from the Perseus model (nif:lemma) and, for each feature (57 in total), whether or not the word is annotated with this feature (columns start with olia:, for example olia:Adverb).

The only changes to the original text were the deletion of diacritic signs acute and grave (like ά and ὰ) because sometimes the place of these signs changes when suffixes are added or deleted, which makes it harder to find patterns. All other diacritic signs were unchanged.

This notebook contains all examples mentioned below (and many more examples for a range of nominal forms).

Deriving morphological patterns in ancient Greek

In what follows I will show some examples of rules that can be found in the annotated text. To check whether word forms have the annotated features and whether they have multiple meanings (with different features) I used Perseus Greek Word Study Tool.

Particle patterns

Particles in ancient Greek are short word forms that are never inflected, i.e. the form or ending does not change. So ideal for finding morphological patterns. First we look at the morphological features. Let’s run the following expression to identify the OLiA annotations of the word γαρ:

if (({"nif:anchorOf"}=="γαρ")) then ({"olia:.*"}==1))

The olia:.* in this expression means all columns that start with olia:, i.e. OLiA annotations except the lemma and the original text. Ruleminer will match all these columns and evaluate the metrics of the resulting candidate rule. If the candidate rule satisfies predefined constraints (here a minimum confidence of 90% was used) it will be added to the resulting rules:

rule_definitionabs supportabs exceptionsconfidence
if ({“nif:anchorOf”}==”γαρ”) then ({“olia:Adverb”}==1)2165420.98097
Results of the expression: if (({“nif:anchorOf”}==”γαρ”)) then ({“olia:.*”}==1))

This produces one rule which states that the word γαρ is identified correctly as an adverb (the Perseus model maps particles to adverbs or conjunctions). This rule has a confidence of just over 98% (in 2.165 cases the word γαρ is annotated as an adverb). There are 42 exceptions, meaning that the word was not annotated as an adverb. These exceptions might point to situations where a word has different features (and meanings) depending on the context. In this case it is strange because γαρ has only one other meaning: the noun γάρος, which does not occur in Plato’s work. I therefore expect that these are all annotation errors by the Perseus model. We can check this by looking at the associations between word forms and their lemmas with the following rule.

if (({"nif:anchorOf"}=="γαρ")) then ({"nif:lemma"}==".*"))

Here are the results:

rule_definitionabs supportabs exceptionsconfidence
if ({“nif:anchorOf”}==”γαρ”) then ({“nif:lemma”}==”γαρ”)2187200.990938
if ({“nif:anchorOf”}==”γαρ”) then ({“nif:lemma”}==”γιγνομαι”)921980.004078
if ({“nif:anchorOf”}==”γαρ”) then ({“nif:lemma”}==”ἐγώ”)622010.002719
if ({“nif:anchorOf”}==”γαρ”) then ({“nif:lemma”}==”γιρ”)422030.001812
if ({“nif:anchorOf”}==”γαρ”) then ({“nif:lemma”}==”γιπτω”)122060.000453
Results of the expression: if (({“nif:anchorOf”}==”γαρ”)) then ({“nif:lemma”}==”.*”))

Here the first rule in the table is the only one that is correct. The others have very low confidence and are obvious errors by the Perseus model: γιρ and γιπτω are nonexistent words, and ἐγώ and γιγνομαι are not the lemmas of γαρ. So these are incorrect annotations, and errors by the NLP parser.

Next we consider the particles οὐ, οὐκ, οὐχ (negating particles) and μη (a particle indicating privation). In this case for each word two annotations are found, the olia:Adverb and the olia:Negation.

rule_definitionabs supportabs exceptionsconfidence
if ({“nif:anchorOf”}==”μη”) then ({“olia:Adverb”}==1)1753590.967439
if ({“nif:anchorOf”}==”μη”) then ({“olia:Negation”}==1)16941180.934879
if ({“nif:anchorOf”}==”οὐ”) then ({“olia:Adverb”}==1)133301.0000
if ({“nif:anchorOf”}==”οὐ”) then ({“olia:Negation”}==1)133210.9993
if ({“nif:anchorOf”}==”οὐκ”) then ({“olia:Adverb”}==1)124801.0000
if ({“nif:anchorOf”}==”οὐκ”) then ({“olia:Negation”}==1)124801.0000
if ({“nif:anchorOf”}==”οὐχ”) then ({“olia:Adverb”}==1)32201.0000
if ({“nif:anchorOf”}==”οὐχ”) then ({“olia:Negation”}==1)32201.0000
Results of the expression with negating particles

Most of the times the word μη is annotated as an adverb and as a negation. There are however a number of exceptions. Looking into this a bit further shows that the word is sometimes annotated as a subordinating conjunction and sometimes the lemma is mistakenly set to μεμω or εἰμι resulting in incorrect verb related annotations. Here are the lemmas in case the word is not an adverb:

rule_definitionabs supportabs exceptionsconfidence
if ({“nif:anchorOf”}==”μη”) then (({“olia:Adverb”}!=1)&({“nif:lemma”}==”μη”))4217700.0232
if ({“nif:anchorOf”}==”μη”) then (({“olia:Adverb”}!=1)&({“nif:lemma”}==”μεμω”)) 1617960.0088
if ({“nif:anchorOf”}==”μη”) then (({“olia:Adverb”}!=1)&({“nif:lemma”}==”εἰμι”))118110.0006

Pronoun patterns

The word form of Ancient Greek pronouns depend on the case and grammatical number. In most of the cases the personal pronoun does not have other meanings depending on the context, so this should lead to strong patterns. To run ruleminer with a list of expressions we can use the following code.

# personal pronouns
# first person, second person
pronouns = [
    'ἐγω', 'ἐμοῦ', 'ἐμοι', 'ἐμε', 
    'μου', 'μοι', 'με',
    'συ', 'σοῦ', 'σοι', 'σε', 
    'σου',
    'ἡμεῖς', 'ἡμῶν', 'ἡμῖν', 'ἡμᾶς',
    'ὑμεῖς', 'ὑμῶν', 'ὑμῖν', 'ὑμᾶς'
]
expressions = [
    'if (({"nif:anchorOf"}=="'+pn+'")) then ({"olia:Pronoun"}==1)'
    for pn in pronouns
]

Then we get, sorted with highest support, the following result

rule_definitionabs supportabs exceptionsconfidence
if ({“nif:anchorOf”}==”ἡμῖν”) then ({“olia:Pronoun”}==1)67101.0000
if ({“nif:anchorOf”}==”μοι”) then ({“olia:Pronoun”}==1)49940.9920
if ({“nif:anchorOf”}==”ἐγω”) then ({“olia:Pronoun”}==1)383360.9141
if ({“nif:anchorOf”}==”σοι”) then ({“olia:Pronoun”}==1)344110.9690
if ({“nif:anchorOf”}==”ἡμῶν”) then ({“olia:Pronoun”}==1)23001.0000
if ({“nif:anchorOf”}==”συ”) then ({“olia:Pronoun”}==1)229270.8945
if ({“nif:anchorOf”}==”ἡμᾶς”) then ({“olia:Pronoun”}==1)20801.0000
Results of pronoun patterns

Again this points to many errors in the Perseus model. Word forms like ἐγω, μοι and συ cannot be taken as anything other than a pronoun. However, the word σοι could also be a possessive adjective depending on the context.

Verbs patterns

We now have seen some easy examples with straightforward rules. For verbs we need more complex rules, but this is still feasible with ruleminer.

In ancient Greek if a verb is thematic, in present tense, indicative mood, plural and third person then the ending of that verb (if it is not contracted) is stem+ουσι(ν). To formulate a rule for this we want to keep the stem of the verb that was found in the antecedent (the if-part) and use it later on in the consequent of the rule (the then-part). This can be done by defining a regex group (by using with parentheses) in the following way:

if (({"nif:anchorOf"}=="(\w+[^εαο])ουσιν?")) then (({"nif:lemma"}=="\1ω"))

The if-part of the rule is true if the column nif:anchorOf matches the regex (\w+[^εαο])ουσιν?. The first part of this regex (between parenthesis) consists of one or more characters not ending with ε, α, and ο. This is the stem of the word and it is stored as a regex group (to be used in the consequent of the rule). The second part is ουσιν?, which is regex for either ουσι or ουσιν. The then-part of the rule is true is nif:lemma contains the stem of the word (in regex this is \1) plus ω.

The first five lines of the results (85 rules in total).

rule_definitionabs supportabs exceptionsconfidence
if({“nif:anchorOf”}==”ἐχουσιν”)then({“nif:lemma”}==”ἐχω”)3101.0000
if({“nif:anchorOf”}==”ἐχουσι”)then({“nif:lemma”}==”ἐχω”)2010.9524
if({“nif:anchorOf”}==”μελλουσιν”)then({“nif:lemma”}==”μελλω”)1601.0000
if({“nif:anchorOf”}==”μελλουσι”)then({“nif:lemma”}==”μελλω”)1001.0000
if({“nif:anchorOf”}==”τυγχανουσιν”)then({“nif:lemma”}==”τυγχανω”)701.0000
First five lines of the expression: if (({“nif:anchorOf”}==”(\w+[^εαο])ουσιν?”)) then (({“nif:lemma”}==”\1ω”))

Aggregate text analysis

Let’s end these examples with an aggregate analysis of the data set of all word forms with lemmas and morphological features. To find out if there is a prevalence in the text with respect to certain morphological features let’s run the following simple rules:

if (({"olia:ProperNoun"}==1)) then ({"olia:Neuter"}==1)
if (({"olia:ProperNoun"}==1)) then ({"olia:Feminine"}==1)
if (({"olia:ProperNoun"}==1)) then ({"olia:Masculine"}==1)

These rules identify the grammatical gender of all the proper nouns in the text (word forms that start with a capital letter and name people, places, things, and ideas). Here are the results:

rule_definitionabs supportconfidence
if({“olia:ProperNoun”}==1)then({“olia:Masculine”}==1)35590.7928
if({“olia:ProperNoun”}==1)then({“olia:Feminine”}==1)4680.1043
if({“olia:ProperNoun”}==1)then({“olia:Neuter”}==1)340.0076

Almost 80% of the proper nouns in the text have masculine gender, and just over 10% have feminine gender. Remember that this is derived from Plato’s dialogues, so no surprise there. Most protagonists in the dialogues, if not all, are male and related word forms are therefore masculine. I specifically looked at the feminine proper nouns with more that five occurrences: they are geographical locations like Δῆλος (most frequent, 48 times), Συράκουσαι (7 times), Αίγυπτος (6 times). It also appeared that a number of male protagonists were incorrectly given annotations with feminine gender (Σιμμιας, 10 times and Μελητος, 6 times). Furthermore some word forms were mistakenly taken as pronoun, and that some pronouns did not have an annotation for gender (that is why it does not sum up).

Conclusion

As you can see this all works quite well. If a word form has one meaning then it is fairly easy to create reliable patterns and find erroneous annotations from a NLP parser. The main problem that cannot be solved in this approach (by looking at word forms only) is that in ancient Greek a word form can have more than one meaning, and therefore different morphological features, depending on the specific context of the word form. For example the meaning of a word form also depends on the (features of) preceding and following words in the sentence. To take that into account a different approach for mining is necessary.

I wonder whether it is feasible to automatically correct the output of the NLP parser in case of high confidence or humanly verified morphological patterns. That would increase the accuracy of the annotations. Furthermore, if the association rules are used for prediction then it might perhaps even be possible to construct a complete rules-based annotation model, and thereby replacing the deep learning model with a transparent rules-based approach.

So it must first be possible to create more complex rules that take into account the context of the word forms. This could be achieved by querying the RDF-graph directly to mine for reliable triple associations and with that find erroneous triples and missing triples in the graph. To be continued.

Natural Language Processing in RDF graphs (2)

Published by:

This is a follow-up to my blog on natural language processing in RDF graphs. Since then I found a number of improvements and incorporated them in the Python packages.

NLP Interchange Format

As there are over fifty different NLP annotations formats available, it didn’t seem a good idea to create yet another annotation format. So instead of a self-made provisional ontology as I did earlier, it is now possible to convert to and use the NLP Interchange Format (NIF) with the Python package nifigator. Included in this package is functionality for a pipeline for PDF documents.

This ontology is different from NAF but has the advantage that is a mature ontology for which the WC3 community has provided guidelines and best practices (see for example here Guidelines for Linked Data corpus creation using NIF). There are some Python packages doing similar things but none of them are able to convert the content of PDFs, docx and html to NIF.

The annotations in NAF are stored in the different layers. The data within each layer is stored in RDF triples in the following way:

raw layernif:Context
text layernif:Page, nif:Paragraph, nif:Sentence: nif:Word
terms layernif:Word
entities layernif:Phrase
deps layernif:Word
headernif:Context
Mapping from NAF layers to NIF classes

Ontolex-Lemon

Secondly, the Python package termate now allows termbases in TBX to be now be converted with the Ontolex-Lemon ontology to RDF. This is based on another WC3 document Guidelines for Linguistic Linked Data Generation: Multilingual Terminologies (TBX) (although I have implemented this for TBX version 3 instead of version 2, on which the guideline is based).

An example can be found here.

Multilingual termbases with metadata from reporting templates

Published by:

Domain-specific termbases are of great importance to many domain-specific NLP-tasks. They enable identification and annotation of terms in documents in situations where often not enough text is available to use statistical approaches. And more importantly, they form a step towards extracting structured facts from unstructured text data.

This blog shows how to construct and use multilingual termbases to annotate text from supervisory documents in different European languages with references to relevant parts of (quantitative) supervisory templates. By linking qualitative data (text) to quantitative data (numbers) we connect initially unstructured text data to data that are often to a high degree structured and well-defined in data point models and taxonomies. I will do this by constructing termbases that contain terminology data combined with linguistic data and metadata from the supervisory templates from different financial sectors.

For terminology data I will start with the IATE-database. Most terminology that is used in European quantitative reporting templates is based on and derived from European legislation. Having multilingualism as one of its founding principles is, the EU publishes terminology in the IATE-database in all official European languages to provide consistent translation of terms in European legislation. The IATE-database is published in the form of a file in TBX-format (TermBase eXchange). But termbases can also be in form of SKOS (Simple Knowledge Organization System, built upon the RDF-format). Both formats are data models that contain descriptions and properties of concepts and are to some extent interchangeable (see for example here).

For metadata on reporting templates I will use relevant XBRL Taxonomies in RDF-format (see here). Normally, XBRL Taxonomy are developed specifically for a single sector and therefore covers to some extent the financial terminology used within that sector. XBRL Taxonomies contain metadata of all data point in the reporting templates. From a XBRL Taxonomy a Data Point Model can be derived (that is: the taxonomy contains all definitions) and is often published together with the taxonomy which is only computer readable.

For linguistic data I will use the Python NLP package of Stanford Stanza, which provide pretrained NLP-models for all official European languages (in order of becoming an official EU language: Dutch, French, German, Italian (1958), Danish, English (1973), Greek (1981), Portuguese, Spanish (1986), Finnish, Swedish (1995), Czech, Estonian, Hungarian, Latvian, Lithuanian, Maltese, Polish, Slovak, Slovenian (2004), Bulgarian, Irish, Romanian (2007) and Croatian (2013)).

So we add semantic and linguistic structure to a terminology database. The resulting data structure is sometimes called an ontology, a taxonomy or a vocabulary, but these terms have no clear distinctive definitions. And moreover, the XBRL-people use the term taxonomy to refer to a structure that contains concepts with properties for definitions, labels, calculations and (table) presentations. To some extent it contains structured metadata of data points (i.e. the semantics of the data). Because of that you can say that it corresponds to an ontology within a Linked Data context. On the other hand a taxonomy within a Linked Data context (and everywhere else if I might add) is basically a description of concepts with sub-class relationships (concepts with hierarchical information). In the remainder of this blog I will use the term termbase for the resulting structure with semantic, linguistic and terminological data combined.

Constructing a termbase from IATE and XBRL

In a previous blog I have described how to set up a terminology database (termbase) specifically for insurance-related terms. Now I will add links from the concepts and terms in the IATE-database to the data point model of insurance reporting templates (thereby adding semantics to the termbase), and secondly I will add linguistic information at term-level like lemmas and part-of-speech tags to allow for easy usage in NLP-tasks. The TBX-format in which the IATE-database is published allows for storing references as well as linguistic data on term-level, so we can construct the termbase as a standalone file in TBX-format (another solution would be to add the terminology and linguistic information to the XBRL Taxonomy and use that as a basis).

The IATE-database currently contains almost 930.000 concepts and many of them have verbal expressions in multiple languages resulting in over 8.000.000 terms. A single (English) expression of a concept in the IATE-database looks like this.

<conceptEntry id="3539858">
  <descrip type="subjectField">insurance</descrip>
  <langSec xml:lang="en">
    <termSec>
      <term>basic own funds</term>
      <termNote type="termType">fullForm</termNote>
      <descrip type="reliabilityCode">9</descrip>
    </termSec>
  </langSec>

Adding labels from the XBRL Taxonomy

For the termbase, we add every element in the XBRL Taxonomy that has a label (tables, concepts, elements, dimensions and members) to the termbase and we add an external cross reference to the template and the location in that template where the element is used (the row or column within the template). The TBX-format allows for fields called externalCrossReference which refer to a resource that is external to the terminology database. Then you get concept entries like this:

<conceptEntry id="http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2021-07-15/tab/s.22.01.01.01#s2md_c4071">
  <descrip type="xbrlTypes">element</descrip>
  <xref type="externalCrossReference">S.22.01.01.01,R0020</xref>
  <langSec xml:lang="en">
    <termSec>
      <term>Basic own funds</term>
      <termNote type="termType">fullForm</termNote>
    </termSec>
    <termSec>
      <term>R0020</term>
      <termNote type="termType">shortForm</termNote>
    </termSec>
  </langSec>
</conceptEntry>

This means that the termbase concept with the id “http://eiopa.europa.eu/xbrl/…/s.22.01.01.01#s2md_c4071” has English labels “Basic own funds” (full form) and “R0020” (short form). These are the labels of row 0020 of template S.22.01.01.01, i.e. in the template definition you will see Basic own funds on row 0020.

The template and rc-codes where elements refer to were extracted using SPARQL-queries on the XBRl Taxonomy in RDF-format.

Adding references between IATE- and XBRL-concepts

Now that we have added terms for labelled elements from the XBRL Taxonomy, the next step is to add cross references between the IATE-concepts and the XBRL-concepts. In the TBX-format the crossReference is a pointer to another related location, such as another entry or another term, in this case to a XBRL-concept. Below the references to the XBRL-concepts are added.

<conceptEntry id="3539858">
  <descrip type="subjectField">insurance</descrip>
  <langSec xml:lang="en">
    <termSec>
      <term>basic own funds</term>
      <termNote type="termType">fullForm</termNote>
      <descrip type="reliabilityCode">9</descrip>
    </termSec>
  </langSec>
  <ref type="crossReference">http://eiopa.europa.eu/xbrl/s2c/dict/dom/el#x7</descrip>
  <ref type="crossReference">http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2021-07-15/tab/s.22.01.22.01#s2md_c4121</descrip>
  <ref type="crossReference">http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2021-07-15/tab/s.22.01.04.01#s2md_c4094</descrip>
  <ref type="crossReference">http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2021-07-15/tab/s.22.01.01.01#s2md_c4071</descrip>
  ...

The IATE-concept with id 3539858 points to the domain item el#x7 (a domain in XBRL is a set of related items, in this case own funds items), and furthermore to (table) elements s.22.01.22.01#s2md_c4121, s.22.01.04.01#s2md_c4094 and s.22.01.01.01#s2md_c4071. These all refer to a single row or a single column within a template and the last one is given as an example above. It is the row in the table S.22.01.01.01.01 with label ‘Basic own funds’. IATE-concepts and XBRL-concepts are considered equal when their lowercase lemmas are the same (except for abbreviations).

For XBRL Taxonomy of Solvency 2 we find in this way 740 unique terms in the IATE-database that are identical to a XBRL-concept (in English), and 1.500 terms that occur in the labels of XBRL-concepts but are not identical, for example ‘net best estimate’ is a part of the label ‘Net Best Estimate of Premium Provisions’. For now I focused on identifying identical terms. How to process long labels with more than one term is a remaining challenge (this describes probably a useful approach).

Adding part-of-speech tags and lemmas

To make the termbase applicable for NLP-tasks we need to add additional linguistic information, such as part-of-speech patterns and lemmas. This improves the annotation process later on. Premium as an adjective has another meaning than premium as a common noun. By matching the PoS-pattern we get better matches (i.e. less false positives). And also the lemmas of a term will find terms irrespective of whether they are in grammatical singular or plural form. This is the concept to which PoS-patterns and lemma are added:

<conceptEntry id="3539858">
  <descrip type="subjectField">insurance</descrip>
  <langSec xml:lang="en">
    <termSec>
      <term>basic own funds</term>
      <termNote type="termType">fullForm</termNote>
      <descrip type="reliabilityCode">9</descrip>
      <termNote type="termLemma">basic own fund</termNote>
      <termNote type="partOfSpeech">adj, adj, noun</termNote>
    </termSec>
  </langSec>
  <langSec xml:lang="it">
    <termSec>
      <term>fondi propri di base</term>
      <termNote type="termType">fullForm</termNote>
      <descrip type="reliabilityCode">9</descrip>
      <termNote type="termLemma">fondo proprio di base</termNote>
      <termNote type="partOfSpeech">noun, det, adp, noun</termNote>
    </termSec>
  </langSec>
  ...

Annotating text in different languages

Because the termbase contains multilingual terms with references to templates and locations we are now able to annotate terms in documents in different European languages. Below you find some examples of what you can do with the termbase. I took an identical text from the Solvency 2 Delegated Acts in English, Finnish and Italian (the first part of article 52), converted the text to NAF and added annotations with the termbase (Nafigator has a function that processes a complete termbase and adds annotations to the NAF-file). This results in the following (using a visualizer from the spaCy package):

In English

Article 52 Mortality riskTerm S.26.03.01.01,R0100 stress The mortality riskTerm S.26.03.01.01,R0100 stress referred to in Article 77b(1)(f) of Directive 2009/138/EC shall be the more adverse of the 1. following two scenarios in terms of its impact on basic own fundsTerm S.22.01.01.01,R0020: (a) an instantaneous permanent increase of 15 % in the mortality rates used for the calculation of the best estimateTerm S.02.01.01.01,R0540; (b) an instantaneous increase of 0.15 percentage points in the mortality rates (expressed as percentages) which are used in the calculation of technical provisionsTerm S.22.01.01.01,R0010 to reflect the mortality experience in the following 12 months.

In Finnish

52 artikla KuolevuusriskiinTerm S.26.03.01.01,R0100 liittyvä stressi Direktiivin 2009/138/EY 77 b artiklan 1 kohdan f alakohdassa tarkoitetun, kuolevuusriskiinTerm S.26.03.01.01,R0100 liittyvän stressin on 1. oltava seuraavista kahdesta skenaariosta se, jonka epäsuotuisa vaikutus omaan perusvarallisuuteenTerm S.22.01.01.01,R0020 on suurempi: a) välitön, pysyvä 15 %:n nousu parhaan estimaatinTerm S.02.01.01.01,R0540 laskennassa käytetyssä kuolevuudessa; b) välitön 0,15 prosenttiyksikön nousu prosentteina ilmaistussa kuolevuudessa, jota käytetään vakuutusteknisen vastuuvelanTerm S.22.01.01.01,R0010 laskennassa ilmentämään havaittua kuolevuutta seuraavien 12 kuukauden aikana. Sovellettaessa 1 kohtaa kuolevuuden nousua sovelletaan ainoastaan vakuutuksiin, joissa kuolevuuden nousu johtaa

In German

Artikel 52 SterblichkeitsrisikostressTerm S.26.03.01.01,R0100 Der in Artikel 77b Absatz 1 Buchstabe f der Richtlinie 2009/138/EG genannte SterblichkeitsrisikostressTerm S.26.03.01.01,R0100 ist das im 1. Hinblick auf die Auswirkungen auf die BasiseigenmittelTerm S.22.01.01.01,R0020 ungünstigere der beiden folgenden Szenarien: (a) plötzlicher dauerhafter Anstieg der bei der Berechnung des besten Schätzwerts zugrunde gelegten Sterblichkeitsraten um 15 %; (b) plötzlicher Anstieg der bei der Berechnung der versicherungstechnischen RückstellungenTerm S.22.01.01.01,R0010 zugrunde gelegten Sterblichkeitsraten (ausgedrückt als Prozentsätze) um 0,15 Prozentpunkte, um die Sterblichkeit in den folgenden zwölf Monaten widerzuspiegeln.

In French

Article 52 Choc de risque de mortalitéTerm S.26.03.01.01,R0100 Le choc de risque de mortalitéTerm S.26.03.01.01,R0100 visé à l’article 77 ter, paragraphe 1, point f), de la directive 2009/138/CE 1. correspond au plus défavorable des deux scénarios suivants en termes d’impact sur les fonds propres de baseTerm S.22.01.01.01,R0020: (a) une hausse permanente soudaine de 15 % des taux de mortalité utilisés pour le calcul de la meilleure estimationTerm S.02.01.01.01,R0540; (b) une hausse soudaine de 0,15 point de pourcentage des taux de mortalité (exprimés en pourcentage) qui sont utilisés dans le calcul des provisions techniquesTerm S.22.01.01.01,R0010 pour refléter l’évolution de la mortalité au cours des 12 mois à venir.

In Swedish

Artikel 52 DödsfallsriskstressTerm S.26.03.01.01,R0100 Den dödsfallsriskstressTerm S.26.03.01.01,R0100 som avses i artikel 77b.1 f i direktiv 2009/138/EG ska vara det som är mest negativt av 1. följande två scenarier i fråga om dess påverkan på kapitalbasen: (a) En omedelbar permanent ökning på 15 % av dödligheten som används för beräkning av bästa skattningenTerm S.02.01.01.01,R0540. (b) En omedelbar ökning på 0,15 % av dödstalen (uttryckta i procent) som används i beräkningen av försäkringstekniska avsättningarTerm S.22.01.01.01,R0010 för att återspegla dödligheten under de följande tolv månaderna.

In Italian

Articolo 52 Stress legato al rischio di mortalitàTerm S.26.03.01.01,R0100 Lo stress legato al rischio di mortalitàTerm S.26.03.01.01,R0100 di cui all’articolo 77 ter, paragrafo 1, lettera f), della direttiva 2009/138/CE è 1. il più sfavorevole dei due seguenti scenari in termini di impatto sui fondi propri di baseTerm S.22.01.01.01,R0020: (a) un incremento permanente istantaneo del 15 % dei tassi di mortalità utilizzati per il calcolo della migliore stima; (b) un incremento istantaneo di 0,15 punti percentuali dei tassi di mortalità (espressi in percentuale) utilizzati nel calcolo delle riserve tecnicheTerm S.22.01.01.01,R0010 per tener conto dei dati tratti dall’esperienza relativi alla mortalità nei 12 mesi successivi. Ai fini del paragrafo 1, l’incremento dei tassi di mortalità si applica soltanto alle polizze di assicurazione per le 2. quali tale incremento comporta un aumento delle riserve tecnicheTerm S.22.01.01.01,R0010 tenendo conto di tutto quanto segue:

In the Italian text one reference is missing: the IATE-database does not yet contain an Italian translation for the English term best estimate. This happens because the IATE-database is far from complete. Not all terms are available in all languages, and the IATE-database probably does not contain all terminology from the reporting templates. And although the IATE-database is constantly updated, it might be necessary for certain use cases to add additional translations and concepts to the database.

This works for every XBRL Taxonomy, although every taxonomy has its own peculiarities (it’s a “standard” as one of my IT-colleagues likes to say). Following the procedure described above, I made the following termbases for insurance undertakings, credit institutions and Dutch pension funds based on the taxonomy versions mentioned:

  • Termbase of EIOPA Solvency 2, taxonomy version 2.6.0
  • Termbase of EBA CRD IV, taxonomy version 3.2.1.0
  • Termbase of DNB FTK, taxonomy version 2.3.0

These can be found on data.world.

Explainable outlier detection with decision trees and ruleminer

Published by:

This is a note on an extension of the ruleminer package to convert the content of decision trees into rules to provide an approach to unsupervised and explainable outlier detection.

Here is a way to use decision trees for unsupervised outlier detection. For an arbitrary data set (in the form of a dataframe) for each column a decision tree is trained to predict that column by using the other columns as features. For target columns with integers a classifier is used and for columns with floating numbers a regressor is used. This provides an unsupervised approach resulting in decision trees that predict every column from the other columns.

The paths in the decision trees basically contain data patterns and rules that are embedded in the data set. These paths can therefore be treated as association rules and applied within the framework of association rules mining. Outliers are those values that could not be predicted accurately with the decision trees, and these are exceptions to the rules. The resulting rules are in a human readable format so this provides transparent and explainable rules representing patterns in the data set.

Training and using decision trees can of course easily be done with scikit-learn. What I have added in ruleminer package is source code to extract decision paths from arbitrary scikit-learn decision trees (classifiers and regressors) to convert them into rules in a ruleminer object. The extracted rules can then be treated like any other set of rules and can be applied to other data sets to calculate rule metrics and find outliers.

Example with the iris data

Here is an example. I ran a AdaBoostClassifier on the iris data set in scikit-learn package and fit an ensemble of 25 trees with depth 2 (this will provide if-then rules where the antecedent of the rule contains a maximum of two conditions):

base, estimator = DecisionTreeClassifier, AdaBoostClassifier

regressor = estimator(
    base_estimator = base(
        random_state=0, 
        max_depth=2),
    n_estimators=25,
    random_state=0)
regressor = regressor.fit(X, Y)

Here X is the features of the iris data set and Y is the target. We now have an ensemble (or forest) of decision trees. The first decision tree in the ensemble looks like this:

The first line in the white (non leaf) nodes contain the conditions of the rules. To extract the rules from this tree I have provided an utility function in the ruleminer package that can be used in the following way:

# derive expression from tree
ruleminer.tree_to_expressions(regressor[0], features, target)

This results in the following set of rules (in the syntax of ruleminer rules):

{'if (({"petal width cm"} <= 0.8)) then ({"target"} == 0)',
 'if (({"petal width cm"} > 0.8) & ({"petal width cm"} <= 1.75)) then ({"target"} == 1)',
 'if (({"petal width cm"} > 0.8) & ({"petal width cm"} > 1.75)) then ({"target"} == 2)'}

For each leaf in the tree the decision path is converted to a rule, each node contains the condition in the rule. To get the best decision tree in an ensemble (according to, for example, the highest absolute support) we generate a miner per decision tree in the ensemble:

ensemble_exprs = ruleminer.fit_ensemble_and_extract_expressions(
    dataframe,
    target = "target", 
    max_depth = 2)

miners = [
    RuleMiner(
        templates=[{'expression': expr} for expr in exprs], 
        data=df) 
    for exprs in ensemble_exprs
]

From this we extract the miner with the highest absolute support

max(miners, key=lambda x: miner.rules['abs support'].sum())

resulting in the following output of the rules

idx id group definition status abs support abs exceptions confidence encodings
0 0 0 if({“petal width cm”}<=0.8)then({“target”}==0) 50 0 1.000000 {}
1 1 0 if(({“petal width cm”}>0.8)&({“petal width cm”}>1.75))then({“target”}==2) 45 1 0.978261 {}
2 2 0 if(({“petal width cm”}>0.8)&({“petal width cm”}<=1.75))then({“target”}==1) 49 5 0.907407 {}

With maximum depth of two, we see that three rules are derived that are confirmed by 144 samples in the data set, and six samples were found that do not satisfy the rules (outliers).

Example with insurance undertakings data

If we apply this to the example data set I have used earlier:

df = pd.DataFrame(
    columns=[
        "Name",
        "Type",
        "Assets",
        "TP-life",
        "TP-nonlife",
        "Own funds",
        "Excess",
    ],
    data=[
        ["Insurer1", "life insurer", 1000.0, 800.0, 0.0, 200.0, 200.0],
        ["Insurer2", "non-life insurer", 4000.0, 0.0, 3200.0, 800.0, 800.0],
        ["Insurer3", "non-life insurer", 800.0, 0.0, 700.0, 100.0, 100.0],
        ["Insurer4", "life insurer", 2500.0, 1800.0, 0.0, 700.0, 700.0],
        ["Insurer5", "non-life insurer", 2100.0, 0.0, 2200.0, 200.0, 200.0],
        ["Insurer6", "life insurer", 9001.0, 8701.0, 0.0, 300.0, 200.0],
        ["Insurer7", "life insurer", 9002.0, 8802.0, 0.0, 200.0, 200.0],
        ["Insurer8", "life insurer", 9003.0, 8903.0, 0.0, 100.0, 200.0],
        ["Insurer9", "non-life insurer", 9000.0, 8850.0, 0.0, 150.0, 200.0],
        ["Insurer10", "non-life insurer", 9000.0, 0, 8750.0, 250.0, 199.99],
    ],
)
df.index.name="id"
df[['Type']] = OrdinalEncoder(dtype=int).fit_transform(df[['Type']])
df[['Name']] = OrdinalEncoder(dtype=int).fit_transform(df[['Name']])

The last two lines converts the string data to integer data so it can be used in the decision trees. We then fit an ensemble of trees with maximum depth of one (for generating the most simple rules):

expressions = ruleminer.fit_dataframe_to_ensemble(df, max_depth = 1)

This results in 41 expressions that we can evaluate with ruleminer selecting the rules that have confidence of 75% and minimum support of two:

templates = [{'expression': solution} for solution in expressions]
params = {
    "filter": {'confidence': 0.75, 'abs support': 2},
    "metrics": ['confidence', 'abs support', 'abs exceptions']
}
r = ruleminer.RuleMiner(templates=templates, data=df, params=params)

This results in the following rules with metrics (the data error that we added in advance (insurer 9 is a non life undertaking reporting life technical provisions) is indeed found):

idx id group definition status abs suppor abs exceptions confidence encodings
0 0 0 if({“TP-life”}>400.0)then({“TP-nonlife”}==0.0) 6 0 1.000000 {}
1 1 0 if({“TP-life”}<=400.0)then({“Type”}==1) 4 0 1.000000 {}
2 2 0 if({“TP-life”}>400.0)then({“Type”}==0) 5 1 0.833333 {}

There is a relationship between the constraints set on the decision tree and the structure and metrics of the resulting rules. The example above showed the most simple rules with maximum depth of one, i.e. only one condition in the if-part of the rule. It is also possible to set the decision tree parameter min_samples_leaf to guarantee a minimum number of samples in a leaf. In the association rules terminology this corresponds to selecting rules with a certain maximum absolute support or exception count. Setting the minimum samples per leaf to one results in rules with a maximum number of exceptions of one and this yields in our case the same results as maximum depth of one:

expressions = ruleminer.fit_dataframe_to_ensemble(df, min_samples_leaf = 1)

The parameter min_weight_fraction_leaf allows for a weighted minimum fraction of the input samples required to be at a leaf node. This might be applicable in cases where you have weights (or levels or importance) of the samples in the data set.

The rules that can be found with ensembles of decision trees are all of the form “if A and B and … then C”, where A, B and C are conditions. Rules containing numerical operations and rules with complex conditions in the consequent of the rule cannot be found in this way. Furthermore, if the maximum depth size of the decision tree is too large then resulting rules, although as such human readable, become less explainable. They might however point to unknown exceptions in the data that could not be captured with the supervised approach (predefined expressions with regexes). Taking into account these drawbacks, this approach has potential to be used in larger data sets.

The notebook of the examples above can be found here.

Natural Language Processing in RDF graphs

Published by:

This blog shows how to store text data in a RDF graph and retrieve and analyze information from that graph. Resource Description Format (RDF) graphs are very suitable structures for storing Natural Language Processing (NLP) data. They enable combining NLP data with other data sets in RDF (such as legal entities data from the Global LEI Foundation and the EIOPA register of European insurance undertakings, terminology data, for example Solvency 2 terminology and data from XBRL reports); and they allow adding text semantics in the form of linguistic annotations, which enables NLP analyses simply by executing database queries.

Here is what I did. To get a proper amount of text data I web-scraped the entire website of De Nederlandsche Bank (text in webpages and in PDF documents, including speeches, press releases, research publications, sector information, dnbulletins, and all blogs by Maarten Gelderman and Olaf Sleijpen, consisting of over 4.000 documents). Text extraction from the web pages was done with the Python package newspaper3k (a great tip from my NLP colleagues from the Authority for Consumers and Markets). Text data was then converted to the NLP Annotation Format (NAF), for which I defined a RDF representation (implemented in the Nafigator package) to upload the data in a RDF triple-store. For the triple-store I used Ontotext’s GraphDB, one of the best RDF database currently available. Then, information can be retrieved from the graph database with SPARQL queries for all kinds of NLP analyses.

Using a triple-store for NLP data leads to an efficient retrieval process of text data, especially if you compare that to a process where you search through different annotation files. Triple-stores for RDF (and the new RDF-star) have become efficient and powerful solutions with the equal capabilities as property graphs but with advantages of RDF and ontologies.

I will describe two parts of this process that are not straightforward in detail: the RDF representation of NAF, and retrieving data from the graph database.

The NLP Annotation Format in RDF

The NLP Annotation Format is an easy format for storing text annotations (see here for links to the description). All documents that were scraped from the website were processed with the Python package Nafigator, that is able to convert PDF document and HTML-files to XML-files satisfying the NLP Annotation Format. Standard annotation layers with the raw text, word forms, terms, named entities and dependencies were added using the Stanford Stanza NLP processor.

In this representation every annotation (word forms, terms, named entities, etc.) of every document must have an Uniform Resource Identifier (URI). To do this, I used a prefix doc_xxx for each document in the document set. This prefix can, for example, be set by

@prefix doc_001: <http://rdf.mangosaurus.eu/doc_001/> .

Which in this case is an identifier based on the domain of this blog. For web-scraped documents you might also use the original URL of the document. Furthermore, for the RDF representation of NAF a provisional RDF Schema with prefix naf-base was made with the basic properties en classes of NAF.

The basic structure is set out below. All examples provided below are derived from the file example.pdf in the Nafigator package (the first sentences of the first page starts with: ‘The Nafigator package … ‘).

Document and header

Every document has a header and pages.

doc_001:doc a naf-base:document ;
    naf-base:hasHeader doc_001:nafHeader ;
    naf-base:hasPages ( doc_001:page1 ) .

Here naf-base:document is a RDF Class and naf-base:hasHeader and naf-base:hasPages are RDF Properties. The three lines above state that doc_001:doc is a document with header doc_001:nafHeader and a single page doc_001:page1.

In the header all metadata of the document is stored, including all linguistics processors and models that were used in processing the document. Below you see the metadata of the NAF text layer and the document metadata.

doc_001:nafHeader a naf-base:header ;
    naf-base:hasLinguisticProcessors [ 
        naf-base:hasLayer naf-base:text ;
        naf-base:lp [ 
            naf-base:hasBeginTimestamp "2022-04-10T13:45:43UTC" ;
            naf-base:hasEndTimestamp "2022-04-10T13:45:44UTC" ;
            naf-base:hasHostname "desktop-computer" ;
            naf-base:hasModel "stanza_resources\\en\\tokenize\\ewt.pt" ;
            naf-base:hasName "text" ;
            naf-base:hasVersion "stanza_version-1.2.2" 
        ] 
        ...
    ] ;
    naf-base:hasPublic [ 
        dc:format "application/pdf" ;
        dc:uri "data/example.pdf" 
    ] .

Sentences, paragraphs and pages

Here is an example of a sentence object with properties.

doc_001:sent1 a naf-base:sentence ;
    naf-base:isPartOf doc_001:para1, doc_001:page1 ;
    naf-base:hasSpan ( doc_001:wf1 doc_001:wf2 ...  doc_001:wf29 ) .

These three lines describe the properties of the RDF subject doc_001:sent1. The doc_001:sent1 identifies the RDF subject for the first sentence of the first document; the first line says that the subject doc_001:sent1 is a (rdf:type) sentence. The second line says that this sentence is a part of the first paragraph and the first page of the document. The span of the sentence contains a ordered list of word forms of the sentence: doc_001:wf1, doc_001:wf2 and so on.

Paragraphs and pages to which the sentences refer are defined in a similar way.

Word forms and terms

Of each word form the properties text, length and offset are defined. The word form is a part of a term, sentence, paragraph and page, and that is also defined for every word form. Take for example the the word form doc_001:wf2 defined as:

doc_001:wf2 a naf-base:wordform ;
    naf-base:hasText "Nafigator"^^rdf:XMLLiteral ;
    naf-base:hasLength "9"^^xsd:integer ;
    naf-base:hasOffset "4"^^xsd:integer ;
    naf-base:isPartOf doc_001:page1, 
        doc_001:para1, 
        doc_001:sent1.

In the next layer the terms of the word forms are defined, with their linguistic properties (lemma, grammatical number, part-of-speech and if applicable other properties such as verb voice and verb form). The term that refers to the word form above is

doc_001:term2 a naf-base:term ;
    naf-base:hasLemma "Nafigator"^^rdf:XMLLiteral ;
    naf-base:hasNumber olia:Singular ;
    naf-base:hasPos olia:ProperNoun ;
    naf-base:hasSpan ( doc_001:wf2 ) .

For the linguistic properties the OLiA ontology is used, which stand for Ontologies of Linguistic Annotations, an OWL taxonomy of data categories for linguistic annotations. The ontology contains precise definitions and interrelation between the linguistic categories. In this case the grammatical number (olia:Singular) and the part-of-speech tag (olia:ProperNoun) is included in the properties of this term. Depending of the term other properties are defined, for example verb forms. The span of the term refers back to the word forms (if you create a NAF ontology then you would define this as a transitive relationship, but for now, by including both relations we speed up the retrieval process).

Named entities

Next are the named entities that are stored in another NAF layer and here as separate subjects in the triple-store. An entity refers back to a term and has a certain type (organization, person, product, law, date and so on). The text of the entity is already stored in the term object so there is not need to include it here. External references could be added here, for example references to legal entities from Global LEI Foundation. Here is the example referring to the triples above.

doc_001:entity1 a naf-base:entity ;
    naf-base:hasType naf-entity:product ;
    naf-base:hasSpan ( doc_001:term2 ) .

Dependencies

Powerful NLP models exist that are able to derive relationships between words in within sentences. The dependencies are defined on the level of terms and stored in the dependency layer of NAF. In this RDF representation the dependencies are simply added to the terms.

doc_001:term3 a naf-base:term ;
    naf-rfunc:compound doc_001:term2 ;
    naf-rfunc:det doc_001:term1 .

The second and third line say that term3 (‘package’) forms a compound term with term2 (‘Nafigator’) and has its determinant in term1 (‘The’).

There are more annotation layers in NAF, but these are the most basic ones and if you have these, then many powerful NLP analyses already can be done.

Information retrieval from the RDF graph database

The conversion of text to RDF described above was applied to all webpages and documents of the website of DNB, 4.065 documents in total with 401.832 sentences containing 9.789.818 words. This text data led to over 221 million RDF triples in the tripe-store. I used a local database that was queried via a SPARQL endpoint. These numbers mentioned here can easily be extracted with SPARQL queries, for example to count the number of sentences we can use the SPARQL query:

SELECT (COUNT(?s) AS ?count) WHERE { ?s a naf-base:sentence . }

With this query all RDF subjects (the variable ?s) that are a sentence are counted and the result is stored in the variable ‘count’. The same can be done with other RDF subjects like word forms and documents.

The RDF representation described above allows you to store the content and annotations of a set of documents with their metadata in one single graph. You can then retrieve information from that graph from different perspectives and for different purposes.

Information retrieval

Suppose we want to find all references on the website with relations between ‘DNB’ and the verb ‘supervise’ by looking for sentences where ‘DNB’ is the nominal subject and ‘supervise’ is the lemma of the verb in the sentence. This is done with the following query

SELECT ?text
WHERE {
    ?term naf-base:hasLemma "supervise" .
	?term naf-rfunc:nsubj [naf-base:hasLemma "DNB" ] .
    ?term naf-base:hasSpan [ rdf:first ?wf ] .
    ?wf naf-base:isPartOf [ a naf-base:sentence ; naf-base:hasText ?text ].
}

It’s almost readable 🙂 The first line in the WHERE statement retrieves words that have ‘supervise’ as a lemma (this includes past, present and future tense and different verb forms). The second line narrows the selection down to where the nominal subject of the verb is ‘DNB’ (the lemma of the subject to be precise). The last two lines select the text of the sentences that includes the words that were found.

Execution of this query is done in a few milliseconds (on a desktop computer with a local database, nothing fancy) and results in 22 sentences, such as “DNB supervises adequate management of sustainability risks by financial institutions.”, “DNB supervises the cash payment system by providing information and guidance on the rules and procedures, data collection and examining compliance with the rules.”, and so on.

Term extraction

Terms are often multi-words and can be retrieved by part-of-speech tags and dependencies. Suppose we want to retrieve all two-words terms of the form adjective, common noun. Part-of-speech tags are defined in the terms layer. In the graph also the relation between the terms is defined, in this case by an adjectival modifier (amod) relation (the common noun is modified by an adjective). Then we can define a query that looks for exactly that: two words, an adjective and a common noun, where the mutual relationship is of an adjectival modifier. This is expressed in the first three lines in the WHERE-clause below. The last two lines retrieve the text of the words.

SELECT DISTINCT ?w1 ?w2 (count(*) as ?c)
WHERE {
    ?term1 naf-base:hasPos olia:CommonNoun .
    ?term2 naf-base:hasPos olia:Adjective .
    ?term1 naf-rfunc:amod ?term2 .
    ?term1 naf-base:hasSpan [ rdf:first/naf-base:hasText ?w1 ] .
    ?term2 naf-base:hasSpan [ rdf:first/naf-base:hasText ?w2 ] .
} GROUP BY ?w1 ?w2
ORDER BY DESC(?c)

Note that in the query a count of the number of occurrences of the term in the output and sort the output according to this count has been added.

Most often the term ‘monetary policy’ was found (2.348 times), followed by ‘financial institutions’ (1.734 times) and ‘financiële instellingen’ (Dutch translation of financial institution, 1.519 times), and so on. In total more than 127.000 of these patterns were found on the website (this is a more complicated query and took around 10 seconds). In this way all kinds of term patterns can be found, which can be collected in a termbase (terminology database).

Opinion extraction

I will give here a very simple example of opinion extraction based on part-of-speech tags. Suppose you want to extract sentences that contain the authors (or someone else’s) subjective opinion. You can look a the grammatical subject and the verb in a sentence, but you can also look at whether a sentence contains something like ‘too high’ or ‘too volatile’ (which often indicates a subjective content). In that case we have the word ‘too’ (an adverb) followed by an adjective, with mutual relation of adverbial modifier (advmod). In the Dutch language this has exactly the same form. The following query extracts these sentences.

SELECT ?text
WHERE {
    ?term1 naf-base:hasPos olia:Adjective .
    ?term2 naf-base:hasSpan [ rdf:first/naf-base:hasText "too" ] .
    ?term1 naf-rfunc:advmod ?term2 .
    ?term1 naf-base:hasSpan [ rdf:first ?wf1 ] .
    ?sent1 naf-base:hasSpan [ rdf:rest*/rdf:first ?wf1 ] .
    ?sent1 a naf-base:sentence .
    ?sent1 naf-base:hasText ?text .
}

With the last three lines the text of the sentence that includes the term is found (the output of the query). With the documents of the website of DNB, the output contains sentences like: “It is also clear that CO2 emissions are still too cheap and must be priced higher to sufficiently curtail emissions” and “Firms end up being too large” (in total 343 sentences in 0.3 seconds).

The examples shown here are just for illustrative purposes and do not always lead to accurate results, but they show that information extraction can be done fairly easy (if you know SPARQL) and reasonably quick. Once the data is stored into a graph database, named entities can be matched with other internal or external data sources and lemmas of terms can be matched with concept-based terminology databases. Then you have a graph where the text is not only available on a simple string-level but also, and more importantly, on a conceptual level.

UPDATE: I have written a follow-up on this blog here.

The Solvency termbase for NLP

Published by:

This blog describes a way to construct a terminology database for the insurance supervision knowledge domain. The goal of this termbase is provide a reliable basis to extract insurance supervision terminology within different NLP analyses.

The terminology of solvency and insurance supervision forms an expert domain of terminology based on economics, mathematics, accounting and finance terminologies. Like probably many other knowledge domains the terminology used is very technical and specific. Some terms are only used within this domain with only a limited number of occurrences (which often hinders the use of statistical methods for finding terms). And many other words have general meanings outside the domain that do not coincide with the specific meanings within the domain. Translation of terms from this specific domain often requires extensive knowledge about the meaning and use of these terms.

What is a termbase?

A termbase is a database containing terminology and related information. It consists of concepts with their verbal designations (terms, i.e. single words or composed of multi-word strings) of a specific knowledge domain, often in different languages. It contains the full form of concepts, but also abbreviations, synonyms and variants and additional information of concepts, such as definitions and external references. To indicate the accuracy or completeness often a reliability code is added to individual terms of a concept. A proper termbase is an important terminology tool to achieve standardization of information and consistent use of (translations) of concepts in documents. And because of that, they are often used by professional translators.

The European Union translates legal documents in all member state languages and uses for this one common publicly available termbase: the IATE (Interactive Terminology for Europe) terminology database. The IATE termbase is used in the EU institutions and agencies since 2004 for the collection, dissemination and management of EU-specific terminology. This helps to avoid divergences in the application of European Law within Europe (there exists a vast amount of literature on the effects on language ambiguity in European legislation). The translations of European legislation are therefore of the highest quality with strong consistency between different directives, regulations, delegated and implementing acts and so on. This termbase is very useful for information extraction from documents and for linking terminology concepts between different documents. They can be extended with abbreviations, synonyms and common variants of terms.

Termbases is very useful for information extraction from documents and for linking terminology concepts between different documents. They can be extended with abbreviations, synonyms and common variants of terms.

The Solvency termbase for NLP

To create a first Solvency termbase for NLP purposes, I extracted terms from Solvency 2 Delegated Acts in a number of languages, looked up these terms in the IATE database and copied the corresponding concepts. It often happens that for one language the same term refers to different concepts (for example, the term ‘balance’ means something different in chemistry and in accounting). But if for one legal document the terms from different languages refer to the same concept, then we probably have the right concept (that was used in the translation of the legal document). So, the more references from the same legal document, the more reliable the term-concept relation is. And if we have the proper term-concept relationship, we automatically have all reliable translations of that concept.

Term extraction was done with part-of-speech patterns (such as adj-noun and adj-noun-noun patterns). To do this, for every language the Delegated Acts was converted to the NLP Annotation Format (NAF). The functionality for conversion to NAF and for extracting terms based on pos patterns is part of the nafigator package. As an NLP engine for nafigator, I used the Stanford Stanza package that contains tokenizers and part-of-speech models for every European language. The termbase itself was made with the terminator repository (currently under construction).

For terms in Dutch, I also added to the termbase additional part-of-speech tags, lemma’s and morphological properties from the Lassy Klein-corpus from the Instituut voor de Nederlandse taal (Dutch Language Institute). This data set consists of approximately 1 million words with manually verified syntactic annotations. I expanded this data set with solvency related words. Linguistical properties of terms of other languages can be added it a reliable data set is available.

Below, you see one concept from the resulting termbase (the concept of which ‘solvency capital requirement’ is the English term) in TermBase eXchange format (TBX). This is an international standard (ISO 30042:2019) for the representation of structured concept-oriented terminological data, based on xml.

<conceptEntry id="249">
 <descrip type="subjectField">insurance</descrip>
 <xref>IATE_2246604</xref>
 <ref>https://iate.europa.eu/entry/result/2246604/en</ref>
 <langSec xml:lang="nl">
  <termSec>
   <term>solvabiliteitskapitaalvereiste</term>
   <termNote type="partOfSpeech">noun</termNote>
   <note>source: ../naf-data/data/legislation/Solvency II Delegated Acts - NL.txt (#hits=331)</note>
   <termNote type="termType">fullForm</termNote>
   <descrip type="reliabilityCode">9</descrip>
   <termNote type="lemma">solvabiliteits_kapitaalvereiste</termNote>
   <termNote type="grammaticalNumber">singular</termNote>
   <termNoteGrp>
    <termNote type="component">solvabiliteits-</termNote>
    <termNote type="component">kapitaal-</termNote>
    <termNote type="component">vereiste</termNote>
   </termNoteGrp>
  </termSec>
 </langSec>
 <langSec xml:lang="en">
  <termSec>
   <term>SCR</term>
   <termNote type="termType">abbreviation</termNote>
   <descrip type="reliabilityCode">9</descrip>
  </termSec>
  <termSec>
   <term>solvency capital requirement</term>
   <termNote type="termType">fullForm</termNote>
   <descrip type="reliabilityCode">9</descrip>
   <termNote type="partOfSpeech">noun, noun, noun</termNote>
   <note>source: ../naf-data/data/legislation/Solvency II Delegated Acts - EN.txt (#hits=266)</note>
  </termSec>
 </langSec>
 <langSec xml:lang="fr">
  <termSec>
   <term>capital de solvabilité requis</term>
   <termNote type="termType">fullForm</termNote>
   <descrip type="reliabilityCode">9</descrip>
   <termNote type="partOfSpeech">noun, adp, noun, adj</termNote>
   <note>source: ../naf-data/data/legislation/Solvency II Delegated Acts - FR.txt (#hits=198)</note>
  </termSec>
  <termSec>
   <term>CSR</term>
   <termNote type="termType">abbreviation</termNote>
   <descrip type="reliabilityCode">9</descrip>
  </termSec>
 </langSec>
</conceptEntry>

You see that the concept contains a link to the IATE database entry with the definition of the concept (the link in this blog actually works so you can try it out). Then a number of language sections contain terms of this concept for different languages. The English section contains the term SCR as an English abbreviation of this concept (the French section contains the abbreviation CSR for the same concept). For every term the part-of-speech tags were added (which are not part of the IATE database) and, for Dutch only, with the lemma and grammatical number of the term and its word components. These additional linguistical attributes allow easier use within NLP analyses. Furthermore, as a note the number of all occurrences in the original legal document are included.

The concept entry contains related terms in all European languages. In Greek the SCR is κεφαλαιακή απαίτηση φερεγγυότητας, in Irish it is ‘ceanglas maidir le caipiteal sócmhainneachta’ (although the Solvency 2 Delegated Acts is not available in the Irish language), in Portuguese it is ‘requisito de capital de solvência’, in Estonian ‘solventsuskapitalinõue’, and so on. These are reliable translations as they are used in legal documents of that language.

The termbase contains all terms from the Solvency 2 Delegated Acts that can be found in the IATE database. In addition, terms that were not found in that database are added with the termNote “NewTerm”, to indicate that this term has yet to be reviewed by a knowledge domain expert. This would also be the way to add synonyms and variants of terms.

The Solvency termbase basically allows to scan for a Solvency 2 concept in a document in any of the 23 European languages (given that it is in the IATE database). This is of course an initial approach to construct a termbase to test whether it is feasible and practical. The terminology that insurance undertakings use in their solvency reports is very likely to differ from the one used in legal documents. I will be testing this with a number of documents to identify Solvency 2 terminology to get an idea of how many synonyms and variants are missing.

Besides this Solvency termbase, it is in the same way possible to construct a Climate termbase based on the European Climate Law (a European regulation from 2021). This law contains a large number of climate-related terminology and is available in all European languages. A Climate termbase gives the possibility to extract climate-related information from all kinds of documents. Furthermore, we have the Sustainable Finance Disclosure Regulation (a European regulation also from 2021) for environmental, social, and governance (ESG) terminology, which could provide a starting point for an ESG termbase. And of course I eagerly await the European Regulation on Artificial Intelligence.

Two new Python packages

Published by:

Here is a short update about two new Python packages I have been working on. The first is about structuring Natural Language Processing projects (a subject I have been working on a lot recently) and the second is about rule mining in arbitrary datasets (in my case supervisory quantitative reports).

nafigator package

This packages converts the content of (native) pdf documents, MS Word documents and html files into files that satisfy the NLP Annotation Format (NAF). It can use a default spaCy or Stanza pipeline or a custom made pipeline. The package creates a file in xml-format with the first basic NAF layers to which additional layers with annotations can be added. It is also possible to convert to RDF format (turtle-syntax and rdf-xml-syntax). This allows the use of this content in graph databases.

This is my approach for storing (intermediate) NLP results in a structured way. Conversion to NAF is done in only one computationally intensive step, and then the NAF files contain all necessary NLP data and can be processed very fast for further analyses. This allows you to use the results efficiently in downstream NLP projects. The NAF files also enable you to keep track of all annotation processes to make sure NLP results remain reproducible.

The NLP Annotation Format from VU University Amsterdam is a practical solution to store NLP annotations and relatively easy to implement. The main purpose of NAF is to provide a standardized format that, with a layered extensible structure, is flexible enough to be used in different NLP projects. The NAF standard has had a number of subsequent versions, and is still under development.

The idea of the format is that basically all NLP processing steps add annotations to a text. You start with the raw text (stored in the raw layer). This raw text is tokenized into pages, paragraphs, sentences and word forms and the result is stored in a text layer. Annotations to each word form (like lemmatized form, part-of-speech tag and morphological features) are stored in the terms layers. Each additional layer builds upon previous layers and adds more complex annotations to the text.

See for more information about the NLP Annotation Format and Nafigator on github.

ruleminer package

Earlier, I have already made some progress with mining datasets for rules and patterns (see for two earlier blogs on this subject here and here). New insights led to a complete new set-up of code that I have published as a Python package under the name ruleminer. The new package improves the data-patterns package in a number of ways:

  • The speed of the rule mining process is improved significantly for if-then rules (at least six times faster). All candidate rules are now internally evaluated as Numpy expressions.
  • Additional evaluation metrics have been added to allow new ways to assess how interesting newly mined rules are. Besides support and confidence, we now also have the metrics lift, added value, casual confidence and conviction. New metrics can be added relatively easy.
  • A rule pruning procedure has been added to delete, to some extent, rules that are semantically identical (for example, if A=B and B=A are mined then one of them is pruned from the list).
  • A proper Pyparsing Grammar for rule expressions is used.

Look for more information on ruleminer here on github.

EIOPA’s Solvency 2 taxonomy in RDF

Published by:

To use the metadata from XBRL taxonomies, like labels, hierarchies, template structures and formulas, often licensed software is needed to process the taxonomy and convert the XML content to readable formats. In an earlier blog I have shown that it is useful to convert XBRL instance data to a linked data set in RDF and then query that data to retrieve the desired information. In this blog I will show how to do this with taxonomies: by using a number of small SPARQL queries the complete Data Point Model (DPM) of (European) XBRL taxonomies can be retrieved.

The main purpose of retrieving metadata in this manner is to be able to use taxonomy metadata in data science environments, for example to be able to apply machine learning models that use taxonomy metadata like hierarchies and to use concept and element labels from a taxonomy in NLP, for example Named Entity Recognition tasks to link quantitative reports to (unstructured, or: not yet structured) text data.

The lightweight solution that I show here is completely based on open source code, in the form of my xbrl2rdf package. This package converts XBRL instance files and all related taxonomy files (schemas and linkbases) to RDF and RDF-star and uses for this lxml and rdflib, and nothing else.

The examples below use the Solvency 2 taxonomy, but other taxonomies works as well. Gist with the notebook with the code below can be found here.

Importing the data

With the xbrl2rdf-package I have converted the EIOPA instance example file for quarterly reports for solo undertaking (QRS) to RDF. All taxonomy concepts that are used in that instance are included in the RDF data set. This result can be read with rdflib into memory.

# RDF graph loading
path = "../data/rdf/qrs_240_instance.ttl"

g = RDFGraph()
g.parse(path, format='turtle')

print("rdflib Graph loaded successfully with {} triples".format(len(g)))

This returns

rdflib Graph loaded successfully with 1203744 triples

So we have a RDF graph with 1.2 million triples that contains all data (facts related to concepts in the taxonomy, including all labels, validation rules, template structures, etc). The original RDF data file is around 64 Mb (combining instance and taxonomy triples). Reading and processing this file into an in-memory RDF graph takes some time, but then the graph can easily be queried.

Extracting template URIs

Let’s start with a simple query. Table or template URIs are subjects in the triple “subject xl:type table:table”. To get a list with all templates of an instance (in this case the first five) we run

q = """
  SELECT ?a
  WHERE {
    ?a xl:type table:table .
  }"""
tables = [str(row[0]) for row in g.query(q)]
tables.sort()
tables[0:5]

This returns a list of the URIs of the templates contained in the instance file.

['http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2019-07-15/tab/S.01.01.02.01#s2md_tS.01.01.02.01',
 'http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2019-07-15/tab/S.01.02.01.01#s2md_tS.01.02.01.01',
 'http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2019-07-15/tab/S.02.01.02.01#s2md_tS.02.01.02.01',
 'http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2019-07-15/tab/S.05.01.02.01#s2md_tS.05.01.02.01',
 'http://eiopa.europa.eu/xbrl/s2md/fws/solvency/solvency2/2019-07-15/tab/S.05.01.02.02#s2md_tS.05.01.02.02']

Extracting the explicit domains

Next, we extract the explicit domains and related data in the taxonomy. A domain is specific XBRL terminology and means a set of elements sharing a specified semantic nature. An explicit domain has its elements enumerated in the taxonomy and can be found with the subject in the triple ‘subject rdf:type model:explicitDomainType’.

q = """
  SELECT DISTINCT ?t ?x1 ?x2 ?x3 ?x4
  WHERE {
    ?t rdf:type model:explicitDomainType .
    ?t xbrli:periodType ?x1 .
    ?t model:creationDate ?x2 .
    ?t xbrli:nillable ?x3 .
    ?t xbrli:abstract ?x4 .
  }"""

The first five domains (of 41 in total) are

indexDomain nameDomain labelperiod typecreation datenillableabstract
0LBLines of businessesinstant2014-07-07truetrue
1MCMain categoriesinstant2014-07-07truetrue
2TITime intervalsinstant2014-07-07truetrue
3AOArticle 112 and 167instant2014-07-07truetrue
4CGCollaterals/Guaranteesinstant2014-07-07truetrue
Domain names and labels with attributes

So, the label of the domain LB is Lines of businesses; it has been there since the early versions of the taxonomy. If a domain is modified then this is also included as a triple in the data set.

Extracting domain members

Elements of an explicit domain are called domain members. A domain member (or simply a member) is enumerated element of an explicit domain. All members from a domain share a certain common nature. To get the members of a domain, we define a function that finds all domain-members relations of a given domain and retrieve the label of the member. In SPARQL this is:

def members(domain):
    q = """
      SELECT DISTINCT ?t ?label
      WHERE {
        ?l arcrole7:domain-member [ xl:from <"""+str(domain)+"""> ;
                                    xl:to ?t ] .
        ?t rdf:type nonnum:domainItemType .
        ?x arcrole3:concept-label [ xl:from ?t ;
                                    xl:to [rdf:value ?label ] ] .
        }"""
    return g.query(q)

All members of all domains can be retrieved by running this function for all domains defined earlier. Based on the output we create a Pandas DataFrame with the results.

df_members = pd.DataFrame()
for d in df_domains.iloc[:, 0]:
    data = [[urldefrag(d)[1]]+[urldefrag(row[0])[1]]+list(row[1:]) for row in members(d)]
    columns = ['Domain',
               'Member',
               'Member label']
    df_members = df_members.append(pd.DataFrame(data=data,
                                                columns=columns))

In total there are 4.879 members of all domains (in this taxonomy).

index DomainMemberMember label
0LBx0Total/NA
1LBx1Accident and sickness
2LBx2Motor
3LBx3Fire and other damage to property
4LBx4Aviation, marine and transport
The first five member of the domain LB (Lines of Businesses)

This allows us, for example, to retrieve all facts in a report that are related to the term ‘motor’ because a reported fact contains references to the domain members to which the fact relates.

Extracting the template structure

Template structures are stored in the taxonomy as a tree of linked elements along the axes x, y and, if applicable, z. The elements have a label and a row-column code (this holds at least for EIOPA and DNB taxonomies), and have a certain depth, i.e. they can be subcategories of other elements. For example in the Solvency 2 balance sheet template, the element ‘Equities’ has subcategories ‘Equities – listed’ and ‘Equities unlisted’. I have not included the code here but with a few lines you can extract the complete template structures as they are stored in the taxonomy. For example for the balance sheet (S.02.01.02.01) we get:

 axisdepthrc-codelabel
1x1C0010Solvency II value
3y1 Assets
4y1 Liabilities
5y2R0010Goodwill
6y2R0020Deferred acquisition costs
7y2R0030Intangible assets
8y2R0040Deferred tax assets
9y2R0050Pension benefit surplus
10y2R0060Property, plant & equipment held for own use
11y2R0070Investments (other than assets held for index-…
12y3R0080Property (other than for own use)
13y3R0090Holdings in related undertakings, including pa…
14y3R0100Equities
15y4R0110Equities – listed
16y4R0120Equities – unlisted
17y3R0130Bonds
18y4R0140Government Bonds
19y4R0150Corporate Bonds
20y4R0160Structured notes
21y4R0170Collateralised securities
22y3R0180Collective Investments Undertakings
23y3R0190Derivatives
24y3R0200Deposits other than cash equivalents
The first 25 lines of the balance sheet template

With relatively small SPARQL queries, it is possible to retrieve metadata from XBRL taxonomies. This works well because we have converted the original taxonomy (in xml) to the linked data format RDF; and this format is especially well suited for representing and querying XBRL data.

The examples above show that it is possible to retrieve the complete Solvency 2 Data Point Model (and more) from the the taxonomy in RDF and make it available in a Python environment. This allows incorporation of metadata in machine learning models and in NLP applications. I hope that this approach will allow more data scientists to use existing metadata from XBRL taxonomies.

Converting XBRL to RDF-star

Published by:

Lately I have been working on the conversion of XBRL instances and related taxonomy schemas and linkbases to RDF and RDF-star. In these semantic data formats, you can link data in XBRL data with other data sources and you can query the data in a fairly easy manner. RDF-star is an extension of RDF that in some situations allows a more compact description of linked data, and by that it narrows the gap between RDF and property graphs. How this works, I will show in this blog using the XBRL taxonomy definitions as an example.

In a previous blog I showed that XBRL instance facts can be converted to RDF and visualized as a network. The same can be done with the related taxonomy elements. An XBRL taxonomy consists of concepts and relations between concepts that define calculations, presentations, labels and definitions. The concepts are laid down (mostly) in XML schemas and the relations in linkbases using XML schemas and XLinks. By converting the XBRL taxonomy to RDF, the XBRL fact data is linked to its corresponding metadata in the taxonomy.

XBRL to RDF

There has been done some work on the conversion of XBRL to RDF, most notably by Dave Raggett. His project xbrlimport, written in C++ and available on SourceForge, converts XBRL data to RDF triples. His approach is clean and straightforward and reuses the original namespaces of the XBRL data (with some obvious elements translated to predicates with RDF namespaces).

I used Raggett’s xbrlimport as a starting point, translated it to Python, added XBRL items that were introduced after publication of the code and improved a number of things. The code is now for example able to convert all EIOPA’s Solvency 2 taxonomy elements with all metadata available to RDF format. This code is available under the same license as xbrlimport (GNU General Public License) as a Python Package on pypi.org. You can take an XBRL instance with corresponding taxonomy (in the form of a zip-file) and convert the contents to RDF and RDF-star. This code will look up any references (URIs) in the XBRL instance to the taxonomy in the zip-file and convert the relevant files to RDF.

Let’s look at some examples of the Solvency 2 taxonomy converted to RDF. The RDF triple of an arbitrary XBRL concept from the Solvency 2 taxonomy looks like this (in turtle format):

s2md_met:mi362 
    rdf:type xbrli:monetaryItemType ;
    xbrli:periodType """instant"""^^rdf:XMLLiteral ;
    model:creationDate """2014-07-07"""^^xsd:dateTime ;
    xbrli:substitutionGroup xbrli:item ;
    xbrli:nillable "true"^^xsd:boolean ;

This example describes the triples of concept s2md_met:mi362 (a Solvency 2 metric). With these triples we have exactly the same data as in the related XML file but now in the form of triples. Namespaces are derived from the XML file (except rdf:type) and datatypes are transformed to RDF datatypes with proper RDF syntax.

This can be done with all concepts used to which the facts of an XBRL instance refer. If you have facts in RDF format, then in RDF these concept are automatically linked with the concepts in the taxonomy because the URIs of the concepts are the same. This creates a network of facts with all related metadata of the facts.

An XBRL taxonomy also contains links that relate concepts to each other for several purposes (to provide labels, definitions, presentations and calculations) . An example of a link is the following.

_:link2 arcrole:concept-label [
    xl:type xl:link ;
    xl:role role2:link ;
    xl:from s2md_met:mi362 ;
    xl:to s2md_met:label_s2md_mi362 ;
    ] .

The link relates concept mi362 with label mi362 by creating a new subject _:link2 with predicate arcrole:concept-label and an object which contains all data about the link (including the xl:from and xl:to and the attributes of the link). This way of introducing a new subject to specify a link between two concepts is called reification and a bit artificial because you would like to link the concept directly with the label, such as

s2md_met:mi281 arcrole:concept-label s2md_met:label_s2md_mi281

However, then you are unable in RDF to link the attributes (like the order and the role) to the predicates. It is one of the disadvantages of the current RDF format. There appears to be no easy way to do this in RDF, other than by using this artificial reification approach (some other solutions exist like the singleton property approach, but all of them have disadvantages.)

The new RDF-star format

Recently, the RDF-star working group published their first Draft Community Report. In this report they introduced new RDF-star and SPARQL-star specifications. These new specifications, although not yet a W3C standard, enable more compact specification of linked datasets and simpler graphs and less nodes.

Let’s look what this means for the XBRL linkbases with the following example. Suppose we have the following link definition.

_:link1 arcrole:breakdown-tree [
    xl:from _:s2md_a1 ;
    xl:to _:s2md_a1.root ;
    xl:type xl:link ;
    xl:role tab:S.01.01.02.01 ;
    xl:order "0"^^xsd:decimal ;
    ] .

The subject in this case is _:link1 with predicate arcrole:breakdown-tree, so this link describes a part of a table template. It points to a subject with all the information of the link, i.e. from, to, type, role and order from the xl namespace. Note that there is no triple with _:s2md_a1 (xl:from) as a subject and _:s2md_a1.root (xl:to) as an object. So if you want to know the relations of the concept _:s2md_a1 you need to look at the link triples and look for entries where xl:from equals the concept.

With the new RDF-star specifications you can just add the triple and then add properties to the triple as a whole, so the example would read

_:s2md_a1 arcrole:breakdown-tree _:s2md_a1.root .

<<_:s2md_a1 arcrole:breakdown-tree _:s2md_a1.root>> 
    xl:role tab:S.01.01.02.01 ;
    xl:order "0"^^xsd:decimal ;
    .

Which is basically what we need to define. If you now want to know the relations of the subject _:s2md_a1 then you just look for triples with this subject. In the visual presentation of the RDF dataset you will see a direct link between the two concepts. This new RDF format also implies simplifications of the SPARQL queries.

This blog has become a bit technical but I hope you see that the RDF-star specification allows a much needed simplification of RDF triples. I showed that the conversion of XBRL taxonomies to RDF-star leads to a smaller amount of triples and also to less complex triples. The resulting taxonomy triples lead to less complex graphs and can be used to derive the XBRL labels, template structures, validation rules and definitions, just by using SPARQL queries.

Europe’s insurance register linked to the GLEIF RDF dataset

Published by:

Number 7 of my New Year’s Resolutions list reads “only use and provide linked data”. So, to start the year well, I decided to do a little experiment to combine insurance undertakings register data with publicly available legal entity data. In concrete terms, I wanted to provide the European insurance register published by EIOPA (containing all licensed insurance undertakings in Europe) as linked data with the Legal Entity data from the Global Legal Entity Identifier Foundation (GLEIF). In this blog I will describe what I have done to do so.

The GLEIF data

The GLEIF data consists of information on all legal entities in the world (entities with a Legal Entity Identifier). A LEI is required by any legal entity who is involved with financial transactions or operates within the financial system. If an organization needs a LEI then it requests one at a local registration agent. For the Netherlands these are the Authority for the Financial Markets (AFM), the Chamber of Commerce (KvK) and the tax authority and others. GLEIF receives data from these agents in each country and makes the collected LEI data available in a number of forms (api, csv, etc).

The really cool thing is that in 2019, together with data.world, GLEIF developed an RDFS/OWL Ontology for Legal Entities, and began in 2020 to publish regularly the LEI data as a linked RDF dataset on data.world (see https://data.world/gleif, you need a (free) account to obtain the data). At the time of this writing, the size of the level 1 data (specifying who is who) is around 10.2 Gb with almost 92 million triples (subject-predicate-object), containing information about entity name, legal form, headquarter and legal address, geographical location, etc. Also related data such as who owns whom is published in this forms.

The EIOPA insurance register

The European Supervisory Authority EIOPA publishes the Register of Insurance undertakings based on information provided by the National Competent Authorities (NCAs). The NCA in each member state is responsible for authorization and registration of the insurance undertakings activities. EIOPA collects the data in the national registers and publishes an European insurance register, which includes more than 3.200 domestic insurance undertakings. The register contains entity data like international and commercial name, name of NCA, addresses, cross border status, registration dates etc. Every insurance undertaking requires a LEI and the LEI is included in the register; this enables us to link the data easily to the GLEIF data.

The EIOPA insurance register is available as CSV and Excel file, without formal naming and clear definitions of column names. Linking the register data with other sources is a tedious job, because it must be done by hand. Take for example the LEI data in the register, which is referred to with the column name ‘LEI’; this is perfectly understandable for humans, but for computers this is just a string of characters. Now that the GLEIF has published its ontologies there is a proper way to refer to a LEI, and that is with the Uniform Resource Identifier (URI) https://www.gleif.org/ontology/L1/LEI, or in a short form gleif-L1:LEI.

The idea is to publish the data in the European insurance register in the same manner; as linked data in RDF format using, where applicable, the GLEIF ontology for legal entities and creating an EIOPA ontology for the data that is unique for the insurance register. This allows users of the data to incorporate the insurance register data into the GLEIF RDF dataset and thereby extending the data available on the legal entities with the data from the insurance register.

Creating triples from the EIOPA register

To convert the EIOPA insurance register to linked data in RDF format, I did the following:

  • extract from the GLEIF RDF level 1 dataset on data.world all insurance undertakings and related data, based on the LEI in the EIOPA register;
  • create a provisional ontology with URIs based on https://www.eiopa.europe.eu/ontology/base (this should ideally be done by EIOPA, as they own the domain name in the URI);
  • transform, with this ontology, the data in the EIOPA register to triples (omitting all data from the EIOPA register that is already included in the GLEIF RDF dataset, like names and addresses);
  • publish the triples for the insurance register in RDF Turtle format on data.world.

Because I used the GLEIF ontology where applicable, the triples I created are automatically linked to the relevant data in the GLEIF dataset. Combining the insurance register dataset with the GLEIF RDF dataset results in a set where you have all the GLEIF level 1 data and all data in the EIOPA insurance register combined for all European insurance undertakings.

Querying the data

Let’s look what we have in this combined dataset. Querying the data in RDF is done with the SPARQL language. Here is an example to return the data on Achmea Schadeverzekeringen.

SELECT DISTINCT ?p ?o
WHERE
{ ?s gleif-L1:hasLegalName "Achmea schadeverzekeringen N.V." . 
  ?s ?p ?o .}

The query looks for triples where the predicate is gleif-base:hasLegalName and the object is Achmea Schadeverzekeringen N.V. and returns all data of the subject that satisfies this constraint. This returns (where I omitted the prefix of the objects):

gleif-L1-data:L-72450067SU8C745IAV11
    rdf#type                             LegalEntity
    gleif-base:hasLegalJurisdiction      NL  
    gleif-base:hasEntityStatus           EntityStatusActive  
    gleif-l1:hasLegalName                Achmea Schadeverzekeringen     
                                         N.V.
    gleif-l1:hasLegalForm                ELF-B5PM
    gleif-L1:hasHeadquartersAddress      L-72450067SU8C745IAV11-LAL  
    gleif-L1:hasLegalAddress             L-72450067SU8C745IAV11-LAL  
    gleif-base:hasRegistrationIdentifier BID-RA000463-08053410
    rdf#type                             InsuranceUndertaking
    eiopa-base:hasRegisterIdentifier     IURI-De-Nederlandsche-Bank-
                                         W1686

We see that the rdf#type of this entity is LegalEntity (from the GLEIF data) and the jurisdiction is NL (this has a prefix that refers to the ISO 3166-1 country codes). The legal form refers to another subject called ELF-B5PM. The headquarters and legal address both refer to the same subject that contains the address data of this entity. Then there is a business identifier to the registration data. The last two lines are added by me: a triple to specify that this subject is not only a LegalEntity but also an InsuranceUndertaking (defined in the ontology), and a triple for the Insurance Undertaking Register Identifier (IURI) of this subject (also defined in the ontology).

Let’s look more closely at the references in this list. First the legal form of Achmea (i.e. the predicate and objects of legal form code ELF-B5PM). Included in the GLEIF data is the following (again omitting the prefix of the object):

rdf#type                         EntityLegalForm  
rdf#type                         EntityLegalFormIdentifier  
gleif-base:identifies            ELF-B5PM  
gleif-base:tag                   B5PM  
gleif-base:hasCoverageArea       NL  
gleif-base:hasNameTransliterated naamloze vennootschap  
gleif-base:hasNameLocal          naamloze vennootschap  
gleif-base:hasAbbreviationLocal  NV, N.V., n.v., nv

With the GLEIF data we have this data on all legal entity forms of insurance undertakings in Europe. The local abbreviations are particularly handy as they help us to link an entity’s name extracted from documents or other data sources with its corresponding LEI.

If we look more closely at the EIOPA Register Identifier IURI-De-Nederlandsche-Bank-W1686 then we find the register data of this Achmea entity:

owl:a                         InsuranceUndertakingRegisterIdentifier
gleif-base:identifies         L-72450067SU8C745IAV11
eiopa-base:hasNCA                          De Nederlandsche Bank  
eiopa-base:hasInsuranceUndertakingID       W1686  
eiopa-base:hasEUCountryWhereEntityOperates NL  
eiopa-base:hasCrossBorderStatus            DomesticUndertaking  
eiopa-base:hasRegistrationStartDate        23/12/1991 01:00:00  
eiopa-base:hasRegistrationEndDate          None  
eiopa-base:hasOperationStartDate           23/12/1991 01:00:00  
eiopa-base:hasOperationEndDate             None

The predicate gleif-base:identifies refers back to the subject where gleif-L1:hasLegalName equals the Achmea entity. The other predicates are based on the provisional ontology I made that contains the definitions of the attributes of the EIOPA insurance register. Here we see for example that W1686 is the identifier of this entity in DNB’s insurance register.

Let me give a tiny example of the advantage of using linked data. The GLEIF data contains the geographical location of all legal entities. With the combined dataset it is easy to obtain the location for the insurance undertakings in, for example, the Netherlands. This query returns entity names with latitude and longitude of the legal address of the entity.

SELECT DISTINCT ?name ?lat ?long
WHERE {?sub rdf:type eiopa-base:InsuranceUndertaking ;
            gleif-base:hasLegalJurisdiction CountryCodes:NL ;
            gleif-L1:hasLegalName ?name ;
            gleif-L1:hasLegalAddress/gleif-base:hasCity ?city .
       ?geo gleif-base:hasCity ?city ;
            geo:lat ?lat ; 
            geo:long ?long .}

This result can be plotted on a map, see the link below. If you click on one of the dots then the name of the insurance undertaking will appear.

All queries above and the code to make the map are included in the notebook EIOPA Register RDF datase – SPARQL queries.

The provisional ontology I created is not yet semantically correct and should be improved, for example by incorporating data on NCAs and providing formal definitions. And other data sources could be added, for example the level 2 dataset to identify insurance groups, and the ISIN to LEI relations that are published daily by GLEIF.

By introducing the RDFS/OWL ontologies, the Global LEI Foundation has set an example on how to publish (financial) entity data in an useful manner. The GLEIF RDF dataset reduces the time needed to link the data with other data sources significantly. I hope other organizations that publish financial entity data as part of their mandate will follow that example.