Category Archives: Coding

Two new Python packages

Published by:

Here is a short update about two new Python packages I have been working on. The first is about structuring Natural Language Processing projects (a subject I have been working on a lot recently) and the second is about rule mining in arbitrary datasets (in my case supervisory quantitative reports).

nafigator package

This packages converts the content of (native) pdf documents, MS Word documents and html files into files that satisfy the NLP Annotation Format (NAF). It can use a default spaCy or Stanza pipeline or a custom made pipeline. The package creates a file in xml-format with the first basic NAF layers to which additional layers with annotations can be added. It is also possible to convert to RDF format (turtle-syntax and rdf-xml-syntax). This allows the use of this content in graph databases.

This is my approach for storing (intermediate) NLP results in a structured way. Conversion to NAF is done in only one computationally intensive step, and then the NAF files contain all necessary NLP data and can be processed very fast for further analyses. This allows you to use the results efficiently in downstream NLP projects. The NAF files also enable you to keep track of all annotation processes to make sure NLP results remain reproducible.

The NLP Annotation Format from VU University Amsterdam is a practical solution to store NLP annotations and relatively easy to implement. The main purpose of NAF is to provide a standardized format that, with a layered extensible structure, is flexible enough to be used in different NLP projects. The NAF standard has had a number of subsequent versions, and is still under development.

The idea of the format is that basically all NLP processing steps add annotations to a text. You start with the raw text (stored in the raw layer). This raw text is tokenized into pages, paragraphs, sentences and word forms and the result is stored in a text layer. Annotations to each word form (like lemmatized form, part-of-speech tag and morphological features) are stored in the terms layers. Each additional layer builds upon previous layers and adds more complex annotations to the text.

See for more information about the NLP Annotation Format and Nafigator on github.

ruleminer package

Earlier, I have already made some progress with mining datasets for rules and patterns (see for two earlier blogs on this subject here and here). New insights led to a complete new set-up of code that I have published as a Python package under the name ruleminer. The new package improves the data-patterns package in a number of ways:

  • The speed of the rule mining process is improved significantly for if-then rules (at least six times faster). All candidate rules are now internally evaluated as Numpy expressions.
  • Additional evaluation metrics have been added to allow new ways to assess how interesting newly mined rules are. Besides support and confidence, we now also have the metrics lift, added value, casual confidence and conviction. New metrics can be added relatively easy.
  • A rule pruning procedure has been added to delete, to some extent, rules that are semantically identical (for example, if A=B and B=A are mined then one of them is pruned from the list).
  • A proper Pyparsing Grammar for rule expressions is used.

Look for more information on ruleminer here on github.

Pilot Data Quality Rules

Published by:

Data Quality is receiving more and more attention within the financial sector, and deservedly so. That’s why DNB will start a pilot in September with the insurance sector to enable entities to run locally the required open source code and to evaluate Solvency 2 quantitative reports with our Data Quality Rules.

In the coming weeks we will:

With these tools you are able to assess the data quality of your Solvency 2 quantitative reports before submitting them to DNB. You can do that within your own data science environment.

We worked hard to make this as easy as possible; the only thing you need is Anaconda / Jupyter Notebooks (Python) and Git to clone our repositories from Github (all free and open source software). And of course the data you want to check. We also provide code to evaluate the XBRL instance files.

We are planning workshops to explain how to use the code and validation rules and to go through the process step by step.

Want to join or know more, please let me know (w.j.willemse at dnb.nl).

Code moved to GitHub/DNB

Published by:

It has been a bit quiet here on my side. But I have been busy moving the source code from the notebooks here to the GitHub-account of DNB. And in doing so we improved the code in many areas. Code development can now be done efficiently by using tools for Continuous Integration and Deployment. This enables everyone to work with us to explore new ideas and to test, improve and expand the code.

Let me shortly introduce our repositories (these are all based on code that I wrote for the blogs on this website).

Our insurance repositories

solvency2-data

A package for retrieving Solvency 2-data. Our idea is to provide you with one package for all Solvency 2-data retrieval.

Currently it is able to download the Risk Free Interest Term Structures from the EIOPA website, so you don’t need to search on the EIOPA website. And it implements the Smith-Wilson algorithm so you can make your own curves with different parameters.

This code is deployed as a package to the Python Package Index (a.k.a. the cheese shop), so you can install it with pip.

data-patterns

A package aimed to improve data quality of your reports. With this code you can generate and evaluate patterns, and we plan to publish validation and plausibility patterns in addition to the existing ones in the taxonomies. With this package you can evaluate your reports with these patterns.

This package is also deployed to the Python Package Index.

solvency2-data-science

A project with data science applications and tutorials on using the packages above. Currently, we have a data science tutorial using the public Solvency 2 data and a tutorial for the data-patterns package.

solvency2-nlp

Our experimental Natural Language Processing projects with Solvency 2 documents. I already published some results here with NLP (reading the Solvency 2 legislation documents and Word2Vec and Topic Modelling with SFCR documents) and we are planning to provide these and other applications in this repository.

All repositories were made from cookiecutter templates, which is a very easy way to set up your projects.

Take a look at the repositories. If you have suggestions for further improvements or ideas for new features, do not hesitate to raise an issue on the GitHub-site. In the documentation of each repository you can find more information on how to contribute.