Category Archives: Coding

Natural Language Processing in RDF graphs (2)

Published by:

This is a follow-up to my blog on natural language processing in RDF graphs. Since then I found a number of improvements and incorporated them in the Python packages.

NLP Interchange Format

As there are over fifty different NLP annotations formats available, it didn’t seem a good idea to create yet another annotation format. So instead of a self-made provisional ontology as I did earlier, it is now possible to convert to and use the NLP Interchange Format (NIF) within the nafigator package. This ontology is different from NAF but has the advantage that is a mature ontology for which the WC3 community has provided guidelines and best practices (see for example here Guidelines for Linked Data corpus creation using NIF). There are some Python packages doing similar things but none of them are able to convert the content of PDFs, docx and html to NIF.

The annotations in NAF are stored in the different layers. The data within each layer is stored in RDF triples in the following way:

raw layernif:Context
text layernif:Page, nif:Paragraph, nif:Sentence: nif:Word
terms layernif:Word
entities layernif:Phrase
deps layernif:Word
headernif:Context
Mapping from NAF layers to NIF classes

You can see how this works out in an example here.

Ontolex-Lemon

Secondly, the Python package termate now allows termbases in TBX to be now be converted with the Ontolex-Lemon ontology to RDF. This is based on another WC3 document Guidelines for Linguistic Linked Data Generation: Multilingual Terminologies (TBX) (although I have implemented this for TBX version 3 instead of version 2, on which the guideline is based).

An example can be found here.

Explainable outlier detection with decision trees and ruleminer

Published by:

This is a note on an extension of the ruleminer package to convert the content of decision trees into rules to provide an approach to unsupervised and explainable outlier detection.

Here is a way to use decision trees for unsupervised outlier detection. For an arbitrary data set (in the form of a dataframe) for each column a decision tree is trained to predict that column by using the other columns as features. For target columns with integers a classifier is used and for columns with floating numbers a regressor is used. This provides an unsupervised approach resulting in decision trees that predict every column from the other columns.

The paths in the decision trees basically contain data patterns and rules that are embedded in the data set. These paths can therefore be treated as association rules and applied within the framework of association rules mining. Outliers are those values that could not be predicted accurately with the decision trees, and these are exceptions to the rules. The resulting rules are in a human readable format so this provides transparent and explainable rules representing patterns in the data set.

Training and using decision trees can of course easily be done with scikit-learn. What I have added in ruleminer package is source code to extract decision paths from arbitrary scikit-learn decision trees (classifiers and regressors) to convert them into rules in a ruleminer object. The extracted rules can then be treated like any other set of rules and can be applied to other data sets to calculate rule metrics and find outliers.

Example with the iris data

Here is an example. I ran a AdaBoostClassifier on the iris data set in scikit-learn package and fit an ensemble of 25 trees with depth 2 (this will provide if-then rules where the antecedent of the rule contains a maximum of two conditions):

base, estimator = DecisionTreeClassifier, AdaBoostClassifier

regressor = estimator(
    base_estimator = base(
        random_state=0, 
        max_depth=2),
    n_estimators=25,
    random_state=0)
regressor = regressor.fit(X, Y)

Here X is the features of the iris data set and Y is the target. We now have an ensemble (or forest) of decision trees. The first decision tree in the ensemble looks like this:

The first line in the white (non leaf) nodes contain the conditions of the rules. To extract the rules from this tree I have provided an utility function in the ruleminer package that can be used in the following way:

# derive expression from tree
ruleminer.tree_to_expressions(regressor[0], features, target)

This results in the following set of rules (in the syntax of ruleminer rules):

{'if (({"petal width cm"} <= 0.8)) then ({"target"} == 0)',
 'if (({"petal width cm"} > 0.8) & ({"petal width cm"} <= 1.75)) then ({"target"} == 1)',
 'if (({"petal width cm"} > 0.8) & ({"petal width cm"} > 1.75)) then ({"target"} == 2)'}

For each leaf in the tree the decision path is converted to a rule, each node contains the condition in the rule. To get the best decision tree in an ensemble (according to, for example, the highest absolute support) we generate a miner per decision tree in the ensemble:

ensemble_exprs = ruleminer.fit_ensemble_and_extract_expressions(
    dataframe,
    target = "target", 
    max_depth = 2)

miners = [
    RuleMiner(
        templates=[{'expression': expr} for expr in exprs], 
        data=df) 
    for exprs in ensemble_exprs
]

From this we extract the miner with the highest absolute support

max(miners, key=lambda x: miner.rules['abs support'].sum())

resulting in the following output of the rules

idx id group definition status abs support abs exceptions confidence encodings
0 0 0 if({“petal width cm”}<=0.8)then({“target”}==0) 50 0 1.000000 {}
1 1 0 if(({“petal width cm”}>0.8)&({“petal width cm”}>1.75))then({“target”}==2) 45 1 0.978261 {}
2 2 0 if(({“petal width cm”}>0.8)&({“petal width cm”}<=1.75))then({“target”}==1) 49 5 0.907407 {}

With maximum depth of two, we see that three rules are derived that are confirmed by 144 samples in the data set, and six samples were found that do not satisfy the rules (outliers).

Example with insurance undertakings data

If we apply this to the example data set I have used earlier:

df = pd.DataFrame(
    columns=[
        "Name",
        "Type",
        "Assets",
        "TP-life",
        "TP-nonlife",
        "Own funds",
        "Excess",
    ],
    data=[
        ["Insurer1", "life insurer", 1000.0, 800.0, 0.0, 200.0, 200.0],
        ["Insurer2", "non-life insurer", 4000.0, 0.0, 3200.0, 800.0, 800.0],
        ["Insurer3", "non-life insurer", 800.0, 0.0, 700.0, 100.0, 100.0],
        ["Insurer4", "life insurer", 2500.0, 1800.0, 0.0, 700.0, 700.0],
        ["Insurer5", "non-life insurer", 2100.0, 0.0, 2200.0, 200.0, 200.0],
        ["Insurer6", "life insurer", 9001.0, 8701.0, 0.0, 300.0, 200.0],
        ["Insurer7", "life insurer", 9002.0, 8802.0, 0.0, 200.0, 200.0],
        ["Insurer8", "life insurer", 9003.0, 8903.0, 0.0, 100.0, 200.0],
        ["Insurer9", "non-life insurer", 9000.0, 8850.0, 0.0, 150.0, 200.0],
        ["Insurer10", "non-life insurer", 9000.0, 0, 8750.0, 250.0, 199.99],
    ],
)
df.index.name="id"
df[['Type']] = OrdinalEncoder(dtype=int).fit_transform(df[['Type']])
df[['Name']] = OrdinalEncoder(dtype=int).fit_transform(df[['Name']])

The last two lines converts the string data to integer data so it can be used in the decision trees. We then fit an ensemble of trees with maximum depth of one (for generating the most simple rules):

expressions = ruleminer.fit_dataframe_to_ensemble(df, max_depth = 1)

This results in 41 expressions that we can evaluate with ruleminer selecting the rules that have confidence of 75% and minimum support of two:

templates = [{'expression': solution} for solution in expressions]
params = {
    "filter": {'confidence': 0.75, 'abs support': 2},
    "metrics": ['confidence', 'abs support', 'abs exceptions']
}
r = ruleminer.RuleMiner(templates=templates, data=df, params=params)

This results in the following rules with metrics (the data error that we added in advance (insurer 9 is a non life undertaking reporting life technical provisions) is indeed found):

idx id group definition status abs suppor abs exceptions confidence encodings
0 0 0 if({“TP-life”}>400.0)then({“TP-nonlife”}==0.0) 6 0 1.000000 {}
1 1 0 if({“TP-life”}<=400.0)then({“Type”}==1) 4 0 1.000000 {}
2 2 0 if({“TP-life”}>400.0)then({“Type”}==0) 5 1 0.833333 {}

There is a relationship between the constraints set on the decision tree and the structure and metrics of the resulting rules. The example above showed the most simple rules with maximum depth of one, i.e. only one condition in the if-part of the rule. It is also possible to set the decision tree parameter min_samples_leaf to guarantee a minimum number of samples in a leaf. In the association rules terminology this corresponds to selecting rules with a certain maximum absolute support or exception count. Setting the minimum samples per leaf to one results in rules with a maximum number of exceptions of one and this yields in our case the same results as maximum depth of one:

expressions = ruleminer.fit_dataframe_to_ensemble(df, min_samples_leaf = 1)

The parameter min_weight_fraction_leaf allows for a weighted minimum fraction of the input samples required to be at a leaf node. This might be applicable in cases where you have weights (or levels or importance) of the samples in the data set.

The rules that can be found with ensembles of decision trees are all of the form “if A and B and … then C”, where A, B and C are conditions. Rules containing numerical operations and rules with complex conditions in the consequent of the rule cannot be found in this way. Furthermore, if the maximum depth size of the decision tree is too large then resulting rules, although as such human readable, become less explainable. They might however point to unknown exceptions in the data that could not be captured with the supervised approach (predefined expressions with regexes). Taking into account these drawbacks, this approach has potential to be used in larger data sets.

The notebook of the examples above can be found here.

Two new Python packages

Published by:

Here is a short update about two new Python packages I have been working on. The first is about structuring Natural Language Processing projects (a subject I have been working on a lot recently) and the second is about rule mining in arbitrary datasets (in my case supervisory quantitative reports).

nafigator package

This packages converts the content of (native) pdf documents, MS Word documents and html files into files that satisfy the NLP Annotation Format (NAF). It can use a default spaCy or Stanza pipeline or a custom made pipeline. The package creates a file in xml-format with the first basic NAF layers to which additional layers with annotations can be added. It is also possible to convert to RDF format (turtle-syntax and rdf-xml-syntax). This allows the use of this content in graph databases.

This is my approach for storing (intermediate) NLP results in a structured way. Conversion to NAF is done in only one computationally intensive step, and then the NAF files contain all necessary NLP data and can be processed very fast for further analyses. This allows you to use the results efficiently in downstream NLP projects. The NAF files also enable you to keep track of all annotation processes to make sure NLP results remain reproducible.

The NLP Annotation Format from VU University Amsterdam is a practical solution to store NLP annotations and relatively easy to implement. The main purpose of NAF is to provide a standardized format that, with a layered extensible structure, is flexible enough to be used in different NLP projects. The NAF standard has had a number of subsequent versions, and is still under development.

The idea of the format is that basically all NLP processing steps add annotations to a text. You start with the raw text (stored in the raw layer). This raw text is tokenized into pages, paragraphs, sentences and word forms and the result is stored in a text layer. Annotations to each word form (like lemmatized form, part-of-speech tag and morphological features) are stored in the terms layers. Each additional layer builds upon previous layers and adds more complex annotations to the text.

See for more information about the NLP Annotation Format and Nafigator on github.

ruleminer package

Earlier, I have already made some progress with mining datasets for rules and patterns (see for two earlier blogs on this subject here and here). New insights led to a complete new set-up of code that I have published as a Python package under the name ruleminer. The new package improves the data-patterns package in a number of ways:

  • The speed of the rule mining process is improved significantly for if-then rules (at least six times faster). All candidate rules are now internally evaluated as Numpy expressions.
  • Additional evaluation metrics have been added to allow new ways to assess how interesting newly mined rules are. Besides support and confidence, we now also have the metrics lift, added value, casual confidence and conviction. New metrics can be added relatively easy.
  • A rule pruning procedure has been added to delete, to some extent, rules that are semantically identical (for example, if A=B and B=A are mined then one of them is pruned from the list).
  • A proper Pyparsing Grammar for rule expressions is used.

Look for more information on ruleminer here on github.

Pilot Data Quality Rules

Published by:

Data Quality is receiving more and more attention within the financial sector, and deservedly so. That’s why DNB will start a pilot in September with the insurance sector to enable entities to run locally the required open source code and to evaluate Solvency 2 quantitative reports with our Data Quality Rules.

In the coming weeks we will:

With these tools you are able to assess the data quality of your Solvency 2 quantitative reports before submitting them to DNB. You can do that within your own data science environment.

We worked hard to make this as easy as possible; the only thing you need is Anaconda / Jupyter Notebooks (Python) and Git to clone our repositories from Github (all free and open source software). And of course the data you want to check. We also provide code to evaluate the XBRL instance files.

We are planning workshops to explain how to use the code and validation rules and to go through the process step by step.

Want to join or know more, please let me know (w.j.willemse at dnb.nl).

Code moved to GitHub/DNB

Published by:

It has been a bit quiet here on my side. But I have been busy moving the source code from the notebooks here to the GitHub-account of DNB. And in doing so we improved the code in many areas. Code development can now be done efficiently by using tools for Continuous Integration and Deployment. This enables everyone to work with us to explore new ideas and to test, improve and expand the code.

Let me shortly introduce our repositories (these are all based on code that I wrote for the blogs on this website).

Our insurance repositories

solvency2-data

A package for retrieving Solvency 2-data. Our idea is to provide you with one package for all Solvency 2-data retrieval.

Currently it is able to download the Risk Free Interest Term Structures from the EIOPA website, so you don’t need to search on the EIOPA website. And it implements the Smith-Wilson algorithm so you can make your own curves with different parameters.

This code is deployed as a package to the Python Package Index (a.k.a. the cheese shop), so you can install it with pip.

data-patterns

A package aimed to improve data quality of your reports. With this code you can generate and evaluate patterns, and we plan to publish validation and plausibility patterns in addition to the existing ones in the taxonomies. With this package you can evaluate your reports with these patterns.

This package is also deployed to the Python Package Index.

solvency2-data-science

A project with data science applications and tutorials on using the packages above. Currently, we have a data science tutorial using the public Solvency 2 data and a tutorial for the data-patterns package.

solvency2-nlp

Our experimental Natural Language Processing projects with Solvency 2 documents. I already published some results here with NLP (reading the Solvency 2 legislation documents and Word2Vec and Topic Modelling with SFCR documents) and we are planning to provide these and other applications in this repository.

All repositories were made from cookiecutter templates, which is a very easy way to set up your projects.

Take a look at the repositories. If you have suggestions for further improvements or ideas for new features, do not hesitate to raise an issue on the GitHub-site. In the documentation of each repository you can find more information on how to contribute.