Here is a short update about two new Python packages I have been working on. The first is about structuring Natural Language Processing projects (a subject I have been working on a lot recently) and the second is about rule mining in arbitrary datasets (in my case supervisory quantitative reports).
This packages converts the content of (native) pdf documents, MS Word documents and html files into files that satisfy the NLP Annotation Format (NAF). It can use a default spaCy or Stanza pipeline or a custom made pipeline. The package creates a file in xml-format with the first basic NAF layers to which additional layers with annotations can be added. It is also possible to convert to RDF format (turtle-syntax and rdf-xml-syntax). This allows the use of this content in graph databases.
This is my approach for storing (intermediate) NLP results in a structured way. Conversion to NAF is done in only one computationally intensive step, and then the NAF files contain all necessary NLP data and can be processed very fast for further analyses. This allows you to use the results efficiently in downstream NLP projects. The NAF files also enable you to keep track of all annotation processes to make sure NLP results remain reproducible.
The NLP Annotation Format from VU University Amsterdam is a practical solution to store NLP annotations and relatively easy to implement. The main purpose of NAF is to provide a standardized format that, with a layered extensible structure, is flexible enough to be used in different NLP projects. The NAF standard has had a number of subsequent versions, and is still under development.
The idea of the format is that basically all NLP processing steps add annotations to a text. You start with the raw text (stored in the raw layer). This raw text is tokenized into pages, paragraphs, sentences and word forms and the result is stored in a text layer. Annotations to each word form (like lemmatized form, part-of-speech tag and morphological features) are stored in the terms layers. Each additional layer builds upon previous layers and adds more complex annotations to the text.
See for more information about the NLP Annotation Format and Nafigator on github.
Earlier, I have already made some progress with mining datasets for rules and patterns (see for two earlier blogs on this subject here and here). New insights led to a complete new set-up of code that I have published as a Python package under the name ruleminer. The new package improves the data-patterns package in a number of ways:
- The speed of the rule mining process is improved significantly for if-then rules (at least six times faster). All candidate rules are now internally evaluated as Numpy expressions.
- Additional evaluation metrics have been added to allow new ways to assess how interesting newly mined rules are. Besides support and confidence, we now also have the metrics lift, added value, casual confidence and conviction. New metrics can be added relatively easy.
- A rule pruning procedure has been added to delete, to some extent, rules that are semantically identical (for example, if A=B and B=A are mined then one of them is pruned from the list).
- A proper Pyparsing Grammar for rule expressions is used.
Look for more information on ruleminer here on github.