Nifigator is a Python package that provides a set of tools for natural language processing (NLP) analysis. Its main purpose is to store NLP data as RDF (Resource Description Framework) data that conforms to the NLP Interchange Format (NIF), which is a standard for representing NLP annotations in a machine-readable way. By storing NLP data in RDF/NIF format, it becomes possible to share and exchange the data with other applications or systems that can read RDF data.
Nifigator is built on top of RDFlib, a Python library for working with RDF data. With Nifigator, it’s possible to create linguistic annotations for text, including part-of-speech tagging, and named entity recognition, among others, in RDFlib graphs. Additionally, Nifigator is designed to follow the principles of Linked Data, which means that it enables the creation of Linguistic Linked Open Data (LLOD) and helps make NLP data more accessible and query-able on the web (or on private webs).
You can find a short overview of the package functionality here. It shows how to create contexts (with content of documents) and collections of contexts, add linguistic annotations to them and derive a graph from the collection that can be used with RDFlib.
To apply NIF in a practical NLP environment with PDF documents, a pipeline is provided to parse PDF documents and convert the text to NIF with the following elements:
- extraction of text and page offsets with pdfminer.six, including the deletion of (some) control characters and the correction of hyphenated words;
- text tokenization with the syntok package;
- NLP processing with Stanza (with conversion of Universal Dependencies to the linguistic data categories of OLiA); and
- conversion of data and annotations to NIF elements.
Furthermore a simple set-up for a SPARQL endpoint is added to make the data accessible and query-able, and a number of example queries with NIF data is included. Also some examples are included that show how to use the Lemon vocabulary, used for representing lexical resources, such as dictionaries and thesauri, with NIF data.