This is a follow-up to my blog on natural language processing in RDF graphs. Since then I found a number of improvements and incorporated them in the Python packages.
NLP Interchange Format
As there are over fifty different NLP annotations formats available, it didn’t seem a good idea to create yet another annotation format. So instead of a self-made provisional ontology as I did earlier, it is now possible to convert to and use the NLP Interchange Format (NIF) with the Python package nifigator. Included in this package is functionality for a pipeline for PDF documents.
This ontology is different from NAF but has the advantage that is a mature ontology for which the WC3 community has provided guidelines and best practices (see for example here Guidelines for Linked Data corpus creation using NIF). There are some Python packages doing similar things but none of them are able to convert the content of PDFs, docx and html to NIF.
The annotations in NAF are stored in the different layers. The data within each layer is stored in RDF triples in the following way:
raw layer | nif:Context |
text layer | nif:Page, nif:Paragraph, nif:Sentence: nif:Word |
terms layer | nif:Word |
entities layer | nif:Phrase |
deps layer | nif:Word |
header | nif:Context |
Ontolex-Lemon
Secondly, the Python package termate now allows termbases in TBX to be now be converted with the Ontolex-Lemon ontology to RDF. This is based on another WC3 document Guidelines for Linguistic Linked Data Generation: Multilingual Terminologies (TBX) (although I have implemented this for TBX version 3 instead of version 2, on which the guideline is based).
An example can be found here.