In our first Natural Language Processing project we will read the Solvency II legislation from the website of the European Union and extract the text within the articles by using regular expressions.
For this notebook, we have chosen the text of the Delegated Acts of Solvency II. This part of the Solvency II regulation is directly into force (because it is a Regulation) and the wording of the Delegated Acts is more detailed than the Solvency II Directive and very precise and internally consistent. This makes it suitable for NLP. From the text we are able to extract features and text data on Solvency II for our future projects.
The code of this notebook can be found in here
Step 1: data Retrieval
We use several packages to read and process the pdfs. For reading we use the fitz-package. Furthermore we need the re-package (regular expressions) for cleaning the text data.
import os
import re
import requests
import fitz
We want to read the Delegated Acts in all available languages. The languages of the European Union are Bulgarian (BG), Spanish (ES), Czech (CS), Danish (DA), German (DE), Estonian (ET), Greek (EL), English (EN), French (FR), Croatian (HR), Italian (IT), Latvian (LV), Lithuanian (LT), Hungarian (HU), Maltese (MT), Dutch (NL), Polish (PL), Portuguese (PT), Romanian (RO), Slovak (SK), Solvenian (SL), Finnish (FI), Swedish (SV).
languages = ['BG','ES','CS','DA','DE','ET','EL',
'EN','FR','HR','IT','LV','LT','HU',
'MT','NL','PL','PT','RO','SK','SL',
'FI','SV']
The urls of the Delegated Acts of Solvency 2 are constructed for these languages by the following list comprehension.
urls = ['https://eur-lex.europa.eu/legal-content/' + lang +
'/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN'
for lang in languages]
The following for loop retrieves the pdfs of the Delegated Acts from the website of the European Union and stores them in da_path.
da_path = 'data/solvency ii/'
for index in range(len(urls)):
filename = 'Solvency II Delegated Acts - ' + languages[index] + '.pdf'
if not(os.path.isfile(da_path + filename)):
r = requests.get(urls[index])
f = open(da_path + filename,'wb+')
f.write(r.content)
f.close()
else:
print("--> already read.")
Step 2: data cleaning
If you look at the pdfs then you see that each page has a header with page number and information about the legislation and the language. These headers must be deleted to access the articles in the text.
DA_dict = dict({
'BG': 'Официален вестник на Европейския съюз',
'CS': 'Úřední věstník Evropské unie',
'DA': 'Den Europæiske Unions Tidende',
'DE': 'Amtsblatt der Europäischen Union',
'EL': 'Επίσημη Εφημερίδα της Ευρωπαϊκής Ένωσης',
'EN': 'Official Journal of the European Union',
'ES': 'Diario Oficial de la Unión Europea',
'ET': 'Euroopa Liidu Teataja',
'FI': 'Euroopan unionin virallinen lehti',
'FR': "Journal officiel de l'Union européenne",
'HR': 'Službeni list Europske unije',
'HU': 'Az Európai Unió Hivatalos Lapja',
'IT': "Gazzetta ufficiale dell'Unione europea",
'LT': 'Europos Sąjungos oficialusis leidinys',
'LV': 'Eiropas Savienības Oficiālais Vēstnesis',
'MT': 'Il-Ġurnal Uffiċjali tal-Unjoni Ewropea',
'NL': 'Publicatieblad van de Europese Unie',
'PL': 'Dziennik Urzędowy Unii Europejskiej',
'PT': 'Jornal Oficial da União Europeia',
'RO': 'Jurnalul Oficial al Uniunii Europene',
'SK': 'Úradný vestník Európskej únie',
'SL': 'Uradni list Evropske unije',
'SV': 'Europeiska unionens officiella tidning'})
The following code reads the pdfs, deletes the headers from all pages and saves the clean text to a .txt file.
DA = dict()
files = [f for f in os.listdir(da_path) if os.path.isfile(os.path.join(da_path, f))]
for language in languages:
if not("Delegated_Acts_" + language + ".txt" in files):
# reading pages from pdf file
da_pdf = fitz.open(da_path + 'Solvency II Delegated Acts - ' + language + '.pdf')
da_pages = [page.getText(output = "text") for page in da_pdf]
da_pdf.close()
# deleting page headers
header = "17.1.2015\\s+L\\s+\\d+/\\d+\\s+" + DA_dict[language].replace(' ','\\s+') + "\\s+" + language + "\\s+"
da_pages = [re.sub(header, '', page) for page in da_pages]
DA[language] = ''.join(da_pages)
# some preliminary cleaning -> could be more
DA[language] = DA[language].replace('\xad ', '')
# saving txt file
da_txt = open(da_path + "Delegated_Acts_" + language + ".txt", "wb")
da_txt.write(DA[language].encode('utf-8'))
da_txt.close()
else:
# loading txt file
da_txt = open(da_path + "Delegated_Acts_" + language + ".txt", "rb")
DA[language] = da_txt.read().decode('utf-8')
da_txt.close()
Step 3: retrieve the text within articles
Retrieving the text within articles is not straightforward. In English we have ‘Article 1 some text’, i.e. de word Article is put before the number. But some European languages put the word after the number and there are two languages, HU and LV, that put a dot between the number and the article. To be able to read the text within the articles we need to know this ordering (and we need of course the word for article in every language).
art_dict = dict({
'BG': ['Член', 'pre'],
'CS': ['Článek', 'pre'],
'DA': ['Artikel', 'pre'],
'DE': ['Artikel', 'pre'],
'EL': ['Άρθρο', 'pre'],
'EN': ['Article', 'pre'],
'ES': ['Artículo', 'pre'],
'ET': ['Artikkel', 'pre'],
'FI': ['artikla', 'post'],
'FR': ['Article', 'pre'],
'HR': ['Članak', 'pre'],
'HU': ['cikk', 'postdot'],
'IT': ['Articolo', 'pre'],
'LT': ['straipsnis','post'],
'LV': ['pants', 'postdot'],
'MT': ['Artikolu', 'pre'],
'NL': ['Artikel', 'pre'],
'PL': ['Artykuł', 'pre'],
'PT': ['Artigo', 'pre'],
'RO': ['Articolul', 'pre'],
'SK': ['Článok', 'pre'],
'SL': ['Člen', 'pre'],
'SV': ['Artikel', 'pre']})
Next we can define a regex to select the text within an article.
def retrieve_article(language, article_num):
method = art_dict[language][1]
if method == 'pre':
string = art_dict[language][0] + ' ' + str(article_num) + '(.*?)' + art_dict[language][0] + ' ' + str(article_num + 1)
elif method == 'post':
string = str(article_num) + ' ' + art_dict[language][0] + '(.*?)' + str(article_num + 1) + ' ' + art_dict[language][0]
elif method == 'postdot':
string = str(article_num) + '. ' + art_dict[language][0] + '(.*?)' + str(article_num + 1) + '. ' + art_dict[language][0]
r = re.compile(string, re.DOTALL)
result = ' '.join(r.search(DA[language])[1].split())
return result
Now we have a function that can retrieve the text of all the articles in the Delegated Acts for each European language.
Now we are able to read the text of the articles from the Delegated Acts. In the following we give three examples (article 292 with states the summary of the Solvency and Financial Conditions Report).
retrieve_article('EN', 292)
"Summary 1. The solvency and financial condition report shall include a clear and concise summary. The summary of the report
shall be understandable to policy holders and beneficiaries. 2. The
summary of the report shall highlight any material changes to the
insurance or reinsurance undertaking's business and performance,
system of governance, risk profile, valuation for solvency purposes
and capital management over the reporting period."
retrieve_article('DE', 292)
'Zusammenfassung 1. Der Bericht über Solvabilität und Finanzlage
enthält eine klare, knappe Zusammenfassung. Die Zusammenfassung des
Berichts ist für Versicherungsnehmer und Anspruchsberechtigte
verständlich. 2. In der Zusammenfassung werden etwaige wesentliche
Änderungen in Bezug auf Geschäftstätigkeit und Leistung des
Versicherungs- oder Rückversicherungsunternehmens, sein
Governance-System, sein Risikoprofil, die Bewertung für
Solvabilitätszwecke und das Kapitalmanagement im Berichtszeitraum
herausgestellt.'
retrieve_article('EL', 292)
'Περίληψη 1. Η έκθεση φερεγγυότητας και χρηματοοικονομικής
κατάστασης περιλαμβάνει σαφή και σύντομη περίληψη. Η περίληψη της
έκθεσης πρέπει να είναι κατανοητή από τους αντισυμβαλλομένους και
τους δικαιούχους. 2. Η περίληψη της έκθεσης επισημαίνει τυχόν
ουσιώδεις αλλαγές όσον αφορά τη δραστηριότητα και τις επιδόσεις της
ασφαλιστικής και αντασφαλιστικής επιχείρησης, το σύστημα
διακυβέρνησης, το προφίλ κινδύνου, την εκτίμηση της αξίας για τους
σκοπούς φερεγγυότητας και τη διαχείριση κεφαλαίου κατά την περίοδο
αναφοράς.'