Word2vec is a well-known algorithm for natural language processing that often leads to surprisingly good results, if trained properly. It consists of a shallow neural network that maps words to a n-dimensional number space, i.e. it produces vector representations of words (so-called word embeddings). Word2vec does this in a way that words used in the same context are embedded near to each other (their respective vectors are close to each other). In this blog I will show you some of the results of word2vec models trained with Wikipedia and insurance-related documents.
One of the nice properties of a word2vec model is that it allows us to do calculations with words. The distance between two word vectors provides a measure for linguistic or semantic similarity of the corresponding words. So if we calculate the nearest neighbors of the word vector then we find similar words of that word. It is also possible to calculate vector differences between two word vectors. For example, it appears that for word2vec model trained with a large data set, the vector difference between man and woman is roughly equal to the difference between king and queen, or in vector notation king – man + woman = queen. If you find this utterly strange then you are not alone. Besides some intuitive and informal explanations, it is not yet completely clear why word2vec models in general yield these results.
Word2vec models need to be trained with a large corpus of text data in order to achieve word embeddings that allow these kind of calculations. There are some large pre-trained word vectors available, such as the GloVe Twitter word vectors, trained with 2 billion tweets, and the word2vec based on google news (trained with 100 billion words). However, most of them are in the English language and are often trained on words that are generally used, and not domain specific.
So let’s see if we can train word2vec models specifically for non-English European languages and trained with specific insurance vocabulary. A way to do this is to train a word2vec model with Wikipedia pages of a specific language and additionally train the model with sentences we found in public documents of insurance undertakings (SFCRs) and in the insurance legislation. In doing so the word2vec model should be able to capture the specific language domain of insurance.
The Wikipedia word2vec model
Data dumps of all Wikimedia wikis, in the form of a XML-files, are provided here. I obtained the latest Wikipedia pages and articles of all official European languages (bg, es, cs, da, de, et, el, en, fr, hr, it, lv, lt, hu, mt, nl, pl pt, ro, sk, sl, fi, sv). These are compressed files and their size range from 8.6 MB (Maltese) to 16.9 GB (English). The Dutch file is around 1.5 GB. These files are bz2-compressed; the uncompressed Dutch file is about 5 times the compressed size and contains more than 2.5 million Wikipedia pages. This is too large to store into memory (at least on my computer), so you need to use Python generator-functions to process the files without the need to store them completely into memory.
The downloaded XML-files are parsed and page titles and texts are then processed with the nltk-package (stop words are deleted and sentences are tokenized and preprocessed). No n-grams were applied. For the word2vec model I used the implementation in the gensim-package.
Let’s look at some results of the resulting Wikipedia word2vec models. If we get the top ten nearest word vectors of the Dutch word for elephant then we get:
In : model.wv.most_similar('olifant', topn = 10) Out: [('olifanten', 0.704888105392456), ('neushoorn', 0.6430075168609619), ('tijger', 0.6399451494216919), ('luipaard', 0.6376790404319763), ('nijlpaard', 0.6358680725097656), ('kameel', 0.5886276960372925), ('neushoorns', 0.5880545377731323), ('ezel', 0.5879943370819092), ('giraf', 0.5807977914810181), ('struisvogel', 0.5724758505821228)]
These are all general Dutch names for (wild) animals. So, the Dutch word2vec model appears to map animal names in the same area of the vector space. The word2vec models of other languages appear to do the same, for example norsut (Finnish for elephant) has the following top ten similar words: krokotiilit, sarvikuonot, käärmeet, virtahevot, apinat, hylkeet, hyeenat, kilpikonnat, jänikset and merileijonat. Again, these are all names for animals (with a slight preference for Nordic sea animals).
In the Danish word2vec model, the top 10 most similar words for mads (in Danish a first name derived from matthew) are:
In : model.wv.most_similar('mads', topn = 10) Out: [('mikkel', 0.6680521965026855), ('nicolaj', 0.6564826965332031), ('kasper', 0.6114416122436523), ('mathias', 0.6102851033210754), ('rasmus', 0.6025335788726807), ('theis', 0.6013824343681335), ('rikke', 0.5957099199295044), ('janni', 0.5956574082374573), ('refslund', 0.5891965627670288), ('kristoffer', 0.5842193365097046)]
Almost all are first names except for Refslund, a former Danish chef whose first name was Mads. The Danish word2vec model appears to map first names in the same domain in the vector space, resulting is a high similarity between first names.
Re-training the Wikipedia Word2vec with SFCRs
The second step is to train the word2vec models with the insurance related text documents. Although the Wikipedia pages for many languages contain some pages on insurance and insurance undertakings, it is difficult to derive the specific language of this domain from these pages. For example the Dutch word for risk margin does not occur in the Dutch Wikipedia pages, and the same holds for many other technical terms. In addition to the Wikipedia pages, we should therefore train the model with insurance specific documents. For this I used the public Solvency and Financial Condition Reports (SFCRs) of Dutch insurance undertakings and the Dutch text of the Solvency II Delegated Acts (here is how to download and read it).
The SFCR sentences are processed in the same manner as the Wikipedia pages, although here I applied bi- and trigrams to be able to distinguish insurance terms rather than separate words (for example technical provisions is a bigram and treated as one word, technical_provisions).
Now the model is able to derive similar words to the Dutch word for risk margin.
In : model.wv.most_similar('risicomarge') Out: [('beste_schatting', 0.43119704723358154), ('technische_voorziening', 0.42812830209732056), ('technische_voorzieningen', 0.4108312726020813), ('inproduct', 0.409644216299057), ('heffingskorting', 0.4008549451828003), ('voorziening', 0.3887258470058441), ('best_estimate', 0.3886040449142456), ('contant_maken', 0.37772029638290405), ('optelling', 0.3554660379886627), ('brutowinst', 0.3554105758666992)]
This already looks nice. Closest to risk margin is the Dutch term beste_schatting (English: best estimate) and technische_voorziening(en) (English: technical provision, singular and plural). The relation to heffingskorting is strange here. Perhaps the word risk margin is not solely being used in insurance.
Let’s do another one. The acronym skv is the same as scr (solvency capital requirement) in English.
In : model.wv.most_similar('skv') Out: [('mkv', 0.6492390036582947), ('mcr_ratio', 0.4787723124027252), ('kapitaalseis', 0.46219778060913086), ('mcr', 0.440476655960083), ('bscr', 0.4224048852920532), ('scr_ratio', 0.41769397258758545), ('ðhail', 0.41652536392211914), ('solvency_capital', 0.4136047661304474), ('mcr_scr', 0.40923237800598145), ('solvabiliteits', 0.406883180141449)]
The SFCR documents were sufficient to derive an association between skv and mkv (English equivalent of mcr), and the English acronyms scr and mcr (apparently the Dutch documents sometimes use scr and mcr in the same context). Other similar words are kapitaalseis (English: capital requirement) and bscr. Because they learn from context, the word2vec models are able to learn words that are synonyms and sometimes antonyms (for example we say ‘today is a cold day’ and ‘today is a hot day’, where hot and cold are used in the same manner).
For an example of a vector calculation look at the following result.
In : model.wv.most_similar(positive = ['dnb', 'duitsland'], negative = ['nederland'], topn = 5) Out: [('bundesbank', 0.4988047778606415), ('bundestag', 0.4865422248840332), ('simplesearch', 0.452720582485199), ('deutsche', 0.437085896730423), ('bondsdag', 0.43249475955963135)]
This function finds the top five similar words of the vector DNB – Nederland + Duitsland. This expression basically asks for the German equivalent of De Nederlandsche Bank (DNB). The model generates the correct answer: the German analogy of DNB as a central bank is the Bundesbank. I think this is somehow incorporated in the Wikipedia pages, because the German equivalent of DNB as a insurance supervisor is not the Bundesbank but Bafin, and this was not picked up by the model. It is not perfect (the other words in the list are less related and for other countries this does not work as well). We need more documents to find more stable associations. But this to me is already pretty surprising.
There has been some research where the word vectors of word2vec models of two languages were mapped onto each other with a linear transformation (see for example Exploiting Similarities among Languages for Machine Translation, Mikolov, et al). In doing so, it was possible to obtain a model for machine translation. So perhaps it is possible for some European languages with a sufficiently large corpus of SFCRs to generate one large model that is to some extent language independent. To derive the translation matrices we could use the different translations of European legislative texts because in their nature these texts provide one of the most reliable translations available.
But that’s it for me for now. Word2vec is a versatile and powerful algorithm that can be used in numerous natural language applications. It is relatively easy to generate these models in other languages than the English language and it is possible to train these models that can deal with the specifics of insurance terminology, as I showed in this blog.