The digitization of information and communication is happening at an unprecedented rate today.
While decoding numerical records have long been a cake-walk for technologies, developing algorithms that understand and correlate natural language has not been easy. It has now become critical for technology to recognize the context of the basic unit of verbal communication - a word - and in turn enable devices to mimic human understanding of natural language the way it is.
Because, the bulk of today’s communication is inevitably not in numbers. Let us take a few examples below.
Healthcare: Take for example, one of the biggest challenges faced by health-tech today - how to integrate HIMS (Hospital information Management System) and EHR (Electronic Health Records)? How to feed this integration into the CDS (Clinical Decision Support) of hospitals? And finally, how to automate the process of generating accurate results from CDS - both diagnostic and prescriptive? Or, another pressing problem faced by 21st century healthcare management - how can feature selection of disease symptoms be used for epidemic surveillance (Bird flu or H1N1 for example)?
Taxonomy: Taxonomies are pivotal to knowledge management and organization, and serve as the foundation for superior representations of knowledge in various systems, such as formal ontologies. Since developing taxonomies by humans is cumbersome and expensive, automation of taxonomy induction to build taxonomies at scale requires recognizing words and word patterns in context.
Financial News: An industry that is highly sensitive to news announcements and press releases, modern technology is being trained to delve into understanding the sentiment of financial news even as we speak, to detect and depict market bearings.
What are Word Embeddings?
Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.
What happens when words are mapped to real number vectors?
Words that emanate from the same or similar context can be associated with each other. Let us look at a few simple examples:
Creating Word Embeddings is the basic step to working with textual data because computers and other devices don’t understand text - they largely work with numbers when it comes to detection and recognition.
There are many techniques to create Word Embeddings. Some of the popular ones are:
- Binary Encoding.
- TF Encoding.
- TF-IDF Encoding.
- Latent Semantic Analysis Encoding.
- Word2Vec Encoding.
In this article, we will primarily focus on the Word2Vec technique.
Key Challenges in Word Embeddings
While developing NLP application, one of the basic fundamental problem is that machines can’t understand the text data directly, so, we need to encode them into numerical format using the 5 techniques mentioned above - each of these come with some problems of their own.
- If you apply one-hot encoding to large set of sentences, then the vector dimension will be equal to entire vocabulary of the corpus, which would require high computation power and matrix operations. Dealing with gigantic vectors is a core challenge.
- Most of the above methods don’t reveal the context similarity between words. For instance, if your corpus is talking about words like ‘Sachin Tendulkar’ and ‘Masterblaster’, then such methods would treat both these words differently.
- It is difficult to find accurate word similarities with WordNet, because WordNet contains more subjective information and also new words may not be present.
Word2Vec - A Revolution
Word2vec is developed using two-layer neural networks. It takes a large amount of text data or text corpus as input and generates a set of vectors from the given text. Word2vec is good at finding out word similarity, as well as preserving the semantic relationship between words that couldn’t be handled by techniques such as one-hot encoding or by WordNet.
So, in general, our straightforward goal is that we need to convert every word into vector format, such that they are good at predicting the words that appear in their context. Also, by giving the word, we need to be able to predict the word with the highest probability to suit in a given context.
Word2Vec is an advanced technique recently developed by Google. There are 2 algorithms available to train this technique.
Training Word2Vec On Own Corpus
We will be using Gensim library of Python to train these word vectors. We are considering all the books of Harry Potter series as the dataset for this purpose. You can find dataset here
Gensim provides an easy-to-use API for the same task. It does not require you to know the mathematical implementation of Word2Vec model, instead it exposes only the required parameters with some default values that the user can tune in for his specific use-case. Gensim requires you to transform the data in a particular format before feeding it to the train function.
The snippet below shows the transformed data.
Some of the parameters that are worth experimenting with are window, size, min_count.
The snippet below shows the core code for training of the model:
We have worked with default values for all the parameters. The table below shows a few machine statistics while our model was getting trained.
For future reference, we have saved our model and dumped it in the binary format on the disk. Our model size is 25MB.
Deriving Meaning from the Vectors
Vector value encode the information for semantically similar words in the given corpus. If you plot the same on X-Y plane then it is evident from the vicinity they lie in. Let’s see what we have in our hand.
We queried words that are similar to “Snape’, the cluster of words we found were ‘dumbledore', 'slughorn', 'quirrell', 'moody', 'lupin', 'karkaroff', 'voldemort', 'sirius', 'umbridge', ‘flitwick’
We queried words that are similar to “Azkaban”, the cluster of words we found were 'chamber', 'goblet', 'prisoner', 'hallows', 'secrets', 'deathly', 'order', 'philosophers', 'phoenix', 'prince'
As we can see from the above results that words in the vicinity of Azkaban are all the harry potter part names. Whereas, words in the vicinity of Snape are all people names.
Please, download our Jupyter Notebook for full code.
Stay tuned to our blog for more updates as our team works on cutting edge analytics. To know more about we do and how we can help you, drop a line at firstname.lastname@example.org.