As per Wikipedia, Word embedding is the collective name for a set of language modelling (click to know more) and feature learning techniques in natural language processing (NLP) (click to know more) where words or phrases from the vocabulary are mapped to vectors of real numbers. These days, it is one of the most popular choice for vectorizing text for numerous NLP tasks like sentiment analysis, machine translation, etc.
Generally, it is recommended to train your own model to generate word vectors that are specific to your application domain. In case, you don't have enough data or resources to train your own embeddings, then you can use pre-trained vectors that are readily available from institutions like Google (click to know more) and Stanford (click to know more). For example, Stanford provide GloVe vectors (840B tokens, 2.2M vocab) of size ~ 3 GB.
What problems did we face ?
To build any of our NLP model we had to load the entire pre-trained word embeddings in memory for faster lookup. Since, our application's vocabulary size was only 10% of the loaded embeddings, more than 50% of the memory was wasted at the expense of faster lookup. We needed a better method to store and retrieve word embeddings such that, the memory footprint was minimum and lookups were faster.
What approaches did we try ?
We explored these methodologies to optimally store and retrieve the word embeddings.
Initially, we tried indexing the embeddings to Elasticsearch (click to know more) as we were already using it. But in the case of word embeddings, we neither required any fuzzy search nor filtering/aggregation. Also, the index size on disk was roughly 3 times given the Elasticsearch index replication factor.
Next, we tried Trie, which is an efficient information retrieval data structure. It efficiently compressed the size with it's representation for words but there was as such no benefit because we were still loading all the resultant trie structure in memory.
Finally, we experimented with LevelDB (click to know more) and it looked like a perfect candidate for our use-case because of the following reasons:
- Optimised for key-value storage and lookup.
- Easy and direct to use methods.
- Light weight database.
- Data is automatically compressed.
- Can be exported to applications inside docker image.
It is highly possible that there are better approaches as well but as of now LevelDB is working fine for us. In the next set of blogs we will discuss how to train your own embeddings and build a real life application on top of it.