Contextual embeddings assign each word a representation based on its context, thereby capturing uses of words across varied contexts and encoding knowledge that transfers across languages

In [ ]:
import numpy as np
import pandas as pd


# One- Hot Encoding¶

In [ ]:
vocabulary = ['I','like','to','play','football','rome','paris','mango','apple']
one_hot_matrix={}

for i in range(len(vocabulary)):
l = [0]*len(vocabulary)
l[i]=1
one_hot_matrix[vocabulary[i]] = l

In [ ]:
one_hot_matrix

Out[ ]:
{'I': [1, 0, 0, 0, 0, 0, 0, 0, 0],
'apple': [0, 0, 0, 0, 0, 0, 0, 0, 1],
'football': [0, 0, 0, 0, 1, 0, 0, 0, 0],
'like': [0, 1, 0, 0, 0, 0, 0, 0, 0],
'mango': [0, 0, 0, 0, 0, 0, 0, 1, 0],
'paris': [0, 0, 0, 0, 0, 0, 1, 0, 0],
'play': [0, 0, 0, 1, 0, 0, 0, 0, 0],
'rome': [0, 0, 0, 0, 0, 1, 0, 0, 0],
'to': [0, 0, 1, 0, 0, 0, 0, 0, 0]}
• One-hot vectors are high-dimensional and sparse, while word embeddings are low-dimensional and dense (they are usually between 50–600 dimensional). When you use one-hot vectors as a feature in a classifier, your feature vector grows with the vocabulary size; word embeddings are more computationally efficient.
• Also no similarity metrics can be computed

# Word2Vec - SkipGram, CBOW¶

The word2vec algorithms uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. In learning, these models take into account the context of the corpus in which the word occurs

In Skipgram: Each word starts with a [1x V] input, where V is the vocabulary size, W1:[V x E] and W2:[E x V] --> so the final output is softmaxed over [1xV] vector, giving the probability of the context word w.r.t the target word

In CBOW: Just as presented in skipgram, two mapping weight vectors are used, although here, the target word is predicted throught the aggregation of context words

W1,W2 also known as word-vector lookup table

According to [1], it is found that Skip-Gram works well with small datasets, and can better represent less frequent words. However, CBOW is found to train faster than Skip-Gram, and can better represent more frequent words.

[1]Efficient Estimation of Word Representations in Vector Space (Mikolov et. al)