Open Collab

Contextual embeddings assign each word a representation based on its context, thereby capturing uses of words across varied contexts and encoding knowledge that transfers across languages

In [ ]:
import numpy as np
import pandas as pd

One- Hot Encoding

In [ ]:
vocabulary = ['I','like','to','play','football','rome','paris','mango','apple']
one_hot_matrix={}

for i in range(len(vocabulary)):
    l = [0]*len(vocabulary)
    l[i]=1
    one_hot_matrix[vocabulary[i]] = l
In [ ]:
one_hot_matrix
Out[ ]:
{'I': [1, 0, 0, 0, 0, 0, 0, 0, 0],
 'apple': [0, 0, 0, 0, 0, 0, 0, 0, 1],
 'football': [0, 0, 0, 0, 1, 0, 0, 0, 0],
 'like': [0, 1, 0, 0, 0, 0, 0, 0, 0],
 'mango': [0, 0, 0, 0, 0, 0, 0, 1, 0],
 'paris': [0, 0, 0, 0, 0, 0, 1, 0, 0],
 'play': [0, 0, 0, 1, 0, 0, 0, 0, 0],
 'rome': [0, 0, 0, 0, 0, 1, 0, 0, 0],
 'to': [0, 0, 1, 0, 0, 0, 0, 0, 0]}
  • One-hot vectors are high-dimensional and sparse, while word embeddings are low-dimensional and dense (they are usually between 50–600 dimensional). When you use one-hot vectors as a feature in a classifier, your feature vector grows with the vocabulary size; word embeddings are more computationally efficient.
  • Also no similarity metrics can be computed

Word2Vec - SkipGram, CBOW

The word2vec algorithms uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. In learning, these models take into account the context of the corpus in which the word occurs

Screenshot 2021-12-01 at 9.20.08 AM.png

In Skipgram: Each word starts with a [1x V] input, where V is the vocabulary size, W1:[V x E] and W2:[E x V] --> so the final output is softmaxed over [1xV] vector, giving the probability of the context word w.r.t the target word

In CBOW: Just as presented in skipgram, two mapping weight vectors are used, although here, the target word is predicted throught the aggregation of context words

W1,W2 also known as word-vector lookup table

According to [1], it is found that Skip-Gram works well with small datasets, and can better represent less frequent words. However, CBOW is found to train faster than Skip-Gram, and can better represent more frequent words.

[1]Efficient Estimation of Word Representations in Vector Space (Mikolov et. al)

Screenshot 2021-12-01 at 9.14.56 AM.png