Topic Modeling:
Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents.
LDA
Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. Modeled as Dirichlet distributions are
# Eg:
# doc1 = ['Tree has fruits']
# doc2 = ['abc party was elected']
Latent: This refers to everything that we don’t know a priori and are hidden in the data. Here, the themes or topics that document consists of are unknown, but they are believed to be present as the text is generated based on those topics.
Dirichlet: It is a ‘distribution of distributions’.
Allocation: This means that once we have Dirichlet, we will allocate topics to the documents and words of the document to topics.
sum for all t in T, where T is the total number of topics. Also, let’s assume that there is W number of words in our vocabulary for all the documents. If we assume conditional independence, we can say that P(w|t,d) = P(w|t) And hence P(w|d) is equal to:
that is the dot product of ϴtd and Фwt for each topic t.
Gibbs Sampling Gibbs sampling is an algorithm for successively sampling conditional distributions of variables, whose distribution over states converges to the true distribution in the long run. This is somewhat an abstract concept and needs a good understanding of Monte Carlo Markov Chains and Bayes theorem.
Therefore, what we are doing is we are trying to maximize the likelihood of our data given these two distributions
We do not have to worry about most of the mathematics as that is handled by gensim.
The goal of the approach is to learn these matrices
Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.
!pip install pyLDAvis==2.1.2
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', shuffle = True)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle = True)
len(newsgroups_train.data)
print(newsgroups_train.target_names)
#tfidf is used for propotionately allocating the topic to the model
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk
nltk.download('wordnet')
def lemmatize_stemming(text):
stemmer = SnowballStemmer(language='english')
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
result.append(lemmatize_stemming(token))
return result
from tqdm import tqdm
articles = newsgroups_train.data
processed_docs=[]
for i in tqdm(range(len(articles))):
processed_docs.append(preprocess(articles[i]))
print(processed_docs[:2])
dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
print(k, v)
count += 1
if count > 10:
break
# keep 10000 most freq tokens. Remove freq below 15 and above 0.5*total
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=100000)
# Use Gensim Bag of Words
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[1]
document_num = 20
bow_doc_x = bow_corpus[document_num]
for i in range(len(bow_doc_x)):
print("Word {} (\"{}\") appears {} time.".format(bow_doc_x[i][0],
dictionary[bow_doc_x[i][0]],
bow_doc_x[i][1]))
# 2m # increasing number of passes shall definitely improve the model
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=8, id2word=dictionary, passes=10, workers=2)
for idx, topic in lda_model.print_topics(-1):
print('Topic: {} \nWords: {}'.format(idx, topic))
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
from pprint import pprint
for doc in corpus_tfidf:
pprint(doc)
break
# instead of using bag of words, we send tf-idf as an input to the model
# 1m 25s
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=8, id2word=dictionary, passes=10, workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
print('Topic: {} Word: {}'.format(idx, topic))
unseen_document = ''
bow_vector = dictionary.doc2bow(preprocess(unseen_document))
for index, score in sorted(lda_model_tfidf[bow_vector], key=lambda tup: -1*tup[1]):
print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))
import pickle
import pyLDAvis
import pyLDAvis.gensim
import os
# Visualize the topics
pyLDAvis.enable_notebook()
num_topics=8
LDAvis_data_filepath = os.path.join('./ldavis_prepared_'+str(num_topics))
# # this is a bit time consuming - make the if statement True
# # if you want to execute visualization prep yourself
if 1 == 1:
LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary)
with open(LDAvis_data_filepath, 'wb') as f:
pickle.dump(LDAvis_prepared, f)
# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath, 'rb') as f:
LDAvis_prepared = pickle.load(f)
pyLDAvis.save_html(LDAvis_prepared, './ldavis_prepared_'+ str(num_topics) +'.html')
# topics in terms of sizes
LDAvis_prepared
Assumptions made by LDA: Each document is just a collection of words or a “bag of words”. Thus, the order of the words and the grammatical role of the words (subject, object, verbs, …) are not considered in the model. Words like am/is/are/of/a/the/but/… don’t carry any information about the “topics” and therefore can be eliminated from the documents as a preprocessing step. In fact, we can eliminate words that occur in at least %80 ~ %90 of the documents, without losing any information. For example, if our corpus contains only medical documents, words like human, body, health, etc might be present in most of the documents and hence can be removed as they don’t add any specific information which would make the document stand out. We know beforehand how many topics we want. ‘k’ is pre-decided. All topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated
# some dataset links for topic modeling in social media dataset
# https://medium.com/aryma-labs/learn-how-we-used-topic-modeling-to-help-a-client-find-hostile-topics-among-various-social-media-de6a6003709f
# https://lazarinastoy.com/topic-modelling-limitations-short-text/
# https://www.kaggle.com/search?q=social+media
# https://www.kaggle.com/eliasdabbas/extract-entities-from-social-media-posts/data