Open Collab

Topic Modeling:

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents.

LDA

Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. Modeled as Dirichlet distributions are

  • a topic per document model -> a documents consists of multiple words each with a topic distribution
  • words per topic model -> every topic is closer to a group of words
In [ ]:
# Eg: 
# doc1 = ['Tree has fruits']
# doc2 = ['abc party was elected']

Latent: This refers to everything that we don’t know a priori and are hidden in the data. Here, the themes or topics that document consists of are unknown, but they are believed to be present as the text is generated based on those topics.

Dirichlet: It is a ‘distribution of distributions’.

Allocation: This means that once we have Dirichlet, we will allocate topics to the documents and words of the document to topics.

  1. ϴtd = P(t|d) which is the probability distribution of topics in documents
  2. Фwt = P(w|t) which is the probability distribution of words in topics And, we can say that the probability of a word given document i.e. P(w|d) is equal to:

p(w|d) = Σ(p(w|d,t)*p(d/t))

sum for all t in T, where T is the total number of topics. Also, let’s assume that there is W number of words in our vocabulary for all the documents. If we assume conditional independence, we can say that P(w|t,d) = P(w|t) And hence P(w|d) is equal to:

that is the dot product of ϴtd and Фwt for each topic t.

Screenshot 2021-11-20 at 5.57.01 PM.png

Screenshot 2021-11-20 at 5.58.10 PM.png

Gibbs Sampling Gibbs sampling is an algorithm for successively sampling conditional distributions of variables, whose distribution over states converges to the true distribution in the long run. This is somewhat an abstract concept and needs a good understanding of Monte Carlo Markov Chains and Bayes theorem.

Screenshot 2021-11-21 at 9.54.15 PM.png

Therefore, what we are doing is we are trying to maximize the likelihood of our data given these two distributions

We do not have to worry about most of the mathematics as that is handled by gensim.

Screenshot 2021-11-20 at 7.00.32 PM.png

The goal of the approach is to learn these matrices

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.

In [ ]:
!pip install pyLDAvis==2.1.2
Collecting pyLDAvis==2.1.2
  Downloading pyLDAvis-2.1.2.tar.gz (1.6 MB)
     |████████████████████████████████| 1.6 MB 13.3 MB/s 
Requirement already satisfied: wheel>=0.23.0 in /usr/local/lib/python3.7/dist-packages (from pyLDAvis==2.1.2) (0.37.0)
Requirement already satisfied: numpy>=1.9.2 in /usr/local/lib/python3.7/dist-packages (from pyLDAvis==2.1.2) (1.19.5)
Requirement already satisfied: scipy>=0.18.0 in /usr/local/lib/python3.7/dist-packages (from pyLDAvis==2.1.2) (1.4.1)
Requirement already satisfied: pandas>=0.17.0 in /usr/local/lib/python3.7/dist-packages (from pyLDAvis==2.1.2) (1.1.5)
Requirement already satisfied: joblib>=0.8.4 in /usr/local/lib/python3.7/dist-packages (from pyLDAvis==2.1.2) (1.1.0)
Requirement already satisfied: jinja2>=2.7.2 in /usr/local/lib/python3.7/dist-packages (from pyLDAvis==2.1.2) (2.11.3)
Requirement already satisfied: numexpr in /usr/local/lib/python3.7/dist-packages (from pyLDAvis==2.1.2) (2.7.3)
Requirement already satisfied: pytest in /usr/local/lib/python3.7/dist-packages (from pyLDAvis==2.1.2) (3.6.4)
Requirement already satisfied: future in /usr/local/lib/python3.7/dist-packages (from pyLDAvis==2.1.2) (0.16.0)
Collecting funcy
  Downloading funcy-1.16-py2.py3-none-any.whl (32 kB)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2>=2.7.2->pyLDAvis==2.1.2) (2.0.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.17.0->pyLDAvis==2.1.2) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.17.0->pyLDAvis==2.1.2) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas>=0.17.0->pyLDAvis==2.1.2) (1.15.0)
Requirement already satisfied: py>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from pytest->pyLDAvis==2.1.2) (1.11.0)
Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.7/dist-packages (from pytest->pyLDAvis==2.1.2) (21.2.0)
Requirement already satisfied: atomicwrites>=1.0 in /usr/local/lib/python3.7/dist-packages (from pytest->pyLDAvis==2.1.2) (1.4.0)
Requirement already satisfied: more-itertools>=4.0.0 in /usr/local/lib/python3.7/dist-packages (from pytest->pyLDAvis==2.1.2) (8.11.0)
Requirement already satisfied: pluggy<0.8,>=0.5 in /usr/local/lib/python3.7/dist-packages (from pytest->pyLDAvis==2.1.2) (0.7.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from pytest->pyLDAvis==2.1.2) (57.4.0)
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (setup.py) ... done
  Created wheel for pyLDAvis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97738 sha256=22e8603d34d3388bc7626c6e9e780d314c4088188743f93b449475cbdb27d073
  Stored in directory: /root/.cache/pip/wheels/3b/fb/41/e32e5312da9f440d34c4eff0d2207b46dc9332a7b931ef1e89
Successfully built pyLDAvis
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-1.16 pyLDAvis-2.1.2
In [ ]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', shuffle = True)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle = True)
In [ ]:
len(newsgroups_train.data)
Out[ ]:
11314
In [ ]:
print(newsgroups_train.target_names)
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
In [ ]:
#tfidf is used for propotionately allocating the topic to the model
In [ ]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk
nltk.download('wordnet')
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
Out[ ]:
True
In [ ]:
def lemmatize_stemming(text):
    stemmer = SnowballStemmer(language='english')
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result
In [ ]:
from tqdm import tqdm
In [ ]:
articles = newsgroups_train.data
processed_docs=[]
for i in tqdm(range(len(articles))):
  processed_docs.append(preprocess(articles[i]))
100%|██████████| 11314/11314 [00:48<00:00, 235.37it/s]
In [ ]:
print(processed_docs[:2])
[['lerxst', 'thing', 'subject', 'nntp', 'post', 'host', 'organ', 'univers', 'maryland', 'colleg', 'park', 'line', 'wonder', 'enlighten', 'door', 'sport', 'look', 'late', 'earli', 'call', 'bricklin', 'door', 'small', 'addit', 'bumper', 'separ', 'rest', 'bodi', 'know', 'tellm', 'model', 'engin', 'spec', 'year', 'product', 'histori', 'info', 'funki', 'look', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst'], ['guykuo', 'carson', 'washington', 'subject', 'clock', 'poll', 'final', 'summari', 'final', 'clock', 'report', 'keyword', 'acceler', 'clock', 'upgrad', 'articl', 'shelley', 'qvfo', 'innc', 'organ', 'univers', 'washington', 'line', 'nntp', 'post', 'host', 'carson', 'washington', 'fair', 'number', 'brave', 'soul', 'upgrad', 'clock', 'oscil', 'share', 'experi', 'poll', 'send', 'brief', 'messag', 'detail', 'experi', 'procedur', 'speed', 'attain', 'rat', 'speed', 'card', 'adapt', 'heat', 'sink', 'hour', 'usag', 'floppi', 'disk', 'function', 'floppi', 'especi', 'request', 'summar', 'day', 'network', 'knowledg', 'base', 'clock', 'upgrad', 'haven', 'answer', 'poll', 'thank', 'guykuo', 'washington']]
In [ ]:
dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break
0 addit
1 bodi
2 bricklin
3 bring
4 bumper
5 call
6 colleg
7 door
8 earli
9 engin
10 enlighten
In [ ]:
# keep 10000 most freq tokens. Remove freq below 15 and above 0.5*total
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=100000)

Using Bag of Words

In [ ]:
# Use Gensim Bag of Words
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[1]
Out[ ]:
[(24, 1),
 (25, 1),
 (26, 1),
 (27, 1),
 (28, 1),
 (29, 1),
 (30, 1),
 (31, 1),
 (32, 2),
 (33, 5),
 (34, 1),
 (35, 1),
 (36, 1),
 (37, 1),
 (38, 2),
 (39, 1),
 (40, 2),
 (41, 2),
 (42, 1),
 (43, 1),
 (44, 1),
 (45, 1),
 (46, 1),
 (47, 1),
 (48, 1),
 (49, 1),
 (50, 1),
 (51, 3),
 (52, 1),
 (53, 1),
 (54, 1),
 (55, 1),
 (56, 1),
 (57, 1),
 (58, 1),
 (59, 1),
 (60, 1),
 (61, 2),
 (62, 1),
 (63, 1),
 (64, 3),
 (65, 1),
 (66, 4)]
In [ ]:
document_num = 20
bow_doc_x = bow_corpus[document_num]

for i in range(len(bow_doc_x)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_x[i][0], 
                                                     dictionary[bow_doc_x[i][0]], 
                                                     bow_doc_x[i][1]))
Word 18 ("rest") appears 1 time.
Word 166 ("clear") appears 1 time.
Word 336 ("refer") appears 1 time.
Word 350 ("true") appears 1 time.
Word 391 ("technolog") appears 1 time.
Word 437 ("christian") appears 1 time.
Word 453 ("exampl") appears 1 time.
Word 476 ("jew") appears 1 time.
Word 480 ("lead") appears 1 time.
Word 482 ("littl") appears 3 time.
Word 520 ("wors") appears 2 time.
Word 721 ("keith") appears 3 time.
Word 732 ("punish") appears 1 time.
Word 803 ("california") appears 1 time.
Word 859 ("institut") appears 1 time.
Word 917 ("similar") appears 1 time.
Word 990 ("allan") appears 1 time.
Word 991 ("anti") appears 1 time.
Word 992 ("arriv") appears 1 time.
Word 993 ("austria") appears 1 time.
Word 994 ("caltech") appears 2 time.
Word 995 ("distinguish") appears 1 time.
Word 996 ("german") appears 1 time.
Word 997 ("germani") appears 3 time.
Word 998 ("hitler") appears 1 time.
Word 999 ("livesey") appears 2 time.
Word 1000 ("motto") appears 2 time.
Word 1001 ("order") appears 1 time.
Word 1002 ("pasadena") appears 1 time.
Word 1003 ("pompous") appears 1 time.
Word 1004 ("popul") appears 1 time.
Word 1005 ("rank") appears 1 time.
Word 1006 ("schneider") appears 1 time.
Word 1007 ("semit") appears 1 time.
Word 1008 ("social") appears 1 time.
Word 1009 ("solntz") appears 1 time.
In [ ]:
# 2m # increasing number of passes shall definitely improve the model
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=8, id2word=dictionary, passes=10, workers=2)
In [ ]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))
Topic: 0 
Words: 0.008*"program" + 0.008*"avail" + 0.007*"data" + 0.007*"window" + 0.006*"server" + 0.006*"file" + 0.006*"user" + 0.006*"list" + 0.006*"chip" + 0.005*"softwar"
Topic: 1 
Words: 0.012*"armenian" + 0.011*"nasa" + 0.011*"space" + 0.007*"bike" + 0.007*"turkish" + 0.005*"ohio" + 0.005*"orbit" + 0.004*"center" + 0.004*"cleveland" + 0.004*"turk"
Topic: 2 
Words: 0.006*"israel" + 0.005*"isra" + 0.005*"jew" + 0.005*"game" + 0.004*"team" + 0.004*"arab" + 0.003*"jewish" + 0.003*"jesus" + 0.003*"play" + 0.003*"live"
Topic: 3 
Words: 0.011*"govern" + 0.005*"american" + 0.004*"public" + 0.004*"presid" + 0.004*"clinton" + 0.004*"weapon" + 0.004*"protect" + 0.004*"crime" + 0.004*"countri" + 0.003*"gun"
Topic: 4 
Words: 0.008*"wire" + 0.006*"engin" + 0.005*"space" + 0.004*"power" + 0.004*"build" + 0.004*"grind" + 0.004*"launch" + 0.004*"insur" + 0.004*"light" + 0.004*"water"
Topic: 5 
Words: 0.022*"window" + 0.021*"file" + 0.011*"program" + 0.010*"card" + 0.009*"driver" + 0.009*"imag" + 0.008*"color" + 0.007*"graphic" + 0.007*"version" + 0.007*"entri"
Topic: 6 
Words: 0.014*"drive" + 0.008*"game" + 0.006*"team" + 0.006*"scsi" + 0.006*"sale" + 0.005*"play" + 0.005*"price" + 0.005*"control" + 0.005*"hockey" + 0.004*"hard"
Topic: 7 
Words: 0.013*"christian" + 0.008*"exist" + 0.006*"moral" + 0.006*"jesus" + 0.006*"religion" + 0.006*"bibl" + 0.005*"word" + 0.005*"atheist" + 0.005*"claim" + 0.004*"life"

Using Tf-idf

In [ ]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

# instead of using bag of words, we send tf-idf as an input to the model
[(0, 0.1710625762591397),
 (1, 0.17368779842655097),
 (2, 0.155419489915073),
 (3, 0.2957445074316908),
 (4, 0.11836602497977548),
 (5, 0.15695023912113057),
 (6, 0.40285625492091776),
 (7, 0.174776788688058),
 (8, 0.12706156155902396),
 (9, 0.272611632455378),
 (10, 0.17017466234223827),
 (11, 0.1442824163480051),
 (12, 0.1846058258449959),
 (13, 0.24150470522322076),
 (14, 0.16783948753534583),
 (15, 0.29020607000815246),
 (16, 0.19136864594649777),
 (17, 0.15384172147015857),
 (18, 0.16526234485610986),
 (19, 0.1878756152537469),
 (20, 0.14980136010861542),
 (21, 0.21700814536427218),
 (22, 0.1996081806413184),
 (23, 0.1419842765585958)]
In [ ]:
# 1m 25s
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=8, id2word=dictionary, passes=10, workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))
Topic: 0 Word: 0.007*"christian" + 0.005*"jesus" + 0.004*"bibl" + 0.004*"moral" + 0.003*"religion" + 0.003*"atheist" + 0.003*"church" + 0.003*"keith" + 0.003*"exist" + 0.003*"christ"
Topic: 1 Word: 0.005*"intercon" + 0.005*"team" + 0.005*"amanda" + 0.004*"jupit" + 0.004*"captain" + 0.004*"comet" + 0.004*"devil" + 0.004*"georgia" + 0.003*"psuvm" + 0.003*"utexa"
Topic: 2 Word: 0.006*"alaska" + 0.005*"kaldi" + 0.005*"engr" + 0.005*"hamburg" + 0.005*"higgin" + 0.004*"sdsu" + 0.004*"bontchev" + 0.004*"clarkson" + 0.004*"fnal" + 0.004*"informatik"
Topic: 3 Word: 0.004*"encrypt" + 0.003*"clipper" + 0.003*"chip" + 0.003*"govern" + 0.003*"pitt" + 0.002*"netcom" + 0.002*"bank" + 0.002*"clinton" + 0.002*"secur" + 0.002*"gordon"
Topic: 4 Word: 0.006*"window" + 0.004*"file" + 0.004*"card" + 0.003*"drive" + 0.003*"driver" + 0.003*"program" + 0.003*"graphic" + 0.003*"softwar" + 0.002*"disk" + 0.002*"video"
Topic: 5 Word: 0.009*"stratus" + 0.007*"udel" + 0.007*"ohio" + 0.007*"jaeger" + 0.006*"salmon" + 0.006*"adob" + 0.005*"gregg" + 0.005*"muenchen" + 0.005*"magnus" + 0.005*"catbyt"
Topic: 6 Word: 0.004*"team" + 0.003*"bike" + 0.003*"game" + 0.002*"player" + 0.002*"space" + 0.002*"nasa" + 0.002*"basebal" + 0.002*"hockey" + 0.002*"insur" + 0.002*"play"
Topic: 7 Word: 0.009*"israel" + 0.008*"isra" + 0.007*"armenian" + 0.006*"arab" + 0.005*"turkish" + 0.004*"jew" + 0.004*"dseg" + 0.004*"columbia" + 0.003*"cunixb" + 0.003*"palestinian"

Testing

In [ ]:
unseen_document = ''
bow_vector = dictionary.doc2bow(preprocess(unseen_document))
for index, score in sorted(lda_model_tfidf[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))
Score: 0.5772048234939575	 Topic: 0.023*"death" + 0.022*"probe" + 0.018*"rise" + 0.017*"continu" + 0.017*"polic"
Score: 0.26278024911880493	 Topic: 0.051*"govt" + 0.028*"plan" + 0.023*"fund" + 0.019*"urg" + 0.018*"council"
Score: 0.020003829151391983	 Topic: 0.061*"polic" + 0.043*"kill" + 0.025*"attack" + 0.017*"investig" + 0.017*"iraq"
Score: 0.020002495497465134	 Topic: 0.021*"talk" + 0.018*"dead" + 0.014*"north" + 0.014*"question" + 0.013*"leav"
Score: 0.020001934841275215	 Topic: 0.020*"council" + 0.015*"plan" + 0.015*"mayor" + 0.015*"support" + 0.012*"hop"
Score: 0.02000165358185768	 Topic: 0.028*"report" + 0.019*"strike" + 0.017*"warn" + 0.014*"drug" + 0.012*"time"
Score: 0.020001377910375595	 Topic: 0.045*"charg" + 0.040*"court" + 0.036*"face" + 0.022*"murder" + 0.017*"case"
Score: 0.020001336932182312	 Topic: 0.027*"water" + 0.019*"miss" + 0.017*"industri" + 0.017*"public" + 0.011*"spark"
Score: 0.020001189783215523	 Topic: 0.019*"lead" + 0.015*"prison" + 0.014*"final" + 0.014*"fight" + 0.013*"tour"
Score: 0.020001154392957687	 Topic: 0.017*"health" + 0.016*"hospit" + 0.015*"call" + 0.014*"servic" + 0.013*"hous"

Visualization

In [ ]:
import pickle 
import pyLDAvis
import pyLDAvis.gensim
import os
# Visualize the topics
pyLDAvis.enable_notebook()
num_topics=8
LDAvis_data_filepath = os.path.join('./ldavis_prepared_'+str(num_topics))
# # this is a bit time consuming - make the if statement True
# # if you want to execute visualization prep yourself
if 1 == 1:
    LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary)
    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)
# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)
pyLDAvis.save_html(LDAvis_prepared, './ldavis_prepared_'+ str(num_topics) +'.html')
# topics in terms of sizes
LDAvis_prepared
/usr/local/lib/python3.7/dist-packages/past/types/oldstr.py:5: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
  from collections import Iterable
Out[ ]:
Slide to adjust relevance metric:(2)
0.00.20.40.60.81.0
PC1PC2Marginal topic distribtion2%5%10%12345678Intertopic Distance Map (via multidimensional scaling)Overall term frequencyEstimated term frequency within the selected topic1. saliency(term w) = frequency(w) * [sum_t p(t | w) * log(p(t | w)/p(t))] for topics t; see Chuang et. al (2012)2. relevance(term w | topic t) = λ * p(w | t) + (1 - λ) * p(w | t)/p(w); see Sievert & Shirley (2014)windowfilechristiandrivearmenianprogramnasaspacegoverncardgamedriverimagisraelscsibikecolorjesusgraphicmoralwireteamencryptserverturkishisrabiblentriversionchipTop-30 Most Salient Terms(1)05001,0001,5002,0002,5003,0003,500

Assumptions made by LDA: Each document is just a collection of words or a “bag of words”. Thus, the order of the words and the grammatical role of the words (subject, object, verbs, …) are not considered in the model. Words like am/is/are/of/a/the/but/… don’t carry any information about the “topics” and therefore can be eliminated from the documents as a preprocessing step. In fact, we can eliminate words that occur in at least %80 ~ %90 of the documents, without losing any information. For example, if our corpus contains only medical documents, words like human, body, health, etc might be present in most of the documents and hence can be removed as they don’t add any specific information which would make the document stand out. We know beforehand how many topics we want. ‘k’ is pre-decided. All topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated

In [ ]:
# some dataset links for topic modeling in social media dataset
# https://medium.com/aryma-labs/learn-how-we-used-topic-modeling-to-help-a-client-find-hostile-topics-among-various-social-media-de6a6003709f
# https://lazarinastoy.com/topic-modelling-limitations-short-text/
# https://www.kaggle.com/search?q=social+media
# https://www.kaggle.com/eliasdabbas/extract-entities-from-social-media-posts/data