from IPython.display import HTML, display
def set_css():
display(HTML('''
<style>
pre {
white-space: pre-wrap;
}
</style>
'''))
get_ipython().events.register('pre_run_cell', set_css)
In this part we will be covering Normalization techniques like Stemming and Lemmatization provided by popular NLP/Python Libraries for English and some Non-English languages
import spacy
nlp = spacy.load('en') # will download all default pipelines/processors - tokenizer, parser, ner
doc = nlp("I am reading a book")
token = doc[2] # Reading
nlp.vocab.morphology.tag_map[token.tag_]
token.lemma_
doc = nlp("I read a book")
token = doc[1] # Read
nlp.vocab.morphology.tag_map[token.tag_]
token.lemma_
To understand various Morph features eg: 'VerbForm' use https://universaldependencies.org/u/feat/index.html
More Examples at : https://spacy.io/usage/linguistic-features#morphology
Word normalization is the task of putting words/tokens in a standard format, choosing a single normal form for words with multiple forms like USA and US or uh-huh and uhhuh. This standardization may be valuable, despite the spelling information that is lost in the normalization p Libraries being used: ntltk, spacy
Eg:
Lowercasing ALL your text data - easy but an essential step for normalization
s1 = "Cat"
s2 = "cat"
s3 = "caT"
print(s1.lower())
print(s2.lower())
print(s3.lower())
# You can Iterate the corpus using the .lower function
sent = "There are fairly good number of registrations for the conference. More people should be registering in the upcoming days"
import nltk
from nltk.tokenize import word_tokenize
# Punkt Sentence Tokenizer - used specially to use a model to identify sentences
nltk.download('punkt')
# Tokenize the sentence
sent = word_tokenize(sent)
print(sent)
# Remove punctuations
def remove_punct(token):
return [word for word in token if word.isalpha()]
sent = remove_punct(sent)
print(sent)
The naive version of morphological analysis is called stemming.
from nltk.stem import PorterStemmer
ps = PorterStemmer()
# Using the .stem() function for each word in the sentence
ps_stem_sent = [ps.stem(words_sent) for words_sent in sent]
print(ps_stem_sent)
Stemming a word or sentence may result in words that are not actual words.
This is due to Over-Stemming and Under-Stemming
When compared to the Porter Stemmer, the Snowball Stemmer can map non-English words too. Since it supports other languages the
import nltk
from nltk.stem.snowball import SnowballStemmer
#the stemmer requires a language parameter
snow_stemmer = SnowballStemmer(language='english')
# Using the .stem() function for each word in the sentence
ss_stem_sent = [snow_stemmer.stem(words_sent) for words_sent in sent]
print(ss_stem_sent)
WordNet is an English dictionary which is a part of Natural Language Tool Kit (NLTK) for Python. This is an extensive library built to make Natural Language Processing (NLP) easy.
WordNet has been used for a number of purposes in information systems, including word-sense disambiguation, information retrieval, automatic text classification, automatic text summarization, machine translation and even automatic crossword puzzle generation.
import nltk
nltk.download('wordnet')
nltk.download('punkt')
# The perceptron part-of-speech tagger implements part-of-speech tagging using the averaged, structured perceptron algorithm
nltk.download('averaged_perceptron_tagger')
from nltk.stem import WordNetLemmatizer
# Create a lemmatizer object
lemmatizer = WordNetLemmatizer()
# Lemmatize Single Word
# Use lem_object.lemmatize()
print(lemmatizer.lemmatize("bats"))
print(lemmatizer.lemmatize("are"))
print(lemmatizer.lemmatize("feet"))
sentence = "The striped bats are hanging on their feet"
# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)
# Lemmatize list of words and join
print("+==============================+")
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)
Notice how 'hanging' wasnt changed to 'hang', 'are' isnt changed to 'be'. Hence to improve performance we can pass the POS tag alongside with the word
# Different stemming as a verb and as a noun
print(lemmatizer.lemmatize("stripes", 'v'))
print(lemmatizer.lemmatize("stripes", 'n'))
print(lemmatizer.lemmatize("striped", 'a'))
print(nltk.pos_tag(nltk.word_tokenize(sentence)))
you can use https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk to find out which POS tag corresponds to which part of speech
# Simple implementation which included POS tag
from nltk.corpus import wordnet
def get_wordnet_pos(word):
tag = nltk.pos_tag([word])[0][1].upper()
tag_dict = {"JJ": wordnet.ADJ,
"NNS": wordnet.NOUN,
"VBP": wordnet.VERB,
"VBG": wordnet.VERB,
}
return tag_dict.get(tag, wordnet.NOUN)
# 3. Lemmatize a Sentence with the appropriate POS tag
sentence = "The striped bats are hanging on their feet"
print([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)])
Notice how using POS improved Normalization
import spacy
# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en', disable=['parser', 'ner'])
sentence = "The striped bats are hanging on their feet"
# Parse the sentence using the loaded 'en' model object `nlp`
doc = nlp(sentence)
# Extract the lemma for each token using "token.lemma_"
" ".join([token.lemma_ for token in doc])
spacy replaces any pronoun by -PRON-
The spaCy library is one of the most popular NLP libraries along with NLTK. The basic difference between the two libraries is the fact that NLTK contains a wide variety of algorithms to solve one problem whereas spaCy contains only one, but the best algorithm to solve a problem.
Non-english languages do not always have their implementations in popular libraries like nltk and spacy For Indic languages, some available libraries are:
!pip install stanfordnlp
import stanfordnlp
stanfordnlp.download('hi')
nlp = stanfordnlp.Pipeline(lang="hi")
Arguments to the function:
!pip install torch==1.4.0
# Please install pytorch 1.4.0 version to avoid error in the below command
doc = nlp("मैंने पिछले महीने भारत की यात्रा की थी। मैं अभी भारत यात्रा कर रहा हूँ|")
for word in doc.sentences[0].words:
# Access attributes using word.text or word.lemma
print("{} --> {}".format(word.text,word.lemma))
Notice: Depending on the Part of speech, the lemmatization for the word has changed
for word in doc.sentences[1].words:
print("{} --> {}".format(word.text,word.lemma))
Notice:
!pip install sanstem
from sanstem import SanskritStemmer
#create a SanskritStemmer object
stemmer = SanskritStemmer()
inflected_noun = 'गजेन'
stemmed_noun = stemmer.noun_stem(inflected_noun)
print(stemmed_noun)
# output : गज्
inflected_verb = 'गच्छामि'
stemmed_verb = stemmer.verb_stem(inflected_verb)
print(stemmed_verb)
# output : गच्छ्