Frequency Analysis using NTLK

NLTK, the natural language processing toolkit for Python, provides a number of ways to analyze text.

How many words in the text?

fdist = FreqDist(moby_dick)
fdist.N() # 260,819

All the words in the text?

fdist = FreqDist(moby_dick)
fdist.keys() # Ordered - DESC freq

How many occurrences of a word?

fdist = FreqDist(moby_dick)
fdist['whale'] # 906

Frequency of a word?

fdist = FreqDist(moby_dick)
fdist.freq('whale') # 0.0035

Most frequent word?

fdist = FreqDist(moby_dick)
fdist.max() # ',' -> punctuation is considered a word

Plot the frequencies?

fdist = FreqDist(moby_dick)
fdist.plot() # Plots word + frequency
fdist.plot(cumulative=True) # Plots word + cumulative freq.

Words of a certain length?

long_words = [w for w in moby_dick if len(w) == 10]

Common long words?

fdist = FreqDist(moby_dick)
words = [w for w in fdist.keys() if len(w) > 10 and fdist[w] > 10]

Long words and their frequencies?

fdist = FreqDist(moby_dick)
words = [(w, fdist[w]) for w in fdist.keys() if len(w) > 10 and fdist[w] > 10]

Unique words?

words_only = [w for w in moby_dick if w.isalpha()]
unique = set([w.lower() for w in words_only])
word_count = len(unique)

Also read...

Comments are closed.