## Frequency Analysis using NTLK

NLTK, the natural language processing toolkit for Python, provides a number of ways to analyze text.

## How many words in the text?

``````fdist = FreqDist(moby_dick)
fdist.N() # 260,819
``````

## All the words in the text?

``````fdist = FreqDist(moby_dick)
fdist.keys() # Ordered - DESC freq
``````

## How many occurrences of a word?

``````fdist = FreqDist(moby_dick)
fdist['whale'] # 906
``````

## Frequency of a word?

``````fdist = FreqDist(moby_dick)
fdist.freq('whale') # 0.0035
``````

## Most frequent word?

``````fdist = FreqDist(moby_dick)
fdist.max() # ',' -> punctuation is considered a word
``````

## Plot the frequencies?

``````fdist = FreqDist(moby_dick)
fdist.plot() # Plots word + frequency
fdist.plot(cumulative=True) # Plots word + cumulative freq.
``````

## Words of a certain length?

``````long_words = [w for w in moby_dick if len(w) == 10]
``````

## Common long words?

``````fdist = FreqDist(moby_dick)
words = [w for w in fdist.keys() if len(w) > 10 and fdist[w] > 10]
``````

## Long words and their frequencies?

``````fdist = FreqDist(moby_dick)
words = [(w, fdist[w]) for w in fdist.keys() if len(w) > 10 and fdist[w] > 10]
``````

## Unique words?

``````words_only = [w for w in moby_dick if w.isalpha()]
unique = set([w.lower() for w in words_only])
word_count = len(unique)
``````