NLTK, the natural language processing toolkit for Python, provides a number of ways to analyze text.
How many words in the text?
fdist = FreqDist(moby_dick)
fdist.N() # 260,819
All the words in the text?
fdist = FreqDist(moby_dick)
fdist.keys() # Ordered - DESC freq
How many occurrences of a word?
fdist = FreqDist(moby_dick)
fdist['whale'] # 906
Frequency of a word?
fdist = FreqDist(moby_dick)
fdist.freq('whale') # 0.0035
Most frequent word?
fdist = FreqDist(moby_dick)
fdist.max() # ',' -> punctuation is considered a word
Plot the frequencies?
fdist = FreqDist(moby_dick)
fdist.plot() # Plots word + frequency
fdist.plot(cumulative=True) # Plots word + cumulative freq.
Words of a certain length?
long_words = [w for w in moby_dick if len(w) == 10]
Common long words?
fdist = FreqDist(moby_dick)
words = [w for w in fdist.keys() if len(w) > 10 and fdist[w] > 10]
Long words and their frequencies?
fdist = FreqDist(moby_dick)
words = [(w, fdist[w]) for w in fdist.keys() if len(w) > 10 and fdist[w] > 10]
Unique words?
words_only = [w for w in moby_dick if w.isalpha()]
unique = set([w.lower() for w in words_only])
word_count = len(unique)