What do you think?
Rate this book
502 pages, Paperback
First published January 1, 2009
POLONIUS: What do you read, my lord?Hamlet was evidently interested in textual analysis, and if the Python Natural Language Toolkit (NLTK) had been available in Elsinore I'm sure he'd have bought this book too. I'd heard good things about it, and it doesn't disappoint: the authors have done a terrific job of combining a lot of freeware tools and resources into a neat package.
HAMLET: Words, words, words.
>>>Python 2.6.6So what else can it do? Geeks may want to skip to the example below, but here's a brief summary. The toolkit contains three kinds of materials. First, there's a well-selected set of texts, packaged up so that they can easily be used. Some of them are listed above; there are a couple of dozen more that you can quickly locate.
>>> import nltk
>>> nltk.download()
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>> text6.generate()
Building ngram index...
SCENE 1 : Well , I see . Running away , And his nostrils
raped and his bottom burned off , And his pen -- SIR ROBIN
: We are just not used to handsome knights . Nay . Nay .
Come on . Anybody armed must go too . OFFICER # 1 : No .
Not only by surprise . Not only by surprise . Not the
Knights Who Say ' Ni '. KNIGHTS OF NI : Ni ! ARTHUR :
You know much that is . Yeah , a swallow ' s got a point .
SOLDIER #>>>
animal_synset = wn.synset('animal.n.01')I then wrote a script which called my function to return all the animal words in the first n words of a piece of text:
human_synset = wn.synset('homo.n.02')
def is_animal_word(word):
hypernyms = [ hyp
for synset in wn.synsets(word)
for path in synset.hypernym_paths()
for hyp in path
if not human_synset in path]
return animal_synset in hypernyms
def print_animal_words_v1(text, n):They've packaged up a bunch of textual resources for easy access, so I could immediately test it on the first 50,000 words of Emma:
words = set([w.lower() for w in text[:n]])
animal_words = sorted(set([w for w in words
if is_animal_word(w)]))
print "Animal words in first %d words" % n
print(animal_words)
>>> emma = gutenberg.words('austen-emma.txt')A quick look at this reveals some suspicious candidates: for example, 'does' is most likely never used as the plural of 'doe', so shouldn't be counted as an animal word.
>>> print_animal_words_v1(emma, 50000)
Animal words in first 50000 words
['baby', 'bear', 'bears', 'blue', 'chat', 'chicken',
'cow', 'cows', 'creature', 'creatures', 'does',
'entire', 'female', 'fish', 'fly', 'games', 'goose',
'head', 'horse', 'horses', 'imagines', 'kite',
'kitty', 'martin', 'martins', 'monarch', 'mounts',
'oysters', 'pen', 'pet', 'pollards', 'shark',
'sharks', 'stock', 'tumbler', 'young']
def print_animal_words_v2(text, n):Now I get a shorter list, which in particular omits the suspicious 'does':
print "Tagging first %d words" % n
tagged_words = nltk.pos_tag(text[:n])
print("Tagging done")
words = set([w.lower() for (w, tag) in tagged_words
if tag.startswith('N')])
animal_words = sorted(set([w for w in words
if is_animal_word(w)]))
print "Animal words in first %d words" % n
print(animal_words)
>>> print_animal_words_v2(emma, 50000)Well, that should be enough to give you the favor of the thing. If you don't want to buy the book, it's available free online here. Have fun!
Tagging first 50000 words
Tagging done
Animal words in first 50000 words
['baby', 'bears', 'blue', 'chicken', 'cow',
'creature', 'creatures', 'female', 'games', 'goose',
'head', 'horse', 'horses', 'kitty', 'martin',
'martins', 'monarch', 'oysters', 'pet', 'pollards',
'shark', 'sharks', 'stock', 'tumbler', 'young']