Au Naturale – an Introduction to NLTK


This blog post is an introduction on how to make a key phrase extractor in Python, using the Natural Language Toolkit (NLTK).

But how will a search engine know what it is about? How will this document be indexed correctly? A human can read it and tell that it is about programming, but no search engine company has the money to pay thousands of people to classify the entire Internet for them. Instead they must reasonably predict what a human may decide to be the key points of a document. And they must automate this.

Remember how proper sentences need to be structured with a subject and a predicate? A subject could be a noun, or a adjective followed by a noun, or a pronoun… A predicate may be or include a verb… We can take a similar approach by defining our key phrases in terms of what types of words (or parts-of-speech) they are, and the pattern in which they occur.

But how do we know what words are nouns or verbs in an automated fashion?

Throughout this post I will use an excerpt from Zen and the Art of Motorcycle Maintenance as an example:

The Buddha, the Godhead, resides quite as comfortably in the circuits of a digital computer or the gears of a cycle transmission as he does at the top of a mountain or in the petals of a flower. To think otherwise is to demean the Buddha…which is to demean oneself.

Before proceeding, make a (mental) note of the key phrases here. What is the document about?

Tokenizing

In a program, text is represented as a string of characters. How can we go about moving one level of abstraction up, to the level of words, or tokens? To tokenize a sentence you may be tempted to use Python’s .split() method, but this means you will need to code additional rules to remove hyphens, newlines and punctuation when appropriate.

Thankfully the Natural Language Toolkit (NLTK) for Python provides a regular expression tokenizer. There is an example of it (including how it fares against Pythons regular expression tokenization method) in Chapter 3 of the NLTK book. It also allows you to have comments:

# Word Tokenization Regex adapted from NLTK book
# (?x) sets flag to allow comments in regexps
sentence_re = r'''(?x)
  # abbreviations, e.g. U.S.A. (with optional last period)
  ([A-Z])(\.[A-Z])+\.?
  # words with optional internal hyphens
  | \w+(-\w+)*
  # currency and percentages, e.g. $12.40, 82%
  | \$?\d+(\.\d+)?%?
  # ellipsis
  | \.\.\.
  # these are separate tokens
  | [][.,;"'?():-_`]
'''

Once we have constructed our regex for defining what sort of format our words should be in, we call it like so:

import nltk
#doc is a string containing our document
toks = nltk.regexp_tokenize(doc, sentence_re)

>>> toks
['The', 'Buddha', ',', 'the', 'Godhead', ',', 'resides', ...

Tagging

The next step is tagging. This uses statistical data to apply a Part-of-speech tag to each token, e.g. ADJ, NN (Noun), and so on. Since it is statistical, we need to either train our model or use a pre-trained model. NLTK comes with a pretty good one for general use, but if you are looking at a certain kind of document you may want to train your own tagger, since it may greatly affect the accuracy (think about very vocabulary-dense fields such as biology).

Note that to train your own tagger you will need a pre-tagged corpus (NLTK comes with some) or use a bootstrapped method (which can take a long time). Check out Streamhacker and Chapter 5 of the NLTK book for a good discussion on training your own (and how to test it empirically).

For the sake of this introduction, we will use the default one. The result is a list of token-tag pairs:

>>> postoks = nltk.tag.pos_tag(toks)
>>> postoks
[('The', 'DT'), ('Buddha', 'NNP'), (',', ','), ('the', 'DT'), ...

Chunking

Now we can use the part-of-speech tags to lift out noun phrases (NP) based on patterns of tags.

Note: All diagrams have been stolen from the NLTK book (which is available under the Creative Commons Attribution Noncommercial No Derivative Works 3.0 US License).

This is called chunking. We can define the form of our chunks using a regular expression, and build a chunker from that:

# This grammar is described in the paper by S. N. Kim,
# T. Baldwin, and M.-Y. Kan.
# Evaluating n-gram based evaluation metrics for automatic
# keyphrase extraction.
# Technical report, University of Melbourne, Melbourne 2010.
grammar = r"""
    NBAR:
        # Nouns and Adjectives, terminated with Nouns
        {<NN.*|JJ>*<NN.*>}

    NP:
        {<NBAR>}
        # Above, connected with in/of/etc...
        {<NBAR><IN><NBAR>}
"""

chunker = nltk.RegexpParser(grammar)
tree = chunker.parse(postoks)

It is also possible to describe a Context Free Grammar (CFG) to do this, and help deal with ambiguity – information can be found in Chapter 8 of the NLTK book. Chunk regexes can be much more complicated if needed, and support chinking, which allows you to specify patterns in terms what you don’t want – see Chapter 7 of the NLTK book.

The output of chunking is a tree, where the noun phrase nodes are located just one level before the leaves, which are the words that constitute the noun phrase:

To access the leaves, we can use this code:

def leaves(tree):
    """Finds NP (nounphrase) leaf nodes of a chunk tree."""
    for subtree in tree.subtrees(filter = lambda t: t.node=='NP'):
        yield subtree.leaves()

Walking the tree and Normalisation

We can now walk the tree to get the terms, applying normalisation if we want to:

def get_terms(tree):
    for leaf in leaves(tree):
        term = [ normalise(word) for word, tag in leaf
            if acceptable_word(word) ]
        yield term

Normalisation may consist of lower-casing words, removing stop-words which appear in many documents (i.e. if, the, a…), stemming (i.e. cars $\rightarrow$ car), and lemmatizing (i.e. drove, drives, rode $\rightarrow$ drive). We normalise so that at later stages we can compare similar key phrases to be the same; 'the man drove the truck' should be comparable to 'The man drives the truck'. This will allow us to better rank our key phrases :)

Functions for normalising and checking for stop-words are described below:

lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()

def normalise(word):
    """Normalises words to lowercase and stems and lemmatizes it."""
    word = word.lower()
    word = stemmer.stem_word(word)
    word = lemmatizer.lemmatize(word)
    return word

def acceptable_word(word):
    """Checks conditions for acceptable word: length, stopword."""
    from nltk.corpus import stopwords
    stopwords = stopwords.words('english')

    accepted = bool(2 <= len(word) <= 40
        and word.lower() not in stopwords)
    return accepted

And the result is:

>>> terms = get_terms(tree)
>>> for term in terms:
...    for word in term:
...        print word,
...    print
buddha
godhead
circuit
digit comput
gear
cycl transmiss
mountain
petal
flower
buddha
demean oneself

Are these similar to the key phrases you chose? There are lots of areas above that can be tweaked. Let me know what you come up with :) (the code can be found in this gist).

In future posts I will talk about how to rank key phrases. I will also discuss how to scale this to process many documents at once using MapReduce.

In the mean time check out the demos on Streamhacker, solve the problems in the NLTK book, or read the NLTK Cookbook :)

  • Excellent work! I’m a big fan of your blog after stumbling onto your github page.

    • Thanks! Your blog looks cool too :)

      Out of curiousity, what are you doing with NLTK?

      • Thank you! Ended up following you on twitter also haha, we have fairly similar interests based on some of your blog posts (especially the google one!). For NLTK, i’ve been working on a startup & side project for a long time now.

        Planning on releasing it in two seeks. It’s a news aggregator. Mainly using SOLR & Lucene, but NLTK plays a huge role also. I was working on keyword extraction, so your blog served as a very good starting point (vs other tutorials that just give basic commands). I’ll give you credits on the project once its out and link you after its released!

        • I saw you mention the startup on your blog. Glad that post helped you out :) and I can’t complain about getting some credit ;) thanks!

  • Vardaan Bhat

    Which part of the code is used? Sorry, I’m a novice.

    • The full code is in the gist I linked: https://gist.github.com/alexbowe/879414 (sorry, maybe it wasn’t clear enough in the post).

      It will tokenize and chunk the text inside the variable called “text”. Feel free to write code to modify it however (e.g. open a file and store the contents in the variable called “text” instead).

      • Vardaan Bhat

        Thanks!

  • Vardaan Bhat

    When I run the code and try to save the values as a list, and try to run functions on them, the list shows up as a NoneType. What should I do?
    I need urgent help for a project.
    Thanks,
    Vardaan

    • Seems like you fixed this. If not, put your code in a GitHub Gist and send me the link?

      • Vardaan Bhat

        Well, it works. Is there a way I could PM you about my project, or should I just use the comments?At least I think it works, until now for the purpose.

        • I’m replying to your other comment now, but you can also email me: bowe dot alexander at gmail dot com.

          • Vardaan Bhat

            Ok. It just didn’t say, another user is typing, so I was confused. Thanks

  • Vardaan Bhat

    How do I find the one most important keyword in the text?

    • Depends on what you mean by “most important”. But an easy way would be to take the terms with the highest frequency (also called “tf” for “term frequency”). You may notice a problem with this: there are common words like “the” and “at”. You can remove these “stop words” as we do above, but some words are not considered stop words but are still very common among documents. If you are using simple term frequency across many documents, and want to change the weighting to stop really common words from being classified as “important”, you can incorporate the “inverse document frequency”. Check the wikipedia page: http://en.wikipedia.org/wiki/Tf%E2%80%93idf

      There are many other ways to do it, but this is perhaps the easiest thing to get you started.

      • Vardaan Bhat

        Ok. Can I email you with some more questions about my project and what you think are some other improvements I can make?

  • Vardaan Bhat

    Is the chunker special and built from the research paper?

    • The chunker is built from the research paper, yes. It might not be particularly special anymore (it was some time ago), but for the documents I was testing it on it worked pretty well at the time.