Apache OpenNLP – machine learning based toolkit

The Apache OpenNLP library is an open source machine learning based toolkit for the processing of natural language text.

It includes a sentence detector, a tokenizer, a name finder, a parts-of-speech (POS) tagger, a chunker, and a parser. It has proficient APIs that can be easily integrated with a Java program.

The goal of the OpenNLP project will be to create a mature toolkit. An additional goal is to provide a large number of pre-built models for a variety of languages, as well as the annotated text resources that those models are derived from.

Features include:

  • Tokenization. OpenNLP offers multiple tokenizer implementations:
    • Whitespace Tokenizer – A whitespace tokenizer, non whitespace sequences are identified as tokens.
    • Simple Tokenizer – A character class tokenizer, sequences of the same character class are tokens.
    • Learnable Tokenizer – A maximum entropy tokenizer, detects token boundaries based on probability model.
  • Sentence segmentation.
  • Part-of-speech tagging – marks tokens with their corresponding word type based on the token itself and the context of the token.
  • Named entity extraction – the Name Finder can detect named entities and numbers in text.
  • Chunking – consists of dividing a text in syntactically correlated parts of words, like noun groups, verb groups, but does not specify their internal structure, nor their role in the main sentence.
  • Parsing – offers two different parser implementations, the chunking parser and the treeinsert parser. OpenNLP has a command line tool which is used to train the models available from the model download page on various corpora.
  • Coreference resolution – links multiple mentions of an entity in a document together. The OpenNLP implementation is currently limited to noun phrase mentions, other mention types cannot be resolved.
  • Maximum entropy.
  • Perceptron based machine learning.

Website: opennlp.apache.org
Support: Documentation, GitHub
Developer: The Apache Software Foundation
License: Apache License Version 2.0

Apache OpenNLP is written in Java. Learn Java with our recommended free books and free tutorials.

Return to Natural Language Processing Home Page | Return to Java Natural Language Tools Page

Make a Donation
Click the button to make a donation via flattr. Donations help us to maintain and improve the site. You can also donate via PayPal.

Read our complete collection of recommended free and open source software. The collection covers all categories of software.

The software collection forms part of our series of informative articles for Linux enthusiasts. There's tons of in-depth reviews, alternatives to Google, fun things to try, hardware, free programming books and tutorials, and much more.
Share this article

Share your Thoughts

This site uses Akismet to reduce spam. Learn how your comment data is processed.