The Apache OpenNLP library is an open source machine learning based toolkit for the processing of natural language text.
It includes a sentence detector, a tokenizer, a name finder, a parts-of-speech (POS) tagger, a chunker, and a parser. It has proficient APIs that can be easily integrated with a Java program.
The goal of the OpenNLP project will be to create a mature toolkit. An additional goal is to provide a large number of pre-built models for a variety of languages, as well as the annotated text resources that those models are derived from.
- Tokenization. OpenNLP offers multiple tokenizer implementations:
- Whitespace Tokenizer – A whitespace tokenizer, non whitespace sequences are identified as tokens.
- Simple Tokenizer – A character class tokenizer, sequences of the same character class are tokens.
- Learnable Tokenizer – A maximum entropy tokenizer, detects token boundaries based on probability model.
- Sentence segmentation.
- Part-of-speech tagging – marks tokens with their corresponding word type based on the token itself and the context of the token.
- Named entity extraction – the Name Finder can detect named entities and numbers in text.
- Chunking – consists of dividing a text in syntactically correlated parts of words, like noun groups, verb groups, but does not specify their internal structure, nor their role in the main sentence.
- Parsing – offers two different parser implementations, the chunking parser and the treeinsert parser. OpenNLP has a command line tool which is used to train the models available from the model download page on various corpora.
- Coreference resolution – links multiple mentions of an entity in a document together. The OpenNLP implementation is currently limited to noun phrase mentions, other mention types cannot be resolved.
- Maximum entropy.
- Perceptron based machine learning.
Click the button to make a donation via flattr. Donations help us to maintain and improve the site. You can also donate via PayPal.
Read our complete collection of recommended free and open source software. The collection covers all categories of software.
The software collection forms part of our series of informative articles for Linux enthusiasts. There's tons of in-depth reviews, alternatives to Google, fun things to try, hardware, free programming books and tutorials, and much more.