The Natural Language Processing for JVM languages (NLP4J) project provides:
- NLP tools readily available for research in various disciplines.
- Frameworks for fast development of efficient and robust NLP components.
- API for manipulating computational structures in NLP (e.g., dependency graph).
Key Features
- Tokenization – takes a raw text and splits tokens by their morphological aspects. It also groups tokens into sentences. The tokenizer is based on the LDC tokenizer used for creating English Treebanks although it uses more robust heuristics:
- Emoticons are recognized as one unit (e.g.,
:-),^_^). - Hyperlinks are recognized as one unit (
emory.edu,[email protected],index.html). - Numbers consisting of punctuation are recognized as one unit (e.g.,
0.1,2/3). - Repeated punctuation are grouped together (e.g.,
---,...). - Abbreviations are recognized as one unit (e.g.,
Prof.,Ph.D). - File extensions are not tokenized (e.g.,
clearnlp.zip,tokenizer.doc). - Units are tokenized (e.g.,
1 kg,2 cm). - Usernames including periods are recognized as one unit (e.g.,
jinho.choi).
- Emoticons are recognized as one unit (e.g.,
- Morphological analyzer – enerates root forms (lemmas) of word tokens. It is a rule-based analyzer inspired by the WordNet morphy although it uses a larger dictionary gathered from various sources and more advanced heuristics. It also normalizes numbers, redundant punctuation, hyperlinks, etc.
- Part-of-Speech Tagging – uses the generalized model from dynamic model selection and utilizes ambiguity classes trained on a large corpus.
- Named Entity Recognition – uses both sparse and dense features extracted from named entity gazetteers, word clusters, and word embeddings.
- Dependency Parsing – uses a transition-based, non-projective parsing algorithm showing a linear-time speed for both projective and non-projective parsing.
Website: emorynlp.github.io/nlp4j
Support: Forum, GitHub Code Repository
Developer: Emory NLP research group
License: Apache License, Version 2.0
NLP4J is written in Java. Learn Java with our recommended free books and free tutorials.
Related Software
| Java Natural Language Processing Tools | |
|---|---|
| CoreNLP | Annotation-based NLP pipeline that provides core natural language analysis |
| OpenNLP | Machine learning based toolkit |
| DL4J | Deploy and train deep learning models |
| Lucene | High-performance, full-featured information retrieval software library |
| UIMA | Open source implementation of the UIMA specification |
| Tika | Content analysis toolkit |
| MALLET | Statistical natural language processing, document classification and more |
| GATE | Full-lifecycle solution for a broad range of NLP tasks |
| ReVerb | Automatically identifies and extracts binary relationships from sentences |
| NLP4J | NLP framework for JVM languages |
| CogComp-NLP | State-of-the-art Natural Language Processing (NLP) tools |
Read our verdict in the software roundup.
Explore our comprehensive directory of recommended free and open source software. Our carefully curated collection spans every major software category.This directory is part of our ongoing series of informative articles for Linux enthusiasts. It features hundreds of detailed reviews, along with open source alternatives to proprietary solutions from major corporations such as Google, Microsoft, Apple, Adobe, IBM, Cisco, Oracle, and Autodesk. You’ll also find interesting projects to try, hardware coverage, free programming books and tutorials, and much more. Discovered a useful open source Linux program that we haven’t covered yet? Let us know by completing this form. |

