Natural Language Processing

NLP4J – NLP framework for JVM languages

The Natural Language Processing for JVM languages (NLP4J) project provides:

  • NLP tools readily available for research in various disciplines.
  • Frameworks for fast development of efficient and robust NLP components.
  • API for manipulating computational structures in NLP (e.g., dependency graph).

Key Features

  • Tokenization – takes a raw text and splits tokens by their morphological aspects. It also groups tokens into sentences. The tokenizer is based on the LDC tokenizer used for creating English Treebanks although it uses more robust heuristics:
    • Emoticons are recognized as one unit (e.g., :-), ^_^).
    • Hyperlinks are recognized as one unit (emory.edu, [email protected], index.html).
    • Numbers consisting of punctuation are recognized as one unit (e.g., 0.1, 2/3).
    • Repeated punctuation are grouped together (e.g., ---, ...).
    • Abbreviations are recognized as one unit (e.g., Prof., Ph.D).
    • File extensions are not tokenized (e.g., clearnlp.zip, tokenizer.doc).
    • Units are tokenized (e.g., 1 kg, 2 cm).
    • Usernames including periods are recognized as one unit (e.g., jinho.choi).
  • Morphological analyzer – enerates root forms (lemmas) of word tokens. It is a rule-based analyzer inspired by the WordNet morphy although it uses a larger dictionary gathered from various sources and more advanced heuristics. It also normalizes numbers, redundant punctuation, hyperlinks, etc.
  • Part-of-Speech Tagging – uses the generalized model from dynamic model selection and utilizes ambiguity classes trained on a large corpus.
  • Named Entity Recognition – uses both sparse and dense features extracted from named entity gazetteers, word clusters, and word embeddings.
  • Dependency Parsing – uses a transition-based, non-projective parsing algorithm showing a linear-time speed for both projective and non-projective parsing.

Website: emorynlp.github.io/nlp4j
Support: Forum, GitHub Code Repository
Developer: Emory NLP research group
License: Apache License, Version 2.0

NLP4J is written in Java. Learn Java with our recommended free books and free tutorials.


Related Software

Java Natural Language Processing Tools
CoreNLPAnnotation-based NLP pipeline that provides core natural language analysis
OpenNLPMachine learning based toolkit
DL4JDeploy and train deep learning models
LuceneHigh-performance, full-featured information retrieval software library
UIMAOpen source implementation of the UIMA specification
TikaContent analysis toolkit
MALLETStatistical natural language processing, document classification and more
GATEFull-lifecycle solution for a broad range of NLP tasks
ReVerb Automatically identifies and extracts binary relationships from sentences
NLP4JNLP framework for JVM languages
CogComp-NLPState-of-the-art Natural Language Processing (NLP) tools

Read our verdict in the software roundup.


Best Free and Open Source Software Explore our comprehensive directory of recommended free and open source software. Our carefully curated collection spans every major software category.

This directory is part of our ongoing series of informative articles for Linux enthusiasts. It features hundreds of detailed reviews, along with open source alternatives to proprietary solutions from major corporations such as Google, Microsoft, Apple, Adobe, IBM, Cisco, Oracle, and Autodesk.

You’ll also find interesting projects to try, hardware coverage, free programming books and tutorials, and much more.

Discovered a useful open source Linux program that we haven’t covered yet? Let us know by completing this form.
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted