Data Science

OpenRefine – desktop program for data cleanup and transformation

OpenRefine is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.

It has similarities with spreadsheet applications, and can handle spreadsheet file formats such as CSV, but it’s closer to acting like a database. Unlike spreadsheets, OpenRefine doesn’t store formulas and display the output of those calculations; it only shows the value inside each cell. It doesn’t support cell colors or text formatting.

OpenRefine lets users clean, correct, codify, and extend data. Without ever needing to type inside a single cell, users can automatically fix typos, convert things to the right format, and add structured categories from trusted sources.

This is free and open source software.

Key Features

  • Faceting – drill through large datasets using facets and apply operations on filtered views of your dataset.
  • Clustering – use a variety of comparison methods to find text entries that are similar but not exact, then shares those results with you so that you can merge the cells that should match.
  • Transformation of data – convert values to other formats, normalizing and denormalizing.
  • Reconciliation – matches your dataset with that of an external source.
  • Infinite undo/redo – go back to any previous state of your dataset and replay your operation history.
  • Privacy – data is cleaned locally. It doesn’t require internet access to run its basic functions.
  • Export – CSV, Excel, Google spreadsheet, HTML table, and TSV.
  • Import – CSV, Google spreadsheet, JSON, RDF triples, TSV, and XML.
  • Cross-platform support – runs under Linux, macOS, and Windows.

Website: openrefine.org
Support: GitHub Code Repository
Developer: Community
License: BSD 3-Clause “New” or “Revised” License

OpenRefine

OpenRefine is written in Java. Learn Java with our recommended free books and free tutorials.


Related Software

Python Data Validation
PydanticData validation using Python type hints
panderaFramework for precision data testing
jsonschema
Implementation of JSON Schema for Python
CerberusLightweight and extensible data validation library
schemaLibrary for validating Python data structures
GXValidating, documenting, and profiling data
marshmallowORM/ODM/framework-agnostic library
VoluptuousPython data validation library
SchematicsCombine types into structures, validate , and transform the shapes of data
ColanderSerialization / deserialization / validation library
ValideerLightweight data validation and adaptation Python library
OpenRefineDesktop program for data cleanup and transformation
Soda CoreData quality and data contract verification engine
OpenMetadataUnified metadata platform
Elementary OSS dbt-native data observability command-line tool

Read our verdict in the software roundup.


Best Free and Open Source Software Explore our comprehensive directory of recommended free and open source software. Our carefully curated collection spans every major software category.

This directory is part of our ongoing series of informative articles for Linux enthusiasts. It features hundreds of detailed reviews, along with open source alternatives to proprietary solutions from major corporations such as Google, Microsoft, Apple, Adobe, IBM, Cisco, Oracle, and Autodesk.

You’ll also find interesting projects to try, hardware coverage, free programming books and tutorials, and much more.

Discovered a useful open source Linux program that we haven’t covered yet? Let us know by completing this form.
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted