Stanford CoreNLP is an extensible annotation-based NLP pipeline that provides core natural language analysis. This open source toolkit is quite widely used, both in the research NLP community and also among commercial and government users of open source NLP technology. Read more hot
Arbitrary Command Output Colourer: a regular expression based colour formatter for programs that display output on the command-line. It works as a wrapper around the target program, executing it and capturing the stdout stream
Align is a general-purpose text filter tool that helps vertically align columns in string-separated tables of input text.
Ansible-cmdb takes the output of Ansible's fact gathering and converts it into a static HTML overview page containing system configuration information.
Ansifilter handles text files containing ANSI terminal escape codes.
a Microsoft® Word reader for Linux and RISC OS
apachegrep is a perl program (which does not require any non-standard perl modules) to help webmasters (or anyone, really) go through their apache common/combined logs and try to pullout various bits of information.
Apropos2 is a replacement for the GNU apropos command that winnows down its responses when given more than one search term. The apropos in man repeatedly prints everything that matches each search term. (The apparent similarity between apropos and whatis led to them being the same script.) Apropos2 is equivalent to "apropos word1 | grep -i word2 | ... | grep -i wordn", but with better error messages.
ASCII art printer
a Perl script that rewrites its input, 2.5 times larger, as ASCII art
ascii2pdf translates simple text documents to PDF format. It has options for changing font, font size, and landscape vs. portrait mode.
AsmView is a small file viewer. It runs in a terminal and accepts input from the command line, a pipe, or interactive prompts. It scrolls in all directions and meets the SFF guidelines.
Aspell is a free and Open Source spell checker designed to eventually replace Ispell. It can either be used as a library or as an independent spell checker. Its main feature is that it does a much better job of coming up with possible suggestions than just about any other spell checker out there for the English language, including Ispell and Microsoft Word.
a fully customizable Python library and command-line tool for converting plain text into XML
AutoConvert consists of three parts: A converter from Chinese HZ encoding to GB encoding, an auto-converter from HZ/GB/BIG5 encoding to GB/BIG5 encoding, and a working procmail example to auto-convert incoming mail.
Autodocbook is a simple perl script that runs though C code looking for specially formatted comment blocks and turns them into docbook sgml files, which can then be used to created man pages, info pages and html documentation.
interprets a special-purpose programming language that makes it possible to handle simple data-reformatting jobs with just a few lines of code
Base64 is a command line tool that implements an RFC 3548-compliant base 64 encoder and decoder. When encoding it can wrap encoded lines to a specified column, and when decoding can optionally ignore non-alphabet characters.
bbe is a sed-like editor for binary files. bbe performs basic byte operations on blocks of input stream. bbe is command line tools developed in GNU/Linux environment.Features include:
Non-interactive command-line tool, reads input stream in arbitrary blocks, not as lines as sed, and input blocks can be defined as offset and length, just length, or using start and stop strings.
Bfr is a general-purpose command-line pipe buffer. It buffers data from stdin and sends it to stdout, adjusting to best fit the pace stdout can handle. It can solve problems on either end of a pipe.
BibTeX2HTML is a set of LaTeX and Perl scripts, which permit to generate automaticaly web pages from a BibTeX database.
converts a Netscape "bookmarks.html" file to into a series of pages of links which are more easily browsable
a tool used for synchronizing different bookmark files and types. booksync preserves current bookmark structures and sorts in new ones correctly in existing directorys or create new one if necessary
Boustrophedon Text Reader
displays text files in Boustrophedon--a writing style created by the ancient Greeks that alternates direction every line
can draw all kinds of boxes around its input text, ranging from a C comment box to complex ASCII art
catdoc extracts content of Microsoft Word (.doc) fileas readable ASCII text and prints it to stdout. Optionally catdoc can output TeX escape sequences instead of certain characters.
translates TeX Device Independent (DVI) files into readable plain text. The program aims to be a superior replacement for the non-free dvi2tty program
ccostring is a text utility to compare lines from different files.
Cedilla is a simple text printer that uses Unicode internally. Cedilla attempts to at least partially solve this problem by making heroic efforts to find or create a suitable glyph.
a lexical scanner, a parser generator, a parser, a tree builder and an XML generator all in one package. Chaperon can parse structured text using a grammar and then generate an XML representation of the parsed text, so it is easy to use Chaperon as a converter for text files
chcase is a Perl script that will rename files to either all upper or all lower case letters.
a text convert tool from Oracle to CSV file
ChkTeX is a LaTeX semantic checker. It is _not_ a replacement for the built-in checker in LaTeX; however it catches some typographic errors LaTeX oversees. In other words, it is Lint for LaTeX. Filters are also provided for checking the LaTeX parts of CWEB documents.
ChmSee is a Compiled HTML Help (CHM) file viewer written in GTK.
cledit is a change log editor that uses the default editor. It converts text change logs to colorized HTML and checks spelling using aspell.
Clipboard Modifier is a flexible system to modify the text in a clipboard in a variety of ways. It can copy a spreadsheet and change the clipboard so that it can be pasted into a wiki, with vertical bars (|) instead of tabs. It can modify multi-line clipboard text so that it can be pasted into Java or Python as strings. An URL in the clipboard pointing to Amazon can be modified so that it has your Associate ID in it. It can pipe the clipboard to a shell command and retrieve the output from it. A clibpboard can be forced to text, removing things like formatting. A complicated URL can be converted into its Python equivalent, using urlencode.
converts a program source code to syntax highlighted HTML. It may be called as a CGI script. It can also handle include commands in HTML files
Colortail is a 'tail' program that can color highlight the output.
Cook is a tool for constructing files, and maintaining referential integrity between files. It is given a set of files to create, and recipes of how to create and maintain them. In any non-trivial program there will be prerequisites to performing the actions necessary to creating any file, such as include files. Cook provides a mechanism to define these.
cpp2latex converts C++ into LaTeX either for including into existing LaTeX documents or as standalone documents.
CSpotRun is a free reader for documents in the popular Pilot DOC format.
cz2cz is software for converting text files between various encoding charsets (ISO-8859-2, Win-1250, UTF-8, ...). Main feature is autodetection of charset used in text file. Only in czech language (and useful for cz users only).
analyses texts for word probabilities, and then generates random sentences based on that. Sometimes these sentences are nonsense; but sometimes they cut right through to the heart of the matter, and reveal hidden meanings
The dbacl project consist of a set of lightweight UNIX/POSIX utilities which can be used, either directly or in shell scripts, to classify text documents automatically, according to Bayesian statistical principles. dbacl is also the name of the core utility.
a data-driven, template-based text filter implemented in Perl, ideal for generating scripts, content and reports. It provides powerful formatting capabilities for delimited text/input streams, especially when combined with other shell programs
generates a valid HTML page to display the output of the diff(1) well-known utility. Using Cascading Style Sheets, the user can fully personnalize the appearance of the web page (you might find the default styles are too much colorfull). diff2html is written using the Python language and is licensed under the GNU GPL
Diogenes is a free, non-commercial, open-source tool for searching the TLG and PHI databases, written in Perl.
converts Microsoft Word files to XML
a document converter that can convert between Rich Text Format (rtf), HyperText Markup Language (html) and plain text (txt). Supports: converting to/from rtf/text/html; colour; font attributes; and most European languages
doclifter translates documents written in troff macros to DocBook. Lifting documents from presentation level to semantic level is hard, and a really good job requires human polishing. This tool aims to do everything that can be mechanized, and to preserve any troff-level information that might have structural implications in SGML/XML comments.
Docvert is Web service software that takes multiple word processor files (typically .doc) and converts them to Oasis OpenDocument v1.0 format, and then optionally to any XML/HTML format. The results are returned in a .zip file.
converts a commented XML DTD to LaTeX source for printing
dtd2xs translates a Document Type Definition (DTD) into a XML Schema (REC-xmlschema-1-20010502). The translator can map meaningful DTD entities onto XML Schema constructs
Duff is a Unix command-line utility for quickly finding duplicates in a given set of files. Duff is written in C and should compile on most modern Unices.
a DVI to PDF translator. Its features include TeX special's that approximate the functionality of the PostScript pdfmarks used by Adobe Acrobat Distiller, the ability to include PDF files and JPEG files as embedded images, support for both Type1 and PK fonts, support for arbitrary linear graphics transformations, a color stack accessible via special's, partial font embedding and stream compression for reduced output file size, native, portable graphics via TPIC specials, balanced page and destination trees for improved reader access on very large document files
dwdiff is a front-end for the diff program that operates at the word level instead of the line level. It is different from wdiff in that it allows the user to specify what should be considered whitespace, and in that it takes an optional list of characters that should be considered delimiters.
lets you place hyperlinks and shell/tcl/TeX/etc code inside plain text files
Elex generates a scanner (lexer) from a specification oriented around regular expressions.
eolfix is a command line utility for querying and correcting end-of-line (EOL) characters in ASCII text files. It can convert line endings between DOS, Unix, and Mac formats and handles "mixed" and binary formats. It converts only as needed and features a report-only mode.
epsmerge is a program for merging EPS (Encapsulated Postscript) files.
epssplit is a Perl program for splitting an EPS (encapsulated postscript) file into several smaller EPS files.
a simple plain-text format which allows conversion to and from HTML. Instead of editing HTML directly, it provides an easy-to-edit, easy-to-read and intuitive way to write HTML
euc2html is a simple application that reads in EUC encoded double-byte characters and translates them to HTML 4.0 Unicode encoded entities.
EVP dirdiff recursively compares two directory trees using message digest (hash), e.g. MD5.
converts the given set of C files into HTML files with all the user defined function calls converted to hyper links so that the user can click the link to view that function definition
converts IE favorite files to a Netscape bookmark file
fccu-docprop is a command line utility that tries to print properties of MS OLE files. MS OLE Files are mainly MS Office DOC and XLS files. This software uses the libgsf library to get those metadata. This software can be used for forensic purpose.
fileblasphemy digs tags out of a filename and allows you to use those tags within the execution of a program.
crlf converts files from/to DOS and UNIX text file formats, tolower converts filename(s) case to lower/upper case, untab converts TABs in files to spaces, and time_t returns values for time handling
fk_html is a simple perl script to convert html mail to plaintext. It converts your mail while you're downloanding it running as a fake pop3 server that redirects your mail client connections.
fsplit and fmerge
fsplit and fmerge are utilities to split a large binary file into smaller pieces and merge them together on another machine.
gClipColl provides a drag-and-drop repository for text snippets. Any text dragged to it is stored in a list for dragging to another application.
a tool to extract information from files. The default settings (and the shorthand options) are useful to extract information such as the title or meta tags from HTML files but it could also be used for other kind of documents
Generic Colouriser acts as a filter, i.e. taking standard input, colourising it and writing to standard output.
a command-line parser generator. Creates a a C or C++ file containing command line parsing routines for your program based on a simple configuration file
a CHM file viewer for Gnome2. It uses PyCHM, a set of Python wrappers around the C library libchm
GNU Talk Filters
The GNU Talk Filters are filter programs that convert ordinary English text into text that mimics a stereotyped or otherwise humorous dialect. These filters have been in the public domain for many years, but now for the first time they are provided as a single integrated package. The filters include austro, b1ff, brooklyn, chef, cockney, drawl, dubya, fudd, funetak, jethro, jive, kraut, pansy, pirate, postmodern, redneck, valspeak, and warez. Each program reads from standard input and writes to standard output. This version of the package also provides the filters as a C library, so they can be easily embedded in other programs.
Gnutran is a simple, Emacs-based front-end to a number of machine translation engines available on the web.
an optical character recognition software. It converts PGM files into ASC files
gozer is a commandline text rendering utility for creating images from abitrary text in antialised truetype fonts using optional fontstyles, wordwrapping and layout control.
searches one or more input files for lines containing a match to a specified pattern. By default, grep prints the matching lines
a plain text to HTML conversor. It succesfully converts subtle text markup to lists, bold, italics, tables and headings to their corresponding HTML tags without having to write unreadable source text files
a tool for automatically creating high-quality HTML markup from Project Gutenberg etexts. In combination with freely-available HTML-to-Postscript conversion tools, GutenMark can convert Project Gutenberg etexts into publication-quality Postscript, for print-on-demand applications
hd2u is a filter used to convert plain texts from DOS (CR/LF) format to UNIX format (CR) and vice versa.
help2info is a bash script that generates a simple info page from the output of the --help argument of the specified program.
a Perl script that converts the --help and --version output from a program into a simple manual page
Tools to manipulate hierarchical text outlines (i.e. text trees), including a generator and a spiffy pager.
Highlight is a universal sourcecode converter for Linux and Windows, which transforms code to HTML, XHTML, RTF, LaTeX or TeX - files with syntax highlighting.
highlights strings using ANSI terminal escape codes
HistView takes an ASCII changelog as input and outputs a formatted HTML page, optionally containing links to download releases.
Html Code Convert
Html Code Convert helps speed up the conversion of HTML code into different format including Java Script, JavaServer Pages, Microsoft ASP, PHP, Perl, and the UNIX Shell. It is particularly useful in CGI scripting.
HTML to LaTeX
HTML to LaTeX converts a web site to a LaTeX document which can be used to generate postscript, pdf, and other formats.
HTML2DB is a tool to assist with the task of converting well-behaved HTML into DocBook SGML.
a converter from html to xsl:fo. The html code could be written with StarOffice or other WYSIWYM editors and must not be 100% valid html code
a small Perl script designed to convert a properly formatted HTML file into a properly formatted LaTeX file
a simple console based utility for converting HTML text streams (or any ASCII based text stream for that matter) into a series of perl print statements for inclusion in a Perl script
html2text converts HTML documents into plain text.
html2text reads HTML documents from standard input or a (local or remote) URI, and formats them into a stream of plain text characters that is written to standard output or into an output-file. The program is able to preserve the original positions of table fields, allows you to set the screen width, and accepts also syntactically incorrect input. The rendering is largely customizable through an RC file.
html_parse is a tool for stripping HTML tags from a document. It is also capable of adding the resulting plain text to a database driven by MySQL.
converts HTML files to PDF or PostScript, generates a table-of-contents for books and generates indexed HTML files
htmlrecode recodes the HTML file using a new character set, while losing no characters at all.
replaces key tags read from a template file with the data read from a data file and generate an output file
features are links to other info files, the ability to read compressed files and prettier layout
creates indexed pdf documents from text files. Designed to aid creating an electronic distribution method for legacy system reports, since many mainframe type print spools are plain text
software for indexing and searching text documents. It supports full text and field based search, relevance ranked results, Boolean queries, and heterogeneous
databases. Isearch can parse many kinds of documents "out of the box," including HTML, mail folders, list digests, SGML-style tagged data, and USMARC
Isotty is an easy translator for terminal character encoding. It can be used to correctly display legacy encoded programs on UTF-8 terminals or vice-versa. It features locale setting for translated programs and multi-byte sanity checks for incorrectly configured terminals. Several encodings are supported, including EUC, Shift-JIS, KOI-8r, UTF-8, and most ISO 8859 encodings.
a TeX macro package for processing the output from Jade/OpenJade in TeX (-t) mode
jbofihe is a parser for checking the grammatical correctness of Lojban text. It also provides approximate translations of Lojban into English. (Lojban is a constructed human language with the interesting property that its grammar can be cast into a form that Bison can parse).
a perl module to do japanese character conversion
thumbnails for jpg images; XawTV snaps and Mavica cameras
jq is a lightweight and flexible command-line JSON processor. jq is like sed for JSON data - you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.
a KDE tail program that monitors multiple files and/or command output in one window
KTextDecode is a Cyrillic text conversion utility for KDE 1.1.x.
Latex2slides is a simple graphical program that produces a set of HTML/JPEG slides from a TeX or LaTeX source. Alternatively, the source can be a multipage postscript, DVI or PDF FILE, and the image format for the slides can be set to PNG.
brings together LaTeX and an SQL database. By using LaTeXDB you can use SQL queries in your LaTeX document and loop over the result sets, creating tables, serial letters and other stuff
Lazyread auto-scrolls files or command output to the screen. Change scroll modes, scroll-speed, colors, pause, search, etc. Render text, HTML, PDF, gzip, tar, zip, ar, bzip2, MS-Word, nroff, binary, directories, .deb, .so, .rpm, piped output and more.
renames uppercase filenames to lowercase
a pager. A pager is a program that displays text files
lft lists files by file type (directory, regular file, symbolic link, etc...)
LigaTeX removes unnecessary ligatures from TeX files. The program currently only works with texts written in German.
a command line program that will parse syslog (and syslog-like) logfiles into a more palatable format. It will take anything resembling a standard syslog file (this includes syslog-ng, and probably most of the other variants out there), and crunch it into one of the following formats for your viewing
logtools is a collection of tools to merge, sort, split, and mangle CLF-format Web logs. Some tools for generic log file manipulation are included.
a simple Python tool that searches for text strings in OpenOffice.org (and StarOffice 6.0 or later) files. It works under Linux, Windows and Macintosh
Lout is a document formatting system that reads a high-level description of a document similar in style to LaTeX and produces a PostScript file. The system reads a high-level description of a document similar in style to LaTeX and produces a PostScript file which can be printed on most laser printers and graphic display devices. Plain text output is also available, PDF output is limited but working (e.g. no graphics).
a console based text retrieval package that is for indexing text/HTML documents
Lucidor is a program for reading and handling e-books. It supports e-books in the EPUB file format and catalogs in the OPDS format. Read more
a very simple Lyx to HTML converter. As the name suggests, it takes a ".lyx" document as input and generates an HTML-file following a few simple
rules. "lyx2html" can be very useful for generation documentation. This is a beta-release
creates an HTML Frequently-Asked Questions page from a text file in which each category, question, and answer are on a single line in the file
converts man (manual) pages to html via CGI or on the command line. man2web also allows for keyword (apropos) searching and generation of section indexes
mll2html is a GNU program which reformats a mailinglists text file (like this) to a HTML mailinglists file.
Modifile is an application to modify the contents of text files, using Perl substitution expressions. Optionally, it can run interactively, with the user confirming substitutions to be made.
mozilla2ps is a quick and gross hack to convert html to postscript pages in an unattended manner.
NFO Viewer is a simple viewer for NFO files, which are "ASCII" art in the CP437 codepage. The advantages of using NFO Viewer instead of a text editor are preset font and encoding settings, automatic window size, and clickable hyperlinks.
ngp is a grep tool that lets you look for a pattern in your source code directory and display results in ncurses.
odt2txt extracts the text out of OpenDocument Texts. It is small, fast and supports multiple output encodings.
OpenBerg Reader is an open-standards-based, multi-platform eBook reader. It is comparable to Adobe Acrobat in its purpose, and is based on Mozilla technologies.
otl is a text processor for generating custom markup from plain text supplemented with lightweight markup. Much of both the input and output formats can be customized.
a tool for converting text files to html content, licensed under GPL
out2html converts program output such as that produced by "git log --color" to colorized HTML.
Panconvert is a markup document converter. It allows selecting files, or ad-hoc conversion of entered/pasted markup. Pandoc needs to be installed, and handles MarkDown/CommonMark, LaTeX, OPML, ODT and EPUB formats.
A paragraph reformatter, vaguely similar to fmt, but better. Par is a filter which copies its input to its output, changing all white characters (except newlines) to spaces, and reformatting each paragraph.
takes the output of diff and display it in a parallel (side-by-side) format, emulating the /PARALLEL option on the VMS version of diff
PDF Split and Merge
PDF Split and Merge (pdfsam) is an easy-to-use tool that provides functions to split and merge PDF files or subsections of them.
pdf2html takes one pdf and generates series of html and PNG images. Each html page contains an image of one page of pdf document.
a very flexible and powerful PERL5 program.
It's the simplest way to create a PDF index from your PDF archive
an easy to use KDE tool for creating PDF documents out of a bunch of image files. It heavily uses ImageMagick's convert tool, tiff2ps and ps2pdf
(commercial) pdfSplit in an application that allows to separate and/or re-arrange the order of the pages of an Adobe Acrobat file.
pdftk is a simple, command line tool for doing everyday things with PDF documents. Use it to merge PDF documents, split PDF pages into a new document, decrypt input as necessary (password required), encrypt output as desired, fill PDF forms with FDF data and/or flatten forms, apply a background watermark, report on PDF metrics, update PDF metadata, attach files to PDF pages or the PDF document, unpack PDF attachments, burst a PDF document into single pages, decompress and re-compress page streams, and repair corrupted PDF files (where possible). Read more
an open-source PDF-to-HTML converter
a tool for indexing big trees of text files, as your local HTML documentation or home directory
phtx is a command line tool that extract data from tables in HTML-encoded files.
a Poor Man's Pre-Processor, implemented in Python. It can be used to preprocess text such as complicated HTML and forms
printconvert is a command-line tool to convert between DOS-style newlines (CR-LF) to Unix-style newlines (LF).
psbind examines the margins in a PostScript document and rearranges the pages to fit them onto paper efficiently.
Psiconv is a PSION 5 Word conversion utility released under the GPL.
the goal of the Pspell library is to provide a generic interface to Spell checker libraries installed on the system
With pyRenamer you can change the name of several files at the same time easily. You can rename files using patterns or search and replace or common substitutions. You can manually rename selected files. You can rename image and music files using their metadata.
Pyrite Publisher is a set of powerful tools for building e-texts in the de facto standard Doc format used on the Palm Computing platform. It currently includes tools for converting HTML and ASCII text to Doc databases.
QDMerge is a modular and extensible engine for merging data files with various templates to create documents. Useful for small to medium sized web sites, but not limited to X/HTML output.
randtype is a small utility to read either standard input or text files and display the output, character-by-character or line-by-line, at random intervals.
Rbpar is a program and an accompanying library suite designed for formatting text paragraphs. In this sense, it greatly resembles the venerable Unix programs fmt and par. The difference is that rbpar sports a more modern design: it is written completely in Ruby and offers an internal API for several paragraph formatting tasks.
has the purpose of converting files between various character sets and usages. When exact transliterations are not possible, as it is often the case, the program may get rid of the offending characters or fall back on approximations
Regular Expression Development and Execution Tool: allows the user to construct regular expressions and test them against input data by executing any of a variety of search programs, editors, and programming languages that make use of regular expressions
regex-markup performs regular expression-based text markup according to used-defined rules. This can be used to color syslog files as well as the output of programs such as ping, traceroute, gcc etc.
regextract applies a regexp to a file and prints all matches.
Region Oriented Ascii Processor
scans a text file, extracts regions that matches specified patterns from it, and processes them with specified executables sequentially
a suite of programs and tools for building wide area full text information retrieval systems over the Internet. The search mechanisms are capable of sorting documents by relevance to keyword search criteria. Boolean operations (and, or, not, and grouping operators,) on multiple keywords are fully supported and the programs are capable of phonetic keyword search. The programs are also find application in enterprise wide area information retrieval systems.
provides a much easier way than sed of replacing one or more strings with others in one or more text or binary files or from standard input
RIV2ASCII Conversion is a simple tool that find the meaning letter codes spotted on freight trains and writes them to the screen.
rlwrap is a 'readline wrapper', i.e. a small utility that uses the GNU readline library to allow the editing of keyboard input for any other command.
a simple text-formatting language. It's similar in function to TeX, HTML, nroff/groff, Postscript
rpl is a UN*X text replacement utility. It will replace strings with new strings in multiple text files.
a set of RTF (Rich Text Format) translation tools
a tool to convert RTF documents (from Microsoft Word, Word Perfect, Frame Maker...) into documents for the WWW
Russian Anywhere is a utility to convert Cyrillic files between different codepages.
safecat implements Dan Bernstein's maildir algorithm, copying standard input safely to a specified directory. With safecat, the user is offered two assurances. First, if safecat returns successfully, then all data is guaranteed to be saved in the destination directory. Second, if a file exists in the destination directory, placed there by safecat, then the file is guaranteed to be complete.
Sar2html converts sar binary data to a graphical HTML format. It has a command line tool, Web interface, and data collection script.
SAREP is a command line search and replace tool written in Perl. It supports regular expressions, multiple file search-and-replace, wildcards, writing out to a new file (rather than overwriting the modified file), and the code is well commented so you can make changes very easily.
Seetxt is a lightweight text file and man page viewer for X windows.
an acronym for Simple Extensible LaTeX To HTML Converter. It is a program which reads a LaTeX source file and converts all known (i.e built-in or user created) commands to the appropriate HTML tags
the sucessor of the now-obsolete SGMLtools project and consists of the easy-to-use front-end, a large number of processing backends, and some custom stylesheets
a tool for searching text files and filtering text streams using structural
a free, open-source producer of dynamic signatures for livening up your e-mail and news postings. It will allow you to sign your messages with a different sig every time
programs to give a quantitative measure of how similar two files are. similarity_by_diff measures the number of difference lines reported by diff(1), while similarity_by_zlib tries compressing the two files separately and togethe
takes words and such as input and it makes large variable-width letters consisting of asterisks
splitpea is a command-line tool written in Python that can split a file into multiple fixed-size pieces and join those pieces to form the original file.
a collection of various small utilities, written for text generation and text manipulation. They include: count - Generate sequenced strings, csvconv - Converter for CSV files, rot13 - Rot13 encoder/decoder, memory - File based hashtables and linkget - Extract links from HTML documents
super sed is an enhanced version of sed.
t2t is a Perl script that converts standard ASCII text to HTML 4.0 tables. Any text with the delimiter embedded in it is converted to a table. The user can specify any regular Perl expression as a delimiter; the default delimiter is the tab.
Tabfmt is a command line utility to format tabular data. The program reads lines from one or more files or from standard input, breaks the lines into fields given a set of input field delimiters, and prints a table with constant-width columns to standard output or a specified file. Minimum and maximum field widths, left and right padding, as well as the characters used for filling, padding and delimiting the fields can be specified.
Table is a small set of programs that treats HTML tables like database tables.
Utilities for manipulating tagged files
a system that transforms XML to LaTeX, HTML, XHTML+MathML and DocBook
TEItools is a coupled set of scripts, written in Tcl, which does various SGML transformations. Currently they include the following converters: from TEI Lite to HTML, RTF, TeX, DVI, PS, PDF; from HTML to TEI Lite, Linuxdoc, TeX, DVI, PS, PDF; from Linuxdoc to HTML, TEI Lite, DocBook, TeX, DVI, PS, PDF; and from DocBook to TEI Lite.
tesh is a simple shell which allows you to create and manage text documents by applying tags to them.
The Tilburg Memory Based Learner, TiMBL, is an open source tool for NLP research, and for many other domains where classification tasks are learned from examples. Read more
timestamp is a text filtering pipe that marks each line with a timestamp. The time is set when the first character of the line is received, and the util is capable of coping with CR repeats fairly well (won't over-write or update the timestamp).
a graphical front end to the diff program. It provides a side-by-side view of the differences between two files, along with several innovative features such as diff bookmarks and a graphical map of differences for quick navigation
a simple file splitter/joiner written in Tcl/Tk. It can currently only be used to join previously-split files such as those downloaded from newsgroups with .001 .002 etc... extensions, but unlike many splitters, the user can opt to skip missing files
tlgu is a utility for converting an input file in Thesaurus Linguae Graeca (TLG) or Packard Humanities Institute (PHI) representation (beta code text and citation information) into Unicode (UTF-8). A companion Hellenic Polytonic HOWTO is also included in the tlgu site.
tlve is a command-line tool to parse different tlv (tag-length-value) structures and for printing them in different text-based formats.
todo2html generates pretty HTML from a standard text TODO file. The formatting is configurable with style sheets, and there is an easy-to-read built-in style or a style-free option.
finds all the files in a directory tree and prints their names, using an optimised disk access strategy. It is similar to `find -print'. The added feature is that Treescan optimises the I/O in various ways. It is sometimes much faster than the naive strategy used by `find'
Trowser is a browser for large line-oriented text files (such as debug traces). It's meant as an alternative to "less". Compared to less, trowser adds color highlighting, a persistent search history, graphical bookmarking, separate search result windows, and flexible skipping of input from pipes to STDIN. Trowser has a graphical interface, but is designed to allow browsing via the keyboard at least to the same extent as less. Key bindings and the cursor positioning concept are derived from vim.
search (and replace) for text blocks in multiple files & (sub)directories
Txr implements a sophisticated query language that matches data across one or more text files or Unix pipes. Queries match entire files or sections of files.
reads a textdocument from stdin, removes all non-alphas and generates a array (list) of words. Then it converts german-umlauts, because graphviz can only handle clean ASCII as node-description and output a dot-file for a directed or an undirected graph
a Perl program that converts plain text to HTML. It uses the HTML::TextToHTML perl module to do so
converts flat ASCII text to man page format. It is a shell script using gnu awk, that should run on any Unix like system
a power PERL5 script to convert text files to PDF format
(commercial) a very flexible and powerful Perl program that converts files from text to PDF format. It extends txt2pdf, with features such as the ability to add form feeds, to skip the first form feed, to not print the file name in the first line, to set the top and left margins, and to set all the text to bold, italic, or bold italic
a Regular Expression "wizard", all written with bash2 builtins, that converts human sentences to RegExs. with a simple interface, you just answer to questions and build your own RegEx for a large variety of programs, like awk, ed, emacs, grep, perl, php, python, sed, tcl and vim
a generic text converter. From simple text files, it
generates HTML, sgml, man, Magic Point (mgp), MoinMoin and Adobe PageMaker documents
txtbdf2ps is a perl script that can generate compact, DSC-compliant Postscript out of a plain text file and a BDF font.
unac is a C library and command that removes accents from a string.
Unicode Data Browser
UnicodeDataBrowser is a browser for the UnicodeData.txt file, which contains much useful information but is not easily read by humans. It creates a scrollable table in which columns represent properties. The table may be sorted on any column. Abbreviations are expanded and characters cross-referenced in decomposition and casing fields are named. Regular expression search restricted to a selected column is available. The set of characters for which information is displayed may be restricted to those characters matching a regular expression on a specified property.
Unicode Description Utilities
The Unicode Description Utilities is a set of four programs for finding out what is in a Unicode file.
UnRTF is a command-line converter from RTF (Rich Text) to HTML, LaTeX, PostScript, plain text, and text with VT100 codes. When converting to HTML, it supports tables, fonts, embedded images, hyperlinks, paragraph alignment, and more.
Unsort unsorts a textfile. In other words: it randomizes the order of the lines in a file.
utf2any translates a file encoded in UTF-7 or UTF-8 (Unicode) into any 7- or 8-bit text format.
a very small and simple tool that attempts to show every line with a comment (both C++ style double slash, and original C 'slash-star-star-slash' matched quotes) inside one or more files
An html to ascii or utf-8 converter specifically programmed to output text suitable for reading.
a small Python program which removes much of the HTML cruft produced by Microsoft Word 2002 (Word version 10), making them much easier to hand-edit
WordFlashReader is an Rapid Serial Visual Presentation (RSVP) program useful for anyone who has an electronic text or book they wish to read. It flashes each word of the text sequentially and pauses for punctuation. Opens *.txt, *.html, and *.pdf files.
a small perl5 script that allows preprocessing of html files
a library which allows access to microsoft word files. It can load and parse the word 2000,97,95 and 6 file formats. These are the file formats known internally as word 9,8,7 and 6. wv compiles and works under most operating systems, particularly Linux, Solaris, AIX and OSF1 (formerly known as mswordview)
a Microsoft Word 97 password validator and almost decrypter
the continuation of Caolan McNamara's wv - the MSWord library. Efforts are underway to make this library more correct, robust, and turn it into a Word97 exporter.
these tools are used to convert XML and HTML to and from a line-oriented format more amenable to processing by classic Unix pipeline processing tools, like grep, sed, awk, cut, shell scripts, and so forth
xroottext renders stdin onto the root window with line wrap and scrolling.
xtranslate is a tool to convert the text in the X11 selection buffer from ASCII to arbitrary UTF-8 characters. This is particularly useful when you've accidentally typed some text while the keyboard was in the wrong language mode.
Yacc to LaTeX: takes any yacc source file, and derives an Extended Backus-Naur Form (EBNF) description from it. This EBNF is written out as LaTeX source
builds html pages out of "templates". It is especially useful for building web sites where the look and feel of all pages should be the same