NetherlandsNL PICCL is a set of workflows for corpus building through OCR, post-correction, modernization of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL functionality and Frog functionality in a single pipeline. Tesseract offers Open Source software for optical character recognition. TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form. Frog enriches textual documents with various linguistic annotations.
written language tool optical character recognition orthographic normalisation sentence splitting tokenisation dependency parsing shallow parsing lemmatisation morphological analysis named entity recognition part of speech tagging Enriching Data Linguistics Philosophy Literary Studies Religion Studies History general linguistics orthography morphology syntax yes Dutchnld no yes Swedishswe yes 20 21 yes Russianrus yes 20 21 yes Spanishspa yes 20 21 yes Portuguesepor yes 20 21 yes Englisheng yes 20 21 yes Germandeu yes 20 21 yes Frenchfra yes 20 21 yes Italianita yes 20 21 yes Finnishfin yes 20 21 yes Modern Greekell yes 15 21 yes Classical Greekgrc yes -10 15 yes Icelandicisl yes -10 15 yes German (Fraktur)deu yes -10 15 yes Latinlat no yes Romanianron yes -10 15 Online available https://github.com/LanguageMachines/piccl not specified not specified POSIX not specified unknown nextflow not specified https://github.com/nextflow-io/nextflow localDesktop ticcltools not specified https://github.com/LanguageMachines/ticcltools localDesktop foliautils not specified https://github.com/LanguageMachines/foliautils localDesktop tesseract not specified https://github.com/tesseract-ocr/tesseract localDesktop command line interface local desktop graphical user interface web application web interface web service text PDF picture application/pdf text PDF with embedded text application/pdf text TIFF picture image/tiff text Lexicon (one word per line) text/plain text Post-OCR text document text/plain text FoLiA text/folia+xml text DjVU image/vnd.djvu text utf8 FoLiA text/folia+xml Discourse/Sentence Boundaries Orthography/Token Morphosyntax/Inflection Morphosyntax/Lemma Morphosyntax/POS Morphosyntax/Word form POSTags/DCOI Tagset GNU GPL 3.0 public https://spdx.org/licenses/GPL-3.0 0 EUR Martin Reynaert
