janodijk2018-08-01+02:00clarin.eu:cr1:p_1342181139640CLARIN NetherlandsResourcehttps://webservices-lst.science.ru.nl/piccl/LandingPagehttps://webservices-lst.science.ru.nl/piccl/info/PICCLPICCL: Philosophical Integrator of Computational and Corpus Librariesv0.6.42015https://webservices-lst.science.ru.nl/piccl/none yetpublished2018-07-12CLARIN-NLCLARIN in the Netherlands184.021.003NWOhttp://www.clarin.nlJan OdijkNational CoordinatorUtrecht, the Netherlandsj.odijk@uu.nlUiL-OTSUtrecht University20092015CLARIAH-CORECommon Lab Research Infrastructure for the Arts and the Humanities184.033.101NWOhttp://www.clariah.nlJan OdijkNational CoordinatorUtrecht, the Netherlandsj.odijk@uu.nlUiL-OTSUtrecht University20152018NetherlandsNLPICCL is a set of workflows for corpus building through OCR, post-correction, modernization of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL functionality and Frog functionality in a single pipeline.
Tesseract offers Open Source software for optical character recognition.
TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form.
Frog enriches textual documents with various linguistic annotations.
written language tooloptical character recognitionorthographic normalisationsentence splittingtokenisationdependency parsingshallow parsinglemmatisationmorphological analysisnamed entity recognitionpart of speech taggingEnriching DataLinguisticsPhilosophyLiterary StudiesReligion StudiesHistorygeneral linguisticsorthographymorphologysyntaxyesDutchnldnoyesSwedishsweyes2021yesRussianrusyes2021yesSpanishspayes2021yesPortugueseporyes2021yesEnglishengyes2021yesGermandeuyes2021yesFrenchfrayes2021yesItalianitayes2021yesFinnishfinyes2021yesModern Greekellyes1521yesClassical Greekgrcyes-1015yesIcelandicislyes-1015yesGerman (Fraktur)deuyes-1015yesLatinlatnoyesRomanianronyes-1015Online availablehttps://github.com/LanguageMachines/picclnot specifiednot specifiedPOSIXnot specifiedunknownnextflownot specifiedhttps://github.com/nextflow-io/nextflowlocalDesktopticcltoolsnot specifiedhttps://github.com/LanguageMachines/ticcltoolslocalDesktopfoliautilsnot specifiedhttps://github.com/LanguageMachines/foliautilslocalDesktoptesseractnot specifiedhttps://github.com/tesseract-ocr/tesseractlocalDesktopcommand line interfacelocal desktopgraphical user interfaceweb applicationweb interfaceweb servicetextPDF pictureapplication/pdftextPDF with embedded textapplication/pdftextTIFF pictureimage/tifftextLexicon (one word per line)text/plaintextPost-OCR text documenttext/plaintextFoLiAtext/folia+xmltextDjVUimage/vnd.djvuGNU GPL3.0publichttps://spdx.org/licenses/GPL-3.00EUR
Martin Reynaert
Tilburg, the Netherlands
reynaert@uvt.nl
Department of Cognitive Science and Artificial Intelligence
Tilburg University, Tilburg
https://www.tilburguniversity.edu/about/schools/humanities/departments/dca/Information Pagetechnicalhttps://webservices-lst.science.ru.nl/piccl/info/engreadmetechnicalhttps://github.com/LanguageMachines/PICCL/blob/master/README.mdengreleaseNotesuserhttps://github.com/LanguageMachines/PICCL/releasesengissueTrackertechnicalhttps://github.com/LanguageMachines/PICCL/issuesengcontIntegrationtechnicalhttps://travis-ci.org/LanguageMachines/PICCLengin proceedingsscientific backgroundyesMartin Reynaert, Maarten van Gompel, Ko van der Sloot and Antal van den Bosch. 2015. PICCL: Philosophical Integrator of Computational and Corpus Libraries. Proceedings of CLARIN Annual Conference 2015, pp. 75-79. Wrocław, Poland. http://www.nederlab.nl/cms/wp-content/uploads/2015/10/Reynaert_PICCL-Philosophical-Integrator-of-Computational-and-Corpus-Libraries.pdf
CLARIN-NLNWOhttp://www.clarin.nlCLARIAH-CORENWOhttps://www.clariah.nlNederlabNWOhttp://www.nederlab.nlMartin Reynaert
project lead
Martin Reynaert
Tilburg, the Netherlands
reynaert@uvt.nl
Department of Cognitive Science and Artificial IntelligenceTilburg University, Tilburghttps://www.tilburguniversity.edu/about/schools/humanities/departments/dca/
software developer
Maarten van Gompel
Nijmegen, the Netherlands
proycon@anaproy.nl
Center for Language and Speech Technology
Radboud University Nijmegen
https://www.ru.nl/clst/
software developer
Ko van der Sloot
Nijmegen, the Netherlands
Center for Language and Speech Technology
Radboud University Nijmegen
https://www.ru.nl/clst/
nextflowunknownYes. Before tool use, please register at https://webservices-lst.science.ru.nl/register.PICCLoptical character recognitionorthographic normalisationsentence splittingtokenisationdependency parsingshallow parsinglemmatisationmorphological analysisnamed entity recognitionpart of speech taggingprojectnewinputself.linkToResourceinputpdftext_url