janodijk 2018-08-01+02:00 clarin.eu:cr1:p_1342181139640 CLARIN Netherlands
Resource https://webservices-lst.science.ru.nl/piccl/ LandingPage https://webservices-lst.science.ru.nl/piccl/info/ PICCL PICCL: Philosophical Integrator of Computational and Corpus Libraries v0.6.4 2015 https://webservices-lst.science.ru.nl/piccl/none yet published 2018-07-12 CLARIN-NLCLARIN in the Netherlands184.021.003NWOhttp://www.clarin.nlJan OdijkNational Coordinator
Utrecht, the Netherlands
j.odijk@uu.nlUiL-OTSUtrecht University
20092015
CLARIAH-CORECommon Lab Research Infrastructure for the Arts and the Humanities184.033.101NWOhttp://www.clariah.nlJan OdijkNational Coordinator
Utrecht, the Netherlands
j.odijk@uu.nlUiL-OTSUtrecht University
20152018
NetherlandsNL PICCL is a set of workflows for corpus building through OCR, post-correction, modernization of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL functionality and Frog functionality in a single pipeline. Tesseract offers Open Source software for optical character recognition. TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form. Frog enriches textual documents with various linguistic annotations.
written language tool optical character recognition orthographic normalisation sentence splitting tokenisation dependency parsing shallow parsing lemmatisation morphological analysis named entity recognition part of speech tagging Enriching Data Linguistics Philosophy Literary Studies Religion Studies History general linguistics orthography morphology syntax yes Dutchnld no yes Swedishswe yes 20 21 yes Russianrus yes 20 21 yes Spanishspa yes 20 21 yes Portuguesepor yes 20 21 yes Englisheng yes 20 21 yes Germandeu yes 20 21 yes Frenchfra yes 20 21 yes Italianita yes 20 21 yes Finnishfin yes 20 21 yes Modern Greekell yes 15 21 yes Classical Greekgrc yes -10 15 yes Icelandicisl yes -10 15 yes German (Fraktur)deu yes -10 15 yes Latinlat no yes Romanianron yes -10 15 Online available https://github.com/LanguageMachines/piccl not specified not specified POSIX not specified unknown nextflow not specified https://github.com/nextflow-io/nextflow localDesktop ticcltools not specified https://github.com/LanguageMachines/ticcltools localDesktop foliautils not specified https://github.com/LanguageMachines/foliautils localDesktop tesseract not specified https://github.com/tesseract-ocr/tesseract localDesktop command line interface local desktop graphical user interface web application web interface web service text PDF picture application/pdf text PDF with embedded text application/pdf text TIFF picture image/tiff text Lexicon (one word per line) text/plain text Post-OCR text document text/plain text FoLiA text/folia+xml text DjVU image/vnd.djvu text utf8 FoLiA text/folia+xml Discourse/Sentence Boundaries Orthography/Token Morphosyntax/Inflection Morphosyntax/Lemma Morphosyntax/POS Morphosyntax/Word form POSTags/DCOI Tagset GNU GPL 3.0 public https://spdx.org/licenses/GPL-3.0 0 EUR Martin Reynaert
Tilburg, the Netherlands
reynaert@uvt.nl Department of Cognitive Science and Artificial Intelligence Tilburg University, Tilburg https://www.tilburguniversity.edu/about/schools/humanities/departments/dca/
Information Page technical https://webservices-lst.science.ru.nl/piccl/info/ eng readme technical https://github.com/LanguageMachines/PICCL/blob/master/README.md eng releaseNotes user https://github.com/LanguageMachines/PICCL/releases eng issueTracker technical https://github.com/LanguageMachines/PICCL/issues eng contIntegration technical https://travis-ci.org/LanguageMachines/PICCL eng in proceedings scientific background yes Martin Reynaert, Maarten van Gompel, Ko van der Sloot and Antal van den Bosch. 2015. PICCL: Philosophical Integrator of Computational and Corpus Libraries. Proceedings of CLARIN Annual Conference 2015, pp. 75-79. Wrocław, Poland. http://www.nederlab.nl/cms/wp-content/uploads/2015/10/Reynaert_PICCL-Philosophical-Integrator-of-Computational-and-Corpus-Libraries.pdf CLARIN-NL <funder>NWO</funder> <url>http://www.clarin.nl</url> <Contact> <Person/> <Email/> <Organisation xml:lang="eng"/> </Contact> <Duration/> </Project> <Project> <name>CLARIAH-CORE</name> <title/> <funder>NWO</funder> <url>https://www.clariah.nl</url> <Contact> <Person/> <Email/> <Organisation xml:lang="eng"/> </Contact> <Duration/> </Project> <Project> <name>Nederlab</name> <title/> <funder>NWO</funder> <url>http://www.nederlab.nl</url> <Contact> <Person/> <Email/> <Organisation xml:lang="eng"/> </Contact> <Duration/> </Project> <Creator> <Contact> <Person>Martin Reynaert</Person> <Organisation xml:lang="eng"/> </Contact> </Creator> <Creator> <Role> project lead </Role> <Contact> <Person> Martin Reynaert </Person> <Address>Tilburg, the Netherlands</Address> <Email> reynaert@uvt.nl </Email> <Department>Department of Cognitive Science and Artificial Intelligence</Department> <Organisation>Tilburg University, Tilburg</Organisation> <Url>https://www.tilburguniversity.edu/about/schools/humanities/departments/dca/</Url> </Contact> </Creator> <Creator> <Role> software developer </Role> <Contact> <Person> Maarten van Gompel </Person> <Address>Nijmegen, the Netherlands</Address> <Email> proycon@anaproy.nl </Email> <Department>Center for Language and Speech Technology</Department> <Organisation> Radboud University Nijmegen </Organisation> <Url> https://www.ru.nl/clst/ </Url> </Contact> </Creator> <Creator> <Role> software developer </Role> <Contact> <Person> Ko van der Sloot </Person> <Address>Nijmegen, the Netherlands</Address> <Department>Center for Language and Speech Technology</Department> <Organisation> Radboud University Nijmegen </Organisation> <Url> https://www.ru.nl/clst/ </Url> </Contact> </Creator> </SoftwareDevelopment> <TechnicalInfo> <ImplementationLanguage> <implementationLanguage>nextflow</implementationLanguage> <version>unknown</version> </ImplementationLanguage> </TechnicalInfo> <LRS> <Authentication>Yes. Before tool use, please register at https://webservices-lst.science.ru.nl/register.</Authentication> <Description><Description>PICCL</Description></Description> <ToolTasks> <toolTask>optical character recognition</toolTask> <toolTask>orthographic normalisation</toolTask> <toolTask>sentence splitting</toolTask> <toolTask>tokenisation</toolTask> <toolTask>dependency parsing</toolTask> <toolTask>shallow parsing</toolTask> <toolTask>lemmatisation</toolTask> <toolTask>morphological analysis</toolTask> <toolTask>named entity recognition</toolTask> <toolTask>part of speech tagging</toolTask> </ToolTasks> <ActualParameters><!--0-1 --> <ActualParameter><!--1 - unbounded --> <ActualParameterName>project</ActualParameterName> <ActualParameterValue>new</ActualParameterValue> </ActualParameter> <ActualParameter><!--1 - unbounded --> <ActualParameterName>input</ActualParameterName> <ActualParameterValue>self.linkToResource</ActualParameterValue> </ActualParameter> </ActualParameters> <LRSMapping> <LRSParameterName>input</LRSParameterName> <ActualParameterName>pdftext_url</ActualParameterName> </LRSMapping> </LRS> </ClarinSoftwareDescription> </Components> </CMD>