PICCL: Philosophical Integrator of Computational and Corpus Libraries

janodijk 2018-08-01+02:00 clarin.eu:cr1:p_1342181139640 CLARIN Netherlands

Resource https://webservices-lst.science.ru.nl/piccl/ LandingPage https://webservices-lst.science.ru.nl/piccl/info/ PICCL PICCL: Philosophical Integrator of Computational and Corpus Libraries v0.6.4 2015 https://webservices-lst.science.ru.nl/piccl/none yet published 2018-07-12 CLARIN-NLCLARIN in the Netherlands184.021.003NWOhttp://www.clarin.nlJan OdijkNational Coordinator

Utrecht, the Netherlands

j.odijk@uu.nlUiL-OTSUtrecht University20092015CLARIAH-CORECommon Lab Research Infrastructure for the Arts and the Humanities184.033.101NWOhttp://www.clariah.nlJan OdijkNational Coordinator

Utrecht, the Netherlands

j.odijk@uu.nlUiL-OTSUtrecht University20152018NetherlandsNL PICCL is a set of workflows for corpus building through OCR, post-correction, modernization of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL functionality and Frog functionality in a single pipeline. Tesseract offers Open Source software for optical character recognition. TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form. Frog enriches textual documents with various linguistic annotations. written language tool optical character recognition orthographic normalisation sentence splitting tokenisation dependency parsing shallow parsing lemmatisation morphological analysis named entity recognition part of speech tagging Enriching Data Linguistics Philosophy Literary Studies Religion Studies History general linguistics orthography morphology syntax yes Dutch

nld

no yes Swedish

swe

yes 20 21 yes Russian

rus

yes 20 21 yes Spanish

spa

yes 20 21 yes Portuguese

por

yes 20 21 yes English

eng

yes 20 21 yes German

deu

yes 20 21 yes French

fra

yes 20 21 yes Italian

ita

yes 20 21 yes Finnish

fin

yes 20 21 yes Modern Greek

ell

yes 15 21 yes Classical Greek

grc

yes -10 15 yes Icelandic

isl

yes -10 15 yes German (Fraktur)

deu

yes -10 15 yes Latin

lat

no yes Romanian

ron

yes -10 15 Online available https://github.com/LanguageMachines/piccl not specified not specified POSIX not specified unknown nextflow not specified https://github.com/nextflow-io/nextflow localDesktop ticcltools not specified https://github.com/LanguageMachines/ticcltools localDesktop foliautils not specified https://github.com/LanguageMachines/foliautils localDesktop tesseract not specified https://github.com/tesseract-ocr/tesseract localDesktop command line interface local desktop graphical user interface web application web interface web service text PDF picture application/pdf text PDF with embedded text application/pdf text TIFF picture image/tiff text Lexicon (one word per line) text/plain text Post-OCR text document text/plain text FoLiA text/folia+xml text DjVU image/vnd.djvu text utf8 FoLiA text/folia+xml Discourse/Sentence Boundaries Orthography/Token Morphosyntax/Inflection Morphosyntax/Lemma Morphosyntax/POS Morphosyntax/Word form POSTags/DCOI Tagset GNU GPL 3.0 public https://spdx.org/licenses/GPL-3.0 0

EUR

Martin Reynaert

Tilburg, the Netherlands

reynaert@uvt.nl Department of Cognitive Science and Artificial Intelligence Tilburg University, Tilburg https://www.tilburguniversity.edu/about/schools/humanities/departments/dca/ Information Page technical https://webservices-lst.science.ru.nl/piccl/info/

eng

readme technical https://github.com/LanguageMachines/PICCL/blob/master/README.md

eng

releaseNotes user https://github.com/LanguageMachines/PICCL/releases

eng

issueTracker technical https://github.com/LanguageMachines/PICCL/issues

eng

contIntegration technical https://travis-ci.org/LanguageMachines/PICCL

eng

in proceedings scientific background yes Martin Reynaert, Maarten van Gompel, Ko van der Sloot and Antal van den Bosch. 2015. PICCL: Philosophical Integrator of Computational and Corpus Libraries. Proceedings of CLARIN Annual Conference 2015, pp. 75-79. Wrocław, Poland. http://www.nederlab.nl/cms/wp-content/uploads/2015/10/Reynaert_PICCL-Philosophical-Integrator-of-Computational-and-Corpus-Libraries.pdf CLARIN-NL <funder>NWO</funder> <url>http://www.clarin.nl</url> <Contact> <Person/> <Email/> <Organisation xml:lang="eng"/> </Contact> <Duration/> </Project> <Project> <name>CLARIAH-CORE</name> <title/> <funder>NWO</funder> <url>https://www.clariah.nl</url> <Contact> <Person/> <Email/> <Organisation xml:lang="eng"/> </Contact> <Duration/> </Project> <Project> <name>Nederlab</name> <title/> <funder>NWO</funder> <url>http://www.nederlab.nl</url> <Contact> <Person/> <Email/> <Organisation xml:lang="eng"/> </Contact> <Duration/> </Project> <Creator> <Contact> <Person>Martin Reynaert</Person> <Organisation xml:lang="eng"/> </Contact> </Creator> <Creator> <Role> project lead </Role> <Contact> <Person> Martin Reynaert </Person> <Address>Tilburg, the Netherlands</Address> <Email> reynaert@uvt.nl </Email> <Department>Department of Cognitive Science and Artificial Intelligence</Department> <Organisation>Tilburg University, Tilburg</Organisation> <Url>https://www.tilburguniversity.edu/about/schools/humanities/departments/dca/</Url> </Contact> </Creator> <Creator> <Role> software developer </Role> <Contact> <Person> Maarten van Gompel </Person> <Address>Nijmegen, the Netherlands</Address> <Email> proycon@anaproy.nl </Email> <Department>Center for Language and Speech Technology</Department> <Organisation> Radboud University Nijmegen </Organisation> <Url> https://www.ru.nl/clst/ </Url> </Contact> </Creator> <Creator> <Role> software developer </Role> <Contact> <Person> Ko van der Sloot </Person> <Address>Nijmegen, the Netherlands</Address> <Department>Center for Language and Speech Technology</Department> <Organisation> Radboud University Nijmegen </Organisation> <Url> https://www.ru.nl/clst/ </Url> </Contact> </Creator> </SoftwareDevelopment> <TechnicalInfo> <ImplementationLanguage> <implementationLanguage>nextflow</implementationLanguage> <version>unknown</version> </ImplementationLanguage> </TechnicalInfo> <LRS> <Authentication>Yes. Before tool use, please register at https://webservices-lst.science.ru.nl/register.</Authentication> <Description><Description>PICCL</Description></Description> <ToolTasks> <toolTask>optical character recognition</toolTask> <toolTask>orthographic normalisation</toolTask> <toolTask>sentence splitting</toolTask> <toolTask>tokenisation</toolTask> <toolTask>dependency parsing</toolTask> <toolTask>shallow parsing</toolTask> <toolTask>lemmatisation</toolTask> <toolTask>morphological analysis</toolTask> <toolTask>named entity recognition</toolTask> <toolTask>part of speech tagging</toolTask> </ToolTasks> <ActualParameters> <ActualParameter> <ActualParameterName>project</ActualParameterName> <ActualParameterValue>new</ActualParameterValue> </ActualParameter> <ActualParameter> <ActualParameterName>input</ActualParameterName> <ActualParameterValue>self.linkToResource</ActualParameterValue> </ActualParameter> </ActualParameters> <LRSMapping> <LRSParameterName>input</LRSParameterName> <ActualParameterName>pdftext_url</ActualParameterName> </LRSMapping> </LRS> </ClarinSoftwareDescription> </Components> </CMD>