janodijk 2018-08-01+02:00 clarin.eu:cr1:p_1342181139640 CLARIN Netherlands
Resource https://github.com/LanguageMachines/ucto Ucto Engine Ucto Tokeniser Engine v0.13 2011-03-27 https://github.com/LanguageMachines/uctonone yet published 2018-05-17 CLARIN-NLCLARIN in the Netherlands184.021.003NWOhttp://www.clarin.nlJan OdijkNational Coordinator
Utrecht, the Netherlands
j.odijk@uu.nlUiL-OTSUtrecht University
20092015
CLARIAH-CORECommon Lab Research Infrastructure for the Arts and the Humanities184.033.101NWOhttp://www.clariah.nlJan OdijkNational Coordinator
Utrecht, the Netherlands
j.odijk@uu.nlUiL-OTSUtrecht University
20152018
NetherlandsNL The Ucto tokenisation engine is a language-independent engine that, given an external configuration file with tokenisation rules for a specifc language ,yields a tokenizer for that language that tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extensible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.
written language tool sentence splitting tokenisation Enriching Data Linguistics general linguistics syntax no no Online available https://github.com/LanguageMachines/ucto not specified not specified POSIX not specified unknown icu not specified http://site.icu-project.org/design/cpp localDesktop libxml2 not specified https://pypi.org/project/libxml2-python/ localDesktop ticcutils not specified localDesktop libfolia not specified https://github.com/LanguageMachines/libfolia localDesktop command line interface local desktop text PDF application/pdf text MS-Word application/msword utf8 text FoLiA text/folia+xml utf8 text text/plain ISO-8859-1 text text/plain ISO 8859-15 text text/plain text utf8 One Sentence per Line text/plain Discourse/Sentence Boundaries Orthography/Token text utf8 One Token per Line text/plain Discourse/Sentence Boundaries Orthography/Token text utf8 FoLiA text/folia+xml Discourse/Sentence Boundaries Orthography/Token GNU GPL 3.0 public https://spdx.org/licenses/GPL-3.0 0 EUR Antal van den Bosch
Nijmegen, the Netherlands
a.vandenbosch@let.ru.nl Center for Language and Speech Technology Radboud University Nijmegen https://www.ru.nl/clst/
Maarten van Gompel, Ko van der Sloot and Antal van den Bosch. 2017. Ucto: Unicode Tokenizer. version 0.9.6. Reference Guide. Technical Report, Jan 23, 2017. technical https://github.com/LanguageMachines/ucto/blob/master/docs/ucto_manual.pdf eng Maarten van Gompel, Ko van der Sloot, Antal van den Bosch (2012). Ucto: Unicode Tokeniser. Reference Guide. ILK Technical Report 12-05. technical http://ilk.uvt.nl/downloads/pub/papers/ilk.1205.pdf eng readme user https://github.com/LanguageMachines/ucto/blob/master/README.md eng releaseNotes user https://github.com/LanguageMachines/ucto/releases eng issueTracker technical https://github.com/LanguageMachines/ucto/issues eng contIntegration technical https://travis-ci.org/LanguageMachines/ucto eng https://raw.githubusercontent.com/LanguageMachines/ucto/master/logo.svg CLARIN-NL <funder>NWO</funder> <url/> <Contact> <Person/> <Email/> <Organisation xml:lang="eng"/> </Contact> <Duration/> </Project> <Project> <name>CLARIAH-CORE</name> <title/> <funder>NWO</funder> <url/> <Contact> <Person/> <Email/> <Organisation xml:lang="eng"/> </Contact> <Duration/> </Project> <Creator> <Contact> <Person>Antal van den Bosch</Person> <Email/> <Organisation xml:lang="eng"/> </Contact> </Creator> <Creator> <Role> project lead </Role> <Contact> <Person> Antal van den Bosch </Person> <Address>Nijmegen, the Netherlands</Address> <Email> a.vandenbosch@let.ru.nl </Email> <Department>Center for Language and Speech Technology</Department> <Organisation> Radboud University Nijmegen </Organisation> <Url> https://www.ru.nl/clst/ </Url> </Contact> </Creator> <Creator> <Role> software developer </Role> <Contact> <Person> Maarten van Gompel </Person> <Address>Nijmegen, the Netherlands</Address> <Email> proycon@anaproy.nl </Email> <Department>Center for Language and Speech Technology</Department> <Organisation> Radboud University Nijmegen </Organisation> <Url> https://www.ru.nl/clst/ </Url> </Contact> </Creator> <Creator> <Role> software developer </Role> <Contact> <Person> Ko van der Sloot </Person> <Address>Nijmegen, the Netherlands</Address> <Department>Center for Language and Speech Technology</Department> <Organisation> Radboud University Nijmegen </Organisation> <Url> https://www.ru.nl/clst/ </Url> </Contact> </Creator> </SoftwareDevelopment> <TechnicalInfo> <ImplementationLanguage> <implementationLanguage>C++</implementationLanguage> <version>unknown</version> </ImplementationLanguage> </TechnicalInfo> </ClarinSoftwareDescription> </Components> </CMD>