janodijk2018-08-01+02:00clarin.eu:cr1:p_1342181139640CLARIN NetherlandsResourcehttps://webservices-lst.science.ru.nl/ucto/UctoUcto Tokeniserv0.132011-03-27https://webservices-lst.science.ru.nl/ucto/none yetpublished2018-05-17CLARIN-NLCLARIN in the Netherlands184.021.003NWOhttp://www.clarin.nlJan OdijkNational CoordinatorUtrecht, the Netherlandsj.odijk@uu.nlUiL-OTSUtrecht University20092015CLARIAH-CORECommon Lab Research Infrastructure for the Arts and the Humanities184.033.101NWOhttp://www.clariah.nlJan OdijkNational CoordinatorUtrecht, the Netherlandsj.odijk@uu.nlUiL-OTSUtrecht University20152018NetherlandsNLUcto tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extendible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.
written language toolsentence splittingtokenisationEnriching DataLinguisticsgeneral linguisticssyntaxyesDutchnldyes2021yesSwedishsweyes2021yesRussianrusyes2021yesSpanishspayes2021yesPortugueseporyes2021yesEnglishengyes2021yesGermandeuyes2021yesFrenchfrayes2021yesItalianitayes2021Online availablehttps://github.com/LanguageMachines/uctonot specifiednot specifiedPOSIXnot specifiedunknownicunot specifiedhttp://site.icu-project.org/design/cpplocalDesktoplibxml2not specifiedhttps://pypi.org/project/libxml2-python/localDesktopticcutilsnot specifiedlocalDesktoplibfolianot specifiedhttps://github.com/LanguageMachines/libfolialocalDesktopcommand line interfacelocal desktopgraphical user interfaceweb applicationweb interfaceweb servicetextPDFapplication/pdftextMS-Wordapplication/mswordutf8textFoLiAtext/folia+xmlutf8texttext/plainISO-8859-1texttext/plainISO 8859-15texttext/plainGNU GPL3.0publichttps://spdx.org/licenses/GPL-3.00EUR
Antal van den Bosch
Nijmegen, the Netherlands
a.vandenbosch@let.ru.nl
Center for Language and Speech Technology
Radboud University Nijmegen
https://www.ru.nl/clst/
Maarten van Gompel, Ko van der Sloot and Antal van den Bosch. 2017. Ucto: Unicode Tokenizer. version 0.9.6. Reference Guide. Technical Report, Jan 23, 2017.technicalhttps://raw.githubusercontent.com/proycon/ucto/master/docs/ucto_manual.pdfengMaarten van Gompel, Ko van der Sloot, Antal van den Bosch (2012). Ucto: Unicode Tokeniser. Reference Guide. ILK Technical Report 12-05.technicalhttp://ilk.uvt.nl/downloads/pub/papers/ilk.1205.pdfengreadmeuserhttps://github.com/LanguageMachines/ucto/blob/master/README.mdengreleaseNotesuserhttps://github.com/LanguageMachines/ucto/releasesengissueTrackertechnicalhttps://github.com/LanguageMachines/ucto/issuesengcontIntegrationtechnicalhttps://travis-ci.org/LanguageMachines/uctoenghttps://raw.githubusercontent.com/LanguageMachines/ucto/master/logo.svgCLARIN-NLNWOCLARIAH-CORENWOAntal van den Bosch
project lead
Antal van den Bosch
Nijmegen, the Netherlands
a.vandenbosch@let.ru.nl
Center for Language and Speech Technology
Radboud University Nijmegen
https://www.ru.nl/clst/
software developer
Maarten van Gompel
Nijmegen, the Netherlands
proycon@anaproy.nl
Center for Language and Speech Technology
Radboud University Nijmegen
https://www.ru.nl/clst/
software developer
Ko van der Sloot
Nijmegen, the Netherlands
Center for Language and Speech Technology
Radboud University Nijmegen
https://www.ru.nl/clst/
C++unknownYes. Before tool use, please register at https://webservices-lst.science.ru.nl/register.Uctosentence splittingtokenisationtextPDFapplication/pdftextMS-Wordapplication/mswordutf8texttext/plainISO-8859-1texttext/plainISO 8859-15texttext/plainprojectnewinputself.linkToResourcelangself.linkToResourceLanguageinputuntokinput_urllanguntokinput_languageYes. Before tool use, please register at https://webservices-lst.science.ru.nl/register.Uctosentence splittingtokenisationutf8textFoLiAtext/folia+xmlprojectnewinputself.linkToResourcelangself.linkToResourceLanguageinputfoliainput_urllangfoliainput_language