JanOdijk2017-10-24+13:37clarin.eu:cr1:p_1342181139640CLARIN NetherlandsResourcehttp://hdl.handle.net/21.11114/COLL-0000-000B-C287-1Corpus Studio WebCorpus Studio Web2015http://corpus-studio-web.cttnww-meertens.surf-hosted.nl/crpstudio/Meertens Institutehttp://portal.clarin.nl/node/4239published2015-12-24CLARIN-NLCLARIN in the Netherlands184.021.003NWOhttp://www.clarin.nlJan OdijkNational CoordinatorUtrecht, the Netherlandsj.odijk@uu.nlUiL-OTSUtrecht University20092015NetherlandsNL
Summary
CorpusStudio is a web application that facilitates in-depth quantitative syntactic research for linguists.
Background
CorpusStudio is a web application that facilitates in-depth quantitative syntactic research for linguists. It does so by supporting researchers in writing queries that operate on syntactically parsed text corpora in a number of major xml formats. Queries that belong together are kept in xml documents that are called ‘Corpus Research Projects’ (CRPs). These documents contain the queries, the order in which they are to be executed, meta-information about the queries and the project as a whole, as well as a specification of the input used for the project. The use of CRPs helps improve the replicability of corpus research.
Access
Any CLARIN-NL user can access the CorpusStudio web application and make use of the 'standard' corpora. New users must provide a login name and password, after which they can make use of the application.
Adaptable
The CorpusStudio code is open-source. Users can take the code, adapt it and use it for their own purposes. Users can also take the code from GitHub as it is, but build their own server in order to run the application on their own text-corpora. User documentation and an API are available (see below). The current version of CorpusStudio supports xml text corpora in the FoLiA and Psdx formats. Extensions to other xml formats are possible.
CrpxProcessor provides the basic functionality and is on github on https://github.com/ErwinKomen/CrpxProcessor. CrppServer takes care of /crpp and uses CrpxProcessor. It is on GitHub on https://github.com/ErwinKomen/CrppServer. CrpStudio is on https://github.com/ErwinKomen/CrpStudio, takes care of /crpstudio and uses CrpxProcessor.
Main features
Keep all important aspects of a research project in one file
Define one or more search queries in a hierarchy
Uses w3c developed Xquery and Xpath
Integrated CorpusStudio-specific Xquery functions
User-definable functions and variables
Create corpus result databases with user-definable features accompanying each hit
Divide the output into calculatable categories
Divide the results into meta-data-dependent groups
Parallel processing yields a speed-up of a factor 20-100 compared to the Windows version
Compatibility with the Windows programs "Cesax" and "CorpusStudio"
Limitations and future developments
Current limitations to the program include: working with result database, restricted login system, no document view, grouping is restricted to system-defined groups, no query or project wizard. Although the CLARIN-NL project has stopped in December 2015, every effort will be undertaken to make sure that a number of essential features are going to be added.
annotation toolwritten language toolcorpus searchingqueryingBrowsing and SearchingData analysisLinguisticssyntaxmorpho-syntaxcomputational linguisticsnonoOnline availablehttps://github.com/ErwinKomen/CrpxProcessorhttps://github.com/ErwinKomen/CrppServerhttps://github.com/ErwinKomen/CrpStudiographical user interfaceweb applicationtextXquery queryunknownpublic0EURErwin KomenE.Komen@Let.ru.nlKU LeuvenKomen, Erwin R. 2015. "Corpus Studio: the web application". User documentation. version 1.7. Meertens Instituut, Amsterdam. userhttp://hdl.handle.net/21.11114/COLL-0000-000B-C289-FengKomen, Erwin R. 2016. "An API for the CorpusStudio web application". version 1.3. Meertens Instituut, Amsterdam.technicalhttp://hdl.handle.net/21.11114/COLL-0000-000B-C288-0engin bookscientific backgroundyesKomen, E. R. 2017. Beyond Counting Syntactic Hits. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 259–268. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.21. License: CC-BY 4.0in bookscientific backgroundyesKomen, Erwin R. 2011. Coreferenced corpora for information structure research. In Outposts of Historical Corpus Linguistics: From the Helsinki Corpus to a Proliferation of Resources. (Studies in Variation, Contacts and Change in English 10) Jukka Tyrkkö, Terttu Nevalainen, Matti Rissanen & Matti Kilpiö (eds). Helsinki, Finland: Research Unit for Variation, Contacts, and Change in English.
PhD thesisscientific backgroundyesKomen, Erwin R. 2013. Finding focus: a study of the historical development of focus in English. Utrecht: LOT.
in proceedingsscientific backgroundyes Komen, Erwin R. 2013. Corpus databases with feature pre-calculation. In Proceedings of the twelfth workshop on treebanks and linguistic theories (TLT12). Sandra Kübler, Petya Osenova & Martin Volk (eds), 85-96. Sofia, Bulgaria: The institute of information and communication technologies, Bulgarian academy of sciences.
Corpus Studio WebCorpus Studio WebCLARIN-NLhttp://portal.clarin.nl/node/4239Erwin KomenE.Komen@Let.ru.nlErwin KomenE.Komen@Let.ru.nlJavaScriptunknownRunknown