Skip to main content.

Shared task on automatic identification of verbal multiword expressions - edition 1.1

Organized as part of the LAW-MWE-CxG 2018 workshop co-located with COLING 2018 (Santa Fe, USA), August 25-26, 2018

Last updated: August 16, 2018

Description

The second edition of the PARSEME shared task on automatic identification of verbal multiword expressions (VMWEs) aims at identifying verbal MWEs in running texts. Verbal MWEs include, among others, idioms (to let the cat out of the bag), light verb constructions (to make a decision), verb-particle constructions (to give up), multi-verb constructions (to make do) and inherently reflexive verbs (se suicider 'to suicide' in French). Their identification is a well-known challenge for NLP applications, due to their complex characteristics including discontinuity, non-compositionality, heterogeneity and syntactic variability.

The shared task is highly multilingual: PARSEME members have elaborated annotation guidelines based on annotation experiments in about 20 languages from several language families. These guidelines take both universal and language-specific phenomena into account. We hope that this will boost the development of language-independent and cross-lingual VMWE identification systems.

Participation policies

The evaluation phase of the shared task is now over, but you can find useful information about the shared task on this page.

Participation is open and free worldwide. We ask potential participant teams to register using the expression of interest form. Task updates and questions will be posted to our public mailing list. More details on the annotated corpora can be found in a dedicated PARSEME page. See also the annotation guidelines used in manual annotation of the training/development and test sets. See also the description of of evaluation measures and the evaluation script.

It should be noted that a large international community has been gathered (via the PARSEME network) around the effort of putting forward universal guidelines and performing corpus annotations. Our policy was to allow the same national teams, which provided annotated corpora, to also submit VMWE identification systems for the shared task. While this policy is non-standard and introduces a bias to system evaluation, we follow it for several reasons:

Submission of results and system description paper

Shared task participants are invited to submit input of two kinds to the SHARED TASK TRACK of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG 2018):

  1. System results (by May 4 May 8) obtained on the blind data (released on 30 April). The results for all languages should be submitted in a single .zip archive containing a single folder per language, named according to the ISO 639-1 code (e.g. FR/ for French, SL/ for Slovene, etc.). Each output file must be named test.system.cupt and conform to the cupt format. The format of each file should be checked before submission by the validation script as follows:
    • ./validate_cupt.py --input test.system.cupt
    If one system participates both in the open and in the closed track, two independent submissions are required. The number of submissions per system is limited to 2 per track, i.e. a team can have at most 4 submissions (with at most one result per language in each submission). Please, use an anonymous nickname for your system i.e. one that reveals neither the authors nor their affiliation.
  2. A system description paper (by May 25). These papers must follow the LAW-MWE-CxG workshop submission instructions and will go through double-blind peer reviewing by other participants and selected LAW-MWE-CxG 2018 Program Committee members. Their acceptance depends on the quality of the paper rather than on the results obtained in the shared task. Authors of the accepted papers will present their work as posters/demos in a dedicated session of the workshop. The submission of a system description paper is not mandatory.

The submission of system results and of a system description paper should be made via the dedicated START space:

https://www.softconf.com/coling2018/ws-LAW-MWE-CxG-2018

Provided data

The PARSEME corpus edition 1.1 (used in this shared task) is available via the CLARIN/LINDAT infrastructure.

The shared task covers 20 languages: Arabic (AR), Bulgarian (BG), German (DE), Greek (EL), English (EN), Spanish (ES), Basque (EU), Farsi (FA), French (FR), Hindi (HI), Hebrew (HE), Croatian (HR), Hungarian (HU), Italian (IT), Lithuanian (LT), Polish (PL), Brazilian Portuguese (PT), Romanian (RO), Slovenian (SL), Turkish (TR).

For each language, we provide corpora (in the .cupt format) in which VMWEs are annotated according to universal guidelines:

For most languages, morphosyntactic data (parts of speech, lemmas, morphological features and/or syntactic dependencies) are also be provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe).

Our public GitLab repository contains:

The table below summarizes the sizes of the training/development/test corpora per language:

Lang-splitSentencesTokensAvg. lengthVMWEVIDIRVLVC.fullLVC.causeVPC.fullVPC.semiIAVMVCLS.ICV
AR-train 237023103097.43219127217940095700330
AR-dev 3871625241.95001704190640000
AR-test 3801796247.25003104100590000
AR-Total313726524484.542191320171769108033
BG-train 1781339917322.45364100527291421135007400
BG-dev 19544202021.56701732402143500800
BG-test 18323922021.4670822542745200800
BG-Total2159948041322.26704126032231909222009000
DE-train 673413058819.32820977220218281264113000
DE-dev 11842214618.75031814834222117000
DE-test 107820559195001834042221023000
DE-Total899617329319.238231341308294321695153000
EL-train 442712245827.61404395093844190080
EL-dev 25626643125.95008103763480010
EL-test 12613587328.4501169030811110020
EL-Total825022476227.2240564501622893800110
EN-train 34715320115.3331600787151191600
EN-test 39657100217.950179016636146264440
EN-Total743612420316.7832139024443297456040
ES-train 27719652134.8173916747922336003604740
ES-dev 6982622037.550065114841700871330
ES-test 20465962329.150095121852810641060
ES-Total551518236433273932771439281105117130
EU-train 825411716514.128235970207415200000
EU-dev 15002160414.450010403821400000
EU-test 14041903813.55007304101700000
EU-Total1115815780714.138237740286618300000
FA-train 27844515316.224511712433000000
FA-dev 474892318.850100501000000
FA-test 359749220.850100501000000
FA-Total3617615681734531713435000000
FR-train 1722543238925.1455017461247147068000190
FR-dev 22365625425.16292071542521500010
FR-test 16063948924.54982121081601400040
FR-Total2106752813225567721651509188297000240
HE-train 1210623747219.612365190545113590000
HE-dev 33856584319.4501258014861340000
HE-test 32096569820.4502182021149600000
HE-Total1870036901319.7223995909042231530000
HI-train 8561785020.8534230321140001760
HI-test 8281758021.2500380320120001300
HI-Total168435430211034610641260003060
HR-train 22955348623.31450113468303450052100
HR-dev 8341962123.550034139143261015700
HR-test 7081642923.250133118131310018800
HR-Total38378953623.324511807255771021086600
HU-train 480312001324.962058408923634131735000
HU-dev 6011556425.87791008510539135000
HU-test 7552075927.47761001662848686000
HU-Total615915633625.37760104011434015156956000
IT-train 1355536088326.6325410989425441476604142320
IT-dev 9173261335.5500197106100191724469
IT-test 12563729329.650320196104252304158
IT-Total1572843078927.342571496114474819110624993437
LT-train 48959011018.431210601951100000
LT-test 62091184021950020202841400000
LT-Total1110420851218.781230804792500000
PL-train 1305822046516.84122373178515311800025300
PL-dev 17632603014.75155724515333002700
PL-test 13002782321.45157324914915002900
PL-Total16121274318175152503227918332280030900
PT-train 2201750677323443088268927758400000
PT-dev 3117685812255313083337300000
PT-test 27706264822.655311891337700000
PT-Total2790463800222.85536113086334499400000
RO-train 4270478196818.347131269304825014600000
RO-dev 706511865816.7589169373291800000
RO-test 693411499716.5589173363341900000
RO-Total56703101562317.958911611378431318300000
SL-train 95672018532123785001162176400050000
SL-dev 19503814619.550012122430120011300
SL-test 19944052320.350010624535130010100
SL-Total1351128052220.733787271631241650071400
TR-train 16715334880206125317202952000010
TR-dev 13202719620.65102850225000000
TR-test 5771438824.95062330272000010
TR-Total1861237646420.27141369003449000020
Total 280838607233121.6793261875716198281902285852711563049112737

The training, development and test data are available in our public GitLab repository. Follow the Repository link to access folders for individual languages.

All VMWE annotations (except Arabic) are available under Creative Commons licenses (see README.md files for details).

The Arabic corpus does not have an open licence. Participants are required to fill in an agreement and obtain the corpus through LDC. Given that this is a late addition, Arabic will be considered as optional this year. This means that we will publish generic and per-category rankings for teams who address Arabic, but it will not be included in the macro-average rankings across languages.

A note for shared task participants: We cannot ensure that the test data of the current edition of the shared task do not overlap with the data published in edition 1.0. Therefore, we kindly ask participants not to use the .parsemetsv files from edition 1.0 for any language during the training or testing phase.

Tracks

System results can be submitted in two tracks:

Teams submitting systems in the open track will be requested to describe and provide references to all resources used at submission time. Teams are encouraged to favor freely available resources for better reproducibility of their results.

Important dates

All deadlines are at 23:59 UTC-12 (anywhere in the world).

Organizing team

Silvio Ricardo Cordeiro, Carlos Ramisch, Agata Savary, Veronika Vincze

Contact

For any inquiries regarding the shared task please send an email to parseme-st-core@nlp.ipipan.waw.pl