Skip to main content.

Shared task on automatic identification of verbal multiword expressions - edition 1.1

Organized as part of the LAW-MWE-CxG 2018 workshop co-located with COLING 2018 (Santa Fe, USA), August 25-26, 2018

Last updated: May 11, 2018

Description

The second edition of the PARSEME shared task on automatic identification of verbal multiword expressions (VMWEs) aims at identifying verbal MWEs in running texts. Verbal MWEs include, among others, idioms (to let the cat out of the bag), light verb constructions (to make a decision), verb-particle constructions (to give up), multi-verb constructions (to make do) and inherently reflexive verbs (se suicider 'to suicide' in French). Their identification is a well-known challenge for NLP applications, due to their complex characteristics including discontinuity, non-compositionality, heterogeneity and syntactic variability.

The shared task is highly multilingual: PARSEME members have elaborated annotation guidelines based on annotation experiments in about 20 languages from several language families. These guidelines take both universal and language-specific phenomena into account. We hope that this will boost the development of language-independent and cross-lingual VMWE identification systems.

Participation policies

Participation is open and free worldwide. We ask potential participant teams to register using the expression of interest form. Task updates and questions will be posted to our public mailing list. More details on the annotated corpora can be found in a dedicated PARSEME page. See also the annotation guidelines used in manual annotation of the training/development and test sets. See also the description of of evaluation measures and the evaluation script.

It should be noted that a large international community has been gathered (via the PARSEME network) around the effort of putting forward universal guidelines and performing corpus annotations. Our policy was to allow the same national teams, which provided annotated corpora, to also submit VMWE identification systems for the shared task. While this policy is non-standard and introduces a bias to system evaluation, we follow it for several reasons:

Submission of results and system description paper

Shared task participants are invited to submit input of two kinds to the SHARED TASK TRACK of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG 2018):

  1. System results (by May 4 May 8) obtained on the blind data (released on 30 April). The results for all languages should be submitted in a single .zip archive containing a single folder per language, named according to the ISO 639-1 code (e.g. FR/ for French, SL/ for Slovene, etc.). Each output file must be named test.system.cupt and conform to the cupt format. The format of each file should be checked before submission by the validation script as follows:
    • ./validate_cupt.py --input test.system.cupt
    If one system participates both in the open and in the closed track, two independent submissions are required. The number of submissions per system is limited to 2 per track, i.e. a team can have at most 4 submissions (with at most one result per language in each submission). Please, use an anonymous nickname for your system i.e. one that reveals neither the authors nor their affiliation.
  2. A system description paper (by May 25). These papers must follow the LAW-MWE-CxG workshop submission instructions and will go through double-blind peer reviewing by other participants and selected LAW-MWE-CxG 2018 Program Committee members. Their acceptance depends on the quality of the paper rather than on the results obtained in the shared task. Authors of the accepted papers will present their work as posters/demos in a dedicated session of the workshop. The submission of a system description paper is not mandatory.

The submission of system results and of a system description paper should be made via the dedicated START space:

https://www.softconf.com/coling2018/ws-LAW-MWE-CxG-2018

Provided data

The shared task covers 20 languages: Arabic (AR), Bulgarian (BG), German (DE), Greek (EL), English (EN), Spanish (ES), Basque (EU), Farsi (FA), French (FR), Hindi (HI), Hebrew (HE), Croatian (HR), Hungarian (HU), Italian (IT), Lithuanian (LT), Polish (PL), Brazilian Portuguese (PT), Romanian (RO), Slovenian (SL), Turkish (TR).

For each language, we provide corpora (in the .cupt format) in which VMWEs are annotated according to universal guidelines:

For most languages, morphosyntactic data (parts of speech, lemmas, morphological features and/or syntactic dependencies) are also be provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe).

Our public GitLab repository contains:

The table below summarizes the sizes of the training/development corpora per language:

LanguageSentencesTokensVMWEVIDIRVLVC.fullLVC.causeVPC.fullVPC.semiIAVMVCLS.ICV

AR-dev387162525001704190640000
AR-train23702310303219127217940095700330

BG-dev1954420206701732402143500800
BG-train178133991735364100527291421135007400

DE-dev1184221465031814834222117000
DE-train67341305882820977220218281264113000

EL-dev2562664315008103763480010
EL-train44271224581404395093844190080

EN-train347153201331600787151191600

ES-dev6982622050065114841700871330
ES-train277196521173916747922336003604740

EU-dev15002160450010403821400000
EU-train825411716528235970207415200000

FA-dev474892350100501000000
FA-train27844515324511712433000000

FR-dev2236562546292071542521500010
FR-train17225432389455017461247147068000190

HE-dev338565843501258014861340000
HE-train1210623747212365190545113590000

HI-train85617850534230321140001760

HR-dev8341962150034139143261015700
HR-train2295534861450113468303450052100

HU-dev601155647791008510539135000
HU-train480312001362058408923634131735000

IT-dev91732613500197106100191724469
IT-train13555360883325410989425441476604142320

LT-train48959011031210601951100000

PL-dev1763260305155724515333002700
PL-train130582204654122373178515311800025300

PT-dev31176858155313083337300000
PT-train22017506773443088268927758400000

RO-dev7065118658589169373291800000
RO-train4270478196847131269304825014600000

SL-dev19503814650012122430120011300
SL-train956720185323785001162176400050000

TR-dev1320271965102850225000000
TR-train167153348806125317202952000010

Total240367522553368710164641451323791191275311021257487529

The training, development and test data are available in our public GitLab repository. Follow the Repository link to access folders for individual languages.

All VMWE annotations (except Arabic) are available under Creative Commons licenses (see README.md files for details).

The Arabic corpus does not have an open licence. Participants are required to fill in an agreement and obtain the corpus through LDC. Given that this is a late addition, Arabic will be considered as optional this year. This means that we will publish generic and per-category rankings for teams who address Arabic, but it will not be included in the macro-average rankings across languages.

A note for shared task participants: We cannot ensure that the test data of the current edition of the shared task do not overlap with the data published in edition 1.0. Therefore, we kindly ask participants not to use the .parsemetsv files from edition 1.0 for any language during the training or testing phase.

Tracks

System results can be submitted in two tracks:

Teams submitting systems in the open track will be requested to describe and provide references to all resources used at submission time. Teams are encouraged to favor freely available resources for better reproducibility of their results.

Important dates

All deadlines are at 23:59 UTC-12 (anywhere in the world).

Organizing team

Silvio Ricardo Cordeiro, Carlos Ramisch, Agata Savary, Veronika Vincze

Contact

For any inquiries regarding the shared task please send an email to parseme-st-core@nlp.ipipan.waw.pl