Skip to main content.

Shared task on automatic identification of verbal multiword expressions - edition 1.1

Organized as part of the LAW-MWE-CxG 2018 workshop co-located with COLING 2018 (Santa Fe, USA), August 25-26, 2018

Last updated: May 11, 2018


The second edition of the PARSEME shared task on automatic identification of verbal multiword expressions (VMWEs) aims at identifying verbal MWEs in running texts. Verbal MWEs include, among others, idioms (to let the cat out of the bag), light verb constructions (to make a decision), verb-particle constructions (to give up), multi-verb constructions (to make do) and inherently reflexive verbs (se suicider 'to suicide' in French). Their identification is a well-known challenge for NLP applications, due to their complex characteristics including discontinuity, non-compositionality, heterogeneity and syntactic variability.

The shared task is highly multilingual: PARSEME members have elaborated annotation guidelines based on annotation experiments in about 20 languages from several language families. These guidelines take both universal and language-specific phenomena into account. We hope that this will boost the development of language-independent and cross-lingual VMWE identification systems.

Participation policies

Participation is open and free worldwide. We ask potential participant teams to register using the expression of interest form. Task updates and questions will be posted to our public mailing list. More details on the annotated corpora can be found in a dedicated PARSEME page. See also the annotation guidelines used in manual annotation of the training/development and test sets. See also the description of of evaluation measures and the evaluation script.

It should be noted that a large international community has been gathered (via the PARSEME network) around the effort of putting forward universal guidelines and performing corpus annotations. Our policy was to allow the same national teams, which provided annotated corpora, to also submit VMWE identification systems for the shared task. While this policy is non-standard and introduces a bias to system evaluation, we follow it for several reasons:

Submission of results and system description paper

Shared task participants are invited to submit input of two kinds to the SHARED TASK TRACK of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG 2018):

  1. System results (by May 4 May 8) obtained on the blind data (released on 30 April). The results for all languages should be submitted in a single .zip archive containing a single folder per language, named according to the ISO 639-1 code (e.g. FR/ for French, SL/ for Slovene, etc.). Each output file must be named test.system.cupt and conform to the cupt format. The format of each file should be checked before submission by the validation script as follows:
    • ./ --input test.system.cupt
    If one system participates both in the open and in the closed track, two independent submissions are required. The number of submissions per system is limited to 2 per track, i.e. a team can have at most 4 submissions (with at most one result per language in each submission). Please, use an anonymous nickname for your system i.e. one that reveals neither the authors nor their affiliation.
  2. A system description paper (by May 25). These papers must follow the LAW-MWE-CxG workshop submission instructions and will go through double-blind peer reviewing by other participants and selected LAW-MWE-CxG 2018 Program Committee members. Their acceptance depends on the quality of the paper rather than on the results obtained in the shared task. Authors of the accepted papers will present their work as posters/demos in a dedicated session of the workshop. The submission of a system description paper is not mandatory.

The submission of system results and of a system description paper should be made via the dedicated START space:

Provided data

The shared task covers 20 languages: Arabic (AR), Bulgarian (BG), German (DE), Greek (EL), English (EN), Spanish (ES), Basque (EU), Farsi (FA), French (FR), Hindi (HI), Hebrew (HE), Croatian (HR), Hungarian (HU), Italian (IT), Lithuanian (LT), Polish (PL), Brazilian Portuguese (PT), Romanian (RO), Slovenian (SL), Turkish (TR).

For each language, we provide corpora (in the .cupt format) in which VMWEs are annotated according to universal guidelines:

For most languages, morphosyntactic data (parts of speech, lemmas, morphological features and/or syntactic dependencies) are also be provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe).

Our public GitLab repository contains:

The table below summarizes the sizes of the training/development corpora per language:























The training, development and test data are available in our public GitLab repository. Follow the Repository link to access folders for individual languages.

All VMWE annotations (except Arabic) are available under Creative Commons licenses (see files for details).

The Arabic corpus does not have an open licence. Participants are required to fill in an agreement and obtain the corpus through LDC. Given that this is a late addition, Arabic will be considered as optional this year. This means that we will publish generic and per-category rankings for teams who address Arabic, but it will not be included in the macro-average rankings across languages.

A note for shared task participants: We cannot ensure that the test data of the current edition of the shared task do not overlap with the data published in edition 1.0. Therefore, we kindly ask participants not to use the .parsemetsv files from edition 1.0 for any language during the training or testing phase.


System results can be submitted in two tracks:

Teams submitting systems in the open track will be requested to describe and provide references to all resources used at submission time. Teams are encouraged to favor freely available resources for better reproducibility of their results.

Important dates

All deadlines are at 23:59 UTC-12 (anywhere in the world).

Organizing team

Silvio Ricardo Cordeiro, Carlos Ramisch, Agata Savary, Veronika Vincze


For any inquiries regarding the shared task please send an email to