Skip to main content.

Shared task on automatic identification of verbal multiword expressions

Organized as part of the MWE 2017 workshop co-located with EACL 2017 (Valencia, Spain), April 4, 2017

Last updated: February 1, 2017


The PARSEME shared task on automatic identification of verbal multiword expressions (VMWEs) aims at identifying verbal MWEs in running texts. Verbal MWEs include idioms (let the cat out of the bag), light verb constructions (make a decision), verb-particle constructions (give up), and inherently reflexive verbs (se suicider 'to suicide' in French). Their identification is a well-known challenge for NLP applications, due to their complex characteristics: discontinuity, non-compositionality, heterogeneity and syntactic variability.

The shared task is highly multilingual: we cover 18 languages from several language families. PARSEME members have elaborated annotation guidelines based on annotation experiments in these languages. They take both universal and language-specific phenomena into account. We hope that this will boost the development of language-independent and cross-lingual VMWE identification systems.


The evaluation phase of the shared task is now over, but you can find useful information about the shared task on this page.

Participation was open and free worldwide.

Task updates and questions can still be posted to our public mailing list.

For more details on the annotation of the corpora visit the dedicated PARSEME page and check the annotation guidelines used in manual annotation of the training and test sets.

It should be noted that a large international community has been gathered (via the PARSEME network) around the effort of putting forward universal guidelines and performing corpus annotations. Our policy was to allow the same national teams, which provided annotated corpora, to also submit VMWE identification systems for the shared task. While this policy is non-standard and introduces a bias to system evaluation, we follow it for several reasons:

Provided data

The shared task covers 18 languages: Bulgarian (BG), Czech (CS), German (DE), Greek (EL), Spanish (ES), Farsi (FA), French (FR), Hebrew (HE), Hungarian (HU), Italian (IT), Lithuanian (LT), Maltese (MT), Polish (PL), Brazilian Portuguese (PT), Romanian (RO), Slovene (SL), Swedish (SV) and Turkish (TR). For all these languages, we provided two corpora to the participants:

The corpora are provided in the parsemetsv format, inspired by the CONLL-U format.

For most languages (all except BG, HE and LT), paired files in the CONLL-U format - not necessarily using UD tagsets - containing parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe).

The table below summarizes the sizes of the training corpora per language:


The table below summarizes the sizes of the test corpora per language:


The training and test data are available in our public GitLab repository. Follow the Repository link to access folders for individual languages. You can also download an archive containing all the data directly using this shortcut link. The test files are provided in the following files:

Both parsemetsv and conllu files could be used in the closed track.

All VMWE annotations are available under Creative Commons licenses (see files for details).

A note for shared task participants: Small-size trial data were previously released via the same repository for most languages. We could fully ensure that no part of these data was included in the test data released on 20 January. Therefore, we asked participants not to use the trial.parsemetsv files for any language (except ES and SV) while training the final versions of their systems.


System results could be submitted in two tracks:

Teams submitting systems in the open track were requested to describe and provide references to all resources used at submission time. Teams were encouraged to favor freely available resources for better reproducibility of their results.

Evaluation metrics

Participants were to provide the output produced by their systems on the test corpus. This output was compared with the gold standard (ground truth). Evaluation metrics are precision, recall and F1, both strict (per VMWE) and fuzzy (per toke, i.e. taking partial matches into account). The evaluation script is available in our public data repository. It can be used as follows:

The token-based F1 takes into account the fact that:

Therefore, we measure the best F1 score from all possible matches between the set of MWE token ranks in the gold and system sentences. We perform this by looking at all possible ways of matching MWEs in both sets.

VMWE categories (e.g., LVC, ID, IReflV, VPC) are ignored by the evaluation metrics. Categories are only provided in the training data to guide system design. Systems focusing on selected VMWE categories only were also encouraged to participate - see the FAQ.

Tokenization issues

Tokenization is closely related to MWE identification, and it has been shown that performing both tasks jointly may enhance the quality of their results.

Note, however, that the data provided by us consist of pre-tokenized sentences, which implies that we expect typical systems to perform tokenization prior to VMWE identification, and that we do not allow the tokenization to be modified with respect to the ground truth. This is necessary since the evaluation measures are token-based. This approach may disadvantage systems which expect untokenized raw text on input, and apply their own tokenization methods, whether jointly with VMWE identification or not.

We are aware of this bias, and we did encourage such systems to participate in the shared task. We believe that re-tokenization methods can be defined, so as to adapt a system output to the tokenization imposed by us.

Publication and workshop

Shared task participants are invited to submit input of two kinds to the SHARED TASK TRACK of the EACL 2017 workshop on Multiword Expressions (MWE 2017) via the dedicated START space:


NEW! The results page now contains system evaluation results per language (disregarding VMWE categories).

Important dates

Organizing team

Marie Candito, Fabienne Cap, Silvio Cordeiro, Antoine Doucet, Voula Giouli, Behrang QasemiZadeh, Carlos Ramisch, Federico Sangati, Agata Savary, Ivelina Stoyanova, Veronika Vincze

Frequently asked questions

  1. My system can identify only one category of VMWEs (e.g. verb particle constructions). Can I still participate in the shared task?
    Organizing different tracks for different VMWE categories would be too complex. Therefore, we plan to publish the systems' results globally, i.e. without the distinction into particular VMWE categories. If a system can only recognize one category, its results with respect to this global picture will probably not be very high. In spite of that we do encourage such systems to participate, since we are interested more in cross-language discussions than in a real competition. Our evaluation script (with a proper choice of parameters) does allow restraining the evaluation to a particular VMWE category. It is, thus, possible for a system's authors to perform the evaluation on their own and describe their results in a system description paper to be submitted to the MWE 2017 workshop.