Skip to main content.

Shared task on automatic identification of verbal multiword expressions

Organized as part of the MWE 2017 workshop co-located with EACL 2017 (Valencia, Spain), April 4, 2017

Last updated: February 1, 2017

Description

The PARSEME shared task on automatic identification of verbal multiword expressions (VMWEs) aims at identifying verbal MWEs in running texts. Verbal MWEs include idioms (let the cat out of the bag), light verb constructions (make a decision), verb-particle constructions (give up), and inherently reflexive verbs (se suicider 'to suicide' in French). Their identification is a well-known challenge for NLP applications, due to their complex characteristics: discontinuity, non-compositionality, heterogeneity and syntactic variability.

The shared task is highly multilingual: we cover 18 languages from several language families. PARSEME members have elaborated annotation guidelines based on annotation experiments in these languages. They take both universal and language-specific phenomena into account. We hope that this will boost the development of language-independent and cross-lingual VMWE identification systems.

Participation

The evaluation phase of the shared task is now over, but you can find useful information about the shared task on this page.

Participation was open and free worldwide.

Task updates and questions can still be posted to our public mailing list.

For more details on the annotation of the corpora visit the dedicated PARSEME page and check the annotation guidelines used in manual annotation of the training and test sets.

It should be noted that a large international community has been gathered (via the PARSEME network) around the effort of putting forward universal guidelines and performing corpus annotations. Our policy was to allow the same national teams, which provided annotated corpora, to also submit VMWE identification systems for the shared task. While this policy is non-standard and introduces a bias to system evaluation, we follow it for several reasons:

Provided data

NEW! The shared task corpus (version 1.0) has been published at the LINDAT/CLARIN infrastructure.

The shared task covers 18 languages: Bulgarian (BG), Czech (CS), German (DE), Greek (EL), Spanish (ES), Farsi (FA), French (FR), Hebrew (HE), Hungarian (HU), Italian (IT), Lithuanian (LT), Maltese (MT), Polish (PL), Brazilian Portuguese (PT), Romanian (RO), Slovene (SL), Swedish (SV) and Turkish (TR). For all these languages, we provided two corpora to the participants:

The corpora are provided in the parsemetsv format, inspired by the CONLL-U format.

For most languages (all except BG, HE and LT), paired files in the CONLL-U format - not necessarily using UD tagsets - containing parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe).

The table below summarizes the sizes of the training corpora per language:

LanguageSentencesTokensVMWEIDIReflVLVCOTHVPC
BG69131576471933417107943520
CS439557405301285214198851258020
DE626112084024471005111178101143
EL5244142322151851509551632
ES250210209074819633621420
FA273646530270700027070
FR17880450221446217861313136210
HE4673997901282860253535408
HU35698777729990058402415
IT157283873251954913580395462
LT12153209636402229017300
MT59651410967722610434770
PL1157819123931493171548128400
PT196403593453447820515211020
RO4546977867440405242496101910
SL888118328517872839451862371
SV2003376569313031
TR1671533488061692911026246340
Total23006245366035272411691177771479939954462

The table below summarizes the sizes of the test corpora per language:

LanguageSentencesTokensVMWEIDIReflVLVCOTHVPC
BG1947424814731002977600
CS5476926631684192114934300
DE12392401650021420400226
EL35678394350012703362116
ES21325771750016622010680
FA49086775000005000
FR16673578450011910527150
HE232747571500300127158185
HU74220398500001460354
IT12724052350025015087211
LT2710465991005804200
MT46351111895001850259560
PL2028296955006626516900
PT260054675500908132900
RO60311007535007529013500
SL25305257950092253452108
SV1600261412365114142155
TR1321271975012490199530
Total4431490260194942064284427248071055

The training and test data are available in our public GitLab repository. Follow the Repository link to access folders for individual languages. You can also download an archive containing all the data directly using this shortcut link. The test files are provided in the following files:

Both parsemetsv and conllu files could be used in the closed track.

All VMWE annotations are available under Creative Commons licenses (see README.md files for details).

A note for shared task participants: Small-size trial data were previously released via the same repository for most languages. We could fully ensure that no part of these data was included in the test data released on 20 January. Therefore, we asked participants not to use the trial.parsemetsv files for any language (except ES and SV) while training the final versions of their systems.

Tracks

System results could be submitted in two tracks:

Teams submitting systems in the open track were requested to describe and provide references to all resources used at submission time. Teams were encouraged to favor freely available resources for better reproducibility of their results.

Evaluation metrics

Participants were to provide the output produced by their systems on the test corpus. This output was compared with the gold standard (ground truth). Evaluation metrics are precision, recall and F1, both strict (per VMWE) and fuzzy (per toke, i.e. taking partial matches into account). The evaluation script is available in our public data repository. It can be used as follows:

The token-based F1 takes into account the fact that:

Therefore, we measure the best F1 score from all possible matches between the set of MWE token ranks in the gold and system sentences. We perform this by looking at all possible ways of matching MWEs in both sets.

VMWE categories (e.g., LVC, ID, IReflV, VPC) are ignored by the evaluation metrics. Categories are only provided in the training data to guide system design. Systems focusing on selected VMWE categories only were also encouraged to participate - see the FAQ.

Tokenization issues

Tokenization is closely related to MWE identification, and it has been shown that performing both tasks jointly may enhance the quality of their results.

Note, however, that the data provided by us consist of pre-tokenized sentences, which implies that we expect typical systems to perform tokenization prior to VMWE identification, and that we do not allow the tokenization to be modified with respect to the ground truth. This is necessary since the evaluation measures are token-based. This approach may disadvantage systems which expect untokenized raw text on input, and apply their own tokenization methods, whether jointly with VMWE identification or not.

We are aware of this bias, and we did encourage such systems to participate in the shared task. We believe that re-tokenization methods can be defined, so as to adapt a system output to the tokenization imposed by us.

Publication and workshop

Shared task participants are invited to submit input of two kinds to the SHARED TASK TRACK of the EACL 2017 workshop on Multiword Expressions (MWE 2017) via the dedicated START space:

Results

NEW! The results page now contains system evaluation results per language (disregarding VMWE categories).

Important dates

Organizing team

Marie Candito, Fabienne Cap, Silvio Cordeiro, Antoine Doucet, Voula Giouli, Behrang QasemiZadeh, Carlos Ramisch, Federico Sangati, Agata Savary, Ivelina Stoyanova, Veronika Vincze

Frequently asked questions

  1. My system can identify only one category of VMWEs (e.g. verb particle constructions). Can I still participate in the shared task?
    Organizing different tracks for different VMWE categories would be too complex. Therefore, we plan to publish the systems' results globally, i.e. without the distinction into particular VMWE categories. If a system can only recognize one category, its results with respect to this global picture will probably not be very high. In spite of that we do encourage such systems to participate, since we are interested more in cross-language discussions than in a real competition. Our evaluation script (with a proper choice of parameters) does allow restraining the evaluation to a particular VMWE category. It is, thus, possible for a system's authors to perform the evaluation on their own and describe their results in a system description paper to be submitted to the MWE 2017 workshop.