Shared task on automatic identification of verbal multiword expressions
Organized as part of the MWE 2017 workshop co-located with EACL 2017 (Valencia, Spain), April 4, 2017
Last updated: February 1, 2017
The PARSEME shared task on automatic identification of verbal multiword expressions (VMWEs) aims at identifying verbal MWEs in running texts. Verbal MWEs include idioms (let the cat out of the bag), light verb constructions (make a decision), verb-particle constructions (give up), and inherently reflexive verbs (se suicider 'to suicide' in French). Their identification is a well-known challenge for NLP applications, due to their complex characteristics: discontinuity, non-compositionality, heterogeneity and syntactic variability.
The shared task is highly multilingual: we cover 18 languages from several language families. PARSEME members have elaborated annotation guidelines based on annotation experiments in these languages. They take both universal and language-specific phenomena into account. We hope that this will boost the development of language-independent and cross-lingual VMWE identification systems.
The evaluation phase of the shared task is now over, but you can find useful information about the shared task on this page.
Participation was open and free worldwide.
Task updates and questions can still be posted to our public mailing list.
It should be noted that a large international community has been gathered (via the PARSEME network) around the effort of putting forward universal guidelines and performing corpus annotations. Our policy was to allow the same national teams, which provided annotated corpora, to also submit VMWE identification systems for the shared task. While this policy is non-standard and introduces a bias to system evaluation, we follow it for several reasons:
- For many languages there are only very few NLP teams, so adopting an exclusive approach (either you annotate or you present a system but not both) would actually exclude the whole language from participation.
- We are interested more in cross-language discussions than in a real competition.
- We admit that we can trust the teams to respect some best practices, including those:
- The test data are never used for training, even if system authors have access to them in advance.
- If any resources were used to annotate the corpus, the same resources should not be used by the system (in the open track).
- If system authors notice other sources of bias between their annotating activity and system evaluation, they should describe them in the submitted papers (if any).
The shared task covers 18 languages: Bulgarian (BG), Czech (CS), German (DE), Greek (EL), Spanish (ES), Farsi (FA), French (FR), Hebrew (HE), Hungarian (HU), Italian (IT), Lithuanian (LT), Maltese (MT), Polish (PL), Brazilian Portuguese (PT), Romanian (RO), Slovene (SL), Swedish (SV) and Turkish (TR). For all these languages, we provided two corpora to the participants:
- Manually built training corpora in which VMWEs are annotated according to the universal guidelines.
- Raw (unannotated) test corpora (
to be released on 20 January) to be used as input to the systems. The VMWE annotations contained in this corpus, performed according to the same guidelines, will be kept secretare now available on the GitLab repository.
For most languages (all except BG, HE and LT), paired files in the CONLL-U format - not necessarily using UD tagsets - containing parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe).
The table below summarizes the sizes of the training corpora per language:
The table below summarizes the sizes of the test corpora per language:
The training and test data are available in our public GitLab repository. Follow the Repository link to access folders for individual languages. You can also download an archive containing all the data directly using this shortcut link. The test files are provided in the following files:
- the test.blind.parsemetsv file contains the tokenized sentences in which systems had to identify VMWEs automatically,
- the test.conllu file in a CONLL-U-compatible format with morphosyntactic and/or syntactic information (not available for BG, HE and LT),
- the test.parsemetsv file contains the reference gold annotations against which system outputs were compared for evaluation; this file was made available on February 1, 2017, after the evaluation phase was over.
Both parsemetsv and conllu files could be used in the closed track.
All VMWE annotations are available under Creative Commons licenses (see README.md files for details).
A note for shared task participants: Small-size trial data were previously released via the same repository for most languages. We could fully ensure that no part of these data was included in the test data released on 20 January. Therefore, we asked participants not to use the trial.parsemetsv files for any language (except ES and SV) while training the final versions of their systems.
System results could be submitted in two tracks:
- Closed track: Systems using only the provided training data - VMWE annotations + CONLL-U files (if any) - to learn VMWE identification models and/or rules.
- Open track: Systems using or not the provided training data, plus any additional resources deemed useful (MWE lexicons, symbolic grammars, wordnets, raw corpora, word embeddings, language models trained on external data, etc.). This track includes notably purely symbolic and rule-based systems.
Teams submitting systems in the open track were requested to describe and provide references to all resources used at submission time. Teams were encouraged to favor freely available resources for better reproducibility of their results.
Participants were to provide the output produced by their systems on the test corpus. This output was compared with the gold standard (ground truth). Evaluation metrics are precision, recall and F1, both strict (per VMWE) and fuzzy (per toke, i.e. taking partial matches into account). The evaluation script is available in our public data repository. It can be used as follows:
- ./evaluate.py gold.parsemetsv system.parsemetsv
The token-based F1 takes into account the fact that:
- discontinuities allowed (take something into account)
- overlapping allowed (take a walk and then a long shower)
- embeddings allowed both at the syntactic level (take the fact that I didn't give up into account) and at the level of lexicalized components (let the cat out of the bag)
- multiword tokens lead to one-token MWEs (ES suicidarse)
Therefore, we measure the best F1 score from all possible matches between the set of MWE token ranks in the gold and system sentences. We perform this by looking at all possible ways of matching MWEs in both sets.
VMWE categories (e.g., LVC, ID, IReflV, VPC) are ignored by the evaluation metrics. Categories are only provided in the training data to guide system design. Systems focusing on selected VMWE categories only were also encouraged to participate - see the FAQ.
Tokenization is closely related to MWE identification, and it has been shown that performing both tasks jointly may enhance the quality of their results.
Note, however, that the data provided by us consist of pre-tokenized sentences, which implies that we expect typical systems to perform tokenization prior to VMWE identification, and that we do not allow the tokenization to be modified with respect to the ground truth. This is necessary since the evaluation measures are token-based. This approach may disadvantage systems which expect untokenized raw text on input, and apply their own tokenization methods, whether jointly with VMWE identification or not.
We are aware of this bias, and we did encourage such systems to participate in the shared task. We believe that re-tokenization methods can be defined, so as to adapt a system output to the tokenization imposed by us.
Publication and workshop
- System results (by January 27) obtained on the blind data (released on 20 January). The results for all languages should be submitted in a single your-system-name.zip archive containing a single folder per language, named according to the ISO 639-1 code (e.g. FR for French, MT for Maltese, etc.). Each output file must be named test.system.parsemetsv and conform to the parsemetsv format. The format of each file should be checked before submission by the validation script as follows:
- ./checkParsemeTsvFormat.py test.system.parsemetsv
- A system description paper (by February 5). These papers must follow the workshop submission instructions and will go through double-blind peer reviewing by other participants and selected MWE 2017 program committee members. Their acceptance depends on the quality of the paper rather than on the results obtained in the shared task. Authors of the accepted papers will present their work as posters/demos in a dedicated session of the workshop. The submission of a system description paper is not mandatory.
NEW! The results page now contains system evaluation results per language (disregarding VMWE categories).
Oct 14, 2016: first Call for Participation Nov 18, 2016: second Call for Participation (previous deadline: Dec 13)Dec 22, 2016: trial data and evaluation script released Jan 6, 2016: training data released Jan 10, 2017: final Call for Participation Jan 20, 2017: blind test data released (previous deadline: Jan 30) Jan 31, 2017: announcement of results
- Feb 5, 2017: submission of shared task system description papers
- Feb 12, 2017: notification of acceptance
- Feb 20, 2017: camera-ready system description papers due
- April 4, 2017: shared task workshop colocated with MWE 2017
Organizing teamMarie Candito, Fabienne Cap, Silvio Cordeiro, Antoine Doucet, Voula Giouli, Behrang QasemiZadeh, Carlos Ramisch, Federico Sangati, Agata Savary, Ivelina Stoyanova, Veronika Vincze
- My system can identify only one category of VMWEs (e.g. verb particle constructions). Can I still participate in the shared task?
Organizing different tracks for different VMWE categories would be too complex. Therefore, we plan to publish the systems' results globally, i.e. without the distinction into particular VMWE categories. If a system can only recognize one category, its results with respect to this global picture will probably not be very high. In spite of that we do encourage such systems to participate, since we are interested more in cross-language discussions than in a real competition. Our evaluation script (with a proper choice of parameters) does allow restraining the evaluation to a particular VMWE category. It is, thus, possible for a system's authors to perform the evaluation on their own and describe their results in a system description paper to be submitted to the MWE 2017 workshop.