Evaluation metrics for the PARSEME shared task (editions 1.1 and 1.2)

Last updated: Oct 06, 2020

NEW! [Sep 15, 2020] We have changed the definition of seen VMWEs. We now consider unseen VMWEs with respect to train+dev instead of train only (see the reasons for this change).

Participants are to provide the output produced by their systems on the (blind) test corpus. This output is compared with the gold standard (ground truth). Evaluation metrics are precision (P), recall (R) and F1 (F) of two types:

general metrics
metrics dedicated to specialized phenomena

General metrics

The general metrics are defined along two dimensions, like in edition 1.0 of the PARSEME shared task:

strict score (per-VMWE) vs. fuzzy score (per-token, i.e. taking partial matches into account), for more details, see the description paper of the PARSEME shared task edition 1.0, section 6
score for identification (disregarding VMWE categories) vs. per-category scores

These scores are calculated both per language and for all participating languages (as a macro-average).

Metrics dedicated to specialized phenomena

For a better account of the challenges that VMWE identifiers have to face, we introduce metrics specialized in the following phenomena:

Continuity - we provide P/R/F scores separately for those VMWEs whose lexicalized components are adjacent (set up a meeting) and non-adjacent (set me up) in the test corpus
Length - we provide P/R/F scores separately for multi-token VMWEs (ES: me abstengo; DE: macht es auf) and for single-token ones (ES: abstenerse; DE: aufmachen)
Novelty - we provide P/R/F scores separately for seen and unseen VMWEs. A VMWE from the test corpus is considered seen if a VMWE with the same (multi-)set of lemmas is annotated at least once in the training or the development corpus. For instance, given the occurrence of has a new look in the training or he development corpus, the following VMWEs from the test corpus would be considered:
- seen: has a new look, had an appealing look, has a look of innocence, the look that he had
- unseen: has a look at this report, gave a look to the book, walk that he had, etc.
Note that the definition of an unseen VMWE was updated in September 2020: it previously referred to VMWEs unseen in the training but not in the development corpus (see the reasons for this change). The results of edition 1.2 are compatible with this new definition (the deprecated results of edition 1.2, conforming to the previous definition are still available. The results of edition 1.1 still conform to the previous definition. We will soon proceed to updating them.
Variability - we additionally provide P/R/F scores for variants, i.e. those seen VMWEs which are not identical to VMWE occurrences from the training corpus. VMWE occurrences are considered identical if the strings between their first and last lexicalized components, including non-lexicalized elements in between, are identical. All the seen examples above are considered variants except has a new look.

These metrics are not calculated per language but for all participating languages (as a macro-average). They are per-VMWE only and do not distinguish VMWE categories. Per-token scores are not provided due to their complex interplay with the above phenomena. Per-language and per-category scores are not provided due to data scarcity.

Macro-average scores

Additionally to per-language scores for general metrics, we provide both general and phenomenon-dedicated scores for all participating languages. They are calculated, as macro-averages, in the following way:

F-scores obtained by a system for each participating language are (arithmetically) averaged.
If a system provides no results for a given language, its F-score for this language is considered as equal to 0.
If a language has no VMWE corresponding to a given phenomenon (e.g. no single-token VMWEs), we do not include this language it the ranking.

Rankings

We plan to publish rankings of the participating systems according to all F-scores defined above. These rankings should, however, be interpreted with care and should not be considered primary outcomes of the shared task. We are interested in promoting cross-language discussions more than a real competition.

Evaluation scripts

The evaluation script "evaluate.py" and the libraries required to run it are available in our public data repository. It takes as input two variants of the same test file - the gold standard and the prediction - in the .cupt format. If you also indicate the training corpus, the script can the calculcate novelty- and variability-dedicated metrics described above.

Only columns 1, 2 and 11 are relevant for the global metrics, while columns 1, 2, 3 (lemma) and 11 are relevant for the dedicated metrics. All other columns are ignored by the script. The script can be used as follows (--train is optional):

./evaluate.py --gold gold.cupt --pred system.cupt --train train-dev.cupt

Notice that, as per the new definition of unseen VMWEs, the file given to the option --train is not only the training file, but a concatenation of the training and development corpora. This file can be easily obtained by a command such as cat train.cupt dev.cupt > train-dev.cupt for each language. Results according to the previous definition, used in the official results of shared task 1.1, can be obtained by passing only train.cupt to option

For details of the evaluation, run the script with the --debug option, preferably in conjunction with less:

./evaluate.py --gold gold.cupt --pred system.cupt --train train-dev.cupt --debug | less -RS

In case of errors, the evaluation script will indicate the line of the input file in which the error occurred. Please notice that the error may actually be located in the previous or next sentence, because the script reads sentences one by one before processing them.

In addition to the individual evaluation script for a single language, we also provide a script to calculate macro-averages called average_of_evaluations.py. This script takes as input several files generated by the evaluation script (you can redirect the output) and generates averages for all metrics. For instance, you can run the following commands to calculate the macro-averaged scores between Bulgarian (BG) and German (DE):

cat BG/train.cupt BG/dev.cupt > BG/train-dev.cupt
cat DE/train.cupt DE/dev.cupt > DE/train-dev.cupt

./evaluate.py --gold BG/dev.cupt --pred BG/system.cupt  --train BG/train-dev.cupt > BG/eval.txt 

./evaluate.py --gold DE/dev.cupt --pred DE/system.cupt  --train DE/train-dev.cupt > DE/eval.txt 

./average_of_evaluations.py BG/eval.txt DE/eval.txt

The rankings using macro-averaged scores over all languages will be calculated using this script.