Skip to main content.

Evaluation metrics for the PARSEME shared task (edition 1.1)

Participants are to provide the output produced by their systems on the (blind) test corpus. This output is compared with the gold standard (ground truth). Evaluation metrics are precision (P), recall (R) and F1 (F) of two types:

General metrics

The general metrics are defined along two dimensions, like in edition 1.0 of the PARSEME shared task:

These scores are calculated both per language and for all participating languages (as a macro-average).

Metrics dedicated to specialized phenomena

For a better account of the challenges that VMWE identifiers have to face, we introduce metrics specialized in the following phenomena:

These metrics are not calculated per language but for all participating languages (as a macro-average). They are per-VMWE only and do not distinguish VMWE categories. Per-token scores are not provided due to their complex interplay with the above phenomena. Per-language and per-category scores are not provided due to data scarcity.

Macro-average scores

Additionally to per-language scores for general metrics, we provide both general and phenomenon-dedicated scores for all participating languages. They are calculated, as macro-averages, in the following way:

Rankings

We plan to publish rankings of the participating systems according to all F-scores defined above. These rankings should, however, be interpreted with care and should not be considered primary outcomes of the shared task. We are interested in promoting cross-language discussions more than a real competition.

Evaluation scripts

The evaluation script is available in our public data repository. It takes as input two variants of the same test file - the gold standard and the prediction - in the .cupt format. If you also indicate the training corpus, the script can the calculcate novelty- and variability-dedicated metrics described above.

Only columns 1, 2 and 11 are relevant for the global metrics, while columns 1, 2, 3 (lemma) and 11 are relevant for the dedicated metrics. All other columns are ignored by the script. The script can be used as follows (--train is optional):

./evaluate.py --gold gold.cupt --pred system.cupt --train train.cupt

For details of the evaluation, run the script with the --debug option, preferably in conjunction with less:

./evaluate.py --gold gold.cupt --pred system.cupt --train train.cupt --debug | less -RS

In case of errors, the evaluation script will indicate the line of the input file in which the error occurred. Please notice that the error may actually be located in the previous or next sentence, because the script reads sentences one by one before processing them.

In addition to the individual evaluation script for a single language, we also provide a script to calculate macro-averages called average_of_evaluations.py. This script takes as input several files generated by the evaluation script (you can redirect the output) and generates averages for all metrics. For instance, you can run the following commands to calculate the macro-averaged scores between Bulgarian (BG) and German (DE):

./evaluate.py --gold BG/dev.cupt --pred BG/system.cupt --train BG/train.cupt > BG/eval.txt
./evaluate.py --gold DE/dev.cupt --pred DE/system.cupt --train DE/train.cupt > DE/eval.txt
./average_of_evaluations.py BG/eval.txt DE/eval.txt

The rankings using macro-averaged scores over all languages will be calculated using this script.