Shared task on automatic identification of verbal multiword expressions - edition 1.1

Organized as part of the LAW-MWE-CxG 2018 workshop co-located with COLING 2018 (Santa Fe, USA), August 25-26, 2018

Last updated: August 16, 2018

NEW! The PARSEME corpus edition 1.1 (used in this shared task) is now available via the CLARIN/LINDAT infrastructure.
NEW! System results are now available. Congratulations to all participants! (May 11)
NEW! The gold test data for all 20 languages is now available (May 11)
Good news: DEADLINE EXTENDED UNTIL MAY 08 for the submission of system results! (May 3)
The blind test data for all 20 languages is now available (April 30)
We have released new version of the evaluation script and a new script to calculate the macro-averages across languages. (April 23)
An extra corpus for Arabic is now available through LDC, see the instructions (April 22)
The full training/development data for all 19 languages (including Basque and Hebrew) is now available (April 12)
The training/development data for 17 languages is now available (April 5).
Trial data, the definition of the .cupt format, the description of evaluation measures and the evaluation script are now available.

Description

The second edition of the PARSEME shared task on automatic identification of verbal multiword expressions (VMWEs) aims at identifying verbal MWEs in running texts. Verbal MWEs include, among others, idioms (to let the cat out of the bag), light verb constructions (to make a decision), verb-particle constructions (to give up), multi-verb constructions (to make do) and inherently reflexive verbs (se suicider 'to suicide' in French). Their identification is a well-known challenge for NLP applications, due to their complex characteristics including discontinuity, non-compositionality, heterogeneity and syntactic variability.

The shared task is highly multilingual: PARSEME members have elaborated annotation guidelines based on annotation experiments in about 20 languages from several language families. These guidelines take both universal and language-specific phenomena into account. We hope that this will boost the development of language-independent and cross-lingual VMWE identification systems.

Participation policies

The evaluation phase of the shared task is now over, but you can find useful information about the shared task on this page.

Participation is open and free worldwide. We ask potential participant teams to register using the expression of interest form. Task updates and questions will be posted to our public mailing list. More details on the annotated corpora can be found in a dedicated PARSEME page. See also the annotation guidelines used in manual annotation of the training/development and test sets. See also the description of of evaluation measures and the evaluation script.

It should be noted that a large international community has been gathered (via the PARSEME network) around the effort of putting forward universal guidelines and performing corpus annotations. Our policy was to allow the same national teams, which provided annotated corpora, to also submit VMWE identification systems for the shared task. While this policy is non-standard and introduces a bias to system evaluation, we follow it for several reasons:

For many languages there are only very few NLP teams, so adopting an exclusive approach (either you annotate or you present a system but not both) would actually exclude the whole language from participation.
We are interested more in cross-language discussions than in a real competition.
We admit that we can trust the teams to respect some best practices, including those:
- The test data are never used for training/development, even if system authors have access to them in advance.
- If any resources were used to annotate the corpus, the same resources should not be used by the system (in the open track).
- If system authors notice other sources of bias between their annotating activity and system evaluation, they should describe them in the submitted papers (if any).

Submission of results and system description paper

Shared task participants are invited to submit input of two kinds to the SHARED TASK TRACK of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG 2018):

System results (by ~~May 4~~ May 8) obtained on the blind data (released on 30 April). The results for all languages should be submitted in a single .zip archive containing a single folder per language, named according to the ISO 639-1 code (e.g. FR/ for French, SL/ for Slovene, etc.). Each output file must be named test.system.cupt and conform to the cupt format. The format of each file should be checked before submission by the validation script as follows:
- ./validate_cupt.py --input test.system.cupt
If one system participates both in the open and in the closed track, two independent submissions are required. The number of submissions per system is limited to 2 per track, i.e. a team can have at most 4 submissions (with at most one result per language in each submission). Please, use an anonymous nickname for your system i.e. one that reveals neither the authors nor their affiliation.
A system description paper (by May 25). These papers must follow the LAW-MWE-CxG workshop submission instructions and will go through double-blind peer reviewing by other participants and selected LAW-MWE-CxG 2018 Program Committee members. Their acceptance depends on the quality of the paper rather than on the results obtained in the shared task. Authors of the accepted papers will present their work as posters/demos in a dedicated session of the workshop. The submission of a system description paper is not mandatory.

The submission of system results and of a system description paper should be made via the dedicated START space:

https://www.softconf.com/coling2018/ws-LAW-MWE-CxG-2018

Provided data

The PARSEME corpus edition 1.1 (used in this shared task) is available via the CLARIN/LINDAT infrastructure.

The shared task covers 20 languages: Arabic (AR), Bulgarian (BG), German (DE), Greek (EL), English (EN), Spanish (ES), Basque (EU), Farsi (FA), French (FR), Hindi (HI), Hebrew (HE), Croatian (HR), Hungarian (HU), Italian (IT), Lithuanian (LT), Polish (PL), Brazilian Portuguese (PT), Romanian (RO), Slovenian (SL), Turkish (TR).

For each language, we provide corpora (in the .cupt format) in which VMWEs are annotated according to universal guidelines:

Manually annotated training corpora made available to the participants in advance, in order to allow them to train their systems.
Manually annotated development corpora also made available in advance so as to tune/optimize the systems' parameters.
Raw (unannotated) test corpora to be used as input to the systems during the evaluation phase. The VMWE annotations in this corpus will be kept secret.

For most languages, morphosyntactic data (parts of speech, lemmas, morphological features and/or syntactic dependencies) are also be provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe).

Our public GitLab repository contains:

trial data in .cupt format in English
training/development data in .cupt format in all participating languages
the evaluation script to calculate per-language evaluation scores (its extension to marco-average scores for all languages will be published shortly)

The table below summarizes the sizes of the training/development/test corpora per language:


Lang-split	Sentences	Tokens	Avg. length	VMWE	VID	IRV	LVC.full	LVC.cause	VPC.full	VPC.semi	IAV	MVC	LS.ICV

AR-train	2370	231030	97.4	3219	1272	17	940	0	957	0	0	33	0
AR-dev	387	16252	41.9	500	17	0	419	0	64	0	0	0	0
AR-test	380	17962	47.2	500	31	0	410	0	59	0	0	0	0

AR-Total	3137	265244	84.5	4219	1320	17	1769		1080			33

BG-train	17813	399173	22.4	5364	1005	2729	1421	135	0	0	74	0	0
BG-dev	1954	42020	21.5	670	173	240	214	35	0	0	8	0	0
BG-test	1832	39220	21.4	670	82	254	274	52	0	0	8	0	0

BG-Total	21599	480413	22.2	6704	1260	3223	1909	222	0	0	90	0	0

DE-train	6734	130588	19.3	2820	977	220	218	28	1264	113	0	0	0
DE-dev	1184	22146	18.7	503	181	48	34	2	221	17	0	0	0
DE-test	1078	20559	19	500	183	40	42	2	210	23	0	0	0

DE-Total	8996	173293	19.2	3823	1341	308	294	32	1695	153	0	0	0

EL-train	4427	122458	27.6	1404	395	0	938	44	19	0	0	8	0
EL-dev	2562	66431	25.9	500	81	0	376	34	8	0	0	1	0
EL-test	1261	35873	28.4	501	169	0	308	11	11	0	0	2	0

EL-Total	8250	224762	27.2	2405	645	0	1622	89	38	0	0	11	0

EN-train	3471	53201	15.3	331	60	0	78	7	151	19	16	0	0
EN-test	3965	71002	17.9	501	79	0	166	36	146	26	44	4	0

EN-Total	7436	124203	16.7	832	139	0	244	43	297	45	60	4	0

ES-train	2771	96521	34.8	1739	167	479	223	36	0	0	360	474	0
ES-dev	698	26220	37.5	500	65	114	84	17	0	0	87	133	0
ES-test	2046	59623	29.1	500	95	121	85	28	1	0	64	106	0

ES-Total	5515	182364	33	2739	327	714	392	81	1	0	511	713	0

EU-train	8254	117165	14.1	2823	597	0	2074	152	0	0	0	0	0
EU-dev	1500	21604	14.4	500	104	0	382	14	0	0	0	0	0
EU-test	1404	19038	13.5	500	73	0	410	17	0	0	0	0	0

EU-Total	11158	157807	14.1	3823	774	0	2866	183	0	0	0	0	0

FA-train	2784	45153	16.2	2451	17	1	2433	0	0	0	0	0	0
FA-dev	474	8923	18.8	501	0	0	501	0	0	0	0	0	0
FA-test	359	7492	20.8	501	0	0	501	0	0	0	0	0	0

FA-Total	3617	61568	17	3453	17	1	3435	0	0	0	0	0	0

FR-train	17225	432389	25.1	4550	1746	1247	1470	68	0	0	0	19	0
FR-dev	2236	56254	25.1	629	207	154	252	15	0	0	0	1	0
FR-test	1606	39489	24.5	498	212	108	160	14	0	0	0	4	0

FR-Total	21067	528132	25	5677	2165	1509	1882	97	0	0	0	24	0

HE-train	12106	237472	19.6	1236	519	0	545	113	59	0	0	0	0
HE-dev	3385	65843	19.4	501	258	0	148	61	34	0	0	0	0
HE-test	3209	65698	20.4	502	182	0	211	49	60	0	0	0	0

HE-Total	18700	369013	19.7	2239	959	0	904	223	153	0	0	0	0

HI-train	856	17850	20.8	534	23	0	321	14	0	0	0	176	0
HI-test	828	17580	21.2	500	38	0	320	12	0	0	0	130	0

HI-Total	1684	35430	21	1034	61	0	641	26	0	0	0	306	0

HR-train	2295	53486	23.3	1450	113	468	303	45	0	0	521	0	0
HR-dev	834	19621	23.5	500	34	139	143	26	1	0	157	0	0
HR-test	708	16429	23.2	501	33	118	131	31	0	0	188	0	0

HR-Total	3837	89536	23.3	2451	180	725	577	102	1	0	866	0	0

HU-train	4803	120013	24.9	6205	84	0	892	363	4131	735	0	0	0
HU-dev	601	15564	25.8	779	10	0	85	10	539	135	0	0	0
HU-test	755	20759	27.4	776	10	0	166	28	486	86	0	0	0

HU-Total	6159	156336	25.3	7760	104	0	1143	401	5156	956	0	0	0

IT-train	13555	360883	26.6	3254	1098	942	544	147	66	0	414	23	20
IT-dev	917	32613	35.5	500	197	106	100	19	17	2	44	6	9
IT-test	1256	37293	29.6	503	201	96	104	25	23	0	41	5	8

IT-Total	15728	430789	27.3	4257	1496	1144	748	191	106	2	499	34	37

LT-train	4895	90110	18.4	312	106	0	195	11	0	0	0	0	0
LT-test	6209	118402	19	500	202	0	284	14	0	0	0	0	0

LT-Total	11104	208512	18.7	812	308	0	479	25	0	0	0	0	0

PL-train	13058	220465	16.8	4122	373	1785	1531	180	0	0	253	0	0
PL-dev	1763	26030	14.7	515	57	245	153	33	0	0	27	0	0
PL-test	1300	27823	21.4	515	73	249	149	15	0	0	29	0	0

PL-Total	16121	274318	17	5152	503	2279	1833	228	0	0	309	0	0

PT-train	22017	506773	23	4430	882	689	2775	84	0	0	0	0	0
PT-dev	3117	68581	22	553	130	83	337	3	0	0	0	0	0
PT-test	2770	62648	22.6	553	118	91	337	7	0	0	0	0	0

PT-Total	27904	638002	22.8	5536	1130	863	3449	94	0	0	0	0	0

RO-train	42704	781968	18.3	4713	1269	3048	250	146	0	0	0	0	0
RO-dev	7065	118658	16.7	589	169	373	29	18	0	0	0	0	0
RO-test	6934	114997	16.5	589	173	363	34	19	0	0	0	0	0

RO-Total	56703	1015623	17.9	5891	1611	3784	313	183	0	0	0	0	0

SL-train	9567	201853	21	2378	500	1162	176	40	0	0	500	0	0
SL-dev	1950	38146	19.5	500	121	224	30	12	0	0	113	0	0
SL-test	1994	40523	20.3	500	106	245	35	13	0	0	101	0	0

SL-Total	13511	280522	20.7	3378	727	1631	241	65	0	0	714	0	0

TR-train	16715	334880	20	6125	3172	0	2952	0	0	0	0	1	0
TR-dev	1320	27196	20.6	510	285	0	225	0	0	0	0	0	0
TR-test	577	14388	24.9	506	233	0	272	0	0	0	0	1	0

TR-Total	18612	376464	20.2	7141	3690	0	3449	0	0	0	0	2	0

Total	280838	6072331	21.6	79326	18757	16198	28190	2285	8527	1156	3049	1127	37

The training, development and test data are available in our public GitLab repository. Follow the Repository link to access folders for individual languages.

All VMWE annotations (except Arabic) are available under Creative Commons licenses (see README.md files for details).

The Arabic corpus does not have an open licence. Participants are required to fill in an agreement and obtain the corpus through LDC. Given that this is a late addition, Arabic will be considered as optional this year. This means that we will publish generic and per-category rankings for teams who address Arabic, but it will not be included in the macro-average rankings across languages.

A note for shared task participants: We cannot ensure that the test data of the current edition of the shared task do not overlap with the data published in edition 1.0. Therefore, we kindly ask participants not to use the .parsemetsv files from edition 1.0 for any language during the training or testing phase.

Tracks

System results can be submitted in two tracks:

Closed track: Systems using only the provided training/development data in the cupt files - VMWE annotations + morpho-syntactic data (if any) - to learn VMWE identification models and/or rules.
Open track: Systems using or not the provided training/development data, plus any additional resources deemed useful (MWE lexicons, symbolic grammars, wordnets, raw corpora, word embeddings, language models trained on external data, etc.). This track includes notably purely symbolic and rule-based systems.

Teams submitting systems in the open track will be requested to describe and provide references to all resources used at submission time. Teams are encouraged to favor freely available resources for better reproducibility of their results.

Important dates

All deadlines are at 23:59 UTC-12 (anywhere in the world).

~~March 21, 2018: shared task trial data and evaluation script released~~
~~April 4, 2018: shared task training and development data released~~
~~April 30, 2018: shared task blind test data released~~
~~May 4, 2018~~ May 8, 2018: submission of system results (EXTENDED!)
~~May 11, 2018: announcement of results~~
~~May 25, 2018: submission of system description papers~~
~~June 20, 2018: notification~~
~~June 30, 2018: camera-ready papers~~
~~August 25-26, 2018: shared task workshop colocated with LAW-MWE-CxG-2018~~

Organizing team

Silvio Ricardo Cordeiro, Carlos Ramisch, Agata Savary, Veronika Vincze

Contact

For any inquiries regarding the shared task please send an email to parseme-st-core@nlp.ipipan.waw.pl

LAW-MWE-CxG 2018 (COLING): Shared Task