CUPT format specification

The PARSEME corpora and PARSEME shared task systems (starting at edition 1.1) use a format called cupt, standing for CoNLL-U + parseme-TSV. It corresponds to a single-file merge of CoNLL-U and parsemetsv (used in edition 1.0). The PARSEME shared task 1.2 trial corpus contains examples of the cupt format.

In cooperation with Universal Dependencies (UD), we specify below how one can extend UD's CoNLL-U format. Then, we specify the cupt format and in particular the syntax and semantics of the extra PARSEME:MWE column. The cupt format is at the origin of the CoNLL-U plus format, with which cupt is mostly compatible (differences listed below).

Extended CoNLL-U format (CoNLL-U Plus Format)

We define a way of extending the CoNLL-U format to include information from other initiatives such as PARSEME. Notice that these specifications do not only cover UD and PARSEME, but any text encoded in CoNLL-U reused by any initiative, since other initiatives may use the CoNLL-U format to represent non-UD data (e.g. automatically parsed corpora, treebanks that respect the CoNLL-U column semantics but not the UD tagset, etc). The generalized version of the extended CoNLL-U format is described on UD's page of CoNLL-U plus format.

An extended CoNLL-U file follows the same rules as CoNLL-U files (text in UTF-8, LF as line break, tab-separated columns, 1 blank line after each sentence including the last, #-headed comments before each sentence, etc.). It can contain any number of columns from a CoNLL-U file in any order, merged with initiative-specific columns. Each initiative standardizes:

Its acronym, used as prefix of all initiative-specific column names, for instance, PARSEME
The names of columns, for instance, the MWE-dedicated column in PARSEME is called PARSEME:MWE
The syntax and semantics of initiative-specific columns, with the following constraints:
- The underscore '_', when it occurs alone in a field, is reserved for underspecified annotations. It can be used in incomplete annotations or in blind versions of the annotated files.
- The star '*', when it occurs alone in a field, is reserved for empty annotations, which are different from underspecified. This concerns sporadic annotations (where not necessarily all words receive an annotation, as opposed to e.g. part-of-speech tags in UPOS).
- The use of underscore '_' and star '*' is unconstrained when they occur with other characters (e.g. in names of features or values as in spec_char=*).

The first line, that is, the first comment of the first sentence, specifies the names and order of the corresponding CoNLL-U columns (note that UPOSTAG was renamed to UPOS and XPOSTAG to XPOS) and of the initiative-specific columns, for example:

# global.columns = ID FORM PARSEME:MWE

Two mandatory metadata fields are included before each sentence, in the form of comments:

text contains the original text and conforms to the CoNLL-U text comment
CoNLL-U's sent_id is replaced with source_sent_id, containing three parts separated by spaces:
# source_sent_id = prefix-uri file-path-under-root sentence-id
1. prefix-uri: a URI permanently referring to a source containing the same version of the corpus, in CoNLL-U format.
  - For sentences coming from a UD treebank, the URI of the corresponding UD release, for example:
    - http://hdl.handle.net/11234/1-1983 for UD 2.0.
    - http://hdl.handle.net/11234/1-2515 for UD 2.1.
  - For sentences from other resources, a permanent/immutable URL to an official website from which the original resource can be downloaded, for example:
    - http://hdl.handle.net/11372/LRT-2282 for the PARSEME shared task 1.0 corpora.
    - http://hdl.handle.net/11372/LRT-2842 for the PARSEME shared task 1.1 corpora.
    - https://gitlab.com/parseme/sharedtask-data/tree/a762bcde22b08740f006a4ae0272b63fd6ff5074 is also a valid prefix-uri for the PARSEME shared task 1.0 corpora, but we recommend the handle.net version above instead.
    - ~~https://gitlab.com/parseme/sharedtask-data~~ should not be used, as a git repository by itself is not permanent/immutable.
  - For sentences from a local corpus (e.g. stored within the extended CoNLL-U file, or in a local CoNLL-U file) or if there is no source treebank (there are only initiative-specific columns and no UD-related information), the prefix-uri is a single period '.'.
2. file-path-under-root: the relative path inside the release folder pointed by prefix-uri, that is, zero or more directory names followed by a filename, all separated by '/'.
  - For sentences coming from a UD treebank, the filename in the corresponding release, for example:
    - UD_German/de-ud-train.conllu points to the training file of the German UD 2.1 treebank.
    - UD_Portuguese-GSD/pt_br-ud-dev.conllu points to the dev file of the Portuguese-GSD UD 2.5 treebank.
  - For sentences from other resources, the actual path to the corpus file under a given directory pointed by prefix-uri.
  - If prefix-uri uniquely identifies exactly one file, the file-path-under-root must be a single period '.'.
3. sentence-id: a unique sentence identifier in the whole corpus.
  - For sentences coming from a UD treebank, the same sent_id as in the corresponding CoNLL-U file, for example:
    - fr-ud-train_10542 is a sentence in the French training corpus in UD 2.1.
    - sv-ud-test-143 is a sentence in the Swedish test corpus in UD 2.1.
  - For sentences from other resources, a new unique identifier not containing any whitespace or slash '/'.

The extended CoNLL-U file may contain free comments, preceded by a hash #. If the extended CoNLL-U file has the same metadata and comments before each sentence, and sentences in the same order as in the original CoNLL-U file, then both files can be easily aligned. This is, however, not a requirement, as sentences do not need to appear in the same order and can have other comments in addition to the mandatory text and source_sent_id metadata comments.

There are two differences between cupt and CoNLL-U Plus files. First, the metadata field source_sent_id in cupt is mandatory and has three parts, whereas the CoNLL-U Plus source_sent_id is not mandatory and has four parts, with the first one corresponding to the format of the source file. Second, cupt files do not contain necessarily contain a sent_id field whereas this is mandatory in CoNLL-U Plus.

Syntax of the `PARSEME:MWE` column

A cupt file is an instantiation of the extended CoNLL-U file format for the PARSEME Shared Task. It includes all 10 columns of a CoNLL-U file in the same order, plus an 11th column called PARSEME:MWE.

The PARSEME:MWE column encodes information about verbal multiword expressions (VMWEs) present in a sentence. It is very similar to the fourth column of edition 1.0's parsemetsv format:

It contains a star '*' if the word in the current line is not part of a VMWE, or if the current line describes a multiword tokens (e.g. 2-3 don't).
It contains an underscore '_' if this information is underspecified (e.g. in the blind test corpus).
It contains a list of semicolon-separated VMWE codes if the current word is part of one or more VMWEs. VMWE codes are only assigned to the lexicalized components of a VMWE (see Lexicalized components and open slots in the annotation guidelines).
- If the current line contains the first lexicalized component of the VMWE in the sentence, the VMWE code consists of a VMWE identifier followed by a colon ':' and a VMWE category label, for example: 1:VID
  - VMWE identifiers are integers starting from 1 for each new sentence, and increased by 1 for each new VMWE.
  - VMWE category labels are strings corresponding to the category of the VMWE (see VMWE categories in the annotation guidelines). The following VMWE category labels are allowed in shared task 1.1:
    - LVC.full -- light-verb constructions, the verb only adds meaning expressed by its morphology, e.g. to give a lecture (edition 1.0: LVC).
    - LVC.cause -- light-verb constructions, the verb adds a causative meaning to the noun, e.g. to grant rights
    - VID -- verbal idioms, e.g. to go bananas (edition 1.0: ID).
    - IRV -- inherently reflexive verbs, e.g. to help oneself to the cookies (edition 1.0: IReflV).
    - VPC.full -- fully non-compositional verb-particle constructions, the particle totally changes the meaning of the verb, e.g. to do in (edition 1.0: VPC).
    - VPC.semi -- semi-compositional verb-particle constructions, the particle adds a partly predictable but non-spatial meaning to the verb, e.g. to eat up.
    - MVC -- multi-verb constructions, e.g. to make do.
    - IAV -- inherently adpositional verbs, e.g. to come across.
    - LS.ICV -- inherently clitic verbs: this language-specific category is used only in Italian.
- If the current line contains a lexicalized component of the VMWE which is not the first one in the sentence, the VMWE code contains the VMWE identifier only, as described above, and no VMWE category label

Examples of cupt files can be found in the shared task 1.2 trial corpus.

Tools

To check that a file conforms to the cupt format, use the validation script as follows:

./validate_cupt.py --input your-file.cupt

To facilitate the upgrade of tools dealing with the old CoNLL-U+parsemetsv pair of files, we provide a script that converts a parsemetsv file into a cupt file: parsemetsv2cupt.py. Notice that these scripts rely on libraries present in the gitlab repository: one must donwload the whole repository to be able to use the scripts.

LAW-MWE-CxG 2018 (COLING): Format specification

CUPT format specification

Extended CoNLL-U format (CoNLL-U Plus Format)

Syntax of the PARSEME:MWE column

Tools

Syntax of the `PARSEME:MWE` column