Overview:
- Convert1to2 - conversion from UD v1 to UD v2
- MarkBugs - check for suspicious/wrong constructions
- SetSpaceAfter - heuristically fill
SpaceAfter=Noand# textif missing in the data - SetSpaceAfterFromText -
SpaceAfter=Noaccording to# text
For using these tools, you need to install Udapi for Python
as described here,
so you have the udapy execution script in your $PATH.
This Udapi block converts UD v1 files into UD v2 files. See http://universaldependencies.org/v2/summary.html for the description of all UD v2 changes.
IMPORTANT: this code does only some of the changes and the output should be checked, see below.
-
Run from command line:
udapy -s ud.Convert1to2 < inputUD1.conllu > outputUD2.conllu
-
If you already have some conversion of your treebank into v2, you can use this script for comparing the differences, e.g.:
vimdiff outputUD2.conllu previous_outputUD2.conllu
-
It is difficult for humans to see the dependency structure in CoNLL-U format. You can use a Udapi pretty-printer
write.TextModeTrees:udapy ud.Convert1to2 write.TextModeTrees attributes=form,lemma,upos,deprel,feats,misc < inputUD1.conllu > outputUD2.txt udapy write.TextModeTrees attributes=form,lemma,upos,deprel,feats,misc < previous_outputUD2.conllu > previous_outputUD2.txt vimdiff outputUD2.txt previous_outputUD2.txt
-
Convert1to2marks unclear cases with a specialToDoattribute in the MISC column (and logging to stderr). You can find these nodes in colored pretty-printed trees with:udapy ud.Convert1to2 write.TextModeTrees attributes=form,lemma,upos,deprel,feats,misc color=1 < inputUD1.conllu | less -R
and then type
/ToDoand pressnto jump to the next occurrence.
- UPOS: CONJ→CCONJ, whatever→AUX if copula
- DEPREL: mwe→fixed, dobj→obj, pass→:pass, name→flat:name, foreign→flat(+Foreign=Yes)
- DEPREL: neg→advmod/det (plus distinguish interjections and negation particles)
- DEPREL: nmod→obl if parent is not nominal
- Reattach cc and punct (commas, semicolons) to the following conjunct
- FEATS: Negative→Polarity, Aspect=Pro→Prosp, VerbForm=Trans→Conv, Definite=Red→Cons
- FEATS: Polarity=Neg if DEPREL was neg and there is no PronType=Neg feature
- FEATS: warn if Tense=Nar or NumType=Gen
- ellipsis/gapped constructions using orphan deprels instead of remnant deprels
- text: make sure the
# textcomment is present and agrees with the tokens' forms
- Addition of enhanced dependencies.
- Changes to tokenization.
- Changes to copular constructions.
- Changes to POS tags beyond renaming CONJ to CCONJ
Unclear cases (either probable v1 annotation errors or cases where fully automatic conversion is not possible)
are marked with a special ToDo attribute in the MISC column and with a longer explanation printed to stderr.
ToDo can have these values:
cc-after-conj: there is no (sibling) conjunct after a node with deprel=cc (but there are some conjunct before)cc-without-conj: node with deprel=cc follows its parent, which has noconjchildrengen: feature NumType=Gen not allowed in UD v2more-remnant-children: a node with more than one remnant childnar: feature Tense=Nar not allowed in UD v2neg: deprel=neg but upos is not any of ADV, PART, DETnmod: deprel=nmod, but parent is ambiguous nominal/predicate, so it is not clear whether thenmodshould be changed tooblpunct-in-coord: it is not clear whether the punctuation is part of coordination (and should be reattached to the following conjunct, which could not be found)remnant: unexpected case where either one of the remnants' grandparent (=head of head) is the root or the nearest common ancestor (governor) of all remnants in the tree is the root
This is work in progress. Bug reports and code contributions are welcome via Github issues and pull requests. Questions about Udapi or this conversion script are welcome via Github issues or email.
Martin Popel (popel@ufal.mff.cuni.cz), based on v2-conversion by Sebastian Schuster.