Charles Explorer logo
🇨🇿

How to embed noncrossing trees in Universal Dependencies treebanks in a low-complexity regular language

Publikace

Tento text není v aktuálním jazyce dostupný. Zobrazuje se verze "en".Abstrakt

A recently proposed balanced-bracket encoding (Yli-Jyrä and Gómez-

Rodríguez 2017) has given us a way to embed all noncrossing dependencygraphs into the string space and to formulate their exact arcfactoredinference problem (Kuhlmann and Johnsson 2015) as the beststring problem in a dynamically constructed and weighted unambiguouscontext-free grammar. The current work improves the encodingand makes it shallower by omitting redundant brackets from it. Thestreamlined encoding gives rise to a bounded-depth subset approximationthat is represented by a small finite-state automaton. When boundedto 7 levels of balanced brackets, the automaton has 762 states andrepresents a strict superset of more than 99.9999% of the noncrossingtrees available in Universal Dependencies 2.4 (Nivre et al. 2019). Inaddition, it strictly contains all 15-vertex noncrossing digraphs. Whenbounded to 4 levels and 90 states, the automaton still captures 99.2%of all noncrossing trees in the reference dataset. The approach is flexibleand extensible towards unrestricted graphs, and it suggests tightfinite-state bounds for dependency parsing, and for the main existingparsing methods.