VLSP 2022 – VTB Challenge: Vietnamese Constituency Parsing
Syntactic parsing is a fundamental problem in natural language processing. Syntax information plays an important role in many applications such as machine translation, information extraction, question answering, etc. Before 2015, the research community had witnessed the influence and success of statistical parsing models based on probabilistic context-free grammars following generative or discriminative approaches. From 2015 onward, deep learning-based parsing models have brought new successes to this problem, but mainly for popular languages such as English and Chinese.
With the main goal of promoting research on Vietnamese parsing and creating high-performance parsers for the community, the component parsing problem for Vietnamese was included in the shared task of the VLSP conference 2022.
The problem is to build a constituency parser for Vietnamese. Linguistically, constituency parsing is parsing based on a phrase structure grammar. In computational linguistics, the input to a constituency parser is a sentence, and the output is a constituency tree. For example, with the sentence "Nam làm bài tập", then the output can be the syntax tree as follows:
Sentence (S)
|
+------------------+-----------------+
| |
Noun Phrase (NP) Verb Phrase (VP)
| |
Noun (N) +----------+-----------+
| | |
Nam Verb (V) Noun Phrase (NP)
| |
làm Noun (N)
|
bài_tập
Participants can develop their model or build on existing open-source parsing systems (usually for other languages). Participants will be provided a Syntax Annotated Vietnamese corpus (Vietnamese Treebank) [1] with about 10,000 sentences belonging to the journalistic domain and socio-political topics. Participants can use additional resources such as Vietnamese raw text corpora to train word embedding models for their parser, or use using pre-trained word embeddings, ... The evaluation method used is Parseval [2] (with provided tools). The testing dataset consists of two types, the testing dataset in the same domain with the training data and the testing dataset outside the domain. The testing dataset outside the expected domain is legal (legal text) or biomedical text (biomedical text).
Result submission
The submission will be evaluated with ground-truth labels using Parseval metric [1]
[1] Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd ed. draft), Chapter 13. 2021 (https://web.stanford.edu/~jurafsky/slp3/13.pdf).
Start: Nov. 12, 2023, midnight
Nov. 30, 2024, midnight
You must be logged in to participate in competitions.
Sign In