Important dates
(Timezone: UTC+7)
Task description
Constituency parsing is a syntactic analysis technique used in natural language processing and computational linguistics to parse sentences and represent their grammatical structure. It involves dividing sentences into smaller, hierarchical constituents or phrases, such as noun phrases (NP), verb phrases (VP), and prepositional phrases (PP). The parsed sentence is represented as a tree structure called a constituency parse tree, where each node represents a constituent and the edges represent the relationships between constituents. The parse tree provides a structural representation of the sentence, capturing both the syntactic relationships between words and the overall grammatical structure. Constituency parsing has various applications in natural language processing, including grammar checking, semantic analysis, machine translation, information extraction, and question answering. It serves as a fundamental step in many downstream tasks that require a deeper understanding of the grammatical structure of sentences.
In VLSP 2022, the first shared task on Vietnamese constituency parsing (VCP) was organized with the contribution of several groups with advanced models. The highest F1-score reached 85.47% on a training dataset of 8,160 sentences. In VLSP 2023, we continue to organize the VCP challenge for the second time.
In the context of this shared task, a constituency parser will take a sentence as input and generate a constituency tree that accurately represents the grammatical structure of the sentence.
For example, with the sentence "Nam làm bài tập", the output can be the syntax tree as follows:
Sentence (S)
|
+------------------+-----------------+
| |
Noun Phrase (NP) Verb Phrase (VP)
| |
Noun (N) +----------+-----------+
| | |
Nam Verb (V) Noun Phrase (NP)
| |
làm Noun (N)
|
bài_tập
The VCP 2022 corpus will be revised with modification of the annotation scheme. The new test dataset contains data of two categories: the in-domain test dataset, which shares the same domain as the training data, and the out-of-domain test dataset. The out-of-domain test dataset includes legal texts and biomedical texts.
Datasets and data format
This shared task will employ all datasets developed for VCP-VLSP 2022 as training data, with a revision of the annotation scheme. The test data will include texts in various domains.
Participants will receive sentences in bracketed-tree format, where each sentence is enclosed within <s></s> tags with an id attribute. The participants are required to submit the results in the bracketed-tree format, maintaining the same order as the provided testing data. Only POS tags and constituency tags are necessary for the syntactic trees. For example:
<s id=”100”>
(S (NP Tôi)
(VP đi
(NP (NNP Nha) (NNP Trang))
(VP dự
(NP hội_thảo))
(. .))
</s>
The testing data consists of a collection of Vietnamese sentences that have been word segmented. Here's an example
<s id=”100”> Tôi đi Nha Trang dự hội_thảo . </s>
Evaluation metric
This shared task comprises of two cases for evaluating the parsing system:
Case 1: In the first case, we compare all labels in the system's output with their corresponding labels in the gold dataset.
Case 2: In the second case, when multiple labels share the same span, we only consider the label within the innermost parentheses.
The submission will be evaluated with ground-truth labels using Parseval metric [2], in which Precision, Recall and F1 are calculated based on the number of correct constituents in the hypothesis parse as compared to the reference parse.
Organizers
Nguyễn Thị Minh Huyền, VNU University of Science
Vũ Xuân Lương, Vietlex
Hà Mỹ Linh, VNU University of Science
References
[1] Phuong-Thai Nguyen, Xuan-Luong Vu, Thi-Minh-Huyen Nguyen, Van-Hiep Nguyen and Hong-Phuong Le. Building a Large Syntactically-Annotated Corpus of Vietnamese. The 3rd Linguistic Annotation Workshop (LAW), Singapore. Pages 182-185, 2009.
[2] Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd ed. draft), Chapter 13. 2021 (https://web.stanford.edu/~jurafsky/slp3/13.pdf).
Result submission
The submission will be evaluated with ground-truth labels using Parseval metric [1]
[1] Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd ed. draft), Chapter 13. 2021 (https://web.stanford.edu/~jurafsky/slp3/13.pdf).
Start: Nov. 1, 2023, midnight
Nov. 17, 2023, midnight
You must be logged in to participate in competitions.
Sign In