VLSP 2022 – VTB Challenge - Vietnamese Constituency Parsing

Organized by linhhm - Current server time: Sept. 17, 2024, 6:35 p.m. UTC

First phase

Public Test
Nov. 3, 2022, midnight UTC

End

Competition Ends
Nov. 15, 2022, midnight UTC

VLSP 2022 – VTB Challenge: Vietnamese Constituency Parsing

Shared Task Registration Form

Important dates

  • July 27, 2022: Registration open.
  • Oct 1, 2022: Registration closed. Training data for development released.
  • Oct 15, 2022: Official training data released.
  • Nov 1, 2022: Release of the public test set.   
  • Nov 3, 2022: Online challenge started.
  • Nov 12, 2022: Private test released. 
  • Nov 14, 2022: End of challenge.
  • November 19, 2022: Deadline for top 5 teams to submit technical reports. If any top teams did not submit their reports, follow-up teams can submit and take their places (follow-up teams are recommended to write their reports in advance and submit by this deadline).
  • November 26, 2022: Final winners announcement, result presentation and award ceremony (workshop day).

Task Description

Syntactic parsing is a fundamental problem in natural language processing. Syntax information plays an important role in many applications such as machine translation, information extraction, question answering, etc. Before 2015, the research community had witnessed the influence and success of statistical parsing models based on probabilistic context-free grammars following generative or discriminative approaches. From 2015 onward, deep learning-based parsing models have brought new successes to this problem, but mainly for popular languages such as English and Chinese.

With the main goal of promoting research on Vietnamese parsing and creating high-performance parsers for the community, the component parsing problem for Vietnamese was included in the shared task of the VLSP conference 2022.

The problem is to build a constituency parser for Vietnamese. Linguistically, constituency parsing is parsing based on a phrase structure grammar. In computational linguistics, the input to a constituency parser is a sentence, and the output is a constituency tree. For example, with the sentence "Nam làm bài tập", then the output can be the syntax tree as follows:

                             Sentence (S)
                                        |
         +------------------+-----------------+
         |                                                           |
Noun Phrase (NP)                        Verb Phrase (VP)
         |                                                           |
 Noun (N)                                +----------+-----------+
         |                                        |                                     |
     Nam                               Verb (V)             Noun Phrase (NP)
                                                  |                                      |
                                                làm                             Noun (N)
                                                                                          |
                                                                                    bài_tập

Participants can develop their model or build on existing open-source parsing systems (usually for other languages). Participants will be provided a Syntax Annotated Vietnamese corpus (Vietnamese Treebank) [1] with about 10,000 sentences belonging to the journalistic domain and socio-political topics. Participants can use additional resources such as Vietnamese raw text corpora to train word embedding models for their parser, or use using pre-trained word embeddings, ... The evaluation method used is Parseval [2] (with provided tools). The testing dataset consists of two types, the testing dataset in the same domain with the training data and the testing dataset outside the domain. The testing dataset outside the expected domain is legal (legal text) or biomedical text (biomedical text).

Data Format and Training Data

Participants will be provided Vietnamese Treebank – VTB [1] with about 10,000 sentences in bracketed-tree format as follow:

            (S (NP (N Nam)) (VP (V làm) (NP (N bài_tập))))

Part-of-speech (POS) tagset: Follow the POS tagset of the Vietnamese universal dependency treebank [3].

Constituency tagset:

No.Constituency tagDescription
1 NP Noun phrase
2 VP Verb phrase
3 AP Adjective phrase
4 RP Adverb phrase
5 PP Prepositional phrase
6 QP Quantitative phrase
7 MDP Modal phrase
8 UCP Coordinated phrase in which components are not the same type
9 LST List mark phrase
10 WHNP Interrogative noun phrase ('aiwho', 'cái gìwhat', 'con gì'which)
11 WHAP Interrogative adjective phrase ('lạnhcold thế nàohow', 'đẹpbeautiful ra saohow')
12 WHRP Interrogative adverb phrase
13 WHPP Interrogative prepositional phrase ('vớiwith aiwhom', 'bằngby cáchmethod nàowhich')
14 S Statement sentence
15 SQ Question sentence
16 SBAR Subordinate clause (modifying noun, verb, and adjective)

Functional tagset:

No.Functional tagDescription
1 H Head of phrase
2 SUB Subject
3 DOB Direct object
4 IOB Indirect object
5 TPC Topic
6 PRD Predicate
7 LGS Logical subject
8 EXT Frequency or range complement
9 VOC Vocative
10 TMP Temporal adjunct
11 LOC Location adjunct
12 DIR Direction adjunct
13 MNR Manner adjunct
14 PRP Purpose adjunct
15 CND Condition adjunct
16 CNC Cnc adjunct
17 ADV Adverbial adjunct
18 EXC Exclamation sentence
19 CMD Command sentence

Null-element tagset:

No.Null-element tagDescription
1 *T* Null element (trace within sentence)
2 *E* Null element in ellipsis phenomenon
3 *0* Null element in complementizer

Testing Data

Testing data is a list of Vietnamese sentences. The sentences have been segmented into words. For example:

Tôi đi Nha Trang dự hội_thảo .

Result submission

Participants must submit the result in the same order as the testing data in the bracketed-tree format. In syntactic trees, only POS tags and consituency tags are required.

(S (NP (PRO Tôi))
(VP (V đi)
(NP (NNP Nha) (NNP Trang)))
(VP (V dự)
(NP (N hội_thảo))))

Result submission 

Evaluation Metric

The submission will be evaluated with ground-truth labels using Parseval metric [1]

[1] Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd ed. draft), Chapter 13. 2021 (https://web.stanford.edu/~jurafsky/slp3/13.pdf).

General rules

  • Right to cancel, modify, or disqualify. The Competition Organizer reserves the right at its sole discretion to terminate, modify, or suspend the competition.

  • By submitting results to this competition, you consent to the public release of your scores at the Competition workshop and in the associated proceedings, at the task organizers' discretion. Scores may include but are not limited to, automatic and manual quantitative judgments, qualitative judgments, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.

  • By joining the competition, you accepted to the terms and conditions of Terms of Participation and Data Use Agreement of VLSP 2022 - VTB Shared task, which has been sent to your email.
  • By joining the competition, you affirm and acknowledge that you agree to comply with applicable laws and regulations, and you may not infringe upon any copyrights, intellectual property, or patent of another party for the software you develop in the course of the competition, and will not breach of any applicable laws and regulations related to export control and data privacy and protection.

  • Prizes are subject to the Competition Organizer’s review and verification of the entrant’s eligibility and compliance with these rules as well as the compliance of the winning submissions with the submission requirements.

  • Participants grant to the Competition Organizer the right to use your winning submissions and the source code and data created for and used to generate the submission for any purpose whatsoever and without further approval.

Eligibility

  • Each participant must create a AIHub account to submit their solution for the competition. Only one account per user is allowed.

  • The competition is public, but the Competition Organizer may elect to disallow participation according to its own considerations.

  • The Competition Organizer reserves the right to disqualify any entrant from the competition if, in the Competition Organizer’s sole discretion, it reasonably believes that the entrant has attempted to undermine the legitimate operation of the competition through cheating, deception, or other unfair playing practices.

Team

  • Participants are allowed to form teams. 

  • You may not participate in more than one team. Each team member must be a single individual operating a separate AIHub account. 

Submission

  • Submissions are void if they are in whole or part illegible, incomplete, damaged, altered, counterfeit, obtained through fraudulent means, or late. The Competition Organizer reserves the right, in its sole discretion, to disqualify any entrant who makes a submission that does not adhere to all requirements.

Data

By downloading or by accessing the data provided by the Competition Organizer in any manner you agree to the following terms:

  • You will not distribute the data except for the purpose of non-commercial and academic-research.

  • You will not distribute, copy, reproduce, disclose, assign, sublicense, embed, host, transfer, sell, trade, or resell any portion of the data provided by the Competition Organizer to any third party for any purpose.

  • The data must not be used for providing surveillance, analyses or research that isolates a group of individuals or any single individual for any unlawful or discriminatory purpose.

  • You accept full responsibility for your use of the data and shall defend and indemnify the Competition Organizer, against any and all claims arising from your use of the data.

Public Test

Start: Nov. 3, 2022, midnight

Private Test

Start: Nov. 12, 2022, midnight

Competition Ends

Nov. 15, 2022, midnight

You must be logged in to participate in competitions.

Sign In