VLSP 2022 – viTempQA Date-Arith Challenge: Vietnamese Temporal Question Answer
Important dates
- June 23, 2025: Registration open
- July 6, 2025: Training data release
- July 15, 2025: Public test release
- August 20, 2025: System submission deadline
- August 30, 2025: Private test results release
- September 10, 2025: Technical report submission
- September 27, 2025: Notification of acceptance
- October 3, 2025: Camera-ready deadline
- October 29-30, 2025: Conference dates
Task Description
Objective: Build a system to answer temporal questions in Vietnamese across three sub-tasks: Date Arithmetic (date-arith), Duration Question Answering (durationQA). The system must extract and reason about temporal information to provide accurate answers related to dates, durations, and temporal relationships.
- Sub-Task 1: Date Arithmetic (date-arith)
Description: The date-arith sub-task focuses on handling questions related to date calculations, such as adding or subtracting time intervals from a given date. This involves understanding and manipulating time expressions to compute answers based on the provided context.
Focus: Parse and manipulate temporal expressions to compute new dates.
Evaluation
System performance will be evaluated using a range of standard metrics, including Accuracy, Exact Match, Precision, Recall, and F1-score:
Evaluation Metrics
- Accuracy: Used for Sub-Task 1 (Date Arithmetic). It is the percentage of system answers that exactly match the ground-truth answers.
Evaluation is performed separately for each sub-task. The final evaluation report includes individual scores as well as aggregate performance across all tasks.
Example for Sub-Task 1: Date Arithmetic
Input:
{
"question": "Thời gian 1 năm và 2 tháng trước tháng 6, 1297 là khi nào?",
"context": "",
"answer": ["Tháng 4, 1296"]
}
System Prediction:
["Tháng 4, 1296"]
- Accuracy: The prediction matches the ground-truth exactly.
→ Accuracy = 1.0
References
- Chu, Zheng, et al. "Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models." arXiv preprint arXiv:2311.17667 (2023).
- Tan, Qingyu, Hwee Tou Ng, and Lidong Bing. "Towards benchmarking and improving the temporal reasoning capability of large language models." arXiv preprint arXiv:2306.08952 (2023).
- Virgo, Felix, Fei Cheng, and Sadao Kurohashi. "Improving event duration question answering by leveraging existing temporal information extraction data." Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022.