VLSP 2022 – viTempQA Duration Challenge: Vietnamese Temporal Question Answer
Important dates
- June 23, 2025: Registration open
- July 6, 2025: Training data release
- July 15, 2025: Public test release
- August 20, 2025: System submission deadline
- August 30, 2025: Private test results release
- September 10, 2025: Technical report submission
- September 27, 2025: Notification of acceptance
- October 3, 2025: Camera-ready deadline
- October 29-30, 2025: Conference dates
Task Description
Objective: Build a system to answer temporal questions in Vietnamese across three sub-tasks: Date Arithmetic (date-arith), Duration Question Answering (durationQA). The system must extract and reason about temporal information to provide accurate answers related to dates, durations, and temporal relationships.
- Sub-Task 2: Duration Question Answering (durationQA)
Description: Answer questions about the duration of events or actions based on a given context. The system must extract duration-related information from text and use real-world knowledge to evaluate answer options, determining how long an event or action lasts.
Focus: Identify explicit or implied durations in the context (e.g., "6 years") and apply real-world reasoning to classify options as correct ("yes") or incorrect ("no") based on factual accuracy.
Evaluation
System performance will be evaluated using a range of standard metrics, including Accuracy, Exact Match, Precision, Recall, and F1-score:
Evaluation Metrics
- Exact Match: Used for Sub-Task 2 (DurationQA). It evaluates whether the predicted label sequence matches exactly the ground-truth label sequence.
- Precision: Ratio of correctly predicted "yes" answers to total "yes" predictions made by the system.
- Recall: Ratio of correctly predicted "yes" answers to total actual "yes" answers in the ground truth.
- F1-score: Harmonic mean of Precision and Recall, summarizing overall performance.
Evaluation is performed separately for each sub-task. The final evaluation report includes individual scores as well as aggregate performance across all tasks.
Example for Sub-Task 1: Date Arithmetic
Input:
{
"question": "Thời gian 1 năm và 2 tháng trước tháng 6, 1297 là khi nào?",
"context": "",
"answer": ["Tháng 4, 1296"]
}
System Prediction:
["Tháng 4, 1296"]
- Accuracy: The prediction matches the ground-truth exactly.
→ Accuracy = 1.0
Example for Sub-Task 2: Duration Question Answering
Input:
{
"context": "Tôi đang sửa chữa chiếc xe đạp bị hỏng.",
"options": ["30 phút", "1 tháng", "10 phút", "2 giờ"],
"qid": 54,
"question": "Mất thời gian bao lâu để sửa chữa chiếc xe đạp?" }
"labels": ["yes", "no", "yes", "yes"],
System Prediction:
["yes", "no", "no", "yes"]
Metric Calculation:
- Exact Match: System prediction ≠ ground truth.
→ Exact Match = 0.0
- Precision: 2 correct "yes" predictions out of 2 total "yes" predictions.
→ Precision = 2 / 2 = 1.0
- Recall: 2 correct "yes" predictions out of 3 actual "yes" in ground truth.
→ Recall = 2 / 3 ≈ 0.6667
- F1-score: Harmonic mean of precision and recall.
→ F1 = 2 × (1.0 × 0.6667) / (1.0 + 0.6667) ≈ 0.8
References
- Chu, Zheng, et al. "Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models." arXiv preprint arXiv:2311.17667 (2023).
- Tan, Qingyu, Hwee Tou Ng, and Lidong Bing. "Towards benchmarking and improving the temporal reasoning capability of large language models." arXiv preprint arXiv:2306.08952 (2023).
- Virgo, Felix, Fei Cheng, and Sadao Kurohashi. "Improving event duration question answering by leveraging existing temporal information extraction data." Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022.