VLSP 2025 – viTempQA Duration Challenge - Vietnamese Temporal QA

Organized by toanlnhus - Current server time: Aug. 30, 2025, 9:19 p.m. UTC

Previous

Public Test
July 10, 2025, midnight UTC

Current

Private Test
Aug. 20, 2025, midnight UTC

End

Competition Ends
Aug. 31, 2025, midnight UTC

VLSP 2022 – viTempQA Duration Challenge: Vietnamese Temporal Question Answer

Shared Task Registration Form

Important dates 
  • June 23, 2025: Registration open
  • July 6, 2025: Training data release
  • July 15, 2025: Public test release
  • August 20, 2025: System submission deadline
  • August 30, 2025: Private test results release
  • September 10, 2025: Technical report submission
  • September 27, 2025: Notification of acceptance
  • October 3, 2025: Camera-ready deadline
  • October 29-30, 2025: Conference dates
 
Task Description

Objective: Build a system to answer temporal questions in Vietnamese across three sub-tasks: Date Arithmetic (date-arith), Duration Question Answering (durationQA). The system must extract and reason about temporal information to provide accurate answers related to dates, durations, and temporal relationships.

  • Sub-Task 2: Duration Question Answering (durationQA)
    Description: Answer questions about the duration of events or actions based on a given context. The system must extract duration-related information from text and use real-world knowledge to evaluate answer options, determining how long an event or action lasts.
    Focus: Identify explicit or implied durations in the context (e.g., "6 years") and apply real-world reasoning to classify options as correct ("yes") or incorrect ("no") based on factual accuracy.
 
Evaluation

System performance will be evaluated using a range of standard metrics, including Accuracy, Exact Match, Precision, Recall, and F1-score:

Evaluation Metrics

  • Exact Match: Used for Sub-Task 2 (DurationQA). It evaluates whether the predicted label sequence matches exactly the ground-truth label sequence.
  • Precision: Ratio of correctly predicted "yes" answers to total "yes" predictions made by the system.
  • Recall: Ratio of correctly predicted "yes" answers to total actual "yes" answers in the ground truth.
  • F1-score: Harmonic mean of Precision and Recall, summarizing overall performance.

Evaluation is performed separately for each sub-task. The final evaluation report includes individual scores as well as aggregate performance across all tasks.

Example for Sub-Task 1: Date Arithmetic

Input:

{

    "question": "Thời gian 1 năm và 2 tháng trước tháng 6, 1297 là khi nào?",

    "context": "",

    "answer": ["Tháng 4, 1296"]

 }

System Prediction:

["Tháng 4, 1296"]

  • Accuracy: The prediction matches the ground-truth exactly.
    Accuracy = 1.0

 

Example for Sub-Task 2: Duration Question Answering

Input:

{

    "context": "Tôi đang sửa chữa chiếc xe đạp bị hỏng.",

    "options": ["30 phút", "1 tháng", "10 phút", "2 giờ"],

    "qid": 54,

    "question": "Mất thời gian bao lâu để sửa chữa chiếc xe đạp?" }

    "labels": ["yes", "no", "yes", "yes"],

System Prediction:

["yes", "no", "no", "yes"]

Metric Calculation:
  • Exact Match: System prediction ≠ ground truth.
    Exact Match = 0.0
  • Precision: 2 correct "yes" predictions out of 2 total "yes" predictions.
    Precision = 2 / 2 = 1.0
  • Recall: 2 correct "yes" predictions out of 3 actual "yes" in ground truth.
    Recall = 2 / 3 ≈ 0.6667
  • F1-score: Harmonic mean of precision and recall.
    F1 = 2 × (1.0 × 0.6667) / (1.0 + 0.6667) ≈ 0.8
References
  1. Chu, Zheng, et al. "Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models." arXiv preprint arXiv:2311.17667 (2023).
  2. Tan, Qingyu, Hwee Tou Ng, and Lidong Bing. "Towards benchmarking and improving the temporal reasoning capability of large language models." arXiv preprint arXiv:2306.08952 (2023).
  3. Virgo, Felix, Fei Cheng, and Sadao Kurohashi. "Improving event duration question answering by leveraging existing temporal information extraction data." Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022.

Public Test

Start: July 10, 2025, midnight

Private Test

Start: Aug. 20, 2025, midnight

Competition Ends

Aug. 31, 2025, midnight

You must be logged in to participate in competitions.

Sign In