TicToc

Code and data for paper "Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception"

Generate Convert Improve

Install / Use

/learn @chengez/TicToc

About this skill

Quality Score

0/100

README

Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception

This repository contains the official data and code implementation for the paper: Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception.

For any questions or feedback, please feel free to email Yize Cheng.

📖 Overview

We identify temporal blindness as a critical limitation in multi-turn LLM agents. Models often fail to account for the passage of real-world time between messages when making tool-call decisions, leading to either over-reliance or under-reliance on prior context.

To evaluate this, we introduce TicToc: a diverse dataset of multi-turn user–agent conversation trajectories involving tool calls. By evaluating 18 open-weight and proprietary models, we underscore the misalignment between agents’ tool-call decisions and human time perception.

main_results

We also show that naive, prompt-based alignment techniques have limited effectiveness for most models, but specific post-training alignment can be a viable way to align multi-turn LLM tool use with human temporal perception.

📂 Repository Contents

Data Files

TicToc/
- Contains all trajectories in our dataset after two rounds of quality filtering, saved separately for each scenario. (Does not contain human preference results).
merged_fully_labeled_data.json
- Main Evaluation Data. Contains all samples merged with aggregated human preference collection results.
labeling_data_summary_filtered_merged.csv
- Per-sample human preference results before aggregation.
merged_fully_labeled_data_train.json & merged_fully_labeled_data_test.json
- Subsets of the fully labeled data used for alignment efforts via targeted post-training (DPO with dynamic margin).

Codebase

inference/
- Contains model inference handlers. All handlers inherit from the abstract base class Base_Handler defined in inference/model_handler.py.
- Extensibility: New models can be easily supported by creating a handler that inherits from this class.
templates/
- Jinja prompt templates for open-source models. These are modified to include timestamp information for each message role, simulating a deployment scenario where system wall-clock time is injected at the start of every user, assistant, and tool message.

⚙️ Installation

We recommend creating a separate virtual environment (conda or venv) with Python >= 3.10.

pip install -r requirements.txt

🚀 Running Evaluation

To run inference and save model results, use the following command:

python $SCRIPT --model "$MODEL" --data "$DATA" --use_time_stamp --output_dir "$OUTPUT_DIR"

Arguments

| Argument | Description | | --- | --- | | $SCRIPT | eval_from_api.py for API-hosted models (e.g., OpenAI) or eval_from_local.py for local models (e.g., Llama). | | $MODEL | A model key string defined in inference/model_map.py. | | $DATA | Use merged_fully_labeled_data.json for full evaluation, or merged_fully_labeled_data_test.json for the test split after DPO. | | --use_time_stamp | Includes timestamps in the model context. Without this, the model will not see the timestamps in the context window. | | --use_special_sys_prompt_naive | (Optional) Adds a "general reminder" to the system prompt (see Section 3.4 of the paper). | | --use_special_sys_prompt_rule | (Optional) Adds few-shot example rules to the system prompt (see Section 3.4 of the paper). | | $OUTPUT_DIR | The directory where inference results will be saved. |

Computing Metrics

After generating inference results, calculate the normalized alignment and attempt rates:

python get_metric.py --model "$MODEL" --data "$DATA" --use_time_stamp --output_dir "$OUTPUT_DIR"

Note: Ensure $MODEL, $DATA, and $OUTPUT_DIR match the values used during inference. And pass --use_time_stamp only when the inference results were also obtained in the "with timestamp" setting.

🧠 Running DPO Training

We implement post-training alignment using Direct Preference Optimization (DPO) with a Dynamic Margin.

The loss function is defined as:

$$ \mathcal{L} = -\mathbb{E}{\mathcal{D}} \Big[ \log \sigma \Big( \beta \log \frac{\pi\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \delta \Big) \Big] $$

Here, the offset effectively shifts the decision boundary of the sigmoid function. Please refer to Appendix D.2 of our paper for details on how we dynamically set the margin based on human preference collection results.

1. Training Data Preparation

Pre-format the input-output pairs before training. This script formats the data according to the specific model's prompt template (including timestamps) and saves it in Hugging Face format:

python dpo_prepare_hf_data.py --data "merged_fully_labeled_data_train.json" --model $MODEL

As before, $MODEL is a model key string defined in inference/model_map.py.

Output: A hugging face dataset folder named {data_name}_{model_name}_dpo_dataset.

2. Running Training

We utilize FSDP offloading due to resource constraints. You must specify fsdp_transformer_layer_cls_to_wrap in your accelerate config. A sample default_config.yaml is provided. This file should by default be saved under Accelerate’s cache folder (e.g. ~/.cache/huggingface/accelerate/default_config.yaml), or under the directory set by the HF_HOME environment variable.

Example launch:

accelerate launch dpo_train_hf_margin.py \
  --model_name $MODEL \
  --dataset_path $DATA_FOLDER \
  --beta $BETA \
  --num_epoch $NUM_EPOCH \
  --learning_rate $LR

$DATA_FOLDER: The path to the HF-formatted data generated in step 1.
$BETA: The beta hyperparameter for the DPO loss.

🔗 Citation

If you find our work, code, or dataset useful, please consider citing us:

@misc{cheng2026llmagentstemporallyblind,
      title={Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception}, 
      author={Yize Cheng and Arshia Soltani Moakhar and Chenrui Fan and Parsa Hosseini and Kazem Faghih and Zahra Sodagar and Wenxiao Wang and Soheil Feizi},
      year={2026},
      eprint={2510.23853},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.23853}, 
}

Related Skills

node-connect

347.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

108.0k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

347.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

347.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。