TicToc
Code and data for paper "Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception"
Install / Use
/learn @chengez/TicTocREADME
Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception
<!-- [](https://opensource.org/licenses/MIT) -->This repository contains the official data and code implementation for the paper: Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception.
For any questions or feedback, please feel free to email Yize Cheng.
📖 Overview
We identify temporal blindness as a critical limitation in multi-turn LLM agents. Models often fail to account for the passage of real-world time between messages when making tool-call decisions, leading to either over-reliance or under-reliance on prior context.

To evaluate this, we introduce TicToc: a diverse dataset of multi-turn user–agent conversation trajectories involving tool calls. By evaluating 18 open-weight and proprietary models, we underscore the misalignment between agents’ tool-call decisions and human time perception.

We also show that naive, prompt-based alignment techniques have limited effectiveness for most models, but specific post-training alignment can be a viable way to align multi-turn LLM tool use with human temporal perception.
📂 Repository Contents
Data Files
TicToc/- Contains all trajectories in our dataset after two rounds of quality filtering, saved separately for each scenario. (Does not contain human preference results).
merged_fully_labeled_data.json- Main Evaluation Data. Contains all samples merged with aggregated human preference collection results.
labeling_data_summary_filtered_merged.csv- Per-sample human preference results before aggregation.
merged_fully_labeled_data_train.json&merged_fully_labeled_data_test.json- Subsets of the fully labeled data used for alignment efforts via targeted post-training (DPO with dynamic margin).
Codebase
inference/- Contains model inference handlers. All handlers inherit from the abstract base class
Base_Handlerdefined ininference/model_handler.py. - Extensibility: New models can be easily supported by creating a handler that inherits from this class.
- Contains model inference handlers. All handlers inherit from the abstract base class
templates/- Jinja prompt templates for open-source models. These are modified to include timestamp information for each message role, simulating a deployment scenario where system wall-clock time is injected at the start of every user, assistant, and tool message.
⚙️ Installation
We recommend creating a separate virtual environment (conda or venv) with Python >= 3.10.
pip install -r requirements.txt
🚀 Running Evaluation
To run inference and save model results, use the following command:
python $SCRIPT --model "$MODEL" --data "$DATA" --use_time_stamp --output_dir "$OUTPUT_DIR"
Arguments
| Argument | Description |
| --- | --- |
| $SCRIPT | eval_from_api.py for API-hosted models (e.g., OpenAI) or eval_from_local.py for local models (e.g., Llama). |
| $MODEL | A model key string defined in inference/model_map.py. |
| $DATA | Use merged_fully_labeled_data.json for full evaluation, or merged_fully_labeled_data_test.json for the test split after DPO. |
| --use_time_stamp | Includes timestamps in the model context. Without this, the model will not see the timestamps in the context window. |
| --use_special_sys_prompt_naive | (Optional) Adds a "general reminder" to the system prompt (see Section 3.4 of the paper). |
| --use_special_sys_prompt_rule | (Optional) Adds few-shot example rules to the system prompt (see Section 3.4 of the paper). |
| $OUTPUT_DIR | The directory where inference results will be saved. |
Computing Metrics
After generating inference results, calculate the normalized alignment and attempt rates:
python get_metric.py --model "$MODEL" --data "$DATA" --use_time_stamp --output_dir "$OUTPUT_DIR"
Note: Ensure $MODEL, $DATA, and $OUTPUT_DIR match the values used during inference. And pass --use_time_stamp only when the inference results were also obtained in the "with timestamp" setting.
🧠 Running DPO Training
We implement post-training alignment using Direct Preference Optimization (DPO) with a Dynamic Margin.
The loss function is defined as:
$$ \mathcal{L} = -\mathbb{E}{\mathcal{D}} \Big[ \log \sigma \Big( \beta \log \frac{\pi\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \delta \Big) \Big] $$
Here, the offset effectively shifts the decision boundary of the sigmoid function. Please refer to Appendix D.2 of our paper for details on how we dynamically set the margin based on human preference collection results.
1. Training Data Preparation
Pre-format the input-output pairs before training. This script formats the data according to the specific model's prompt template (including timestamps) and saves it in Hugging Face format:
python dpo_prepare_hf_data.py --data "merged_fully_labeled_data_train.json" --model $MODEL
As before, $MODEL is a model key string defined in inference/model_map.py.
Output: A hugging face dataset folder named
{data_name}_{model_name}_dpo_dataset.
2. Running Training
We utilize FSDP offloading due to resource constraints. You must specify fsdp_transformer_layer_cls_to_wrap in your accelerate config. A sample default_config.yaml is provided. This file should by default be saved under Accelerate’s cache folder (e.g. ~/.cache/huggingface/accelerate/default_config.yaml), or under the directory set by the HF_HOME environment variable.
Example launch:
accelerate launch dpo_train_hf_margin.py \
--model_name $MODEL \
--dataset_path $DATA_FOLDER \
--beta $BETA \
--num_epoch $NUM_EPOCH \
--learning_rate $LR
$DATA_FOLDER: The path to the HF-formatted data generated in step 1.$BETA: The beta hyperparameter for the DPO loss.
🔗 Citation
If you find our work, code, or dataset useful, please consider citing us:
@misc{cheng2026llmagentstemporallyblind,
title={Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception},
author={Yize Cheng and Arshia Soltani Moakhar and Chenrui Fan and Parsa Hosseini and Kazem Faghih and Zahra Sodagar and Wenxiao Wang and Soheil Feizi},
year={2026},
eprint={2510.23853},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.23853},
}
Related Skills
node-connect
347.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
