TrafficLLM
The repository of TrafficLLM, a universal LLM adaptation framework to learn robust traffic representation for all open-sourced LLM in real-world scenarios and enhance the generalization across diverse traffic analysis tasks.
Install / Use
/learn @ZGC-LLM-Safety/TrafficLLMREADME
TrafficLLM: Enhancing Large Language Models for Network Traffic Analysis with Robust Traffic Representation
<p align="center"> <a href='https://github.com/ZGC-LLM-Safety/TrafficLLM'><img src='https://img.shields.io/badge/Project-Github-purple'></a> <a href='https://arxiv.org/abs/2504.04222'><img src='https://img.shields.io/badge/Paper-Arxiv-orange'></a> <a href='https://drive.google.com/drive/folders/1RZAOPcNKq73-quA8KG_lkAo_EqlwhlQb'><img src='https://img.shields.io/badge/Datasets-Google Drive-red'></a> <a href='https://drive.google.com/drive/folders/1YjEhdordqZRpnw_oKczwUztcT52T0oQ0'><img src='https://img.shields.io/badge/Models-Google Drive-green'></a> <a href='https://mp.weixin.qq.com/s/pt2CfG0i9Fex-sy7-dcoyg' target='_blank'><img src='https://img.shields.io/badge/Chinese-Blog-blue'></a> </p>The repository of TrafficLLM, a universal LLM adaptation framework to learn robust traffic representation for all open-sourced LLM in real-world scenarios and enhance the generalization across diverse traffic analysis tasks.

Note: this code is based on ChatGLM2 and Llama2. Many thanks to the authors.
News
- [x] [2025.11.06] 🌟🌟 We update the MCP configuration that uses TrafficLLM to build single-agent and multi-agent systems for network traffic analysis.
- [x] [2025.04.05] 🔥🔥 We release the preprint paper at the Arxiv website.
- [x] [2024.11.26] 🌲🌲 We release the generation code to use TrafficLLM to generate packets with Scapy, which can generate pcap files that can be read by Wireshark. Go to tutorials for more details.
- [x] [2024.10.28] 🎉🎉 We have update the adaptation code for using GLM4 to build TrafficLLM, which has a faster tuning and inference speed than ChatGLM2. Go to Adapt2GLM4 for more details.
Brief Introduction
TrafficLLM is built upon a sophisticated fine-tuning framework using natural language and traffic data, which proposes the following techniques to enhance the utility of large language models in network traffic analysis.
- Traffic-Domain Tokenization. To overcome the modality gap between natural language and heterogeneous traffic data, TrafficLLM introduces traffic-domain tokenization to process the diverse input of traffic detection and generation tasks for LLM adaptation. This mechanism effectively extends LLM’s native tokenizer by training specialized the tokenization model on large-scale traffic-domain corpora.
- Dual-Stage Tuning Pipeline. TrafficLLM employs a dual-stage tuning pipeline to achieve LLM’s robust representation learning across different traffic-domain tasks. The pipeline trains LLM to understand instructions and learn task-related traffic patterns at different stages, which builds upon TrafficLLM task understanding and traffic reasoning abilities for diverse traffic detection and generation tasks.
- Extensible Adaptation with Parameter-Effective Fine-Tuning (EA-PEFT). To adapt LLM for generalization to new traffic environments, TrafficLLM proposes an extensible adaptation with parameter-effective fine-tuning (EA-PEFT) to update model parameters with low overhead. The technique splits model capabilities in different PEFT models, which helps minimize the costs on dynamic scenarios raised by traffic pattern changes.
TrafficLLM Datasets
We released TrafficLLM's training datasets, which contains over 0.4M traffic data and 9K human instructions for LLM adaptation across different traffic analysis tasks.
Instruction Datasets: The instruction datasets are used to help LLM learn the domain knowledge of traffic detection or generation tasks and understand which task should be conduct in different scenarios.Traffic Datasets: The traffic datasets contain the traffic tuning data we extracted from the public traffic datasets, which helps LLM learn the traffic pattern in differernt downstream tasks.
Instruction Datasets
To build the natural language corpus as the human instructions in TrafficLLM, we collected the 9,209 task-specific instructions supervised by experts and AI assistants. The statistics are shown as follows:
| Mainstream Tasks | Downstream Tasks | Abbrev. | #Sample | | ------------------ | ---------------------------- | ------- | ------- | | Traffic Detection | Malware Traffic Detection | MTD | 1.0K | | | Botnet Detection | BND | 1.1K | | | Malicious DoH Detection | MDD | 0.6K | | | Web Attack Detection | WAD | 0.6K | | | APT Attack Detection | AAD | 0.6K | | | Encrypted VPN Detection | EVD | 1.2K | | | Tor Behavior Detection | TBD | 0.6K | | | Encrypted App Classification | EAC | 0.6K | | | Website Fingerprinting | WF | 0.6K | | | Concept Drift | CD | 0.6K | | Traffic Generation | Malware Traffic Generation | MTG | 0.6K | | | Botnet Traffic Generation | BTG | 0.1K | | | Encrypted VPN Generation | EVG | 0.4K | | | Encrypted App Generation | EAG | 0.6K |
Traffic Datasets
To evaluate the performance of TrafficLLM on various network scenarios, we extracted over 0.4M tuning data from public-available traffic datasets to measure TrafficLLM’s abilities to detect or generate malicious and benign traffic. The statistics are shown as follows:
| Datasets | Tasks | Abbrev. | #Sample | | ---------------- | ---------------------------- | ------- | ------- | | USTC TFC 2016 | Malware Traffic Detection | MTD | 50.7K | | ISCX Botnet 2014 | Botnet Detection | BND | 25.0K | | DoHBrw 2020 | Malicious DoH Detection | MDD | 47.8K | | CSIC 2010 | Web Attack Detection | WAD | 34.5K | | DAPT 2020 | APT Attack Detection | AAD | 10.0K | | ISCX VPN 2016 | Encrypted VPN Detection | EVD | 64.8K | | ISCX Tor 2016 | Tor Behavior Detection | TBD | 40.0K | | CSTNET 2023 | Encrypted App Classification | EAC | 97.6K | | CW-100 2018 | Website Fingerprinting | WF | 7.4K | | APP-53 2023 | Concept Drift | CD | 109.8K |
Getting Started
<span id='all_catelogue'/>Table of Contents:
- <a href='#chapter-1'>1. Environment Preparation</a>
- <a href='#chapter-2'>2. Training TrafficLLM</a>
- <a href='#chapter-2.1'>2.1. Preparing Pre-trained Checkpoint</a>
- <a href='#chapter-2.2'>2.2. Preprocessing Dataset</a>
- <a href='#chapter-2.3'>2.3. Training Traffic-Domain Tokenizer (Optional)</a>
- <a href='#chapter-2.4'>2.4. Neural Language Instruction Tuning</a>
- <a href='#chapter-2.5'>2.5. Task-Specific Traffic Tuning</a>
- <a href='#chapter-2.6'>2.6. Extensible Adaptation with PEFT (EA-PEFT)</a>
- <a href='#chapter-3'>3. Evaluating TrafficLLM</a>
- <a href='#chapter-3.1'>3.1. Preparing Checkpoints and Data</a>
- <a href='#chapter-3.2'>3.2. Running Evaluation</a>
1. Environment Preparation <a href='#all_catelogue'>[Back to Top]</a>
Please clone the repo and install the required environment by runing the following commands.
conda create -n trafficllm python=3.9
conda activate trafficllm
# Clone our TrafficLLM
git clone https://github.com/ZGC-LLM-Safety/TrafficLLM.git
cd TrafficLLM
# Install required libraries
pip install -r requirements.txt
# If training
pip install rouge_chinese nltk jieba datasets
<span id='chapter-2'/>
2. Training TrafficLLM <a href='#all_catelogue'>[Back to Top]</a>
TrafficLLM employs three core techniques: traffic-domain tokenization to process instructions and traffic data, dual-stage tuning pipeline to understand text semantics and learn traffic patterns across different tasks, and the EA-PEFT to update model parameters for new scenario adaptation.
<span id='chapter-2.1'/>2.1 Preparing Pre-trained Checkpoint <a href='#all_catelogue'>[Back to Top]</a>
TrafficLLM is trained based on existing open-sourced LLMs. Please follow the instructions to prepare the checkpoints.
ChatGLM2: Prepare the base model ChatGLM, which is an open-sourced LLM with light-wise deployment requirements. Please download its weights here. We generally utilize the v2 model with 6B parameters.Other LLMs: To adapt other LLMs for traffic analysis tasks, you can reuse the training data in the repo and modify their training scripts according to the official instructions. For instance, Llama2 is required to register the new dataset in the configs. See llm for more details.
2.2 Preprocessing Dataset <a href='#all_catelogue'>[Back to Top]</a>
To extract suitable training data for LLM learning from the raw traffic datasets, we design specialized extractors to preprocess traffic datasets for different tasks. The preprocessing code contain the following parameters to config.
input: The raw traffic dataset path (The main directory path that cont
Related Skills
node-connect
350.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
350.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
350.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
