TrafficLLM

The repository of TrafficLLM, a universal LLM adaptation framework to learn robust traffic representation for all open-sourced LLM in real-world scenarios and enhance the generalization across diverse traffic analysis tasks.

Generate Convert Improve

Install / Use

/learn @ZGC-LLM-Safety/TrafficLLM

About this skill

Quality Score

0/100

README

TrafficLLM: Enhancing Large Language Models for Network Traffic Analysis with Robust Traffic Representation

The repository of TrafficLLM, a universal LLM adaptation framework to learn robust traffic representation for all open-sourced LLM in real-world scenarios and enhance the generalization across diverse traffic analysis tasks.

TrafficLLM's framework

Note: this code is based on ChatGLM2 and Llama2. Many thanks to the authors.

News

[x] [2025.11.06] 🌟🌟 We update the MCP configuration that uses TrafficLLM to build single-agent and multi-agent systems for network traffic analysis.
[x] [2025.04.05] 🔥🔥 We release the preprint paper at the Arxiv website.
[x] [2024.11.26] 🌲🌲 We release the generation code to use TrafficLLM to generate packets with Scapy, which can generate pcap files that can be read by Wireshark. Go to tutorials for more details.
[x] [2024.10.28] 🎉🎉 We have update the adaptation code for using GLM4 to build TrafficLLM, which has a faster tuning and inference speed than ChatGLM2. Go to Adapt2GLM4 for more details.

Brief Introduction

TrafficLLM is built upon a sophisticated fine-tuning framework using natural language and traffic data, which proposes the following techniques to enhance the utility of large language models in network traffic analysis.

Traffic-Domain Tokenization. To overcome the modality gap between natural language and heterogeneous traffic data, TrafficLLM introduces traffic-domain tokenization to process the diverse input of traffic detection and generation tasks for LLM adaptation. This mechanism effectively extends LLM’s native tokenizer by training specialized the tokenization model on large-scale traffic-domain corpora.
Dual-Stage Tuning Pipeline. TrafficLLM employs a dual-stage tuning pipeline to achieve LLM’s robust representation learning across different traffic-domain tasks. The pipeline trains LLM to understand instructions and learn task-related traffic patterns at different stages, which builds upon TrafficLLM task understanding and traffic reasoning abilities for diverse traffic detection and generation tasks.
Extensible Adaptation with Parameter-Effective Fine-Tuning (EA-PEFT). To adapt LLM for generalization to new traffic environments, TrafficLLM proposes an extensible adaptation with parameter-effective fine-tuning (EA-PEFT) to update model parameters with low overhead. The technique splits model capabilities in different PEFT models, which helps minimize the costs on dynamic scenarios raised by traffic pattern changes.

TrafficLLM Datasets

We released TrafficLLM's training datasets, which contains over 0.4M traffic data and 9K human instructions for LLM adaptation across different traffic analysis tasks.

Instruction Datasets: The instruction datasets are used to help LLM learn the domain knowledge of traffic detection or generation tasks and understand which task should be conduct in different scenarios.
Traffic Datasets: The traffic datasets contain the traffic tuning data we extracted from the public traffic datasets, which helps LLM learn the traffic pattern in differernt downstream tasks.

Instruction Datasets

To build the natural language corpus as the human instructions in TrafficLLM, we collected the 9,209 task-specific instructions supervised by experts and AI assistants. The statistics are shown as follows:

| Mainstream Tasks | Downstream Tasks | Abbrev. | #Sample | | ------------------ | ---------------------------- | ------- | ------- | | Traffic Detection | Malware Traffic Detection | MTD | 1.0K | | | Botnet Detection | BND | 1.1K | | | Malicious DoH Detection | MDD | 0.6K | | | Web Attack Detection | WAD | 0.6K | | | APT Attack Detection | AAD | 0.6K | | | Encrypted VPN Detection | EVD | 1.2K | | | Tor Behavior Detection | TBD | 0.6K | | | Encrypted App Classification | EAC | 0.6K | | | Website Fingerprinting | WF | 0.6K | | | Concept Drift | CD | 0.6K | | Traffic Generation | Malware Traffic Generation | MTG | 0.6K | | | Botnet Traffic Generation | BTG | 0.1K | | | Encrypted VPN Generation | EVG | 0.4K | | | Encrypted App Generation | EAG | 0.6K |

Traffic Datasets

To evaluate the performance of TrafficLLM on various network scenarios, we extracted over 0.4M tuning data from public-available traffic datasets to measure TrafficLLM’s abilities to detect or generate malicious and benign traffic. The statistics are shown as follows:

| Datasets | Tasks | Abbrev. | #Sample | | ---------------- | ---------------------------- | ------- | ------- | | USTC TFC 2016 | Malware Traffic Detection | MTD | 50.7K | | ISCX Botnet 2014 | Botnet Detection | BND | 25.0K | | DoHBrw 2020 | Malicious DoH Detection | MDD | 47.8K | | CSIC 2010 | Web Attack Detection | WAD | 34.5K | | DAPT 2020 | APT Attack Detection | AAD | 10.0K | | ISCX VPN 2016 | Encrypted VPN Detection | EVD | 64.8K | | ISCX Tor 2016 | Tor Behavior Detection | TBD | 40.0K | | CSTNET 2023 | Encrypted App Classification | EAC | 97.6K | | CW-100 2018 | Website Fingerprinting | WF | 7.4K | | APP-53 2023 | Concept Drift | CD | 109.8K |

Getting Started

<a href='#chapter-1'>1. Environment Preparation</a>
<a href='#chapter-2'>2. Training TrafficLLM</a>
- <a href='#chapter-2.1'>2.1. Preparing Pre-trained Checkpoint</a>
- <a href='#chapter-2.2'>2.2. Preprocessing Dataset</a>
- <a href='#chapter-2.3'>2.3. Training Traffic-Domain Tokenizer (Optional)</a>
- <a href='#chapter-2.4'>2.4. Neural Language Instruction Tuning</a>
- <a href='#chapter-2.5'>2.5. Task-Specific Traffic Tuning</a>
- <a href='#chapter-2.6'>2.6. Extensible Adaptation with PEFT (EA-PEFT)</a>
<a href='#chapter-3'>3. Evaluating TrafficLLM</a>
- <a href='#chapter-3.1'>3.1. Preparing Checkpoints and Data</a>
- <a href='#chapter-3.2'>3.2. Running Evaluation</a>

1. Environment Preparation <a href='#all_catelogue'>[Back to Top]</a>

Please clone the repo and install the required environment by runing the following commands.

conda create -n trafficllm python=3.9

conda activate trafficllm

# Clone our TrafficLLM
git clone https://github.com/ZGC-LLM-Safety/TrafficLLM.git
cd TrafficLLM
# Install required libraries
pip install -r requirements.txt
# If training
pip install rouge_chinese nltk jieba datasets

2. Training TrafficLLM <a href='#all_catelogue'>[Back to Top]</a>

TrafficLLM employs three core techniques: traffic-domain tokenization to process instructions and traffic data, dual-stage tuning pipeline to understand text semantics and learn traffic patterns across different tasks, and the EA-PEFT to update model parameters for new scenario adaptation.

2.1 Preparing Pre-trained Checkpoint <a href='#all_catelogue'>[Back to Top]</a>

TrafficLLM is trained based on existing open-sourced LLMs. Please follow the instructions to prepare the checkpoints.

ChatGLM2: Prepare the base model ChatGLM, which is an open-sourced LLM with light-wise deployment requirements. Please download its weights here. We generally utilize the v2 model with 6B parameters.
Other LLMs: To adapt other LLMs for traffic analysis tasks, you can reuse the training data in the repo and modify their training scripts according to the official instructions. For instance, Llama2 is required to register the new dataset in the configs. See llm for more details.

2.2 Preprocessing Dataset <a href='#all_catelogue'>[Back to Top]</a>

To extract suitable training data for LLM learning from the raw traffic datasets, we design specialized extractors to preprocess traffic datasets for different tasks. The preprocessing code contain the following parameters to config.

input: The raw traffic dataset path (The main directory path that cont

Related Skills

node-connect

350.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

110.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

350.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

350.8k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

ZGC-LLM-Safety

View profile

View on GitHub

GitHub Stars431

CategoryDevelopment

Updated1d ago

Forks75

ZGC-LLM-Safety/TrafficLLM

Languages

Python

Security Score

80/100

Audited on Apr 5, 2026

No findings

TrafficLLM

Install / Use

README

TrafficLLM: Enhancing Large Language Models for Network Traffic Analysis with Robust Traffic Representation

News

Brief Introduction

TrafficLLM Datasets

Instruction Datasets

Traffic Datasets

Getting Started

Table of Contents:

1. Environment Preparation <a href='#all_catelogue'>[Back to Top]</a>

2. Training TrafficLLM <a href='#all_catelogue'>[Back to Top]</a>

2.1 Preparing Pre-trained Checkpoint <a href='#all_catelogue'>[Back to Top]</a>

2.2 Preprocessing Dataset <a href='#all_catelogue'>[Back to Top]</a>

Related Skills