PyVideoProc
🚀 High-Performance Multi-Stream Video Processing in Python with CUDA
Install / Use
/learn @lmk123568/PyVideoProcREADME
PyVideoProc 提供基于 CUDA 加速优化的高性能 Python SDK,可高效实现多路、多卡、多模型的视频解码、AI 推理与编码,显著降低开发复杂度并提升吞吐性能
PyVideoProc provides a high-performance Python SDK optimized with CUDA acceleration, enabling efficient multi-stream, multi-GPU, and multi-model video decoding, AI inference, and encoding—significantly reducing development complexity and boosting throughput performance.
https://github.com/user-attachments/assets/402f9080-004e-457e-8d36-3fefdb56f21d
⭐ 多进程绕过 GIL 限制,提升 Python 并发性能
⭐ 减少 Host-Device 数据传输,降低 GPU 显存冗余拷贝,提升推理速度
⭐ 尽可能在 GPU 上计算,以降低 CPU 计算负担
⭐ 开箱即用,简单易懂,扩展性强,适合中小型项目快速部署
| | Open Source | Learning Curve | Developer Friendliness | Performance | Architecture Design | | :-------------------------------------------------------: | :---------: | :------------------------------: | :------------------------------: | :---------: | :----------------------------: | | DeepStream | ❌ | High | Low | High | Single-process, multi-threaded | | VideoPipe | ✅ | medium(requires cpp knowledge) | Medium(requires cpp knowledge) | Medium | Single-process, multi-threaded | | Our | ✅ | ≈ 0 | High | Medium | Multi-process, single-threaded |
🔗 Bilibili: https://www.bilibili.com/video/BV12TcvzbEcZ
🔗 YouTube: https://youtu.be/-soVRjH1rb4
Quick Start
本项目推荐 Docker 容器运行,首先确保本地环境满足以下三个条件:
- Docker >= 24.0.0
- NVIDIA Driver >= 590
- NVIDIA Container Toolkit >= 1.13.0
1. 生成镜像
clone 本项目,生成包含完整开发环境的镜像
git clone https://github.com/lmk123568/PyVideoProc.git
cd PyVideoProc/docker
docker build -t pyvideoproc:cuda12.8 .
镜像生成后,进入容器,不报错即成功
docker run -it \
--gpus all \
-e NVIDIA_DRIVER_CAPABILITIES=all \
-v {your_path}/PyVideoProc:/workspace \
pyvideoproc:cuda12.8 \
bash
后续示例代码默认在容器内/workspace运行
⚠️ 不推荐自己本地装环境,如果一定要自己装,请参考 Dockerfile
2. 编译加速包
python scripts/setup.py install
这里面包含了硬件编解码、YOLO26 推理优化的 C++ 实现,并通过 Pybind11 给 Python 调用
3. 训练模型权重转换
将通过 ultralytics 训练的pt模型导入到当前目录(/workspace)下(示例模型为 yolo26n.pt)
python scripts/pt2trt.py --w ./yolo26n.pt --fp16
转换过程中会与 ultralytics 官方结果进行推理对齐
💡 TensorRT 编译生成 .engine 过程中,推理尺寸默认设置为
(576,1024),可以跳过letterbox降低计算开销
💡 遇到警告
requirements: Ultralytics requirement ['onnxruntime-gpu'] not found ...可以Ctrl + C跳过
4. 运行
开启 MPS(Multi-Process Service)
nvidia-cuda-mps-control -d
# echo quit | nvidia-cuda-mps-control 关闭 MPS
阅读理解其代码并运行
python main.py
Benchmark
测试日期: 2026-01-25
测试硬件: AMD Ryzen 9 5950 X + NVIDIA GeForce RTX 3090
测试任务: 4 × RTSP Decoders → YOLO26 (TensorRT) → 4 × RTMP Encoders
| | CPU | RAM | GPU VRAM | GPU-Util | | ------------------------- | ------- | ------- | -------- | ------------ | | VideoPipe(ffmpeg codec) | 511.6 % | 1.5 GiB | 2677 MiB | 16 % | | Our | 40 % | 1.2GiB | 3932 MiB | 12 % |
