LLMDataDistill

distill large scale web page text

Generate Convert Improve

Install / Use

/learn @EastTower16/LLMDataDistill

About this skill

Quality Score

0/100

README

LLM Data Distill

准备数据

以悟道数据集为例，下载悟道开源的数据集 200G https://www.scidb.cn/en/detail?dataSetId=c6a3fe684227415a9db8e21bac4a15ab
下好后解压到一个目录
该数据集分类统计情况大致如下： {'经济': 1055142, '娱乐': 1538285, '文化': 609237, '军事': 411410, '游戏': 742239, '汽车': 1308636, '科技': 1219031, '农业': 1074709, '体育': 648958, '国际': 596095, '教育': 1051980, '社会': 433914, '旅行': 746855, '房产': 378339, '法律': 34910, '股票': 1134, '豆瓣话题': 169600, '博客': 11628495, '日报': 13571, '评论': 10757, '酒业': 230, '资讯': 1049332, '科普文章': 47066, '孕育常识': 39660, '百科': 9456851, '小红书攻略': 153601, '经验': 456112, '财经': 54040, '健康': 14992, '医学问答': 252159, '亲子': 35, '网页文本': 98745, '新闻': 828978, '生活': 22, '百家号文章': 333216, '黄金': 215, '时尚': 1113, '文旅': 2910, '观点': 1242, '党建': 1, '保险': 70, '期货': 328, '理论': 209, '快讯': 41, '国内': 14, '美容': 7, '国学': 603, '信托': 62, '公益': 14, '能源': 7, '创新': 6, '户外': 5, '海外': 4, '天气': 538, '水利资讯': 9}

去重

数据集中最多分类是博客类，质量低也是该类别，程序默认针对博客类做去重
安装bazel 5.1+版本
安装设置好cuda 驱动 11.0以上版本
编译去重程序 bazel build --config=cuda dedup:dedup_main
运行bazel-bin/dedup/dedup_main --wudao_dir <downloaded_wudao_dir> --output_path <dup_key_path>
内存占用需要16G左右
程序输出重复的key到输出文件

过滤低质内容

目前低质识别只处理营销比较严重的网页
编译： bazel build --config=cuda distill:distill_page_main
下载训练好的低质模型：链接: https://pan.baidu.com/s/1RvSl1mfUGXJ3Z2WoF8vzdw?pwd=qsbi 提取码: qsbi
运行： bazel-bin/distill/distill_page_main --wudao_dir <downloaded_wudao_dir> --model_path <download_model_path> --distilled_output_path <output_keys_path>
默认输出的是有营销倾向的文档id

限制

为了速度，目前程序运行需要在支持CUDA的GPU上
针对悟道开源数据集，后面需要更灵活的配置，适配自定义的语料
低质模型支持更多的类型

Related Skills

node-connect

347.9k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

108.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

347.9k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

347.9k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。