DeepEnlighten
Pure RL to post-train base models for social reasoning capabilities. Lightweight replication of DeepSeek-R1-Zero with Social IQa dataset.
Install / Use
/learn @DolbyUUU/DeepEnlightenREADME
DeepEnlighten: Generalization from EQ to IQ
DeepEnlighten is a lightweight replication study of the DeepSeek-R1-Zero framework. This project investigates the use of pure reinforcement learning (RL) without supervised fine-tuning (SFT) to post-train base models for social reasoning capabilities.
It leverages the following key components:
- RL Framework: verl
- RL Algorithms: REINFORCE++
- RL Dataset: Social IQa
- Base Models: Qwen2.5 (3B), Llama3.2 (3B)
- Math Evaluation: DeepSeek-Math
Dataset
Social IQa:
- Designed to probe emotional and social intelligence in everyday scenarios.
- Example:
- Q: "Jordan wanted to tell Tracy a secret, so Jordan leaned towards Tracy. Why did Jordan do this?"
- A: "To make sure no one else could hear."
- Dataset preprocessing is implemented in
DeepEnlighten/examples/data_preprocess/social_iqa.py. - Raw and processed datasets can be found in
DeepEnlighten/data. Note that Llama3.2-Instruct and Qwen2.5-Instruct use different instruction tuning templates, so separate datasets are required for each.
Rule-Based Rewards
- Reward modelling is implemented in
DeepEnlighten/verl/utils/reward_score/socialiqa.py. - Rules:
- Format Reward: +2 if valid, -2 if invalid.
- Answer Reward: +2 if correct, -2 if incorrect, -3 if invalid.
- Language Consistency Reward or Others: not applied.
Training
After configuring your WandB, GPUs, and other settings, execute the training:
bash run_rl_trainer_xxx.sh
Key Findings
For details, refer to:
- DeepEnlighten Training Report
analysisdirectory: Contains log analysis of CoT, language mixing, and "aha moment".evaluationdirectory: Contains evaluation results on math benchmarks.
1. Generalization from EQ to IQ
- Social reasoning can generalize to out-of-distribution (OOD) tasks requiring mathematical reasoning.
Table: Accuracy in Mathematical Reasoning CoT Tests
(Base Model = Llama3.2-3B-Instruct, 1000 Steps RL, Number of Samples in Parenthesis)
| Task | DeepEnglighten-3B | Llama3.2-3B-Instruct | |--------------------|--------------------------------------|----------------------------| | math-cot-test | 0.4419 (3750) | 0.2672 (3750) | | cmath-cot-test | 0.5995 (824) | 0.5480 (823) | | gsm8k-cot-test | 0.7576 (330) | 0.7660 (329) |
2. Longer CoT and Overthinking Phenomenon
- Longer CoT does not consistently appear across different experiments.
- Longer CoT likely emerges only when the task is challenging, as the model may resort to memorization rather than true reasoning.
- Llama-Instruct as a base model tends to over-think in social reasoning, while this paper suggests that Llama-Instruct is the least likely to over-think in math reasoning.
- Further experiments are required to validate these observations.
3. Longer CoT ≠ Higher EQ
- While CoT becomes longer and the mean rewards increase, longer CoT does not correlate with higher accuracy.
- This aligns with superficial self-reflection findings from OAT-ZERO.
Figures (Base Model = Llama3.2-3B-Instruct):
- Left Figure: Answer accuracy versus token count distribution.
- Right Figure: Regression analysis of accuracy against token count.
4. Language Mixing Does Exist
- While language mixing is observed, it is not prevalent.
- Example: "购买电影票是娱乐的行为,是一种人性性行为,反映了人 Seekingjoy, pleasure and entertainment's需要。"
Table: Language Distribution in Model Thinking
(Base Model = Llama3.2-3B-Instruct)
| Category | Count | Percentage | |------------------------|-------|------------| | Only English | 96674 | 98.23% | | Only Chinese | 0 | 0.00% | | Mixed (English & Chinese) | 1727 | 1.75% |
Acknowledgements
This project builds upon and references several open-source works:
- Logic-RL-Lite: Reproduction of R1-Zero on logic puzzles.
- verl Framework: Reinforcement learning framework.
- DeepSeek-Math: Mathematical reasoning benchmarks.
- Social IQa Dataset: Social reasoning dataset.
