DeepEnlighten: Generalization from EQ to IQ

DeepEnlighten is a lightweight replication study of the DeepSeek-R1-Zero framework. This project investigates the use of pure reinforcement learning (RL) without supervised fine-tuning (SFT) to post-train base models for social reasoning capabilities.

It leverages the following key components:

RL Framework: verl
RL Algorithms: REINFORCE++
RL Dataset: Social IQa
Base Models: Qwen2.5 (3B), Llama3.2 (3B)
Math Evaluation: DeepSeek-Math

Dataset

Social IQa:

Designed to probe emotional and social intelligence in everyday scenarios.
Example:
- Q: "Jordan wanted to tell Tracy a secret, so Jordan leaned towards Tracy. Why did Jordan do this?"
- A: "To make sure no one else could hear."
Dataset preprocessing is implemented in DeepEnlighten/examples/data_preprocess/social_iqa.py.
Raw and processed datasets can be found in DeepEnlighten/data. Note that Llama3.2-Instruct and Qwen2.5-Instruct use different instruction tuning templates, so separate datasets are required for each.

Rule-Based Rewards

Reward modelling is implemented in DeepEnlighten/verl/utils/reward_score/socialiqa.py.
Rules:
- Format Reward: +2 if valid, -2 if invalid.
- Answer Reward: +2 if correct, -2 if incorrect, -3 if invalid.
- Language Consistency Reward or Others: not applied.

Training

After configuring your WandB, GPUs, and other settings, execute the training:

bash run_rl_trainer_xxx.sh

Key Findings

For details, refer to:

DeepEnlighten Training Report
analysis directory: Contains log analysis of CoT, language mixing, and "aha moment".
evaluation directory: Contains evaluation results on math benchmarks.

1. Generalization from EQ to IQ

Social reasoning can generalize to out-of-distribution (OOD) tasks requiring mathematical reasoning.

Table: Accuracy in Mathematical Reasoning CoT Tests

(Base Model = Llama3.2-3B-Instruct, 1000 Steps RL, Number of Samples in Parenthesis)

| Task | DeepEnglighten-3B | Llama3.2-3B-Instruct | |--------------------|--------------------------------------|----------------------------| | math-cot-test | 0.4419 (3750) | 0.2672 (3750) | | cmath-cot-test | 0.5995 (824) | 0.5480 (823) | | gsm8k-cot-test | 0.7576 (330) | 0.7660 (329) |

2. Longer CoT and Overthinking Phenomenon

Longer CoT does not consistently appear across different experiments.
Longer CoT likely emerges only when the task is challenging, as the model may resort to memorization rather than true reasoning.
Llama-Instruct as a base model tends to over-think in social reasoning, while this paper suggests that Llama-Instruct is the least likely to over-think in math reasoning.
Further experiments are required to validate these observations.

3. Longer CoT ≠ Higher EQ

While CoT becomes longer and the mean rewards increase, longer CoT does not correlate with higher accuracy.
This aligns with superficial self-reflection findings from OAT-ZERO.

Figures (Base Model = Llama3.2-3B-Instruct):

Left Figure: Answer accuracy versus token count distribution.
Right Figure: Regression analysis of accuracy against token count.

4. Language Mixing Does Exist

While language mixing is observed, it is not prevalent.
Example: "购买电影票是娱乐的行为，是一种人性性行为，反映了人 Seekingjoy, pleasure and entertainment's需要。"

Table: Language Distribution in Model Thinking

(Base Model = Llama3.2-3B-Instruct)

| Category | Count | Percentage | |------------------------|-------|------------| | Only English | 96674 | 98.23% | | Only Chinese | 0 | 0.00% | | Mixed (English & Chinese) | 1727 | 1.75% |

Acknowledgements

This project builds upon and references several open-source works:

Logic-RL-Lite: Reproduction of R1-Zero on logic puzzles.
verl Framework: Reinforcement learning framework.
DeepSeek-Math: Mathematical reasoning benchmarks.
Social IQa Dataset: Social reasoning dataset.

DeepEnlighten

Install / Use

README