SkillAgentSearch skills...

DeepEnlighten

Pure RL to post-train base models for social reasoning capabilities. Lightweight replication of DeepSeek-R1-Zero with Social IQa dataset.

Install / Use

/learn @DolbyUUU/DeepEnlighten

README

DeepEnlighten: Generalization from EQ to IQ

DeepEnlighten is a lightweight replication study of the DeepSeek-R1-Zero framework. This project investigates the use of pure reinforcement learning (RL) without supervised fine-tuning (SFT) to post-train base models for social reasoning capabilities.

It leverages the following key components:

  1. RL Framework: verl
  2. RL Algorithms: REINFORCE++
  3. RL Dataset: Social IQa
  4. Base Models: Qwen2.5 (3B), Llama3.2 (3B)
  5. Math Evaluation: DeepSeek-Math

Dataset

Social IQa:

  • Designed to probe emotional and social intelligence in everyday scenarios.
  • Example:
    • Q: "Jordan wanted to tell Tracy a secret, so Jordan leaned towards Tracy. Why did Jordan do this?"
    • A: "To make sure no one else could hear."
  • Dataset preprocessing is implemented in DeepEnlighten/examples/data_preprocess/social_iqa.py.
  • Raw and processed datasets can be found in DeepEnlighten/data. Note that Llama3.2-Instruct and Qwen2.5-Instruct use different instruction tuning templates, so separate datasets are required for each.

Rule-Based Rewards

  • Reward modelling is implemented in DeepEnlighten/verl/utils/reward_score/socialiqa.py.
  • Rules:
    • Format Reward: +2 if valid, -2 if invalid.
    • Answer Reward: +2 if correct, -2 if incorrect, -3 if invalid.
    • Language Consistency Reward or Others: not applied.

Training

After configuring your WandB, GPUs, and other settings, execute the training:

bash run_rl_trainer_xxx.sh

Key Findings

For details, refer to:

  • DeepEnlighten Training Report
  • analysis directory: Contains log analysis of CoT, language mixing, and "aha moment".
  • evaluation directory: Contains evaluation results on math benchmarks.

1. Generalization from EQ to IQ

  • Social reasoning can generalize to out-of-distribution (OOD) tasks requiring mathematical reasoning.

Table: Accuracy in Mathematical Reasoning CoT Tests

(Base Model = Llama3.2-3B-Instruct, 1000 Steps RL, Number of Samples in Parenthesis)

| Task | DeepEnglighten-3B | Llama3.2-3B-Instruct | |--------------------|--------------------------------------|----------------------------| | math-cot-test | 0.4419 (3750) | 0.2672 (3750) | | cmath-cot-test | 0.5995 (824) | 0.5480 (823) | | gsm8k-cot-test | 0.7576 (330) | 0.7660 (329) |


2. Longer CoT and Overthinking Phenomenon

  • Longer CoT does not consistently appear across different experiments.
  • Longer CoT likely emerges only when the task is challenging, as the model may resort to memorization rather than true reasoning.
  • Llama-Instruct as a base model tends to over-think in social reasoning, while this paper suggests that Llama-Instruct is the least likely to over-think in math reasoning.
  • Further experiments are required to validate these observations.

3. Longer CoT ≠ Higher EQ

  • While CoT becomes longer and the mean rewards increase, longer CoT does not correlate with higher accuracy.
  • This aligns with superficial self-reflection findings from OAT-ZERO.

Figures (Base Model = Llama3.2-3B-Instruct):

  • Left Figure: Answer accuracy versus token count distribution.
  • Right Figure: Regression analysis of accuracy against token count.
<div style="display: flex; justify-content: space-between; gap: 1px;"> <img src="analysis/Llama3.2-3B-Instruct/plots/barplot_answer_vs_tokens_20250312_150316.png" alt="Barplot: Answer Accuracy vs Token Count" style="width: 48%;"> <img src="analysis/Llama3.2-3B-Instruct/plots/regression_answer_vs_tokens_20250312_150316.png" alt="Regression: Answer Accuracy vs Token Count" style="width: 48%;"> </div>

4. Language Mixing Does Exist

  • While language mixing is observed, it is not prevalent.
  • Example: "购买电影票是娱乐的行为,是一种人性性行为,反映了人 Seekingjoy, pleasure and entertainment's需要。"

Table: Language Distribution in Model Thinking

(Base Model = Llama3.2-3B-Instruct)

| Category | Count | Percentage | |------------------------|-------|------------| | Only English | 96674 | 98.23% | | Only Chinese | 0 | 0.00% | | Mixed (English & Chinese) | 1727 | 1.75% |


Acknowledgements

This project builds upon and references several open-source works:

View on GitHub
GitHub Stars42
CategoryEducation
Updated6d ago
Forks0

Languages

Python

Security Score

80/100

Audited on Mar 27, 2026

No findings