Selfie

This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen, Carl Vondrick, and Chengzhi Mao.

Generate Convert Improve

Install / Use

/learn @tonychenxyz/Selfie

About this skill

Quality Score

0/100

README

`SelfIE`: Self-Interpretation of Large Language Model Embeddings

SelfIE

This repository contains the code and data for the paper SelfIE: Self-Interpretation of Large Language Model Embeddings by Haozhe Chen, Carl Vondrick, and Chengzhi Mao.

Abstract

The expanding impacts of Large Language Models (LLMs) increasingly require the answer to: How do LLMs obtain their answers? The ability to explain and control an LLM's reasoning process is key for reliability, transparency, and future model developments. We propose SelfIE (Self-Interpretation of Embeddings) that enables LLMs to interpret their own embeddings in natural language by leveraging their ability to respond inquiry about a given passage. Capable of interpreting open-world concepts in the hidden embeddings, SelfIE reveals LLM internal reasoning in cases such as making ethical decisions, internalizing prompt injection, and recalling harmful knowledge. SelfIE's text descriptions on hidden embeddings also open up new avenues to control LLM reasoning. We propose Supervised Control, which allows editing open-ended concepts while only requiring gradient computation of individual layer. We extend RLHF to hidden embeddings and propose Reinforcement Control that erases harmful knowledge in LLM without supervision targets.

Installation

To install selfie from the github repository main branch, run:

git clone https://github.com/tonychenxyz/selfie.git
cd selfie
pip install -e .

The code has been tested with transformers==4.34.0.

Quickstart

Load model with huggingface. Currently the library supports all LLaMA models.

from transformers import AutoTokenizer,AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained(model_path, device_map="auto")
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")

Create interpretation prompt from a tuple. Placeholders are denoted with 0.

from selfie.interpret import InterpretationPrompt
interpretation_prompt = InterpretationPrompt(tokenizer, ("[INST]", 0, 0, 0, 0, 0, "[/INST] Sure, I will summarize the message:"))

Specify original input prompt with original_prompt and layer and token idx to interpret in tokens_to_interpret. Get interpretation as a dictionary with interpret

from selfie.interpret import interpret

original_prompt = "[INST] What's highest mountain in the world? [/INST]"
tokens_to_interpret = [(10,5), (10,6)]
bs = 2
max_new_tokens = 10
k = 1

interpretation_df = interpret(original_prompt=original_prompt, tokens_to_interpret=tokens_to_interpret, model=model, interpretation_prompt=interpretation_prompt, bs=bs, max_new_tokens=max_new_tokens, k=k, tokenizer=tokenizer)

See full example code in demo.ipynb.

Reasoning Control

Check out notebooks in examples directory for examples of supervised and reinforcement control.

Citation

If you find this repository helpful, please consider citing our paper:

@misc{chen2024selfie,
      title={SelfIE: Self-Interpretation of Large Language Model Embeddings}, 
      author={Haozhe Chen and Carl Vondrick and Chengzhi Mao},
      year={2024},
      eprint={2403.10949},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Related Skills

node-connect

351.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

110.9k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

351.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

351.8k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。