IPAdapterWAN

No description available

Generate Convert Improve

Install / Use

/learn @kaaskoek232/IPAdapterWAN

About this skill

Quality Score

0/100

README

ComfyUI-IPAdapter-WAN 2.0

This extension adapts the InstantX IP-Adapter for SD3.5-Large to work with Wan 2.1 and other UNet-based video/image models in ComfyUI.

Unlike the original SD3 version (which depends on joint_blocks from MMDiT), this version performs sampling-time identity conditioning by dynamically injecting into attention layers — making it compatible with models like Wan 2.1, AnimateDiff, and other non-SD3 pipelines.

🚀 Features

Core Functionality

🔁 Injects identity embeddings during sampling via attention block patching
🧠 Works with Wan 2.1 and other UNet-style models (no SD3/MMDiT required)
🛠️ Built on top of ComfyUI's IPAdapter framework
🎨 Enables consistent face/identity across frames in video workflows

Version 2.0 Upgrades

⏱️ Timestep-based Scheduling: Fine-grained control over when IP conditioning is active during denoising (start/end percentage)
🎯 Type-Safe Architecture: Extensive type hints, dataclasses, and runtime validation for better reliability
💾 Enhanced Memory Management: FP16 by default, explicit device placement, optimized batched processing
🔧 Structured Configuration: Centralized config system with IPAdapterConfig and VisionEncoderConfig
📊 Time-Conditioned Resampler: Advanced TimeResampler with Adaptive Layer Normalization (adaLN) for timestep-aware conditioning
🛡️ Fail-Fast Validation: Input validation at construction time to catch errors early
📚 Comprehensive Documentation: Detailed docstrings with computational complexity analysis
🎨 True Dynamic Resolution Support: Perceiver-style resampler architecture enables native support for variable-resolution inputs from SigLIP2 so400m naflex - no fixed sequence length assumptions

📦 Installation

Clone the repo into your ComfyUI custom nodes directory:

git clone https://github.com/your-username/ComfyUI-IPAdapter-WAN.git

Download the required model weights:

Download the IP-Adapter weights:
- ip-adapter.bin
- Place it in: ComfyUI/models/ipadapter/
Download the CLIP Vision model:
- SigLIP2 so400m patch16 naflex (recommended): google/siglip2-so400m-patch16-naflex
- Place the model file in: ComfyUI/models/clip_vision/
- This model supports dynamic resolution with native flexible (naflex) attention, providing better performance for variable aspect ratios.

(Note: This node only supports SigLIP2 so400m patch16 naflex. The vision encoder type is automatically set to SigLIP2 - no configuration needed.)

Dynamic Resolution: The IPAdapter architecture fully supports variable-resolution inputs. The Perceiver-style resampler compresses any sequence length from SigLIP2 to a fixed number of queries, enabling true dynamic resolution support without hardcoded size assumptions.

🧠 How It Works

Wan models use a UNet structure instead of the DiT transformer blocks used in SD3. To make IPAdapter work with Wan:

The extension scans all attention blocks (modules with .to_q and .to_k) dynamically.
It injects IPAdapter's attention processors (IPAttnProcessor) directly into those blocks.
Version 2.0 Enhancement: Identity embeddings are updated based on the current sampling timestep using a time-conditioned resampler (TimeResampler) with:
- Adaptive Layer Normalization (adaLN) for timestep-aware feature modulation
- Perceiver-style cross-attention for efficient compression of vision embeddings - this enables true dynamic resolution support by compressing variable-length sequences to fixed queries
- Timestep embedding integration for denoising-aware conditioning
Dynamic Resolution Support: The resampler architecture is specifically designed to handle variable-length vision encoder outputs (from SigLIP2 so400m naflex). It compresses any sequence length to a fixed number of queries (64), making it resolution-agnostic. This means:
- ✅ Works with any input resolution/aspect ratio from SigLIP2
- ✅ No hardcoded sequence length assumptions
- ✅ Efficient O(n) scaling with variable token counts
Timestep Scheduling: The extension respects start_percent and end_percent parameters, only applying IP conditioning during the specified denoising range.

This means it works without requiring joint_blocks or specific architectural assumptions — making it plug-and-play for many custom models.

🛠 Usage

In ComfyUI, use the following nodes:
- Load IPAdapter WAN Model - Loads the IP-Adapter checkpoint with SigLIP2 so400m configuration
- Apply IPAdapter WAN Model - Applies IP conditioning to your diffusion model
Connect the CLIP Vision embedding (from a face image) and your diffusion model to the adapter.
- Note: This node automatically uses SigLIP2 so400m configuration - no vision encoder selection needed.
New in 2.0: Configure timestep scheduling:
- start_percent (default: 0.0): When to start applying IP conditioning (0.0 = beginning of denoising)
- end_percent (default: 1.0): When to stop applying IP conditioning (1.0 = end of denoising)
- This allows fine-grained control over when identity conditioning is active during the denoising process
Use a weight of ~0.5 as a good starting point (adjustable from -1.0 to 5.0).
You can apply this in video workflows to maintain consistent identity across frames.

📁 Example Workflows

Example .json workflows will be available soon in the workflows/ folder.

✅ Compatibility

| Model | Status | | ---------------- | ------------------- | | Wan 2.1 | ✅ Works | | AnimateDiff | ✅ Works | | SD3 / SDXL | ❌ Use original repo | | Any UNet variant | ✅ Likely to work |

✨ Version 2.0 Technical Improvements

Architecture Enhancements

Type-Safe Design: Protocol-based interfaces, dataclasses for configuration, comprehensive type hints
Memory Optimization: FP16 by default, efficient batched processing, zero-copy state sharing
Error Handling: Fail-fast validation with clear error messages
Modular Configuration: Centralized config system enabling easy hyperparameter tuning

Performance Optimizations

Efficient Attention: Uses F.scaled_dot_product_attention for hardware-accelerated attention
Memory-Efficient Reshaping: Fused tensor operations to minimize memory copies
Lazy Evaluation: IP conditioning computed only when active (based on timestep schedule)

Developer Experience

Comprehensive Documentation: Detailed docstrings with computational complexity analysis
Extensible Design: Easy to add new vision encoder types or modify architecture
Runtime Validation: Input validation prevents common errors before execution

🔧 TODOs

Allow multiple adapters without conflict
Auto-detect model parameters (hidden size, num layers)
Convert .bin to safetensors format
Add more workflows for different models
Support additional vision encoder types beyond SigLIP2

🧑‍💻 Credits

Adapted from: InstantX IPAdapter for SD3.5
Version 2.0 enhancements include:
- Type-safe architecture with comprehensive validation
- Time-conditioned resampler with adaLN
- Timestep-based scheduling for fine-grained control
- Enhanced memory management and performance optimizations

📝 Changelog

Version 2.0

✨ Added timestep-based scheduling (start_percent/end_percent)
✨ Implemented time-conditioned resampler with Adaptive Layer Normalization
✨ Enhanced type safety with dataclasses, enums, and protocols
✨ Improved memory management (FP16 default, optimized batching)
✨ Added comprehensive documentation and error handling
✨ Structured configuration system for easier customization

Feel free to contribute or suggest improvements via GitHub Issues or Pull Requests.

Related Skills

node-connect

343.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

92.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

343.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

343.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。