Loomis Painter: Reconstructing the painting process

This is a research project. In this repo the code for fine tuning WAN 2.1 can be found. WAN Learns Painting (wlp).

<table> <tr> <td align="center"> <img src="assets/base.gif" width="380" alt="Generated Video" /> <br /> <sub>Generated Video</sub> </td> <td align="center"> <img src="assets/reference_image.png" width="380" alt="Input" title="Haystacks by Claude Monet. Source: Wikiart." /> <br /> <sub>Input</sub> </td> </tr> </table>

Hugging Face Inference

The easiest way to get started is by using Hugging Face and following checkpoint.

<details> <summary>Code for inference with Hugging Face</summary>

import torch
from diffusers import AutoencoderKLWan, WanImageToVideoPipeline
from diffusers.utils import export_to_video, load_image
from transformers import CLIPVisionModel
from huggingface_hub import hf_hub_download
from typing import List, Tuple, Union
from PIL import Image, ImageOps


def pil_resize(
    image: Image.Image,
    target_size: Tuple[int, int],
    pad_input: bool = False,
    padding_color: Union[str, int, Tuple[int, ...]] = "white",
) -> Image.Image:
    """Resizing it to the target size.

    Args:
        image: Input image to be processed.
        target_size: Target size (width, height).
        pad_input: If set resizes the image while keeping the aspect ratio and pads the unfilled part.
        padding_color: The color for the padded pixels.

    Returns:
        The resized image
    """
    if pad_input:
        # Resize image, keep aspect ratio
        image = ImageOps.contain(image, size=target_size)
        # Pad while keeping image in center
        image = ImageOps.pad(image, size=target_size, color=padding_color)
    else:
        image = image.resize(target_size)
    return image


def undo_pil_resize(
    image: Image.Image,
    target_size: Tuple[int, int],
) -> Image.Image:
    """Undo the resizing and padding of the input image to the a new image with size target_size.

    Args:
        image: Input image to be processed.
        target_size: Target size (width, height).

    Returns:
        The resized image
    """
    tmp_img = Image.new(mode="RGB", size=target_size)
    # Get the resized image size
    tmp_img = ImageOps.contain(tmp_img, size=image.size)

    # Undo padding by center cropping
    width, height = image.size
    tmp_width, tmp_height = tmp_img.size

    left = int(round((width - tmp_width) / 2.0))
    top = int(round((height - tmp_height) / 2.0))
    right = left + tmp_width
    bottom = top + tmp_height
    cropped = image.crop((left, top, right, bottom))

    # Undo resizing
    ret = cropped.resize(target_size)
    return ret

# Set to True if you have a GPU with less than 80GB VRAM --> Very slow inference!
enable_sequential_cpu_offload = True

# Download the LoRA file
lora_path = hf_hub_download(repo_id="Markus-Pobitzer/wlp-lora", filename="base.safetensors")
print(f"LoRA path: {lora_path}")

# Loads the pipeline
model_id = "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
image_encoder = CLIPVisionModel.from_pretrained(
    model_id, subfolder="image_encoder", torch_dtype=torch.float32
)
# Takes more than 100 GB of disk space
pipe = WanImageToVideoPipeline.from_pretrained(
    model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16
)

# Load LoRA
pipe.load_lora_weights(lora_path)
pipe.fuse_lora()

# Either offload or directly to GPU
if enable_sequential_cpu_offload:
    pipe.enable_sequential_cpu_offload()
else:
    pipe.to("cuda")


### INFERENCE ###
image = load_image(
    "https://uploads3.wikiart.org/images/claude-monet/haystacks-at-giverny.jpg"
)
og_size = image.size
height = 480
width = 832
# Resize and pad
ref_image = pil_resize(image, target_size=(width, height), pad_input=True)
prompt = "Painting process step by step."

output = pipe(
    image=ref_image,
    prompt=prompt,
    height=height,
    width=width,
    num_frames=81,
    output_type="pil",
    guidance_scale=1.0,
).frames[0]
# To original image size
output = [undo_pil_resize(img, og_size) for img in output][::-1]
# Save video
export_to_video(output, "output.mp4", fps=3)

</details>

Art Media Transfer

To transfer from one art media to the other use following LoRA:

lora_path = hf_hub_download(repo_id="Markus-Pobitzer/wlp-lora", filename="art_media_transfer.safetensors")

Make sure that you also change the prompt accordingly. The supported art medias are:

acrylic
colored pencils
loomis
pencil
oil

The prompt has following format:

art_media = "..."
painting_desc = "..."
prompt = f"<{art_media}> Painting process step by step. {painting_desc}"

For acrylic, colored pencils and oil the prompt can contain color descriptions, i.e.

prompt = f"<acrylic> Painting process step by step. The image depicts a serene landscape with a small brown and green island in the center of a body of water, surrounded by green trees and a few boats. The sky is blue with scattered clouds, and there are birds flying in the background."

For the loomis and pencil art media we left the color information out during fine tuning, i.e.

prompt = f"<pencil> Painting process step by step. The image depicts a serene landscape with a small island in the center of a body of water, surrounded by trees and a few boats. There are scattered clouds, and birds flying in the background."

Note that the loomis method only works on portrait photos/paintings and otherwise seems to fall back to an other art media.

Installation

Until the Loomis Painter paper is accepted by a conference, we cannot release the full code required to reproduce the dataset and results. However, all the code necessary to fine-tune the WAN 2.1 model is available in this repository.

This project uses uv to manage dependencies, please use following guide to install it: https://docs.astral.sh/uv/getting-started/installation/

Create the Environment & Install Packages

Navigate to this project's root folder (where this README.md file is) in your terminal and run:

uv sync

Activate the Virtual Environment

source .venv/bin/activate

Dataset

The code for constructing the dataset is not included in this repository, and the dataset itself will not be publicly available. However, the dataset loader is provided and can be found at src/wlp/dataset/video_pkl_dataset.py.

If interested in the training data, feel free to reach out via E-Mail.

Fine Tuning

We used the script in scripts/train_snellius.sh to fine tune the WAN 2.1 model on a SLURM cluster. For fine tuning the base model we used 4 H100 GPUs for 24 hours, that corresponds to 14 epochs when using a dataset with 690 training videos. The model also shows good results when fine tuning for only 7 epochs.

Citation

If you use this work, please cite:

@misc{pobitzer2025loomispainter,
      title={Loomis Painter: Reconstructing the Painting Process},
      author={Markus Pobitzer and Chang Liu and Chenyi Zhuang and Teng Long and Bin Ren and Nicu Sebe},
      year={2025},
      eprint={2511.17344},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.17344},
}

Acknowledgments

We would like to thank the following projects and teams for their contributions and inspiration:

WAN 2.1 for their Video generation model
DiffSynth-Studio without their code training WAN would have been harder
Huggingface
PaintsUndo for the inspiration
UniAnimate-DiT for some code on how to fine tune WAN 2.1

Wlp

Install / Use

README