UniCon is adapted from a pretrained image diffusion model with additional joint cross-attention modules and LoRA adapters.
Given a pair of image-condition inputs, our UniCon model processes them concurrently in two parallel branches. Features from two branches attend to each other in the injected joint cross-attention modules. The LoRA adapters apply the condition branch and the joint cross-attention modules.
The model is trained on image-condition pairs. During training, we separately sample timesteps for each input and compute loss over both branches.
@article{li2024unicon,
title={A Simple Approach to Unifying Diffusion-based Conditional Generation},
author={Li, Xirui and Herrmann, Charles and Chan, Kelvin CK and Li, Yinxiao and Sun, Deqing and Yang, Ming-Hsuan},
booktitle={arXiv preprint arxiv:2410.11439},
year={2024}
}