The first open Stable Diffusion 3-like architecture model is JUST out ๐ฃ - but it is not SD3! ๐ค
It is Tencent-Hunyuan/HunyuanDiT by Tencent, a 1.5B parameter DiT (diffusion transformer) text-to-image model ๐ผ๏ธโจ, trained with multi-lingual CLIP + multi-lingual T5 text-encoders for english ๐ค chinese understanding
The Stable Diffusion 3 research paper broken down, including some overlooked details! ๐
Model ๐ 2 base model variants mentioned: 2B and 8B sizes
๐ New architecture in all abstraction levels: - ๐ฝ UNet; โฌ๏ธ Multimodal Diffusion Transformer, bye cross attention ๐ - ๐ Rectified flows for the diffusion process - ๐งฉ Still a Latent Diffusion Model
๐ 3 text-encoders: 2 CLIPs, one T5-XXL; plug-and-play: removing the larger one maintains competitiveness
๐๏ธ Dataset was deduplicated with SSCD which helped with memorization (no more details about the dataset tho)
Variants ๐ A DPO fine-tuned model showed great improvement in prompt understanding and aesthetics โ๏ธ An Instruct Edit 2B model was trained, and learned how to do text-replacement
Results โ State of the art in automated evals for composition and prompt understanding โ Best win rate in human preference evaluation for prompt understanding, aesthetics and typography (missing some details on how many participants and the design of the experiment)