Between Stable Diffusion 3.5 and FLUX.1: A Renaissance in Text-to-Image Generation
Between Stable Diffusion 3.5 and FLUX.1, this year has seen another renaissance for text-to-image generation. These models have given us a step forward in prompt adherence, added the capability to spell with open-source models, and continued to improve in the quality of their aesthetic outputs. Nonetheless, the core mechanic behind these models has remained fundamentally the same: use some, with either an empty latent image or image primer, to generate a single image.
OmniGen: A New Architecture for Text-to-Image Generation
In this article, we want to shine a spotlight on an incredibly promising new architecture for text-to-image generation, OmniGen. Inspired by similar efforts in the Large Language Model research community, OmniGen is the first fully unified diffusion model framework for all sorts of additional downstream tasks like image editing, subject-driven generation, and visual-conditional generation Source.
Follow along for a breakdown of the architecture that makes OmniGen possible, an exploration of the models capabilities, and a demonstration for how to run OmniGen and test the model using a GPU Droplet.
Prerequisites
- Python: The content of this article is highly technical. We recommend this piece to readers experienced with both Python and basic concepts in Deep Learning.
- Cloud GPU: Running FLUX.1 will require a sufficiently powerful GPU. We recommend at least 40 GB VRAM machines at the minimum.
The OmniGen Framework
OmniGen Architecture
OmniGen is comprised of a two-part composite of a Variational AutoEncoder (VAE) and large pretrained Transformer. The VAE functionally extracts visual features from the images continuously, while the Transformer generates images based on the input conditions. Specifically, these are the VAE from Stable Diffusion XL which was frozen during training, and the Transformer model was initialized with Microsoft’s Phi-3. This allows a functional connection between the strength of the VAE with the a Transformer that has inherited a significant understanding of textual processing capability. Put together, this creates a simple but strong pipeline that prevents the need for additional encoders in the model, which simplifies the pipeline significantly. OmniGen inherently encodes conditional information by itself.
“Furthermore, OmniGen jointly models text and images within a single model, rather than independently modeling different input conditions with separate encoders as in existing works which lacks interaction between different modality conditions” Source.
For the attention mechanism, they use a modified version of the causal attention mechanism. Specifically, the mechanism works by applying causal attention to each element in the sequence, and simultaneously apply bidirectional attention within each image sequence. This makes it possible for each patch to “pay attention” to the other parts of the image and ensure that each image can only consider image or text sequences that have previously appeared Source.
To generate an image, they randomly sample a Gaussian noise and apply flow matching to predict the target velocity, and iterate the set number of inference steps to generate the images latent representation. The VAE then decodes the value into the final image output Source.
What Can We Do with OmniGen?
OmniGen is capable of numerous tasks, but, more importantly, abstracts away extra steps from the increasingly long process of image generation/editing with AI technologies. Let’s briefly overview each of them before jumping into the coding demo.
-
- Text-to-image generation: Like Stable Diffusion or FLUX, OmniGen is perfectly capable of generating high-quality images on its own with a high degree of performance. In our experience, the quality is very familiar to that of using the baseline Stable Diffusion XL. It will be interesting to see if the same process could be applied to the more modern FLUX and SD 3.5 Large models to give this same capability to those models.
-
- Text-based image editing: OmniGen makes it easy to edit images in a single step with text instruction. It uses an LLM-style prompting format, so asking the model to change the image subject’s hair color is a simple request.
-
-
- Image compositing: OmniGen makes it easy to combine two subject matters seamlessly into novel environments.
-
- Depth/pose estimation: OmniGen comes integrated with the same technologies that make ControlNet’s so effective and is capable of doing the extraction of the pose themselves.
-
- And much more! OmniGen is probably the most versatile single model pipeline ever released. Be sure to try all the examples provided by the demo.
OmniGen Code Demo
Setup the GPU Droplet
Now that we have walked through everything OmniGen brings to the table, we are ready to begin the code demo.
Install the Packages & Clone the Repository
Once you have successfully SSH’d into your Droplet, we can continue. First, we need to make sure we are in the right directory & clone our repo. Paste the following command into the terminal:
cd ../home
sudo apt-get install git-lfs
git-lfs clone https://huggingface.co/spaces/Shitao/OmniGen
cd OmniGen/
pip3 install requirements.txt
This will do everything we need to setup the environment with all the packages needed to run OmniGen. All that’s left is to run the demo!
„
python3 app.py --share
Click the shareable public link to open the Gradio application in any browser window.
Running the Demo
Now we are ready to run the demo! Begin by testing generation with a simple text prompt. We found the results very similar to baseline SDXL. Afterwards, test out the provided examples at the bottom of the page to get a feel for using the model. We recommend using their skeletons to construct any new generation prompts. Espacially if you use them going forward. The authors have found the best ways to use their own model.
Closing Thoughts
OmniGen is a fascinating step forward for image generation models. In particular, we are impressed by the consolidation of the entire pipeline into a single model file. The file is capable of such a diverse array of tasks, including editing, image composition, and much more. We look forward to the release of the next versions of OmniGen in the coming months as the pipeline framework is spread to other models.