SDXL-Turbo H-Space Modification

How this works

The base prompt and prompt continuation from above are combined and then send to a SDXL-Turbo diffusion model, which creates a corresponding image. The semantic latent space, or $h$ -space, of the model, which is in the bottom of the unet, is then adjusted during the image generation process. For this adjustment, a precomputed direction is scaled by the given scales from above and added. The direction is computed by running the model multiple times with a base prompt and an adjusted prompt and then taking the average $h$ -space difference between the two.

As you can see above, the directions are far from perfect and don't work for every image. When generating the directions this way, often correlating attributes are mixed in. For example, the prompt "a portrait of a smiling person" will usually zoom in more on the face, so the smiling direction might also contain a zoom in component. Actually, for the example of the smiling direction above, the zoom in seems to be dominant, while a change in how much the person is smiling is not really visible.

Here is a simplified excerpt from the model repo:

pipe = AutoPipelineForText2Image.from_pretrained('stabilityai/sdxl-turbo', torch_dtype=torch.float16, variant="fp16").to('cuda')

def get_representation(text, count=200):
    reprs = []
    def get_repr(module, input, output): reprs.append(output.detach())
    for i in range(count):
        with torch.no_grad(), pipe.unet.mid_block.register_forward_hook(get_repr):
            pipe(prompt = text, num_inference_steps = 1, guidance_scale=0.0)
    return torch.cat(reprs).mean(axis=0).cpu()

direction = get_representation("a portrait of a smiling person") - get_representation("a portrait of a person")

The $h$ -space has a dimensionality of $(1,1280,16,16)$ for SDXL-Turbo. The last two dimensions are spatial, so even at the very bottom of the diffusion unet, we still have $16\times16$ blocks. This also impacts the visualization above, as the $h$ -space modification is unique for each block. This can cause problems if some parts of e.g. a human face are at a different position in the generated image compared to the precalculated direction.

If you want to know more about $h$ -space modification in diffusion models, here are some papers:

Discovering Interpretable Directions in the Semantic Latent Space of Diffusion Models by René Haas et al. ¹ ²
Diffusion Models already have a Semantic Latent Space by Mingi Kwon et al.

You can check out the source code for this Project on Github for Model and Webapp. Furthermore, this project uses replicate for running the wrapped SDXL-Turbo model, and Vercel for webapp deployment.

Project by Jonas Loos, last updated 2024-02-07.

A partial implementation can be found here. ↩
This project is very similar to the method described in section 5. ↩

SDXL-Turbo H-Space Modification

How this works

Footnotes