Follow the Mean: Reference-Guided Flow Matching
Show it examples.
Change what it generates: no retraining, no gradients, no reward signals.
Pedro M. P. Curvo · Maksim Zhdanov · Floor Eijkelboom · Jan-Willem van de Meent
University of Amsterdam
The idea
Guide with examples, not rewards.
Think about how humans actually learn. Not by being scored on every move, but by watching someone do it first. You learn to cook by seeing a dish made, not by getting a number between 0 and 1 after each attempt. For most things that matter, we learn from examples.
The same intuition applies to generative models. If you ask an artist to paint in the style of Van Gogh and hand them a reward signal at every brushstroke, they'll be lost. Show them a handful of Van Gogh paintings instead, and suddenly the task is clear: not because the examples are a perfect specification, but because they shift the artist's mental picture of where the painting should go.
That's exactly what this paper does, but for flow matching models. The velocity field at any point in time is fully determined by a conditional endpoint mean: a kind of running best guess of where the generation is headed. Swap in a handful of reference images, and you shift that guess. The flow follows. No fine-tuning, no reward network, no repeated sampling. A single forward pass on a frozen model.
The reference images don't need to be perfect. They just need to shift the mean in the right direction.






Method
How it works
01
Recover the endpoint mean
For any pretrained rectified-flow model, the endpoint mean is just μt(x) = x + (1 − t)ut(x). No extra computation needed: it falls out of the velocity prediction.
02
Compute a reference mean
Given M reference images, compute their posterior mean in closed form as a softmax-weighted sum. This is exact under a Gaussian bridge: the same math that underlies cross-attention.
03
Apply the correction
Add a scheduled velocity correction proportional to the difference between reference mean and model mean. No gradients, no backward pass, no auxiliary model.
From flow matching to endpoint means
Start with the linear flow-matching bridge between noise x0 and data x1. For a fixed intermediate state xt = x, there are many possible endpoint pairs that could have produced it. The model therefore cannot use a single displacement; the loss-minimizing marginal velocity is the average bridge velocity over all endpoint pairs consistent with the current state:
xt = (1 − t)x0 + tx1
ut(x) = E[x1 − x0 | xt = x]
ut(x) =μt(x) − x1 − t⇒ μt(x) = x + (1 − t)ut(x)
Once the pretrained model gives μtθ, a reference bank gives an empirical target mean μ̂tρ. Interpolating between those endpoint means yields the closed-form velocity correction below.
Reference-guided endpoint mixture
To turn a reference set into guidance, define a target endpoint distribution as a geometric mixture of the pretrained data distribution p1 and the reference distribution ρ1:
π(x1) ∝ p1(x1)1−βtρ1(x1)βt
μtπ(x) ≃ (1 − βt) μtθ(x) + βt μ̂tρ(x)
Since the linear-bridge velocity is just the endpoint mean minus the current state divided by the remaining time, replacing the model mean with this mixed mean leaves the original velocity plus a reference-mean correction.
Guided velocity
utπ(x) ≃ utθ(x) + βtμ̂tρ(x) − μtθ(x)1 − t
betat is a scalar guidance schedule, quadratic decay by default. No classifier, no reward, no search.
Why it's mathematically sound
The guidance rule follows from two constructions that both yield valid bridge marginals. The geometric mixture at the endpoint level is valid by construction and recovers the same velocity formula under a Gaussian posterior approximation, which holds exactly in VAE latent spaces. An arithmetic mixture of the two distributions gives the same velocity via an exact Bayes' rule computation. Both support the same guidance update.
Reference banks
What the reference sets look like
Each reference bank contains 20 images. They don't need to match the prompt: they just need to encode the target attribute. Toy-like pink elephants steer toward pink. Keyhole silhouettes steer toward keyhole compositions.




Reference-Mean Guidance · Results
Swap the reference set, change the output
Prompt and noise seed are fixed within each column. Only the reference set changes. The output shifts systematically in color, object identity, and style.
elephant in a jungle

baseline

pink elephants

blue elephants
a cat

baseline

studio photos

Van Gogh
a house in a forest

baseline

sketches

cinematic
animal in a savanna

baseline

giraffes

zebras
Structural control
It can even transfer geometry
Structural correctness is hard to define as a reward: VLMs struggle to judge whether a hand is correctly oriented or a silhouette matches a target shape. Reference-mean guidance sidesteps this entirely: just show it examples of the target structure. No pose estimator, no spatial loss, no gradient.
Keyhole shape
a miniature forest...all inside a keyhole on a black background

baseline

reference-guided

nearest reference
Hand pose
a hand doing the sign of the horns

baseline

reference-guided

nearest reference
Gymnastics pose
a gymnast performing a ring leap...

baseline

reference-guided

nearest reference
The nearest-reference column confirms that RMG transfers structural priors rather than copying reference content.
Comparison of control interfaces · GenEval
One forward pass. No reward model.
This comparison is between control interfaces, not a controlled ablation. RMG uses a fixed 20-image visual reference bank per category; baselines must express the same constraint through text prompts, classifier scores, or reward gradients. All methods share the same backbone, resolution, sampler, prompts, and seeds. The largest gains appear on compositional categories: position (+28.75) and two-object generation (+8.08), where text alone is insufficient to ground the target structure.
| Method | Time down | NFE down | Aux. down | Mean up | Two up | Position up | Attribution up |
|---|---|---|---|---|---|---|---|
| FLUX.2-klein (4B) baseline | 1.00x | 1x | - | 80.10 | 91.41 | 65.25 | 58.75 |
| Search-based | |||||||
| + Prompt Opt. | 7.87x | 8x | 8C+2L | 84.18 | 95.45 | 69.75 | 64.00 |
| + Best-of-4 | 4.07x | 4x | 4C | 83.35 | 95.96 | 67.75 | 65.25 |
| + SMC | 6.17x | 4x | 81C | 80.28 | 95.71 | 61.75 | 57.25 |
| Gradient-based | |||||||
| + ReNO | 19.44x | 4x | 4C | 83.46 | 93.18 | 65.50 | 64.75 |
| Ours (RMG) | 1.02x | 1x | - | 91.17 | 99.49 | 94.00 | 75.25 |
This comparison is not intended to isolate the effect of additional visual information; it measures whether reference sets provide a practical and efficient interface for test-time control.
Semi-Parametric Guidance
A learned variant that knows what to copy
The closed-form reference mean is efficient but blunt: it can accidentally transfer nuisance correlations from the bank, like a shared white background. SPG fixes this by amortizing the same idea into a learnable architecture. A cross-attention anchor summarizes the reference set; a learned residual refiner decides what to keep and what to suppress. Reference set still swappable at inference time. Generation quality unchanged.
Unconditional generation quality on AFHQv2
| Method | FID down | KID down | IS up |
|---|---|---|---|
| DiT-B/4 baseline | 23.111 | 0.012 | 6.554 |
| SPG (ours) | 23.256 | 0.013 | 6.227 |
SPG matches DiT-B/4 baseline (FID 23.26 vs 23.11), confirming the reference-set anchor does not degrade generation quality.
Inference-time control
Generated class proportions closely track the reference-set composition across a wide range of reference sizes M, demonstrating continuous control without modifying model parameters.
Reference-set swaps · same model, same noise
full reference set: generated





nearest neighbors (latent space): not copies





cat-only reference set





dog-only reference set





Broader direction
Generative models that adapt through data, not parameter updates
If control is represented by parameters, adaptation requires optimization and can create interference between concepts. If control is represented by a reference set, adaptation becomes a data operation: adding, removing, or reweighting examples changes the posterior mean and therefore the flow. This makes reference-guided flows naturally suited to low-data regimes, personalization, and continual adaptation settings where the target distribution changes faster than one would like to retrain a model.
Citation
@article{curvo2026follow,
title={Follow the Mean: Reference-Guided Flow Matching},
author={Pedro M. P. Curvo and Maksim Zhdanov and Floor Eijkelboom and Jan-Willem van de Meent},
year={2026},
}