Pre-Print

Follow the Mean: Reference-Guided Flow Matching

Show it examples.
Change what it generates: no retraining, no gradients, no reward signals.

Pedro M. P. Curvo · Maksim Zhdanov · Floor Eijkelboom · Jan-Willem van de Meent

University of Amsterdam

arXivGitHub (coming soon)

The idea

Guide with examples, not rewards.

Think about how humans actually learn. Not by being scored on every move, but by watching someone do it first. You learn to cook by seeing a dish made, not by getting a number between 0 and 1 after each attempt. For most things that matter, we learn from examples.

The same intuition applies to generative models. If you ask an artist to paint in the style of Van Gogh and hand them a reward signal at every brushstroke, they'll be lost. Show them a handful of Van Gogh paintings instead, and suddenly the task is clear: not because the examples are a perfect specification, but because they shift the artist's mental picture of where the painting should go.

That's exactly what this paper does, but for flow matching models. The velocity field at any point in time is fully determined by a conditional endpoint mean: a kind of running best guess of where the generation is headed. Swap in a handful of reference images, and you shift that guess. The flow follows. No fine-tuning, no reward network, no repeated sampling. A single forward pass on a frozen model.

The reference images don't need to be perfect. They just need to shift the mean in the right direction.

Baseline elephant — Prompt, seed, and model weights are fixed. Changing the reference set changes the output.

Pink elephant guided — Prompt, seed, and model weights are fixed. Changing the reference set changes the output.

Method

How it works

Recover the endpoint mean

For any pretrained rectified-flow model, the endpoint mean is just μ_t(x) = x + (1 − t)u_t(x). No extra computation needed: it falls out of the velocity prediction.

Compute a reference mean

Given M reference images, compute their posterior mean in closed form as a softmax-weighted sum. This is exact under a Gaussian bridge: the same math that underlies cross-attention.

Apply the correction

Add a scheduled velocity correction proportional to the difference between reference mean and model mean. No gradients, no backward pass, no auxiliary model.

From flow matching to endpoint means

Start with the linear flow-matching bridge between noise x₀ and data x₁. For a fixed intermediate state x_t = x, there are many possible endpoint pairs that could have produced it. The model therefore cannot use a single displacement; the loss-minimizing marginal velocity is the average bridge velocity over all endpoint pairs consistent with the current state:

x_t = (1 − t)x₀ + tx₁

u_t(x) = E[x₁ − x₀ | x_t = x]

u_t(x) =μ_t(x) − x1 − t⇒ μ_t(x) = x + (1 − t)u_t(x)

Once the pretrained model gives μ_t^θ, a reference bank gives an empirical target mean μ̂_t^ρ. Interpolating between those endpoint means yields the closed-form velocity correction below.

Reference-guided endpoint mixture

To turn a reference set into guidance, define a target endpoint distribution as a geometric mixture of the pretrained data distribution p₁ and the reference distribution ρ₁:

π(x₁) ∝ p₁(x₁)^1−β_tρ₁(x₁)^β_t

μ_t^π(x) ≃ (1 − β_t) μ_t^θ(x) + β_t μ̂_t^ρ(x)

Since the linear-bridge velocity is just the endpoint mean minus the current state divided by the remaining time, replacing the model mean with this mixed mean leaves the original velocity plus a reference-mean correction.

Guided velocity

u_t^π(x) ≃ u_t^θ(x) + β_tμ̂_t^ρ(x) − μ_t^θ(x)1 − t

beta_t is a scalar guidance schedule, quadratic decay by default. No classifier, no reward, no search.

Why it's mathematically sound

The guidance rule follows from two constructions that both yield valid bridge marginals. The geometric mixture at the endpoint level is valid by construction and recovers the same velocity formula under a Gaussian posterior approximation, which holds exactly in VAE latent spaces. An arithmetic mixture of the two distributions gives the same velocity via an exact Bayes' rule computation. Both support the same guidance update.

Reference banks

What the reference sets look like

Each reference bank contains 20 images. They don't need to match the prompt: they just need to encode the target attribute. Toy-like pink elephants steer toward pink. Keyhole silhouettes steer toward keyhole compositions.

Reference-Mean Guidance · Results

Swap the reference set, change the output

Prompt and noise seed are fixed within each column. Only the reference set changes. The output shifts systematically in color, object identity, and style.

elephant in a jungle

baseline

pink elephants

blue elephants

a cat

baseline

studio photos

Van Gogh

a house in a forest

baseline

sketches

cinematic

animal in a savanna

baseline

giraffes

zebras

Structural control

It can even transfer geometry

Structural correctness is hard to define as a reward: VLMs struggle to judge whether a hand is correctly oriented or a silhouette matches a target shape. Reference-mean guidance sidesteps this entirely: just show it examples of the target structure. No pose estimator, no spatial loss, no gradient.

The nearest-reference column confirms that RMG transfers structural priors rather than copying reference content.

Comparison of control interfaces · GenEval

One forward pass. No reward model.

This comparison is between control interfaces, not a controlled ablation. RMG uses a fixed 20-image visual reference bank per category; baselines must express the same constraint through text prompts, classifier scores, or reward gradients. All methods share the same backbone, resolution, sampler, prompts, and seeds. The largest gains appear on compositional categories: position (+28.75) and two-object generation (+8.08), where text alone is insufficient to ground the target structure.

Method	Time down	NFE down	Aux. down	Mean up	Two up	Position up	Attribution up
FLUX.2-klein (4B) baseline	1.00x	1x	-	80.10	91.41	65.25	58.75
Search-based
+ Prompt Opt.	7.87x	8x	8C+2L	84.18	95.45	69.75	64.00
+ Best-of-4	4.07x	4x	4C	83.35	95.96	67.75	65.25
+ SMC	6.17x	4x	81C	80.28	95.71	61.75	57.25
Gradient-based
+ ReNO	19.44x	4x	4C	83.46	93.18	65.50	64.75
Ours (RMG)	1.02x	1x	-	91.17	99.49	94.00	75.25

This comparison is not intended to isolate the effect of additional visual information; it measures whether reference sets provide a practical and efficient interface for test-time control.

Semi-Parametric Guidance

A learned variant that knows what to copy

The closed-form reference mean is efficient but blunt: it can accidentally transfer nuisance correlations from the bank, like a shared white background. SPG fixes this by amortizing the same idea into a learnable architecture. A cross-attention anchor summarizes the reference set; a learned residual refiner decides what to keep and what to suppress. Reference set still swappable at inference time. Generation quality unchanged.

Unconditional generation quality on AFHQv2

Method	FID down	KID down	IS up
DiT-B/4 baseline	23.111	0.012	6.554
SPG (ours)	23.256	0.013	6.227

SPG matches DiT-B/4 baseline (FID 23.26 vs 23.11), confirming the reference-set anchor does not degrade generation quality.

Inference-time control

Generated class proportions closely track the reference-set composition across a wide range of reference sizes M, demonstrating continuous control without modifying model parameters.

Reference-set swaps · same model, same noise

full reference set: generated

nearest neighbors (latent space): not copies

cat-only reference set

dog-only reference set

Broader direction

Generative models that adapt through data, not parameter updates

If control is represented by parameters, adaptation requires optimization and can create interference between concepts. If control is represented by a reference set, adaptation becomes a data operation: adding, removing, or reweighting examples changes the posterior mean and therefore the flow. This makes reference-guided flows naturally suited to low-data regimes, personalization, and continual adaptation settings where the target distribution changes faster than one would like to retrain a model.

Citation

@article{curvo2026follow,
  title={Follow the Mean: Reference-Guided Flow Matching},
  author={Pedro M. P. Curvo and Maksim Zhdanov and Floor Eijkelboom and Jan-Willem van de Meent},
  year={2026},
}