Pre-Print

    Follow the Mean: Reference-Guided Flow Matching

    Show it examples.
    Change what it generates: no retraining, no gradients, no reward signals.

    Pedro M. P. Curvo · Maksim Zhdanov · Floor Eijkelboom · Jan-Willem van de Meent

    University of Amsterdam

    Controlling what a generative model produces usually means one of three things: fine-tune it on new data, attach a reward network that scores outputs, or run expensive search at inference time. All three are slow, brittle, or require labeled signal. This paper does something different: it steers a frozen model using only a handful of example images, exploiting a structural property of flow matching that makes this exact and efficient.

    The idea

    Guide with examples, not rewards.

    Think about how humans actually learn. Not by being scored on every move, but by watching someone do it first. You learn to cook by seeing a dish made. If you ask an artist to paint in the style of Van Gogh and hand them a reward signal at every brushstroke, they'll be lost. Show them a handful of Van Gogh paintings instead, and suddenly the task is clear — not because the examples are a perfect specification, but because they shift the artist's mental picture of where the painting should go.

    The same intuition applies here, and flow matching makes it precise. At every step of generation, the model's velocity is entirely determined by one thing: where the model expects to end up — its conditional endpoint mean. That expectation is not a side product; it is the velocity. Shift the expectation, and the strokes change. Which means the examples aren't just intuition — they're the exact object you need to control.

    And it turns out the math is simpler than you'd expect. Given a finite reference set, the reference endpoint mean has a closed form under a Gaussian bridge: a softmax-weighted sum over your reference images, where the weights are posterior probabilities computed from where you are in the trajectory. No neural network, no optimization loop. You compute it, apply a velocity correction proportional to the difference from the model's prediction, and you're done.

    Best of all: the reference images don't need to match the prompt. They don't need to be high quality. They just need to encode the attribute you want to transfer. Twenty pink toy elephants will steer toward pink. Twenty Van Gogh paintings will steer toward impasto brushstrokes. The model never sees them during training. It sees them once, at inference time, and follows.

    Follow the MeanReference-Guided Flow MatchingNoise
    Referencesetr₁r₂r₃r₄arXiv:2605.10302

    Method

    How it works

    The mechanism has three steps, each requiring only what the frozen model already computes.

    01

    Recover the endpoint mean

    For any pretrained rectified-flow model, the endpoint mean is just μt(x) = x + (1 − t)ut(x). No extra computation needed: it falls out of the velocity prediction.

    02

    Compute a reference mean

    Given M reference images, compute their posterior mean in closed form as a softmax-weighted sum. This is exact under a Gaussian bridge: the same math that underlies cross-attention.

    03

    Apply the correction

    Add a scheduled velocity correction proportional to the difference between reference mean and model mean. No gradients, no backward pass, no auxiliary model.

    Under the hood

    The derivation

    The three steps above have a clean derivation. This is where the endpoint-mean correction comes from, and why the reference mean takes the form of a softmax over examples.

    Linear bridge

    Same state

    Expected velocity

    Endpoint mean

    Mean shift

    Reference target

    Guided mean

    Guided velocity

    Reference posterior

    Attention form

    xt = (1 - t)x0 + tx1
    (x0, x1) such that xt = x
    ut(x) = E[x1 - x0 | xt = x]
    ut(x) =μt(x) - x1 - tμt(x) = E[x1 | xt = x]
    utπ(x) - ut(x) =μtπ(x) - μt(x)1 - t
    π(x1) ∝ p1(x1)1-βt·ρ1(x1)βt
    μtπ(x) = (1 - βt) μt(x) + βt μtρ(x)
    utπ(x)= utθ(x) + βμ̂tρ(x) - μtθ(x)1 - t
    μ̂tρ(x)= Σm αt(m)(x)·r(m)
    αt(m)(x) = Softmaxm(txTr(m) - ½t2||r(m)||2(1 - t)2)

    Flow matching begins with a straight path from source noise to a data endpoint.

    Many endpoint pairs can pass through the same intermediate point, so the flow averages over them. At inference time you don't know which endpoint you're heading toward — you only know where you are now. The marginal field has to be consistent with all possible endpoints that could have produced this state.

    The marginal vector field is the conditional expectation of those bridge velocities.

    For a Gaussian source, integrating out x0 leaves a velocity that depends on nothing except the expected endpoint — the model's running best guess of where the generation will land. This is the lever. Control the endpoint mean, and you control the flow.

    You don't need to retrain anything, modify the architecture, or define a reward. To change where the model goes, you only need to shift one quantity: the conditional endpoint mean. The velocity correction falls out as a ratio — mean shift divided by time remaining. That's the entire mechanism.

    We want a target distribution that interpolates between the model's prior and the reference density. A geometric mixture does this: β = 0 recovers the base model, β = 1 follows the reference entirely. This is an approximation — exact computation under the bridge is intractable — but under the Gaussian posterior assumption that holds in FLUX.2's VAE latent space, it gives the closed-form result that follows.

    The geometric mixture produces a mean that is a convex combination of the model's prediction and the reference mean. βt is scheduled over the trajectory — usually near zero early, when structure is forming, and larger later, when details are resolved. You control how hard the references pull, and when.

    Substituting that mean shift gives the practical correction used during sampling. The only missing ingredient is μ̂tρ, the approximate reference mean, which the next two steps compute from a finite bank of examples.

    For a finite reference bank, the reference mean is a posterior-weighted average of examples. The weights αt(m) are posterior probabilities: how likely is each reference image to be the true endpoint, given where you are now in the trajectory? Under a Gaussian bridge, this posterior has a closed form. No inference network, no optimization.

    The weights are a softmax over dot products between the current state xt and each reference image r(m), normalized by the bridge variance (1−t)2. This is cross-attention — not by analogy, but exactly. The current state is the query. The reference images are keys and values.

    x_tμ_tμ_t^ρmean shift
    β = 0model meanβ = 1reference meanguided mean
    Cross-attention over R

    Part I

    Reference-Mean Guidance

    Training-free. Closed-form. Applied to a frozen 4B-parameter FLUX.2-klein model. The next three sections show what it does.

    Reference-Mean Guidance · Results

    Swap the reference set, change the output

    Take the prompt "elephant in a jungle." A frozen FLUX.2-klein generates a reasonable image — grey elephant, green canopy. Now hand it twenty images of pink toy elephants. None of them contain jungles. None of them match the prompt. The prompt hasn't changed. The seed hasn't changed. The 4-billion model weights haven't changed. The output is now pink.

    Swap the reference bank to blue elephants. The output turns blue.

    That's the whole interface. A folder of images. Swappable at inference time, with no retraining. The nearest-neighbor column shows the closest image in the reference bank, so you can see that the model is being guided rather than simply copying.

    The reference banks used below

    Pink elephants
    Pink elephants
    Giraffes
    Giraffes
    Van Gogh style
    Van Gogh style
    Keyhole shapes
    Keyhole shapes

    Examples don't need to be perfect

    Style change

    Any reference set R that shifts the endpoint mean in the right direction works.

    Examples don't need to be perfect

    Attribute change

    Any reference set R that shifts the endpoint mean in the right direction works.

    Examples don't need to be perfect

    Compositionality

    Any reference set R that shifts the endpoint mean in the right direction works.

    Style change: baseline
    baseline
    Style change: ours
    ours
    Style change: nearest neighbor
    nearest neighbor
    Attribute change: baseline
    baseline
    Attribute change: ours
    ours
    Attribute change: nearest neighbor
    nearest neighbor
    Compositionality: baseline
    baseline
    Compositionality: ours
    ours
    Compositionality: nearest neighbor
    nearest neighbor

    Structural control

    It can even transfer geometry

    We didn't expect this to work for structure.

    Structural correctness is hard to define as a reward: VLMs struggle to judge whether a hand is correctly oriented or a silhouette matches a target shape. Reference-mean guidance sidesteps this entirely: just show it examples of the target structure. No pose estimator, no spatial loss, no gradient.

    Keyhole shape

    a miniature forest...all inside a keyhole on a black background

    Hand pose

    a hand doing the sign of the horns

    Gymnastics pose

    a gymnast performing a ring leap...

    Keyhole shape: baseline
    baseline
    Keyhole shape: ours
    ours
    Keyhole shape: nearest neighbor
    nearest neighbor
    Hand pose: baseline
    baseline
    Hand pose: ours
    ours
    Hand pose: nearest neighbor
    nearest neighbor
    Gymnastics pose: baseline
    baseline
    Gymnastics pose: ours
    ours
    Gymnastics pose: nearest neighbor
    nearest neighbor

    GenEval benchmark

    One forward pass. No reward model.

    This comparison is between control interfaces, not a controlled ablation. RMG uses a fixed 20-image visual reference bank per category; baselines must express the same constraint through text prompts, classifier scores, or reward gradients. All methods share the same backbone, resolution, sampler, prompts, and seeds. The largest gains appear on compositional categories: position (+28.75) and two-object generation (+8.08), where text alone is insufficient to ground the target structure.

    MethodTime downNFE downAux. downMean upTwo upPosition upAttribution up
    FLUX.2-klein (4B) baseline1.00x1x-80.1091.4165.2558.75
    Search-based
    + Prompt Opt.7.87x8x8C+2L84.1895.4569.7564.00
    + Best-of-44.07x4x4C83.3595.9667.7565.25
    + SMC6.17x4x81C80.2895.7161.7557.25
    Gradient-based
    + ReNO19.44x4x4C83.4693.1865.5064.75
    Ours (RMG)1.02x1x-91.1799.4994.0075.25

    The largest gains are on the categories where text prompts consistently fail: position (+28.75 points) and two-object composition (+8.08). Showing the model examples of correct spatial arrangements works where describing them in text does not.

    Semi-Parametric Guidance

    A learned variant that knows what to copy

    Reference-Mean Guidance has a limitation. The closed-form correction uses all of the reference bank's information — including things you don't want. A bank of pink toy elephants, all photographed against a white studio backdrop, will steer toward pink backgrounds as well as pink subjects. RMG can't distinguish between the attribute you care about and the ones that happen to correlate with it.

    Semi-Parametric Guidance fixes this by amortizing the same idea into a learned architecture. Instead of applying the closed-form correction directly, SPG uses a cross-attention anchor to summarize the reference set, and a learned residual refiner to decide what to keep and what to suppress. The reference set is still swappable at inference time — nothing about that changes. But now the model can learn which aspects of the reference distribution are relevant to the generation task.

    Trained on AFHQv2, SPG matches DiT-B/4 baseline quality (FID 23.26 vs 23.11), meaning the reference-set anchor doesn't degrade generation. And by swapping the reference set at inference time, you get continuous control over what the model generates: show it only cats, it generates cats; show it only dogs, it generates dogs; show it a mixture, it tracks the mixture proportions.

    Unconditional generation quality on AFHQv2

    MethodFID downKID downIS up
    DiT-B/4 baseline23.1110.0126.554
    SPG (ours)23.2560.0136.227

    SPG matches DiT-B/4 baseline (FID 23.26 vs 23.11), confirming the reference-set anchor does not degrade generation quality.

    Inference-time control

    Generated class proportions closely track the reference-set composition across a wide range of reference sizes M, demonstrating continuous control without modifying model parameters.

    Reference-set swaps · same model, same noise

    full reference set: generated

    SPG full reference setSPG full reference setSPG full reference setSPG full reference setSPG full reference set

    nearest neighbors (latent space): not copies

    Nearest neighborsNearest neighborsNearest neighborsNearest neighborsNearest neighbors

    cat-only reference set

    Cat-only referenceCat-only referenceCat-only referenceCat-only referenceCat-only reference

    dog-only reference set

    Dog-only referenceDog-only referenceDog-only referenceDog-only referenceDog-only reference

    Conclusion

    What this points toward

    If control is represented by parameters, adaptation requires optimization and can create interference between concepts. If control is represented by a reference set, adaptation becomes a data operation: adding, removing, or reweighting examples changes the posterior mean and therefore the flow. This makes reference-guided flows naturally suited to low-data regimes, personalization, and continual adaptation settings where the target distribution changes faster than one would like to retrain a model.

    Fine-tuning, reward networks, test-time search — these are the standard answers to controllable generation. Reference-guided flow matching suggests a different answer: a reference set. Swappable, interpretable, and requiring no gradient.

    Citation

    @misc{curvo2026followmeanreferenceguidedflow,
          title={Follow the Mean: Reference-Guided Flow Matching},
          author={Pedro M. P. Curvo and Maksim Zhdanov and Floor Eijkelboom and Jan-Willem van de Meent},
          year={2026},
          eprint={2605.10302},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2605.10302},
    }