Pre-Print

    Follow the Mean: Reference-Guided Flow Matching

    Show it examples.
    Change what it generates: no retraining, no gradients, no reward signals.

    Pedro M. P. Curvo · Maksim Zhdanov · Floor Eijkelboom · Jan-Willem van de Meent

    University of Amsterdam

    arXiv logoarXivGitHub (coming soon)

    The idea

    Guide with examples, not rewards.

    Think about how humans actually learn. Not by being scored on every move, but by watching someone do it first. You learn to cook by seeing a dish made, not by getting a number between 0 and 1 after each attempt. For most things that matter, we learn from examples.

    The same intuition applies to generative models. If you ask an artist to paint in the style of Van Gogh and hand them a reward signal at every brushstroke, they'll be lost. Show them a handful of Van Gogh paintings instead, and suddenly the task is clear: not because the examples are a perfect specification, but because they shift the artist's mental picture of where the painting should go.

    That's exactly what this paper does, but for flow matching models. The velocity field at any point in time is fully determined by a conditional endpoint mean: a kind of running best guess of where the generation is headed. Swap in a handful of reference images, and you shift that guess. The flow follows. No fine-tuning, no reward network, no repeated sampling. A single forward pass on a frozen model.

    The reference images don't need to be perfect. They just need to shift the mean in the right direction.

    Baseline elephantPink elephant guidedBaseline animalGiraffe guidedBaseline gymnast ring leapReference-guided gymnast ring leap
    Prompt, seed, and model weights are fixed. Changing the reference set changes the output.

    Method

    How it works

    01

    Recover the endpoint mean

    For any pretrained rectified-flow model, the endpoint mean is just μt(x) = x + (1 − t)ut(x). No extra computation needed: it falls out of the velocity prediction.

    02

    Compute a reference mean

    Given M reference images, compute their posterior mean in closed form as a softmax-weighted sum. This is exact under a Gaussian bridge: the same math that underlies cross-attention.

    03

    Apply the correction

    Add a scheduled velocity correction proportional to the difference between reference mean and model mean. No gradients, no backward pass, no auxiliary model.

    From flow matching to endpoint means

    Start with the linear flow-matching bridge between noise x0 and data x1. For a fixed intermediate state xt = x, there are many possible endpoint pairs that could have produced it. The model therefore cannot use a single displacement; the loss-minimizing marginal velocity is the average bridge velocity over all endpoint pairs consistent with the current state:

    xt = (1 − t)x0 + tx1

    ut(x) = E[x1x0 | xt = x]

    ut(x) =μt(x) − x1 − t⇒ μt(x) = x + (1 − t)ut(x)

    Once the pretrained model gives μtθ, a reference bank gives an empirical target mean μ̂tρ. Interpolating between those endpoint means yields the closed-form velocity correction below.

    Reference-guided endpoint mixture

    To turn a reference set into guidance, define a target endpoint distribution as a geometric mixture of the pretrained data distribution p1 and the reference distribution ρ1:

    π(x1) ∝ p1(x1)1−βtρ1(x1)βt

    μtπ(x) ≃ (1 − βt) μtθ(x) + βt μ̂tρ(x)

    Since the linear-bridge velocity is just the endpoint mean minus the current state divided by the remaining time, replacing the model mean with this mixed mean leaves the original velocity plus a reference-mean correction.

    Guided velocity

    utπ(x) ≃ utθ(x) + βtμ̂tρ(x) − μtθ(x)1 − t

    betat is a scalar guidance schedule, quadratic decay by default. No classifier, no reward, no search.

    Why it's mathematically sound

    The guidance rule follows from two constructions that both yield valid bridge marginals. The geometric mixture at the endpoint level is valid by construction and recovers the same velocity formula under a Gaussian posterior approximation, which holds exactly in VAE latent spaces. An arithmetic mixture of the two distributions gives the same velocity via an exact Bayes' rule computation. Both support the same guidance update.

    Reference banks

    What the reference sets look like

    Each reference bank contains 20 images. They don't need to match the prompt: they just need to encode the target attribute. Toy-like pink elephants steer toward pink. Keyhole silhouettes steer toward keyhole compositions.

    Pink elephants
    Pink elephants
    Giraffes
    Giraffes
    Van Gogh style
    Van Gogh style
    Keyhole shapes
    Keyhole shapes

    Reference-Mean Guidance · Results

    Swap the reference set, change the output

    Prompt and noise seed are fixed within each column. Only the reference set changes. The output shifts systematically in color, object identity, and style.

    elephant in a jungle

    Baseline: elephant in a jungle

    baseline

    pink elephants

    pink elephants

    blue elephants

    blue elephants

    a cat

    Baseline: a cat

    baseline

    studio photos

    studio photos

    Van Gogh

    Van Gogh

    a house in a forest

    Baseline: a house in a forest

    baseline

    sketches

    sketches

    cinematic

    cinematic

    animal in a savanna

    Baseline: animal in a savanna

    baseline

    giraffes

    giraffes

    zebras

    zebras

    Structural control

    It can even transfer geometry

    Structural correctness is hard to define as a reward: VLMs struggle to judge whether a hand is correctly oriented or a silhouette matches a target shape. Reference-mean guidance sidesteps this entirely: just show it examples of the target structure. No pose estimator, no spatial loss, no gradient.

    Keyhole shape

    a miniature forest...all inside a keyhole on a black background

    Keyhole shape baseline

    baseline

    Keyhole shape guided

    reference-guided

    Keyhole shape nearest reference

    nearest reference

    Hand pose

    a hand doing the sign of the horns

    Hand pose baseline

    baseline

    Hand pose guided

    reference-guided

    Hand pose nearest reference

    nearest reference

    Gymnastics pose

    a gymnast performing a ring leap...

    Gymnastics pose baseline

    baseline

    Gymnastics pose guided

    reference-guided

    Gymnastics pose nearest reference

    nearest reference

    The nearest-reference column confirms that RMG transfers structural priors rather than copying reference content.

    Comparison of control interfaces · GenEval

    One forward pass. No reward model.

    This comparison is between control interfaces, not a controlled ablation. RMG uses a fixed 20-image visual reference bank per category; baselines must express the same constraint through text prompts, classifier scores, or reward gradients. All methods share the same backbone, resolution, sampler, prompts, and seeds. The largest gains appear on compositional categories: position (+28.75) and two-object generation (+8.08), where text alone is insufficient to ground the target structure.

    MethodTime downNFE downAux. downMean upTwo upPosition upAttribution up
    FLUX.2-klein (4B) baseline1.00x1x-80.1091.4165.2558.75
    Search-based
    + Prompt Opt.7.87x8x8C+2L84.1895.4569.7564.00
    + Best-of-44.07x4x4C83.3595.9667.7565.25
    + SMC6.17x4x81C80.2895.7161.7557.25
    Gradient-based
    + ReNO19.44x4x4C83.4693.1865.5064.75
    Ours (RMG)1.02x1x-91.1799.4994.0075.25

    This comparison is not intended to isolate the effect of additional visual information; it measures whether reference sets provide a practical and efficient interface for test-time control.

    Semi-Parametric Guidance

    A learned variant that knows what to copy

    The closed-form reference mean is efficient but blunt: it can accidentally transfer nuisance correlations from the bank, like a shared white background. SPG fixes this by amortizing the same idea into a learnable architecture. A cross-attention anchor summarizes the reference set; a learned residual refiner decides what to keep and what to suppress. Reference set still swappable at inference time. Generation quality unchanged.

    Unconditional generation quality on AFHQv2

    MethodFID downKID downIS up
    DiT-B/4 baseline23.1110.0126.554
    SPG (ours)23.2560.0136.227

    SPG matches DiT-B/4 baseline (FID 23.26 vs 23.11), confirming the reference-set anchor does not degrade generation quality.

    Inference-time control

    Generated class proportions closely track the reference-set composition across a wide range of reference sizes M, demonstrating continuous control without modifying model parameters.

    Reference-set swaps · same model, same noise

    full reference set: generated

    SPG full reference setSPG full reference setSPG full reference setSPG full reference setSPG full reference set

    nearest neighbors (latent space): not copies

    Nearest neighborsNearest neighborsNearest neighborsNearest neighborsNearest neighbors

    cat-only reference set

    Cat-only referenceCat-only referenceCat-only referenceCat-only referenceCat-only reference

    dog-only reference set

    Dog-only referenceDog-only referenceDog-only referenceDog-only referenceDog-only reference

    Broader direction

    Generative models that adapt through data, not parameter updates

    If control is represented by parameters, adaptation requires optimization and can create interference between concepts. If control is represented by a reference set, adaptation becomes a data operation: adding, removing, or reweighting examples changes the posterior mean and therefore the flow. This makes reference-guided flows naturally suited to low-data regimes, personalization, and continual adaptation settings where the target distribution changes faster than one would like to retrain a model.

    Citation

    @article{curvo2026follow,
      title={Follow the Mean: Reference-Guided Flow Matching},
      author={Pedro M. P. Curvo and Maksim Zhdanov and Floor Eijkelboom and Jan-Willem van de Meent},
      year={2026},
    }