FiD
Definition
FiD (Fusion-in-Decoder)
FiD (Izacard & Grave, 2021) is a retrieval-augmented seq2seq architecture for open-domain QA that encodes each retrieved passage independently alongside the question, then fuses all passage representations in the decoder via cross-attention. By separating encoding (per-passage, parallel) from fusion (joint, in the decoder), FiD scales to 100+ passages where naive concatenation hits context limits and where RAG’s marginalization is slow.
Intuition
Split the work: encode locally, fuse globally
Naive RAG concatenates all passages into one prompt — the self-attention cost grows quadratically with the total number of tokens, so you can only afford a handful of passages. Lewis et al.’s RAG avoids this by marginalizing over documents, but inference is slow and each output is grounded in one document at a time.
FiD’s trick: run the encoder separately on each (question, passage) pair. Encoding passages of length costs instead of — linear in the number of passages. The decoder then attends over the concatenation of all passage encodings, so cross-attention can synthesize evidence across passages to produce answers not stated verbatim in any single one.
Mathematical Formulation
Given question and retrieved passages , FiD builds one encoder representation per passage and concatenates them for the decoder:
where:
- — the input question; — the -th retrieved passage (with title ), prepended with special markers (
question:,title:,context:) - — encoder hidden states for passage , computed independently (no attention across passages in the encoder)
- — concatenation; — the full set of token representations the decoder attends to
- — autoregressive T5 decoder; its cross-attention ranges over all of , so a single answer can draw on multiple passages
- — the generated answer; — previously generated tokens
Cost: linear vs. quadratic in passages
Independent encoding makes the dominant self-attention cost linear in the number of passages ; only the (much cheaper) decoder cross-attention sees all tokens jointly.
Key Properties / Variants
- Scalability: Performance on Natural Questions improves monotonically with passage count up to 100+ passages, unlike concatenation which saturates/degrades at the context limit.
- Cross-document synthesis: Decoder cross-attention fuses evidence; answers can be assembled from facts spread across several passages.
- Architectural simplicity: Standard seq2seq (T5) with only modified input formatting — no new parameters, no marginalization machinery.
- Disconnected retriever: Vanilla FiD uses a fixed retriever (BM25 or DPR); there is no end-to-end gradient flow to the retriever (contrast with end-to-end RAG). Atlas later closes this gap with a learnable, periodically reindexed dense Bi-Encoder.
- Inference cost grows with : Decoder cross-attention scans all passage tokens each step; tractable but linear in .
- Non-adaptive: Always retrieves a fixed number of passages regardless of query difficulty (contrast with adaptive Self-RAG).
- Mitigates “lost in the middle”: Independent encoding sidesteps the positional attention decay that plagues single-context concatenation (each passage is encoded as if at the start).
Algorithm: FiD (Fusion-in-Decoder) — Inference
────────────────────────────────────────────────
Input: question q, retriever R, corpus D, generator (Encoder, Decoder), k
1. Z ← R.retrieve(q, D, top_k = k) # fixed retriever (BM25 / DPR)
2. for each z_i in Z: # independent, parallelizable
x_i ← "question: " + q +
" title: " + title(z_i) +
" context: " + text(z_i)
H_i ← Encoder(x_i) # per-passage hidden states
3. H ← concat(H_1, ..., H_k) # fuse in decoder input
4. y ← Decoder.generate(cross_attend_over = H) # cross-attention over all passages
return yFiD vs. RAG — what "fusion" means
In RAG (Lewis et al.), documents are latent variables: the model marginalizes and gradients flow to the query encoder. In FiD there is no marginalization — all passages are fed jointly to the decoder and the retriever is frozen. FiD trades end-to-end learnability for a large gain in how many passages it can exploit (44.5 EM for RAG vs. up to 68.2 EM for FiD with gold passages on Natural Questions).
Connections
- Compared with: RAG (marginalizes over latent documents, end-to-end), Self-RAG (adaptive retrieval), naive concatenation
- Extended by: Atlas (adds a learnable, periodically reindexed dense retriever on top of the FiD generator)
- Retriever used: BM25 / DPR / Dense Retrieval feeding the Bi-Encoder front end
- Built on: Transformers (T5 seq2seq), decoder cross-attention
- Part of: Retrieval-Augmented Generation family