Document Identifiers

Document Identifiers (DocIDs)

In the context of Autoregressive Retrieval and DSI, Document Identifiers (DocIDs) are the target sequences that a generative model is trained to produce to represent a document. The choice of identifier scheme is critical because it defines how the model “navigates” the document space.

Identifier Schemes

SchemeDescriptionPros/Cons
Atomic IDsEach document is assigned a single, unique token (e.g., doc_742).Does not scale; no semantic meaning.
Naive String IDsDocuments are identified by their content (e.g., the first 10 words or the title).Good for known-item search; bad for long docs.
Numeric StringsDocuments get random or sequential numbers (e.g., 1, 2, 3).Easy to assign; hard for model to learn relationships.
Semantic IDsIDs based on a hierarchy (e.g., 10.2.5.1) created via hierarchical clustering of document embeddings.Best performance; related documents share ID prefixes.

The Power of Semantic Structure

If two documents are about “Deep Learning” and “Neural Networks”, a good semantic ID scheme might assign them IDs starting with the same prefix (e.g., 5.1.x). This helps the model generalize: if the model predicts the first few digits correctly, it has already narrowed the search down to the correct topical “neighborhood.”

Connections

Appears In