LoRA

Definition

LoRA (Low-Rank Adaptation)

LoRA is a parameter-efficient fine-tuning (PEFT) method that adapts a large pretrained model to a downstream task without updating its original weights. Instead of fine-tuning a weight matrix directly, LoRA freezes and learns a low-rank additive update , where and with rank . Only and are trained, so the number of trainable parameters drops by orders of magnitude.

In this course LoRA is the mechanism that makes the LLM-as-RS formulation tractable: a frozen LLM backbone plus a small trainable LoRA adapter is fine-tuned on recommendation data (e.g. TALLRec, LLaRA, CoLLM).

Intuition

Fine-tune the change, not the weights — and keep it skinny

Full fine-tuning of a billion-parameter LLM means learning a dense update the same size as — expensive in compute, memory, and storage (one full copy of the model per task). The empirical observation behind LoRA is that the update needed to adapt a model has a low “intrinsic rank”: the meaningful change lives in a tiny subspace. So instead of a full matrix, we factor the update through a narrow -dimensional bottleneck, . With as small as 4–16, you train a fraction of a percent of the parameters yet recover most of the quality of full fine-tuning.

For recommendation this is the enabling trick: a vanilla LLM never saw click signal, Top-K ranking, or item structure during pretraining (the alignment problem). LoRA lets you inject that recommendation-specific signal cheaply while leaving the frozen LLM’s world knowledge intact — the snowflake (frozen LLM) + flame (trainable LoRA) picture from the LLM-as-RS slides.

Mathematical Formulation

Low-Rank Weight Update

For a pretrained weight matrix , LoRA replaces the forward pass with

where:

  • — frozen pretrained weights (, not updated)
  • — trainable down-projection, initialized to
  • — trainable up-projection, initialized random Gaussian
  • — rank of the adapter, (the bottleneck width)
  • — scaling constant; rescales the update so it is roughly -independent
  • — layer input, — layer output

Because is initialized to , at the start, so training begins exactly at the pretrained model and only learns to deviate as needed.

Trainable Parameter Count

A full update has parameters; the LoRA update has only e.g. for , : trainable params vs M — a reduction per adapted matrix.

In the LLM-as-RS training objective the LoRA adapter is optimized with the same next-token / instruction loss as the LLM, e.g. supervised fine-tuning on instances like “Given the user’s liked/disliked items and a target item, answer Yes/No”:

where the gradient flows only into and ; stays frozen. The output is item text titles (LLM-as-RS) or, more generally, the adapted LLM carries the injected collaborative signal.

Key Properties / Variants

  • No inference latency. After training, can be merged into the frozen weights: . The merged model is identical in shape to the original, so unlike adapter-layer methods LoRA adds zero extra latency at serving time.
  • Cheap task switching. Each task is just a small pair (megabytes, not gigabytes). You keep one frozen backbone and swap adapters — ideal for multi-domain / multi-task recommendation.
  • Where it is applied. Typically injected into the attention projection matrices () of each Transformer block; sometimes the feed-forward layers too.
  • Hyperparameters. Rank trades capacity vs cost; scaling controls the update magnitude. Both are small (e.g. ).
  • In the LLM-based GR taxonomy LoRA appears in two of the three alignment paradigms:
    • Text prompting (paradigm ①): TALLRec uses lightweight LoRA fine-tuning on natural-language preference instructions.
    • Inject collaborative signal (paradigm ②): iLoRA, LLaRA, CoLLM project a learned CF embedding into the LLM’s token-embedding space and fine-tune a LoRA adapter on top of the frozen LLM. iLoRA additionally instance-customizes the adapter.
  • Relation to other PEFT. Belongs to the broader parameter-efficient fine-tuning family (alongside prefix-tuning, prompt-tuning, soft prompts, and adapter layers); LoRA is the most widely used because of the zero-merge-latency property.
  • Contrast with full fine-tuning / SFT. LoRA can implement SFT cheaply, and is compatible with later preference optimization or RL stages on the same frozen backbone.

Connections

Appears In