Embedding Drift as a Sensitive Indicator of Internal Representation Instability in Language Model Fine-Tuning
Authors: Harshit Chaturvedy
Affiliation: Forsyth Central High School
Publication date: 2026-04-30
Journal/archive name: NSRI Student Research Journal
Volume: 1 Issue: 1 Pages/article: Pending
DOI: Pending DOI assignment
Abstract
Fine-tuning pre-trained language models alter both performance metrics and internal representations, yet conventional loss measurements often fail to capture subtle shifts in high-dimensional embedding space. In this study, we introduce embedding drift, a scalar metric defined as the mean cosine distance between hidden-state vectors of fixed probe sentences, to quantify representational change during training. Each sentence is mapped to a 768-dimensional vector via mean pooling over token embeddings, and drift is computed as "drift"=1/N ∑_(i=1)^N▒〖(1-〗 (e_i^((0) )⋅e_i^((t) ))/(∥e_i^((0) )∥∥e_i^((t) )∥)) where N is the number of probe sentences, e_i^((0) )the initial embedding, and e_i^((t) )the embedding at step t. We performed two controlled experiments with DistilBERT on a subset of IMDb: a baseline (learning rate 5e-5) and a high learning rate (5e-4). Drift increased steadily from ~0.03 to ~0.26 in the baseline, whereas the high learning rate induced a rapid jump from ~0.22 to ~0.76, despite training loss showing minimal change. These results demonstrate that embedding drift provides a quantitative, vector-space measure of representational instability that conventional loss metrics may overlook, offering insight into internal model dynamics during fine-tuning.
Keywords
Applied Science - Engineering, Applied Science - Computer Science
Citation
References
Reference metadata is pending and must be finalized before DOI deposit.