What's Happening in Representation Learning? A Look at REP4NLP 2025
Representation Learning
Representation Learning is one of the primitives of intelligence. In a broad sense, it’s the foundation of all abstractions and compressions, necessary precursors for solving anycomplex problems in math and science. More narrowly, learning complex representations by changing weights through gradient descent, has been at the heart of the effectiveness of deep neural nets since their inception and I’ve found it a compelling part of ML and NLP for years. More specifically, although there have been many different iterations of the “learn meaningful representations of language” problem I’ve always been quite fond of is Word2Vec. One of the first ML for NLP techniques I learned about, and one that always deserves a mention when talking about representation learning, for its elegance and ability to reveal the underlying “geometry of language” even back in 2012. This first encounter with representation learning for NLP led to an enduring interest in the field and its developments, of which there were plenty in the 2010s. The story of NLP in the 2010s is essentially the story of representation learning for language driving massive improvements across downstream tasks, a story I followed with interest.
Nevertheless, as the NLP community at large has moved more and more towards applied work on agents, prompting and the like, I’ve also paid less attention to developments in representation learning, despite there being many. Working at a question-answering-from-data company, I’ve been drawn more toward the practical applications too. However, in the last few weeks I’ve decided to take a closer look at the current state of representation learning for NLP by surveying the papers accepted to the REP4NLP 2025 workshop held this past May at NAACL in Albuquerque, New Mexico. Here’s what I found interesting.
Before we begin, some caveats. First and foremost, the papers in this workshop will only provide a limited snapshot of the field and by their very nature will tend to be a lagging indicator of what people have been working on and talking about. Second, I don’t claim to be the most up-to-date member of the NLP community, so some references or methods might be lost on me, but I do have a strong interest in the subfield and have followed it for many years. That said, let’s start!
REP4NLP 2025
I’ve talked about representation learning more broadly above, but the call for papers for the workshop asked for work on several key themes: efficient learning and inference as models scale up (with respect to training data, computing time, and energy consumption), investigating representation dynamics during training, evaluating existing representations for generalization and robustness, understanding the relationship between representations and model behaviors, exploring beyond English textual representations (cross-modal, cross-lingual, knowledge-informed approaches), and developing new representations using various methods from language model objectives to neuro-symbolic approaches. The accepted papers reflect this broad scope and can be split into different categories. We’ll take a general look at these broad categories and then focus on some of the papers I found most interesting.
🔬 Interpretability and Understanding Model Representations/Behavior
-
Tracking Universal Features Through Fine-Tuning and Model Merging Suggestion: Analyze (Standout) | 🔗 ACL Anthology 🏷️ Tags: #feature-analysis #model-merging #sparse-autoencoders
-
A Comparative Study of Learning Paradigms in Large Language Models via Intrinsic Dimension Suggestion: Analyze (Standout) | 🔗 ACL Anthology 🏷️ Tags: #intrinsic-dimension #in-context-learning #supervised-fine-tuning
-
Reverse Probing: Evaluating Knowledge Transfer via Finetuned Task Embeddings for Coreference Resolution Suggestion: Open | 🔗 ACL Anthology 🏷️ Tags: #probing #knowledge-transfer #evaluation-methodology
Notes: Like the methodological flip of typical probing. Instead of probing complex task representations on simple tasks, they probe simple task embeddings on complex tasks. Neat finding that semantic similarity tasks (paraphrase detection) transfer best to coreference resolution and in general they talked about how to solve some interesting problems like where to take embeddings from LLMs and how to combine them.
📝 Text Embeddings
-
Prompt Tuning Can Simply Adapt Large Language Models to Text Encoders Suggestion: Read (If you like embeddings) | 🔗 ACL Anthology 🏷️ Tags: #prompt-tuning #text-encoders
Notes: Interesting comparison between bidirectional attention and unidirectional one. Cool that it is on taking meaningful embeddings from LLMs. Would be interesting to combine with diffusion LLMs see how the results would differ there.
-
Large Language Models Are Overparameterized Text Encoders Suggestion: Read (If you like embeddings) | 🔗 ACL Anthology 🏷️ Tags: #model-pruning #text-encoders
Notes: Super cool finding: can prune 30% of layers with negligible impact, 80% with modest drop. Big question: if 30% of parameters do nothing semantically, what ARE they doing? Regularization? optimization dynamics? Generation-specific computation not needed for encoding? Their method is very simple (3 lines of code) yet effective. Raises some questions about parameter efficiency and what different model components actually contribute.
🏗️ Alternative Architectures & Pre-training Objectives
-
DEPTH: Discourse Education through Pre-Training Hierarchically Suggestion: Skim | 🔗 ACL Anthology 🏷️ Tags: #discourse-learning #pre-training-objective
Notes: Always very cool to see different training objectives. Part of a long line of attempts of inserting sentence-level tasks in LLM pre-training. Smells of bitter lesson though, seems a bit too complex to me. Unclear if the discourse-level issues for GPT-style autoregressive models brought up here are real. Incompatibility of flash attention makes it hard to compare efficiency gains.
-
State Space Models are Strong Text Rerankers Suggestion: Skim | 🔗 ACL Anthology 🏷️ Tags: #mamba #text-reranking #information-retrieval
Notes: Main thing here is that there are a lot of experiments, on different SSMs and different LLMs. Very thorough empirical study on the models + tasks combination, gold for the right person. Conclusions are that “(1) Mamba architectures achieve competitive text ranking performance, comparable to transformer-based models of similar size; (2) they are less efficient in training and inference compared to transformers with flash attention”
-
Punctuation Restoration Improves Structure Understanding without Supervision Suggestion: Skim | 🔗 ACL Anthology 🏷️ Tags: #punctuation-restoration #structural-understanding #pre-training-objective
Notes: The core idea is quite cool. The concept of finding increasingly complex objectives to learn better representations is fun. Shows that punctuation restoration improves structure-related tasks (NER, chunking, POS tagging) by ≥2% in 16/18 experiments. Suggests current pretraining objectives (MLM, autoregressive) might miss important structural knowledge.
⚡ Efficiency Gains
-
Choose Your Words Wisely: Domain-adaptive Masking Makes Language Models Learn Faster Suggestion: Open | 🔗 ACL Anthology 🏷️ Tags: #domain-adaptation #efficient-training #masked-language-modeling #biomedical-nlp
-
Amuro & Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models Suggestion: Open | 🔗 ACL Anthology 🏷️ Tags: #pre-training #fine-tuning #continual-learning #model-analysis
-
Vocabulary-level Memory Efficiency for Language Model Fine-tuning Suggestion: Open | 🔗 ACL Anthology 🏷️ Tags: #memory-efficiency #vocabulary-optimization #fine-tuning #resource-optimization
🧠 Multi-Modal or Task-specific
-
Cross-Modal Learning for Music-to-Music-Video Description Generation Suggestion: Open | 🔗 ACL Anthology 🏷️ Tags: #cross-modal #music-video #multimodal-learning #generation
-
Efficient Document-level Event Relation Extraction Suggestion: Open | 🔗 ACL Anthology 🏷️ Tags: #event-extraction #efficiency #document-level #two-stage-framework
-
Investigating Adapters for Parameter-efficient Low-resource Automatic Speech Recognition Suggestion: Open | 🔗 ACL Anthology 🏷️ Tags: #adapters #parameter-efficiency #speech-recognition #low-resource
Standout Papers
- Tracking Universal Features Through Fine-Tuning and Model Merging -
Niels Nielsen Horn, Desmond Elliott
Summary
Intro: This paper offers an excellent window into how Sparse Autoencoders (SAEs) are being used in interpretability research for NLP. Although we might be past the peak of interest that Anthropic’s detailed reports generated in late 2023, SAEs remain our best tools for peering directly into transformer weights and getting qualitative accounts of what they represent.
Setup: Building on this notion and prior work, Horn and Elliott provide a concise and easy-to-follow analysis of SAE feature persistence after fine-tuning and model merging for a 1-layer Mistral-like transformer model. More specifically they start from a base model trained on an equal split of general english tokens and Python code. Then they finetune separately on two different datasets, one of Lua code and one of english children stories. Finally, a third model is subsequently created as a the result of merging the two finetuned models using Spherical Linear Interpolation, a fancy, more geometrically sound form of weight averaging.
Results: For this 1-layer model the authors find that:
- Features learned on the base model persist after fine-tuning 63 % of the top-100 base features (by activation frequency) remain detectable in both the finetuned models, these features are mostly “universal” low-level patterns: whitespace, brackets, word-pieces. Higher-level features (e.g., “Python try/except”) often do disappear.
- Merging has a positive but limited effect on recovering lost features Merge recovers ~11 % of base features that had vanished in one branch but stayed alive in the other. Only ~4 % of features that were present in both branches are corrupted by merging.
- “Robust” features also useful Features that survive both fine-tunes and the merge contribute ~45 % of total log-prob improvement on a mixed validation set, despite being <10 % of all discovered features.
My thoughts
- Very interesting to see SAEs in action
- Interesting connection to LoRA or Federated Learning or model merging in general
- Did not really buy the “robust features” argument tbh
- Would explore connection to feature universality to see how many of the top-100 features would be in models only trained on the finetuned datasets (maybe 50%?) or how many for models with the same dataset but different training objectives
- Should these results be compared with simpler “Linear Probe” approaches as suggested by Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research
- A Comparative Study of Learning Paradigms in Large Language Models via Intrinsic Dimension -
Saahith Janapati, Yangfeng Ji
Summary
Intro: Deciding between Supervised Finetuning and In Context Learning is a complex decision from many NLP practitioners. The authors compare the two mechanisms through the lens of a metric called Intrinsic Dimension to give us more insights into these two tuning techniques.
Setup: They analyze Llama-3-8B, Llama-2-13B/7B, and Mistral-7B-v0.3 across 8 English benchmarks (AG News, SST-2, CoLA, CommonsenseQA, MMLU, QQP, QNLI, MNLI). For SFT, they use LoRA adapters on Q/K/V/O projection matrices with 1k training examples over 15 epochs, logging checkpoints to track ID dynamics. For ICL, they test k-shot prompts with k ∈ {0,1,2,5,10,12,14,16,18,20}. They measure layer-wise ID via the TwoNN estimator, which entails taking a summary statistic of the distribution of the ratio between the first and second neighbor of the points in the training dataset for the layer at hand.
Results:
- Fine-tuning dynamics - ID may decrease initially but then increases steadily, somewhat unintuitively.
- ICL vs k - ID rises from 0-shot up to ~5-10 shots, then plateaus or declines; the k where AUC peaks usually matches where accuracy saturates
- Paradigm comparison - For k ≥ 5, ICL induces consistently higher IDs than SFT across all (model, dataset) pairs—even though SFT reaches better accuracy
- Cool finding - ID can be used as a heuristic to pick SFT checkpoints before overfitting and choose optimal k for ICL performance
My thoughts
- Intrinsic dimension is a cool concept for understanding model representations, really like their definition for it. “Intrinsic dimension (ID) is a useful metric for assessing the geometric complexity of a model’s representations. It quantifies the number of degrees of freedom in the representation space, serving as a measure of the complexity of the underlying manifolds where the embeddings reside.”
- This very cool too. Yin et al. (2024) explore the use of Local Intrinsic Dimension (LID) to detect untruthful outputs from LLMs. Their study reveals that truthful outputs typically exhibit lower LIDs compared to hallucinated ones, suggesting that LID can serve as a signal for truthfulness in LLM generations. They also identify a positive relationship between the ID of data representations and validation performance during fine-tuning.
- Unclear why SFT would have continuosly increasing ID, maybe overfitting maybe overfitting it is unclear
- Interesting comparison between in-context learning and SFT
- Connection to ARC Challenge: Could higher ID representations help with abstract reasoning tasks?
- Bonus: From Tokens to Thoughts - How LLMs and Humans Trade Compression for Meaning -
Chen Shani, Dan Jurafsky, Yann LeCun, Ravid Shwartz-Ziv
Summary
Intro: Lecun, Jurafsky and co attempt to answer a very interesting question, how do human representations differ from the ones formed by LLMs. Their analysis focuses on compression vs richer abstractions and “Rate-Distortion Theory and the Information Bottleneck principle, to quantitatively compare” the different representations. Specifically, they investigate three research questions: “[RQ1]: To what extent do concepts emergent in LLMs align with human-defined conceptual categories? [RQ2]: Do LLMs and humans exhibit similar internal geometric structures within these concepts, especially concerning item typicality? [RQ3]: How do humans and LLMs differ in their strategies for balancing representational compression with the preservation of semantic fidelity when forming concepts?”
Setup: The authors develop an information-theoretic framework drawing from Rate-Distortion Theory and the Information Bottleneck principle to quantitatively compare LLM and human conceptual representations. They analyze token embeddings from a diverse suite of LLMs totaling around 30 different models. For human baselines, they use cognitive psychology datasets like the categorization studies by Rosch (1973, 1975) and work on typicality judgments covering ~3k common words across various conceptual categories by McCloskey & Glucksberg (1978). They measure how well different systems balance compression (grouping similar concepts together) versus meaning preservation (keeping important distinctions). They compare how humans and LLMs organize concepts, measuring both how efficiently they compress information and how much semantic detail they retain in the process.
Results:
LLM-derived clusters significantly align with human-defined conceptual categories, suggesting they capture key aspects of human conceptual organization. Notably, certain encoder models exhibit surprisingly strong alignment, sometimes outperforming much larger models, highlighting that factors beyond sheer scale influence human-like categorical abstraction.
Limited Capture of Semantic Nuance: While LLMs effectively form broad conceptual categories, their internal representations demonstrate only modest alignment with human-perceived fine-grained semantic distinctions, such as item typicality or psychological distance to category prototypes. This suggests a divergence in how LLMs and humans structure information within concepts.
LLMs demonstrate markedly superior information-theoretic efficiency in their conceptual representations compared to human conceptual structures. Evaluated via our L-objective, LLM-derived clusters consistently achieve a more “optimal” balance (by this measure) between representational complexity (compression) and semantic distortion. Human conceptualizations, while richer, appear less statistically compact, suggesting optimization for pressures beyond pure statistical compressibility
My thoughts
- Great question, very interesting formulation
- Cool concept of cognitive heritage as geography of the human mind
- This paper deserves a deeper dive, cool L-objective worth follow up work on.
Final thoughts and Next Steps
-
Research building on previous work: The SAE work takes Anthropic’s sparse autoencoders and asks what happens during fine-tuning and model merging. The intrinsic dimension paper applies differential geometry to compare in-context learning vs. supervised fine-tuning both widely used adaptation paradigms. The Discourse Education through Pre-Training Hierarchically paper clearly cites and mentions alternative approaches to adding discorse level-info into LLM pretraining. Representation learning has steadily build up from Word2Vec through BERT to more recent LLMs. Cool to see it in continuing today.
-
Alternative objectives beyond next-token prediction are fun: The punctuation restoration work and DEPTH paper both explore training objectives that go beyond standard autoregressive or masked language modeling. These papers make me think about diffusion models for text - could denoising objectives at different levels of abstraction (character, word, sentence, document) teach richer representations than just predicting the next token? Worth exploring.
-
Intrinsic Dimension and SAEs are incredibly cool and seem to be relevant for understanding something about the representations of language models at every level.
-
Would like to compare the ID SAE features of different architectures or different training data, worth doing a paper comparing diffusion models vs BERT vs GPT-2 representations?
-
Universal representation hypothesis, is it true? Stay tuned for the next unireps workshop at neurips, universal representation + L-objective work?