blank

What’s Happening in Representation Learning? A Look at REP4NLP 2025

2025-06-27T00:00:00+00:00

Representation Learning

Representation Learning is one of the primitives of intelligence. In a broad sense, it’s the foundation of all abstractions and compressions, necessary precursors for solving anycomplex problems in math and science. More narrowly, learning complex representations by changing weights through gradient descent, has been at the heart of the effectiveness of deep neural nets since their inception and I’ve found it a compelling part of ML and NLP for years. More specifically, although there have been many different iterations of the “learn meaningful representations of language” problem I’ve always been quite fond of is Word2Vec. One of the first ML for NLP techniques I learned about, and one that always deserves a mention when talking about representation learning, for its elegance and ability to reveal the underlying “geometry of language” even back in 2012. This first encounter with representation learning for NLP led to an enduring interest in the field and its developments, of which there were plenty in the 2010s. The story of NLP in the 2010s is essentially the story of representation learning for language driving massive improvements across downstream tasks, a story I followed with interest.

Nevertheless, as the NLP community at large has moved more and more towards applied work on agents, prompting and the like, I’ve also paid less attention to developments in representation learning, despite there being many. Working at a question-answering-from-data company, I’ve been drawn more toward the practical applications too. However, in the last few weeks I’ve decided to take a closer look at the current state of representation learning for NLP by surveying the papers accepted to the REP4NLP 2025 workshop held this past May at NAACL in Albuquerque, New Mexico. Here’s what I found interesting.

Before we begin, some caveats. First and foremost, the papers in this workshop will only provide a limited snapshot of the field and by their very nature will tend to be a lagging indicator of what people have been working on and talking about. Second, I don’t claim to be the most up-to-date member of the NLP community, so some references or methods might be lost on me, but I do have a strong interest in the subfield and have followed it for many years. That said, let’s start!

REP4NLP 2025

I’ve talked about representation learning more broadly above, but the call for papers for the workshop asked for work on several key themes: efficient learning and inference as models scale up (with respect to training data, computing time, and energy consumption), investigating representation dynamics during training, evaluating existing representations for generalization and robustness, understanding the relationship between representations and model behaviors, exploring beyond English textual representations (cross-modal, cross-lingual, knowledge-informed approaches), and developing new representations using various methods from language model objectives to neuro-symbolic approaches. The accepted papers reflect this broad scope and can be split into different categories. We’ll take a general look at these broad categories and then focus on some of the papers I found most interesting.

🔬 Interpretability and Understanding Model Representations/Behavior

Tracking Universal Features Through Fine-Tuning and Model Merging Suggestion: Analyze (Standout) | 🔗 ACL Anthology 🏷️ Tags: #feature-analysis #model-merging #sparse-autoencoders
A Comparative Study of Learning Paradigms in Large Language Models via Intrinsic Dimension Suggestion: Analyze (Standout) | 🔗 ACL Anthology 🏷️ Tags: #intrinsic-dimension #in-context-learning #supervised-fine-tuning
Reverse Probing: Evaluating Knowledge Transfer via Finetuned Task Embeddings for Coreference Resolution Suggestion: Open | 🔗 ACL Anthology 🏷️ Tags: #probing #knowledge-transfer #evaluation-methodology

Notes: Like the methodological flip of typical probing. Instead of probing complex task representations on simple tasks, they probe simple task embeddings on complex tasks. Neat finding that semantic similarity tasks (paraphrase detection) transfer best to coreference resolution and in general they talked about how to solve some interesting problems like where to take embeddings from LLMs and how to combine them.

📝 Text Embeddings

Prompt Tuning Can Simply Adapt Large Language Models to Text Encoders Suggestion: Read (If you like embeddings) | 🔗 ACL Anthology 🏷️ Tags: #prompt-tuning #text-encoders

Notes: Interesting comparison between bidirectional attention and unidirectional one. Cool that it is on taking meaningful embeddings from LLMs. Would be interesting to combine with diffusion LLMs see how the results would differ there.
Large Language Models Are Overparameterized Text Encoders Suggestion: Read (If you like embeddings) | 🔗 ACL Anthology 🏷️ Tags: #model-pruning #text-encoders

Notes: Super cool finding: can prune 30% of layers with negligible impact, 80% with modest drop. Big question: if 30% of parameters do nothing semantically, what ARE they doing? Regularization? optimization dynamics? Generation-specific computation not needed for encoding? Their method is very simple (3 lines of code) yet effective. Raises some questions about parameter efficiency and what different model components actually contribute.

🏗️ Alternative Architectures & Pre-training Objectives

DEPTH: Discourse Education through Pre-Training Hierarchically Suggestion: Skim | 🔗 ACL Anthology 🏷️ Tags: #discourse-learning #pre-training-objective

Notes: Always very cool to see different training objectives. Part of a long line of attempts of inserting sentence-level tasks in LLM pre-training. Smells of bitter lesson though, seems a bit too complex to me. Unclear if the discourse-level issues for GPT-style autoregressive models brought up here are real. Incompatibility of flash attention makes it hard to compare efficiency gains.
State Space Models are Strong Text Rerankers Suggestion: Skim | 🔗 ACL Anthology 🏷️ Tags: #mamba #text-reranking #information-retrieval

Notes: Main thing here is that there are a lot of experiments, on different SSMs and different LLMs. Very thorough empirical study on the models + tasks combination, gold for the right person. Conclusions are that “(1) Mamba architectures achieve competitive text ranking performance, comparable to transformer-based models of similar size; (2) they are less efficient in training and inference compared to transformers with flash attention”
Punctuation Restoration Improves Structure Understanding without Supervision Suggestion: Skim | 🔗 ACL Anthology 🏷️ Tags: #punctuation-restoration #structural-understanding #pre-training-objective

Notes: The core idea is quite cool. The concept of finding increasingly complex objectives to learn better representations is fun. Shows that punctuation restoration improves structure-related tasks (NER, chunking, POS tagging) by ≥2% in 16/18 experiments. Suggests current pretraining objectives (MLM, autoregressive) might miss important structural knowledge.

⚡ Efficiency Gains

Choose Your Words Wisely: Domain-adaptive Masking Makes Language Models Learn Faster Suggestion: Open | 🔗 ACL Anthology 🏷️ Tags: #domain-adaptation #efficient-training #masked-language-modeling #biomedical-nlp
Amuro & Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models Suggestion: Open | 🔗 ACL Anthology 🏷️ Tags: #pre-training #fine-tuning #continual-learning #model-analysis
Vocabulary-level Memory Efficiency for Language Model Fine-tuning Suggestion: Open | 🔗 ACL Anthology 🏷️ Tags: #memory-efficiency #vocabulary-optimization #fine-tuning #resource-optimization

Cross-Modal Learning for Music-to-Music-Video Description Generation Suggestion: Open | 🔗 ACL Anthology 🏷️ Tags: #cross-modal #music-video #multimodal-learning #generation
Efficient Document-level Event Relation Extraction Suggestion: Open | 🔗 ACL Anthology 🏷️ Tags: #event-extraction #efficiency #document-level #two-stage-framework
Investigating Adapters for Parameter-efficient Low-resource Automatic Speech Recognition Suggestion: Open | 🔗 ACL Anthology 🏷️ Tags: #adapters #parameter-efficiency #speech-recognition #low-resource

Standout Papers

- Tracking Universal Features Through Fine-Tuning and Model Merging -

Niels Nielsen Horn, Desmond Elliott

Summary

Intro: This paper offers an excellent window into how Sparse Autoencoders (SAEs) are being used in interpretability research for NLP. Although we might be past the peak of interest that Anthropic’s detailed reports generated in late 2023, SAEs remain our best tools for peering directly into transformer weights and getting qualitative accounts of what they represent.

Setup: Building on this notion and prior work, Horn and Elliott provide a concise and easy-to-follow analysis of SAE feature persistence after fine-tuning and model merging for a 1-layer Mistral-like transformer model. More specifically they start from a base model trained on an equal split of general english tokens and Python code. Then they finetune separately on two different datasets, one of Lua code and one of english children stories. Finally, a third model is subsequently created as a the result of merging the two finetuned models using Spherical Linear Interpolation, a fancy, more geometrically sound form of weight averaging.

Results: For this 1-layer model the authors find that:

Features learned on the base model persist after fine-tuning 63 % of the top-100 base features (by activation frequency) remain detectable in both the finetuned models, these features are mostly “universal” low-level patterns: whitespace, brackets, word-pieces. Higher-level features (e.g., “Python try/except”) often do disappear.
Merging has a positive but limited effect on recovering lost features Merge recovers ~11 % of base features that had vanished in one branch but stayed alive in the other. Only ~4 % of features that were present in both branches are corrupted by merging.
“Robust” features also useful Features that survive both fine-tunes and the merge contribute ~45 % of total log-prob improvement on a mixed validation set, despite being <10 % of all discovered features.

My thoughts

Very interesting to see SAEs in action
Interesting connection to LoRA or Federated Learning or model merging in general
Did not really buy the “robust features” argument tbh
Would explore connection to feature universality to see how many of the top-100 features would be in models only trained on the finetuned datasets (maybe 50%?) or how many for models with the same dataset but different training objectives
Should these results be compared with simpler “Linear Probe” approaches as suggested by Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research

- A Comparative Study of Learning Paradigms in Large Language Models via Intrinsic Dimension -

Saahith Janapati, Yangfeng Ji

Summary

Intro: Deciding between Supervised Finetuning and In Context Learning is a complex decision from many NLP practitioners. The authors compare the two mechanisms through the lens of a metric called Intrinsic Dimension to give us more insights into these two tuning techniques.

Setup: They analyze Llama-3-8B, Llama-2-13B/7B, and Mistral-7B-v0.3 across 8 English benchmarks (AG News, SST-2, CoLA, CommonsenseQA, MMLU, QQP, QNLI, MNLI). For SFT, they use LoRA adapters on Q/K/V/O projection matrices with 1k training examples over 15 epochs, logging checkpoints to track ID dynamics. For ICL, they test k-shot prompts with k ∈ {0,1,2,5,10,12,14,16,18,20}. They measure layer-wise ID via the TwoNN estimator, which entails taking a summary statistic of the distribution of the ratio between the first and second neighbor of the points in the training dataset for the layer at hand.

Results:

Fine-tuning dynamics - ID may decrease initially but then increases steadily, somewhat unintuitively.
ICL vs k - ID rises from 0-shot up to ~5-10 shots, then plateaus or declines; the k where AUC peaks usually matches where accuracy saturates
Paradigm comparison - For k ≥ 5, ICL induces consistently higher IDs than SFT across all (model, dataset) pairs—even though SFT reaches better accuracy
Cool finding - ID can be used as a heuristic to pick SFT checkpoints before overfitting and choose optimal k for ICL performance

My thoughts

Intrinsic dimension is a cool concept for understanding model representations, really like their definition for it. “Intrinsic dimension (ID) is a useful metric for assessing the geometric complexity of a model’s representations. It quantifies the number of degrees of freedom in the representation space, serving as a measure of the complexity of the underlying manifolds where the embeddings reside.”
This very cool too. Yin et al. (2024) explore the use of Local Intrinsic Dimension (LID) to detect untruthful outputs from LLMs. Their study reveals that truthful outputs typically exhibit lower LIDs compared to hallucinated ones, suggesting that LID can serve as a signal for truthfulness in LLM generations. They also identify a positive relationship between the ID of data representations and validation performance during fine-tuning.
Unclear why SFT would have continuosly increasing ID, maybe overfitting maybe overfitting it is unclear
Interesting comparison between in-context learning and SFT
Connection to ARC Challenge: Could higher ID representations help with abstract reasoning tasks?

- Bonus: From Tokens to Thoughts - How LLMs and Humans Trade Compression for Meaning -

Chen Shani, Dan Jurafsky, Yann LeCun, Ravid Shwartz-Ziv

Summary

Intro: Lecun, Jurafsky and co attempt to answer a very interesting question, how do human representations differ from the ones formed by LLMs. Their analysis focuses on compression vs richer abstractions and “Rate-Distortion Theory and the Information Bottleneck principle, to quantitatively compare” the different representations. Specifically, they investigate three research questions: “[RQ1]: To what extent do concepts emergent in LLMs align with human-defined conceptual categories? [RQ2]: Do LLMs and humans exhibit similar internal geometric structures within these concepts, especially concerning item typicality? [RQ3]: How do humans and LLMs differ in their strategies for balancing representational compression with the preservation of semantic fidelity when forming concepts?”

Setup: The authors develop an information-theoretic framework drawing from Rate-Distortion Theory and the Information Bottleneck principle to quantitatively compare LLM and human conceptual representations. They analyze token embeddings from a diverse suite of LLMs totaling around 30 different models. For human baselines, they use cognitive psychology datasets like the categorization studies by Rosch (1973, 1975) and work on typicality judgments covering ~3k common words across various conceptual categories by McCloskey & Glucksberg (1978). They measure how well different systems balance compression (grouping similar concepts together) versus meaning preservation (keeping important distinctions). They compare how humans and LLMs organize concepts, measuring both how efficiently they compress information and how much semantic detail they retain in the process.

Results:

LLM-derived clusters significantly align with human-defined conceptual categories, suggesting they capture key aspects of human conceptual organization. Notably, certain encoder models exhibit surprisingly strong alignment, sometimes outperforming much larger models, highlighting that factors beyond sheer scale influence human-like categorical abstraction.

Limited Capture of Semantic Nuance: While LLMs effectively form broad conceptual categories, their internal representations demonstrate only modest alignment with human-perceived fine-grained semantic distinctions, such as item typicality or psychological distance to category prototypes. This suggests a divergence in how LLMs and humans structure information within concepts.

LLMs demonstrate markedly superior information-theoretic efficiency in their conceptual representations compared to human conceptual structures. Evaluated via our L-objective, LLM-derived clusters consistently achieve a more “optimal” balance (by this measure) between representational complexity (compression) and semantic distortion. Human conceptualizations, while richer, appear less statistically compact, suggesting optimization for pressures beyond pure statistical compressibility

My thoughts

Great question, very interesting formulation
Cool concept of cognitive heritage as geography of the human mind
This paper deserves a deeper dive, cool L-objective worth follow up work on.

Final thoughts and Next Steps

Research building on previous work: The SAE work takes Anthropic’s sparse autoencoders and asks what happens during fine-tuning and model merging. The intrinsic dimension paper applies differential geometry to compare in-context learning vs. supervised fine-tuning both widely used adaptation paradigms. The Discourse Education through Pre-Training Hierarchically paper clearly cites and mentions alternative approaches to adding discorse level-info into LLM pretraining. Representation learning has steadily build up from Word2Vec through BERT to more recent LLMs. Cool to see it in continuing today.
Alternative objectives beyond next-token prediction are fun: The punctuation restoration work and DEPTH paper both explore training objectives that go beyond standard autoregressive or masked language modeling. These papers make me think about diffusion models for text - could denoising objectives at different levels of abstraction (character, word, sentence, document) teach richer representations than just predicting the next token? Worth exploring.
Intrinsic Dimension and SAEs are incredibly cool and seem to be relevant for understanding something about the representations of language models at every level.
Would like to compare the ID SAE features of different architectures or different training data, worth doing a paper comparing diffusion models vs BERT vs GPT-2 representations?
Universal representation hypothesis, is it true? Stay tuned for the next unireps workshop at neurips, universal representation + L-objective work?

Expensive ways to get tired

2025-02-13T00:00:00+00:00

What is Homomorphic Encryption? – OpenMined

2020-08-13T00:00:00+00:00

Search site6 years agoThis post is part of our Privacy-Preserving Data Science, Explained series.Check out the companion video to this article on youtube.Input privacy is one of the most relevant issues in private ML. We explored one solution to this problem in Secure Multi-Party Computation. However, if secret sharing was not an option due to the limited number of participants, what’s the alternative? Homomorphic encryption(HE) is a kind of encryption that allows computation on encrypted data. In short, HE ensures that performing operations on encrypted data and decrypting the result is equivalent to performing analogous operations without any encryption. So like SMPC, we can use HE to achieve input privacy but with only one party needed to encrypt and decrypt the data.HE is not only used to protect data owners; model owners have some of the same privacy concerns around their valuable intellectual property (IP) as clients do around their data. Therefore it is crucial when using a model in an untrustworthy environment, to keep its parameters encrypted.Homomorphic encryption has numerous applications that range from healthcare to smart electric grids; from education to machine learning as a service (MLaaS). All sectors where input privacy is paramount and making use of the data is usually already complex due to: regulations, the significance of the data, and security concerns. Other notable uses of the technology involve non-intrusive, privacy-preserving security, i.e., systems capable of detecting nefarious activity from an encrypted and private data source. A useful metaphor for these systems is to think of them as the sniffer dogs of digital data sources. They don’t infringe upon anyone’s privacy thanks to the encryption, their accuracy can be empirically verified, and since their parameters are kept private, they are not easily reverse engineered. Here’s a link to Andrew Trask’s article on privacy-preserving security if you want to dive deeper.PySyft supports the CKKS leveled homomorphic encryption scheme and the Paillier partially homomorphic encryption scheme which is limited to addition but is much faster.More details on CKKS and Paillier are available below in the “theory behind the implementation” section. Here we’ll focus on how to use HE in PySyft.After importing and hooking torch with syft’s additional functionalities we generate the private and public keys. Using the public key we can encrypt syft tensors and with the + operator we can homomorphically sum them. Intuitively, the private key allows us to decrypt the result of our operations which we can easily print.To use the more complex and powerful CKKS scheme we can follow similar steps. Besides importing syft and torch for CKKS we also use functions from syft.frameworks.tenseal which integrates the TenSEAL package for performing HE operations on tensors.Since 1978, when the idea of Fully Homomorphic Encryption (FHE) was first formalized, several cryptosystems have been invented to get closer to computing arbitrary functions on ciphertext. However, until Craig Gentry introduced the crucial technique of bootstrapping in ’09, they were all partially or somewhat homomorphic encryption schemes. To understand why that is the case, it’s important to define more precisely what “computing arbitrary functions” means.At the level of bits and logic gates, successive combinations of AND and XOR can express any boolean function. Since these two gates behave respectively as operations of binary multiplication and addition we can conclude that a scheme under which we can perform homomorphic addition and multiplication on ciphertext should be capable of achieving FHE, with some caveats.The operations of addition and multiplication suggest the use of a ring as the underlying algebraic structure. A set of objects, like integers, is considered a ring if we can perform on it an “addition-like” and a “multiplication-like” operation and if it admits neutral elements for both operations, respectively 0 and 1. Examples of rings are, as we have said, the set of all integers but also the set of all possible remainders of an integer division by a specific n, known as “all integers modulo n”. For instance the set of integers modulo 5 Z_5 = {0, 1, 2, 3, 4}, is a ring because we have both 0 and 1 and the addition-like operation that after summing two numbers takes as its result the remainder of the integer division by 5 of that sum. So 3+4 = 7 -> 7%5 = 2. Multiplication works in the same way. 44 = 16 -> 16%5 = 1. As we’ll detail later we can also build rings with polynomials.Besides homomorphic addition and multiplication what are the other characteristics common to most FHE schemes?They tend to be based on schemes that are capable of “somewhat” homomorphic encryption. These schemes can only perform a limited number of successive multiplication and addition operations on ciphertext before the results become unreliable and impossible to decrypt. This limitation arises directly from the way these systems guarantee security by relying on noise or error to make relatively simple problems computationally intractable. Using multiplication we can operate in a similar way because C(m1)C(m2) = m1m2 + 2(r2m1+r1m2+r1r2) + p(m1q2+2r1q2+q1m2+2q1r2+q1q2)is a valid encryption of m1m2. Here the noise grows much faster than in addition.Before talking about why a growing noise term is such a crucial issue let’s focus on why we need it in the first place. If we were to remove the noise from our scheme it would still be capable of performing homomorphic addition and multiplication. However, to break the noise-less scheme an attacker would just need to get a hold of two encrypted messages and proceed by simply calculating the greatest common divisor for which there is the Euclidean algorithm that runs linearly with the number of digits in the smallest input.If we add noise, the problem becomes much more difficult. In the literature it is known as the Approximate Greatest Common Divisor (GCD) or Approximate Common Divisor (ACD) problem and with reasonable parameters it is considered hard to solve.Approximate GCD is not the only problem used to secure FHE, in fact numerous recent schemes employ the Learning With Errors problem which is also conjectured to be hard to solve.In its most basic formulation the problem states that, given a random uniform integer matrix A and a very small integer vector error e, if we multiply A by a secret vector s and then sum e, recovering s from the resulting vector b without e is computationally intractable even if knowing the matrix A. Similarly to the scheme we explained before, As + e can be thought of as an encryption of 0 and, from that starting point, a different scheme can be developed as presented by prof. Daniele Micciancio in this talk with intuitive addition and multiplication.Even schemes that rely on LWE however, we see the noise rise as successive additions are performed and even more so with multiplications. To understand why that is a problem we’ll go back to our first scheme with an example.Let’s say, to keep things simple, that we would like to add to itself a number three times Cp(0) = 0 + 2 * 5 + 68 * 29 = 1982 the number is m=0 encrypted with secret key p=29.Cp(0)+Cp(0)+Cp(0) = 5946 = 0 + 2(15) + (204)*29 as always 5946 doesn’t tell us much about the result of our calculation but with p we can decrypt it as we detailed above. We start with 5946%29 = 1 and then we check whether the result is even or odd, 1 is odd so we conclude 3m=1 but of course we know that is not the case so what went wrong?The error term has gotten too big and instead of being even it’s now odd. Since 30 is bigger than p=29 the modulo p operation that usually doesn’t effect the error because it’s too small has now changed it, corrupting the encryption. This example is specific to our scheme and the noise was a little too high to begin with. The behavior of all SWHE schemes become unpredictable as the error grows larger.So is there a way to decrease the error? Yes, bootstrapping is a technique that involves running the decryption procedure homomorphically without revealing the message and using the encrypted secret key.To give a sense of how this is possible we should keep in mind that we can produce a series of XOR and AND gates or additions and multiplications that perform the decryption operation. In our scheme, the modulo p followed by modulo 2. Furthermore, the scheme we presented here encrypts binary messages. To get an encryption of the secret key we simply need to have an ordered set of encryptions of its bits.With these two ingredients we can perform homomorphic decryption and eliminate the noise produced by previous operations. Homomorphic decryption however injects some noise of its own, like any other function. As long as we can still perform reliably one operation of addition or of multiplication before needing bootstrapping again we have reached FHE. This is the recipe for most proposed FHE schemes, an underlying SWHE scheme that supports addition and multiplication, usually secured by adding noise, and a way to reduce the noise when it grows too large, usually by bootstrapping.The following are some examples of HE schemes: some partial, others fully homomorphic. I have provided short descriptions and links to resources to understand them better.In 1999 Pascal Paillier invented a partially homomorphic, asymmetric cryptosystem now bearing his last name. Paillier’s scheme is homomorphic with respect to addition. For more details on the actual algorithm take a look at the article dedicated to it on our blog.In short, it achieves HE with respect to addition by encrypting the message as an exponent of the public key. This way when multiplying two ciphertexts encrypted with the same key the result is a valid encryption of the sum.Original paper: Public-Key Cryptosystems Based on Composite Degree Residuosity ClassesVideo resource: Implementation of Homomorphic Encryption: PaillierCKKS has been developed by researchers at Seoul National University and UC San Diego and it is characterized by using the approximate nature of floating-point arithmetic as part of the LHE scheme.OpenMined uses CKKS as the primary way to encrypt tensors on which we want to perform both addition and multiplication.OpenMined demo on CKKS: Homomorphic Encryption in PySyft with SEAL and PyTorch Original paper: Homomorphic Encryption for Arithmetic of Approximate NumbersVideo resource: Introduction to CKKS (Approximate Homomorphic Encryption)It uses rings over polynomials and has an approachable construction similar to the scheme we described in this post but using LWE. It was developed by Brakerski,Fan and Vercauteren. We have a beginner friendly post on how to implement it in Python.OM Implementation: Build an Homomorphic Encryption Scheme from Scratch with PythonGreat Blogpost: A Homomorphic Encryption Illustrated Primer Original Paper: Somewhat Practical Fully Homomorphic EncryptionGSW uses LWE applied to linear algebra where the messages are encrypted as eigenvalues of matrices which have a common eigenvector. GSW was developed by Craig Gentry, Amit Sahai, and Brent Waters.Original Paper: Homomorphic Encryption from Learning with Errors:Conceptually-Simpler, Asymptotically-Faster, Attribute-BasedVideo resource: Fully Homomorphic EncryptionBGV can use modulus switching, an alternative technique for noise management. BGV was developed by Zvika Brakerski,Craig Gentry and Vinod Vaikunathan.Original paper: Fully Homomorphic Encryption without BootstrappingSEAL is an open-source library developed by Microsoft that implements the BFV and CKKS schemes in both their symmetric and asymmetric versions. The library is written in C++ but has many wrappers for Python and JavaScript.HElib supports BGV and CKKS with a focus on ciphertext “packing” techniques that increase the efficiency of the base schemes. To describe the library approach to HE the developers have written that in HElib “The underlying cryptosystem serves as the equivalent of a “hardware platform”, in that it defines a set of operations that can be applied homomorphically, and specifies their cost.” HElib has been developed in C++ by researchers at IBM and it is open-source.Python-paillier open-source implementation in python of the Paillier scheme.TFHE implements HE at the binary gate level with a ring-variant of the GSW scheme and applies gate-by-gate bootstrapping. cuFHE is the cuda enabled version of TFHE. The library is open-source and written in C.Palisade is an open-source library supported by DARPA that implements the BFV, BGV, CKKS, TFHE, FHEW schemes and provides other useful features to support lattice-based cryptography.You might also be interested in: Homomorphic Encryption in PySyft with SEAL and PyTorch, Build an Homomorphic Encryption Scheme from Scratch with Python, What is the Paillier Cryptosystem?Sign up to recieve an email when new content like this is posted.Want to write for OpenMined or help update a post?Let us know!By sending, you agree to our privacy policyand join the OpenMined Newsletter.January 5, 2026December 1, 2025Follow usLearn MoreSolutionsProgramsOur vision for the future is ambitious. Here is how you can help:© 2026 OpenMined FoundationOpenMined is a 501(c)(3) non-profit foundation and a global community on a mission to create the public network for non-public information.With your support, we can unlock the world’s insights while making privacy accessible to everyone.We can do it, with your help.Secure Donation

What is Federated Learning? – OpenMined

2020-05-19T00:00:00+00:00

Search site6 years agoThis post is part of our Privacy-Preserving Data Science, Explained series. Update as of November 18, 2021: The version of PySyft mentioned in this post has been deprecated. Any implementations using this older version of PySyft are unlikely to work. Stay tuned for the release of PySyft 0.6.0, a data centric library for use in production targeted for release in early December.In this article of the introductory series on Private ML, we will introduce Federated Learning (FL), explaining what FL is, when to use it, and how to implement it with OpenMined tools. The information in this article will be digestible for a broad audience, but section by section, we will go more into the weeds to understand and use Federated Learning.For more info about the series, check out the intro article or take a look at the other posts to learn more about the techniques that can enable privacy-preserving ML with OpenMined’s libraries.Initially proposed in 2015, federated learning is an algorithmic solution that enables the training of ML models by sending copies of a model to the place where data resides and performing training at the edge, thereby eliminating the necessity to move large amounts of data to a central server for training purposes.The data remains at its source devices, a.k.a. the clients, which receive a copy of the global model from the central server. This copy of the global model is trained locally with the data of each device. The model weights are updated via local training, and then the local copy is sent back to the central server. Once the server receives the updated model, it proceeds to aggregate the updates, improving the global model without revealing any of the private data on which it was trained.One of the first applications of FL was to improve word recommendation in Google’s Android keyboard without uploading the data, i.e. a user’s text, to the cloud. More recently, Apple has detailed how it employs federated learning to improve Siri‘s voice recognition. Besides, intuitively, keeping the data at its source is valuable in any privacy-preserving applications, especially when applied in healthcare or on confidential data in business and government.To get started we will use the classical MNIST data set that will stand in for our clients’ data, PySyft will provide all the components needed to demo federated learning and test it locally on this data set. If you want to imagine a reasonably close application, we could conceive that the MNIST characters are part of the digital signatures of our clients, produced when signing documents on their smartphone and we would like to use them to train a character recognition model. In this scenario we would like to provide strong privacy assurances to our users by not uploading their signatures to a central server. In PySyft, the clients’ devices, that is, the entities performing model training, are called workers.At this point in or mock example we have to send the data to the workers using PySyft’s Federated Data Loader.The Federated Data Loader, like the standard torch data loader on which is based, enables lazy loading during training and so in the train function, because the data batches are now distributed across Alice and bob, you need to send the model to the right location for each batch using model.send(…). Then you perform all the operations remotely with the same syntax as if you were doing everything locally on PyTorch. When you’re done, you get back the model updated and the loss that can be logged using the .get() method.At this point, the model has been trained on the data of both Alice and Bob by the respective workers, and their data has not left their devices.Federated learning alone, however, is not enough to ensure privacy because using the model updates, an “honest-but-curious” server could reconstruct the samples from which the updates were computed. This is where Secure MultiParty Computation, homomorphic encryption, and differential privacy come to provide stronger guarantees of security to data owners. We will explore all these topics in this series. As we mentioned above the larger vision for the technology goes beyond any single application or service. With the help of federated learning data owners can more easily maintain control of their data that can be used to train models without leaving the owners’ systems. These guarantees, besides being positive for all users of data-intensive applications, have the potential to make available whole new data sets in sectors like healthcare where, to follow HIPPA or the health related provisions of GDPR, privacy is the top priority. To contribute to making this vision a reality OpenMined is working on PyGrid a peer-to-peer platform that uses the PySyft framework for Federated Learning and data science. Data owners and data scientists can connect on the platform, where the data owners can feel safe in the knowledge that their data will never leave their node, and data scientists can perform their analysis without infringing on anyone’s privacy rights. Today, this type of interaction could take from weeks to months in sectors working on sensitive data, but with PyGrid it could all be just a few lines of code away. To learn more about PyGrid, here is a deeper dive in the platform and the use cases it enables. OpenMined would like to thank Antonio Lopardo, Emma Bluemke, Théo Ryffel, Nahua Kang, Andrew Trask, Jonathan Lebensold, Ayoub Benaissa, and Madhura Joshi, Shaistha Fathima, Nate Solon, Robin Röhm, Sabrina Steinert, Michael Höh and Ben Szymkow for their contributions to various parts of this series.Sign up to recieve an email when new content like this is posted.Want to write for OpenMined or help update a post?Let us know!By sending, you agree to our privacy policyand join the OpenMined Newsletter.January 5, 2026December 1, 2025Follow usLearn MoreSolutionsProgramsOur vision for the future is ambitious. Here is how you can help:© 2026 OpenMined FoundationOpenMined is a 501(c)(3) non-profit foundation and a global community on a mission to create the public network for non-public information.With your support, we can unlock the world’s insights while making privacy accessible to everyone.We can do it, with your help.Secure Donation

What is Secure Multi-Party Computation? – OpenMined

2020-05-19T00:00:00+00:00

Search site6 years agoThis post is part of our Privacy-Preserving Data Science, Explained series. As we mentioned in one of the previous posts in this series, federated learning is not enough to develop privacy-preserving ML applications. In fact, to keep the model from merely copying what it receives, the data needs to be kept secret while still permitting training and inference. One way to achieve this objective, with both significant advantages and trade-offs, is secret sharing in secure multi-party computation. Today we’ll explore secure multi-party computation (SMPC) and explore how it can help us achieve input privacy. Similar to the post on FL, we hope that all the information in this article will be digestible for a broad audience, but section by section, we will go more into the weeds to understand and use this technique. For more info about the series, check out the intro article or take a look at the other posts to learn more about the technologies that can enable privacy-preserving ML with OpenMined’s libraries.Broadly speaking, SMPC techniques are ways for parties to compute a function jointly while keeping their inputs secret. In the case of ML, this function might be a model’s loss function during training, or it could be the model itself in inference.SMPC tends to have a significant communication overhead but has the advantage that, unless a substantial proportion of the parties are malicious and coordinating, the input data will remain private even if sought after for unlimited time and resources. Secret sharing in SMPC can protect both models’ parameters and training/inference data.Machine Learning as a Service is one of the most significant use-cases of SMPC as it would allow companies to offer their models to perform inference on private data sent by their clients, while ensuring the utmost privacy. For example, in the medical field, a cloud service provider could run a trained classification model on secretly shared patient data and send the secretly shared result (e.g., a prediction of a disease) back to the patient.Other notable applications of the technology involve non-intrusive, privacy-preserving security, i.e. systems capable of detecting nefarious activity from an encrypted and private data source. A useful metaphor for these systems is to think of them as the sniffer dogs of digital data sources. They don’t infringe upon anyone’s privacy thanks to the protection provided by secret sharing, their accuracy can be empirically verified, and since their parameters are kept private, they shouldn’t be easily reverse engineered. For more info on this use case, check-out Andrew Trask’s great blog post that goes more in-depth on similar applications using Homomorphic encryption to protect the data, secret sharing in SMPC can be used in much the same way.PySyft implements secret sharing and fixed precision encoding, we’ll detail both below but in PySyft they are two very simple tensor methods.One of the easiest to understand implementations of secret sharing in SMPC is additive secret sharing, as explained in the Udacity course on Private AI. Additive secret sharing boils down to the idea that, a number let’s say x=5, can be split in several shares let’s say two, share_1=2 and share_2=3, managed independently by two participants, let’s call them Alice and Candice. At this point, if we were to apply any number of addition operations on the shares individually and then sum the results, this sum would be the same as applying those same additions and on x=5 . To facilitate the encoding of negative numbers and increase the security of the protocol, we use the modulo operation with a very large prime number. The addition operations could also be performed with other encrypted numbers by adding up the shares of each of the addend. We can also perform multiplication between one encrypted number and a non encrypted number, by viewing the the operation as a series of additions. For multiplication between encrypted numbers PySyft implements the SPDZ protocol that is an extension of additive secret sharing, encrypt, decrypt and add are the same, but it enables more complex operations than addition. Operations like multiplication where SPDZ manages to maintain the encrypted numbers private during the computation by using a triple of numbers generated by a crypto provider that is not otherwise involved in the computation. In the code at the beginning of the implementation section the crypto provider is secure_worker. Taking a closer look at the operations we can see that since alpha * beta == xy - xb - ay + ab, b * alpha == bx - ab, and a * beta == ay - ab if we add them all together and then sum a*b we will effectively return a privately shared version ofxy. For a more in-depth look at SPDZ and how it implements other functions check out these blog posts by Morten Dahl.This schema appears quite simple, but it already permits all operations, as combinations of additions and multiplications, between two secretly shared numbers. Indeed, like the more complex Homomorphic encryption schemes that work with a single party, SPDZ allows computation on ciphertexts generating an encrypted result which, when decrypted, matches the result of the operations as if they had been performed on the plaintext. In this case, splitting the data into shares is the encryption, adding the shares back together is the decryption, while the shares are the ciphertext on which to operate.This technique is adequate for integers, covering the encryption of things like the values of the pixels in images or the counts of entries in a database. The parameters of many ML models like neural networks, however, are floats, so how can we use additive secret sharing in ML? We need to introduce a new ingredient, Fixed Precision Encoding, an intuitive technique that enables computation to be performed on floats encoded in integers values. In base 10 the encoding is as simple as removing the decimal point while keeping as many decimal places as indicated by the precision.SMPC is also one of the pillars of PyGrid, OpenMined’s peer-to-peer platform that uses the PySyft framework for Federated Learning and data science. The platform uses secure multiparty computation in cases when the overhead in communication is manageable, for example, when using a model only for inference. In those cases, this technique protects both data and model’s parameters and enables the kind of Private MLaaS applications that we introduced in this article.OpenMined would like to thank Antonio Lopardo, Emma Bluemke, Théo Ryffel, Nahua Kang, Andrew Trask, Jonathan Lebensold, Ayoub Benaissa, and Madhura Joshi, Shaistha Fathima, Nate Solon, Robin Röhm, Sabrina Steinert, Michael Höh and Ben Szymkow for their contributions to various parts of this series.Sign up to recieve an email when new content like this is posted.Want to write for OpenMined or help update a post?Let us know!By sending, you agree to our privacy policyand join the OpenMined Newsletter.January 5, 2026December 1, 2025Follow usLearn MoreSolutionsProgramsOur vision for the future is ambitious. Here is how you can help:© 2026 OpenMined FoundationOpenMined is a 501(c)(3) non-profit foundation and a global community on a mission to create the public network for non-public information.With your support, we can unlock the world’s insights while making privacy accessible to everyone.We can do it, with your help.Secure Donation

Just a moment…

2020-05-15T00:00:00+00:00

Just a moment…

2020-01-07T00:00:00+00:00

Just a moment…

2019-12-23T00:00:00+00:00

Just a moment…

2018-11-05T00:00:00+00:00

blank

What’s Happening in Representation Learning? A Look at REP4NLP 2025

Representation Learning

REP4NLP 2025

🔬 Interpretability and Understanding Model Representations/Behavior

📝 Text Embeddings

🏗️ Alternative Architectures & Pre-training Objectives

⚡ Efficiency Gains

🧠 Multi-Modal or Task-specific

Standout Papers

- Tracking Universal Features Through Fine-Tuning and Model Merging -

Summary

My thoughts

- A Comparative Study of Learning Paradigms in Large Language Models via Intrinsic Dimension -

Summary

My thoughts

- Bonus: From Tokens to Thoughts - How LLMs and Humans Trade Compression for Meaning -

Summary

My thoughts

Final thoughts and Next Steps

Expensive ways to get tired

What is Homomorphic Encryption? – OpenMined

What is Federated Learning? – OpenMined

What is Secure Multi-Party Computation? – OpenMined

Just a moment…

Just a moment…

Just a moment…

Just a moment…