Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
Julian Minder*, Clement Dumas*, Caden Juang, Bilal Chugtai, Neel Nanda
ICLR 2025 Workshop on Sparsity in LLMs (SLLM)
I am a PhD student at DLAB at EPFL, where I am supervised by Prof. Robert West and co-advised by Prof. Ryan Cotterell (ETH Zurich).
I am passionate about understanding and improving artificial intelligence systems. My work focuses on making AI systems more transparent and trustworthy through interpretability research, with the goal of enhancing model robustness and reducing bias. I aim to better understand how these systems work and how we can make them safer.
I completed my master's degree in computer science at ETH Zurich in 2024, following earlier studies in computer science and neuroinformatics at the University of Zurich. I wrote my master's thesis at EPFL under Bob West and Chris Wendler, investigating the mechanistic effects of fine-tuning language models. The thesis was awarded the ETH medal for outstanding Master's degree thesis.
I was a research scholar at MATS 7 working together with Clement Dumas under the mentorship of Neel Nanda to study the differences between base and instruct models. I was awarded a 1-year scholarship for my work.
Currently, I'm mostly working on Model Diffing, a research area focused on understanding and comparing language models. My work investigates how we can better understand the mechanistic effects of fine-tuning on language models.
Please feel free to reach out anytime!
If you're interested in doing a project with me, please reach out via email with the subject "[STUDENT PROJECT] ...", telling me a bit about yourself and your interests. For EPFL students: Please additionally also apply via our lab application system.
Julian Minder*, Clement Dumas*, Caden Juang, Bilal Chugtai, Neel Nanda
ICLR 2025 Workshop on Sparsity in LLMs (SLLM)
Denis Sutter, Julian Minder, Thomas Hofmann, Tiago Pimentel
Preprint
Julian Minder*, Kevin Du*, Niklas Stoehr, Giovanni Monea, Chris Wendler, Robert West, Ryan Cotterell
The Thirteenth International Conference on Learning Representations (ICLR 2025)
Julian Minder, Florian Grötschla, Joël Mathys, Roger Wattenhofer
(Extended Abstract) Second Learning on Graphs Conference (LoG 2023)
Narrow finetunes leave clearly readable traces: activation differences between base and finetuned models on the first few tokens of unrelated text reliably reveal the finetuning domain.
This post presents some motivation on why we work on model diffing, some of our first results using sparse dictionary methods and our next steps.