news
Georgia Tech Researchers to Present Breakthrough AI Interpretability Methods
Primary tabs
A team of researchers from the AI Safety Initiative (AISI) at Georgia Tech is set to present groundbreaking work on understanding and controlling advanced AI systems at two prestigious conferences in 2025: the International Conference on Learning Representations (ICLR) and the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Their research focuses on novel techniques to make large language models (LLMs) and diffusion models more interpretable and controllable - crucial advancements as AI systems become increasingly powerful and widely deployed.
New Methods for Steering AI Behavior
Yixiong Hao leads the team's work on contrastive activation engineering (CAE), which offers a new way to guide LLM outputs by targeted modifications to internal representations. Unlike traditional methods requiring extensive computational resources, CAE can be applied during inference with minimal overhead.
"We've made significant progress in understanding the capabilities and limitations of CAE techniques," Hao explained. "Our research reveals that while CAE can be effective for in-distribution contexts, it has clear boundaries that practitioners need to be aware of."
The team discovered practical insights about implementing CAE, including the optimal number of samples needed for effective steering vectors and how these vectors respond to adversarial inputs. They also found that larger models better resist steering-induced performance degradation.
Decoding How AI Models Learn From Context
In parallel research, Stepan Shabalin collaborated with Google DeepMind researchers to adapt sparse autoencoder circuits to work with the larger Gemma-1 2B model, providing key insights into how AI systems learn from context.
"We've demonstrated that task vectors in large language models can be approximated by a sparse sum of autoencoder latents," said Shabalin. "This gives us a deeper understanding of how models recognize and execute tasks based on context."
Extending Techniques to Image Generation Models
A third paper, co-authored by Shabalin, Hao, and Ayush Panda, applies similar interpretability techniques to text-to-image diffusion models. Their research uses Sparse Autoencoders (SAEs) and Inference-Time Decomposition of Activations (ITDA) with the state-of-the-art Flux 1 diffusion model.
"By developing an automated interpretation pipeline for vision models, we've been able to extract semantically meaningful features," noted Panda. Their results show these methods outperform standard approaches on interpretability metrics, enabling new possibilities for controlled image generation.
Importance for AI Safety
Parv Mahajan, Collaborative Initiative Lead at AISI, emphasized the significance of the research: "These papers represent important advances in our ability to understand and control the behavior of increasingly complex AI systems. As these models become more powerful and widely deployed, interpretability research like this becomes essential for ensuring their safe and beneficial use."
The team will present their work at dedicated workshops during ICLR and CVPR, creating opportunities for collaboration with other researchers. Their work aligns with AISI's mission to make frontier AI systems more transparent, controllable, and aligned with human values.
Groups
Status
- Workflow Status:Published
- Created By:Parv Mahajan
- Created:04/15/2025
- Modified By:Parv Mahajan
- Modified:04/15/2025
Categories
Keywords