Georgia Tech Researchers to Present Breakthrough AI Interpretability Methods

A team of researchers from the AI Safety Initiative (AISI) at Georgia Tech is set to present groundbreaking work on understanding and controlling advanced AI systems at two prestigious conferences in 2025: the International Conference on Learning Representations (ICLR) and the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Their research focuses on novel techniques to make large language models (LLMs) and diffusion models more interpretable and controllable - crucial advancements as AI systems become increasingly powerful and widely deployed.

New Methods for Steering AI Behavior

Yixiong Hao leads the team's work on contrastive activation engineering (CAE), which offers a new way to guide LLM outputs by targeted modifications to internal representations. Unlike traditional methods requiring extensive computational resources, CAE can be applied during inference with minimal overhead.

"We've made significant progress in understanding the capabilities and limitations of CAE techniques," Hao explained. "Our research reveals that while CAE can be effective for in-distribution contexts, it has clear boundaries that practitioners need to be aware of."

The team discovered practical insights about implementing CAE, including the optimal number of samples needed for effective steering vectors and how these vectors respond to adversarial inputs. They also found that larger models better resist steering-induced performance degradation.

Decoding How AI Models Learn From Context

In parallel research, Stepan Shabalin collaborated with Google DeepMind researchers to adapt sparse autoencoder circuits to work with the larger Gemma-1 2B model, providing key insights into how AI systems learn from context.

"We've demonstrated that task vectors in large language models can be approximated by a sparse sum of autoencoder latents," said Shabalin. "This gives us a deeper understanding of how models recognize and execute tasks based on context."

Extending Techniques to Image Generation Models

A third paper, co-authored by Shabalin, Hao, and Ayush Panda, applies similar interpretability techniques to text-to-image diffusion models. Their research uses Sparse Autoencoders (SAEs) and Inference-Time Decomposition of Activations (ITDA) with the state-of-the-art Flux 1 diffusion model.

"By developing an automated interpretation pipeline for vision models, we've been able to extract semantically meaningful features," noted Panda. Their results show these methods outperform standard approaches on interpretability metrics, enabling new possibilities for controlled image generation.

Importance for AI Safety

Parv Mahajan, Collaborative Initiative Lead at AISI, emphasized the significance of the research: "These papers represent important advances in our ability to understand and control the behavior of increasingly complex AI systems. As these models become more powerful and widely deployed, interpretability research like this becomes essential for ensuring their safe and beneficial use."

The team will present their work at dedicated workshops during ICLR and CVPR, creating opportunities for collaboration with other researchers. Their work aligns with AISI's mission to make frontier AI systems more transparent, controllable, and aligned with human values.

Media

Activations Image

thing.png

Screenshot-2025-04-15-010925.png

Summary

A team of AISI student researchers has developed transformative approaches for peering into AI decision-making processes, with applications spanning both text and image generation. Their research reveals how large models process tasks internally and demonstrates practical methods for steering outputs without resource-intensive retraining. This work addresses a critical need as AI deployment accelerates, offering both theoretical understanding and practical tools for ensuring these powerful systems remain aligned with human intentions. The findings will be showcased at ICLR and CVPR, two of the field's most prestigious venues.

Details

Contact: More information about the AI Safety Initiative can be found at aisi.dev.

Sidebar Content

No sidebar content

Groups

AI Safety Initative (AISI)

Status

Workflow Status:Published
Created By:Parv Mahajan
Created:04/15/2025
Modified By:Parv Mahajan
Modified:04/15/2025

Mercury (Hg)