event
PhD Defense by Ali Hassani
Primary tabs
Title: Neighborhood Attention: Fast and Flexible Sparse Attention
Ali Hassani
Ph.D. Student in Computer Science
School of Interactive Computing
Georgia Institute of Technology
Date: Wednesday, January 7th, 2026
Time: 13:00-15:00 EST
Location: Coda C1115 Druid Hills
Remote option (Zoom):
https://gatech.zoom.us/j/92667338016
Meeting ID: 926 6733 8016
Committee:
Dr. @Shi, Humphrey (Advisor) - School of Interactive Computing, Georgia Institute of Technology
Dr. @Hwu, Wen-mei - Electrical & Computer Engineering, University of Illinois at Urbana-Champaign
Dr. @Goyal, Kartik - School of Interactive Computing, Georgia Institute of Technology
Dr. @Hoffman, Judy - School of Interactive Computing, Georgia Institute of Technology
Dr. @Kira, Zsolt - School of Interactive Computing, Georgia Institute of Technology
Abstract:
Attention is at the heart of most foundational AI models, across tasks and modalities.
In many of those cases, it incurs a significant amount of computation, which is quadratic
in complexity, and often cited as one of its greatest limitations. As a result, many sparse
approaches have been proposed to alleviate this issue, with one of the most common
approaches being masked or reduced attention span.
In this work, we revisit sliding window approaches, which were commonly believed to
be inherently inefficient, and we propose a new framework called Neighborhood Attention
(NA). Through it, we solve design flaws in the original sliding window attention works, at-
tempt to implement the approach efficiently for modern hardware accelerators, specifically
GPUs, and conduct experiments that highlight the strengths and weaknesses of these
approaches. At the same time, we bridge the parameterization and properties of
Convolution and Attention, by showing that NA exhibits inductive biases and receptive fields
similar to that in convolutions, while still capable of capturing inter-dependencies, both short
and long range, similar to attention.
We then show the necessity for and challenges that arise from infrastructure, especially
in the context of modern implementations such as Flash Attention, and develop even more
efficient and performance-optimized implementations for NA, specifically for the most re-
cent and popular AI hardware accelerators, the NVIDIA Hopper and Blackwell GPUs.
We build models based on the NA family, highlighting its superior quality and efficiency
compared to existing approaches, and also plug NA into existing foundational models,
and showing that it can accelerate those models by up to 1.6× end-to-end and without
further training, and up to 2.6× end-to-end with training. We further demonstrate that our
methodology can actually create sparse Attention patterns that realize the theoretical limit
of their speedups.
This work is open-sourced through the NATTEN project at natten.org.
Thesis PDF: https://alihassanijr.com/files/Hassani-Dissertation-2025-10-11.pdf
Groups
Status
- Workflow status: Published
- Created by: Tatianna Richardson
- Created: 12/15/2025
- Modified By: Tatianna Richardson
- Modified: 12/15/2025
Categories
Keywords
Target Audience