event

UPS Delivers Optimal Phase Diagram for High Dimensional Variable Selection

Primary tabs

TITLE:   UPS Delivers Optimal Phase Diagram for High Dimensional Variable Selection

SPEAKER:  Jiashun Jin

ABSTRACT:

Consider a linear  regression model  \begin{equation*}
Y =  X \beta + z, \qquad z \sim N(0, I_n),   \qquad X = X_{n, p},
\end{equation*} where both $p$ and $n$ are large but $p >  n$.   The vector $\beta$ is unknown but is  sparse  in the sense that only a small proportion of   its coordinates is  nonzero, and we are interested in  identifying these nonzero ones.  We model the coordinates of $\beta$ as  samples from a two-component mixture $(1 - \eps) \nu_0 + \eps  {\pi}$,   and the rows of $X$ as  samples from $N(0, \frac{1}{n}\Omega)$, where $\nu_0$ is the point mass at $0$,  $\pi$ is a  distribution, and $\Omega$ is a $p$ by $p$ correlation matrix which is unknown but is presumably sparse.

We propose a two-stage variable selection procedure which we call the {\it UPS}.   This  is a Screen and Clean procedure,  in which   we screen with the  Univariate thresholding, and clean with the Penalized MLE. In many situations,  the UPS possesses two important properties: Sure Screening and Separable After Screening (SAS). These properties enable us to reduce  the original regression problem to many small-size regression problems that can be fitted separately.  As a result, the UPS is effective both in theory and in computation.

We measure the performance of  variable selection procedure by the Hamming distance,  and use an asymptotic framework where $p \goto \infty$ and $(\eps, \pi, n, \Omega)$ depend on $p$. We find that  in many situations, the UPS achieves the optimal rate of convergence.
We also find that in the $(\eps_p, \pi_p)$ space, there is a  three-phase diagram shared    by many choices of $\Omega$.  In the first phase, it is possible to   recover  all   signals. In the second phase, exact recovery is impossible, but it is possible to recover most of the signals.
In the third phase, successful variable selection is impossible. The UPS partitions the phase space  in the same way that  the optimal procedures do, and recovers most of the signals
as long as successful variable selection is possible.

The lasso and the subset selection (also known as the $L^1$- and $L^0$-penalization methods, respectively) are well-known approaches to variable selection. However,
somewhat surprisingly, there are regions in the phase space where  neither the lasso nor the subset selection is rate optimal, even  for very simple $\Omega$. The lasso is non-optimal because it is too loose in  filtering out  fake signals (i.e. noise that is highly correlated with a signal), and the subset selection is non-optimal  because it tends to kill one or more signals   in correlated pairs, triplets, etc..


Status

  • Workflow Status:Published
  • Created By:Anita Race
  • Created:03/24/2011
  • Modified By:Fletcher Moore
  • Modified:10/07/2016

Categories

  • No categories were selected.

Keywords

  • No keywords were submitted.