Non-Gaussian Hyperplane Tessellations and Robust One-Bit Compressed Sensing

We show that a tessellation generated by a small number of random affine hyperplanes can be used to approximate Euclidean distances between any two points in an arbitrary bounded set $T$, where the random hyperplanes are generated by subgaussian or heavy-tailed normal vectors and uniformly distributed shifts. We derive quantitative bounds on the number of hyperplanes needed for constructing such tessellations in terms of natural metric complexity measures of $T$ and the desired approximation error. Our work extends significantly prior results in this direction, which were restricted to Gaussian hyperplane tessellations of subsets of the Euclidean unit sphere. As an application, we obtain new reconstruction results in memoryless one-bit compressed sensing with non-Gaussian measurement matrices. We show that by quantizing at uniformly distributed thresholds, it is possible to accurately reconstruct low-complexity signals from a small number of one-bit quantized measurements, even if the measurement vectors are drawn from a heavy-tailed distribution. Our reconstruction results are uniform in nature and robust in the presence of pre-quantization noise on the analog measurements as well as adversarial bit corruptions in the quantization process. Moreover we show that if the measurement matrix is subgaussian then accurate recovery can be achieved via a convex program.


Introduction
In this article we study a fundamental geometric question: can distances between points in a given set T ⊂ R n be accurately encoded using a small number of random hyperplanes?To formulate this question more precisely, let H X i ,τ i = {x ∈ R n : X i , x + τ i = 0}, i = 1, . . ., m, be a set of affine hyperplanes with normal vectors X i and shift parameters τ i .These hyperplanes tessellate the set T into (at most) 2 m cells and, for any x ∈ T , the bit string (sign( X i , x + τ i )) m i=1 ∈ {−1, 1} m encodes the cell in which x is located (see Figures 1 and 2).Moreover, for any two points x, y ∈ T , the normalized Hamming distance between their bit strings (1.1) 1 m |{i : sign( X i , x + τ i ) = sign( X i , y + τ i )}| counts the fraction of hyperplanes separating x and y.In what follows we are interested in quantifying the number of random hyperplanes that suffice to ensure that (1.1) approximates the distance between any two points in T that are not 'too close'.
A beautiful result due to Plan and Vershynin [22] essentially solves this question for subsets of the Euclidean unit sphere with respect to the geodesic distance, using homogeneous Gaussian hyperplanes (i.e., τ i = 0 for all i).They showed that if T ⊂ S n−1 and the normal vectors X 1 , . . .X m are independent standard Gaussian vectors, then with probability at least 1 − 2e −cmρ 2 , for all x, y ∈ T , (1.2) provided that m ρ −6 2 * (T ); here where G is the standard Gaussian random vector in R n .Thus, * (T ) is the Gaussian mean width of T -a natural geometric parameter that is of central importance in geometry (e.g.Dvoretzky type theorems, see for instance [2]) and in statistics, where it is used to capture the difficulty of prediction problems in numerous manuscripts.It follows from (1.2) that if x and y are 'far enough apart', then the fraction of homogeneous Gaussian hyperplanes that separate them concentrates sharply around their geodesic distance.
As far as random homogeneous Gaussian tessellations of T ⊂ S n−1 are concerned, it was conjectured in [22] that m ρ −2 2 * (T ) is necessary and sufficient for (1.2) to hold.The best known sufficient condition for an arbitrary T ⊂ S n−1 is m ρ −4 2 * (T ), established in [19], while for certain 'simple' subsets of the Euclidean sphere (e.g., if T is a subspace) m ρ −2 2 * (T ) is known to be sufficient [19,22].
Illustration of the hyperplane cut generated by the vector X 1 (and shift parameter 0).The homogeneous hyperplane H X 1 divides R n into two parts, a "+" and a "−" side.The red and green points are assigned the bit 1, the orange point is assigned −1.
It is natural to ask whether approximating distances via random tessellations is possible in more general situations, and the obvious cases that come to mind are to consider other distributions for generating the normal vectors (i.e., not Gaussian), and sets T that need not be subsets of S n−1 .As it happens, these are not only natural but also of extreme importance in signal processing-specifically, when studying signal reconstruction problems from quantized measurements.We describe the connections between the extended version of the random tessellation problem and signal recovery in detail in Section 1.1.
Unfortunately, it is clear that the two extensions one is interested in are not possible when considering tessellations generated by homogeneous hyperplanes.First of all, it is impossible to separate points lying on a straight line through the origin using a homogeneous hyperplane.And second, it is easy to find very natural distributions for which (1.2) is false.As an extreme case, if the X i are i.i.d.symmetric Bernoulli random vectors (i.e., the X i 's are selected independently from the uniform distribution on {−1, 1} n ), there are vectors in S n−1 that are far apart but still cannot be separated using H X i -even if one uses all possible hyperplanes generated by points in {−1, 1} n .
A possible solution to both problems stems in a phenomenon that appears in the engineering literature: there is extensive experimental evidence that signal recovery from quantized measurements improves substantially if one adds appropriate 'noise' to the measurements before quantizing.The operation of adding noise before quantization, which was first proposed in [23], is called dithering (see also the survey [12]).In the context of random tessellations, the geometric interpretation of dithering is adding random parallel shifts to the hyperplanes.As we show in what follows, the addition of such random shifts allows one to address the two problems: random tessellations of arbitrary sets T that are generated by rather general distributions can be used to approximate distances in T .Moreover, as an added value, our results explain why dithering is such an effective method in signal recovery problems (see Section 1.1 for more details).and (e 1 + λe 2 )/ √ 1 + λ 2 for −1 < λ < 0 are marked in red) which are far apart, but cannot be separated by a Bernoulli hyperplane.This problem persists in high dimensions.In addition, any two points lying on a straight line through the origin (examples are marked in green) cannot be separated by a homogeneous hyperplane (the latter problem is not specific to the Bernoulli case).Both problems can be solved by using parallel shifts of the hyperplanes.
To formulate our results, consider i.i.d.shifts τ i that are uniformly distributed in [−λ, λ] for a well chosen λ, let T ⊂ RB n 2 , the Euclidean ball of radius R, set X to be a random vector in R n and let X 1 , . . ., X m be independent copies of X that are also independent of (τ i ) m i=1 .Although the method we introduce can be used in other situations (see in particular Remark 1.8), our focus is on two scenarios.The first is an L-subgaussian scenario, in which X is isotropic 1 , symmetric, and L-subgaussian, that is, for every The following result is a special case of Theorem 2.3, and to formulate it we denote by conv(T ) the convex hull of the set T .
There exist constants c 0 , . . ., c 4 depending only on L such that the following holds.Fix then with probability at least 1−8 exp(−c 2 mρ/R), for any x, y ∈ conv(T ) such that x−y 2 ≥ ρ, one has Theorem 1.1 shows that to approximate Euclidean distances in T it is sufficient to use a number of hyperplanes that is proportional to the squared Gaussian mean width of T .The latter quantity is a natural measure of the 'intrinsic dimension' of the set.For instance, if E is a d-dimensional subspace, then T = E ∩ B n 2 has mean width 2 * (T ) d. Another example that plays an important role in what follows is T = Σ s,n , the set of all s-sparse vectors in the unit ball, in this case 2 * (T ) log n s s log(en/s).Note that the lower estimate in (1.3) implies that the hyperplanes endow a ρ-uniform tessellation: any cell of the tessellation of T has diameter at most ρ.
In the second scenario we explore heavy-tailed random variables: again X is isotropic and symmetric, but in addition we only assume that X satisfies an L 1 -L 2 equivalence: In the heavy-tailed scenario a different complexity parameter dictates the required number of hyperplanes.For K ⊂ R n we consider where (ε i ) i≥1 is a sequence of independent, symmetric {−1, 1}-valued random variables that is independent of X 1 , ..., X m .If X 1 , . . ., X m happen to be isotropic, symmetric and subgaussian, then E(K) ≤ c * (K) for an absolute constant c.
Remark 1.2.The fact that E(K) is dominated by the Gaussian mean width of K is one of the features of subgaussian processes and is an outcome of Talagrand's majorizing measures theorem [25].Finding upper bounds on E(K) when X is not subgaussian is a challenging question that has been studied extensively over the last 30 years or so and which will not be pursued here.
Theorem 1.3 is a special case of Theorem 2.2 below.In what follows, given K ⊂ R n and r > 0 we denote by N (K, r) the smallest number of Euclidean balls of radius r that are needed to cover K. 1 Recall than a random vector is isotropic if its covariance matrix is the identity; thus, for every Theorem 1.3.There exist constants c 0 , . . ., c 4 that depend only on L for which the following holds.Fix Then with probability at least 1−8 exp(−c 3 m(ρ/R) 2 ), for every x, y ∈ U that satisfy x−y 2 ≥ ρ, Remark 1.4.It should come as no surprise that the uniform upper estimate on d(x, y) deteriorates the more 'heavy-tailed' the random vector X is, while at the same time the lower bound is universal.It reflects the fact that such lower bounds are due to a small-ball property rather than tail estimates.This universal behaviour implies that almost regardless of the choice of X, if x and y are reasonably 'far apart' then their distance is exhibited by the fraction of tessellation hyperplanes that separate the points.
The connection between the number of hyperplanes m and the accuracy ρ is less explicit in Theorem 1.3, because E(U r ) depends on m.And even though the uniform central limit theorem shows that E(U r ) converges to * (U r ) as m tends to infinity, we are interested in quantitative estimates, which are, in general, nontrivial.Since estimating E(U r ) is not the main focus of this article, we shall not pursue this question any further.We only consider the set T = Σ s,n for the sake of illustration.In this case, U, U r ⊂ 4( If the latter parameter can be bounded by a multiple C * (Σ s,n ) of the Gaussian width, then Theorem 1.3 shows that (1.5) holds if which is, up to worse dependencies on R and ρ, the same scaling as in the subgaussian case.
As it happens, this can be guaranteed under very mild assumptions on the vector X by using techniques from [15,17].For instance, X can be any isotropic, unconditional, log-concave vector X.More striking is the case where the coordinates of X are independent copies of a mean-zero random variable ξ satisfying, for some c > 0, (E|ξ| p ) 1/p cp α , for all p ≤ log n.
In this case, the assumption is satisfied with C ce 2α−1 .Thus, one only needs control of the first log n moments and higher moments may not even exist.This condition is satisfied by a wide variety of extremely heavy-tailed random variables.We refer to [9, Section V] for proofs and many explicit examples.
Before we present the proofs of Theorems 1.1 and 1.3, let us explore the connection between random hyperplane tessellations and signal recovery problems.Readers that are solely interested in hyperplane tessellations can safely skip straight to Section 2, where the proofs may be found.
1.1.Application to one-bit compressed sensing.One good reason for studying non-Gaussian random hyperplane tessellations of arbitrary sets comes from signal recovery problems involving quantized measurements.By quantization we mean converting analog measurements of a signal into a finite number of bits.This essential step is part of any signal processing procedure and allows one to digitally transmit, process, and reconstruct signals.The area of quantized compressed sensing investigates how to design a measurement procedure, quantizer, and reconstruction algorithm that together recover low-complexity signals-such as signals that have a sparse representation in a given basis.An efficient system has to be able to reconstruct signals based on a minimal number of measurements, each of which is quantized to the smallest number of bits, and to do so via a computationally efficient reconstruction algorithm.In addition, the system should be reliable: it should be robust to both pre-quantization noise (noise in the analog measurements process) and post-quantization noise (bit corruptions that occur during the quantization process).
Our interest is in the popular one-bit compressed sensing model, in which one observes quantized measurements of the form (1.6) q = sign(Ax + ν noise + τ thres ), where A ∈ R m×n , m n, sign is the sign function applied element-wise, ν noise ∈ R m is a vector modelling the noise in the analog measurement process and τ thres ∈ R m is a (possibly random) vector consisting of quantization thresholds.We restrict ourselves to memoryless quantization, meaning that the thresholds are set in a non-adaptive manner.In this case, the one bit quantizer sign(•+τ thres ) can be implemented efficiently in practice, and because of its efficiency it has been very popular in the engineering literature-especially in applications in which analog-to-digital converters represent a significant factor in the energy consumption of the measurement system (see e.g.[5,18]).
In spite of its popularity, there are few rigorous results that show that one-bit compressed sensing is viable: the vast majority of the mathematical literature (see e.g.[3,13,14,20,21]) has focused on the special case where A is a standard Gaussian matrix, and the practical relevance of such results is limited-Gaussian matrices cannot be realized in a real-world measurement setup.As an additional difficulty, it is well known that one-bit compressed sensing may perform poorly outside the Gaussian setup.In fact, it can very easily fail, even if the measurement matrix is known to perform optimally in 'unquantized' compressed sensing.For example, if the threshold vector τ thres = 0, there are 2-sparse vectors that cannot be distinguished based on their one-bit Bernoulli measurements (see Figure 3).
As an application of the new hyperplane tessellation results described in the previous section, we show that one-bit compressed sensing can actually perform well in scenarios that are far more general than the Gaussian setting.What makes all the difference is the rather striking effect that dithering (that is, adding well-designed 'noise' to the measurements before quantizing) has on the one-bit quantizer.Indeed, we show that thanks to dithering, accurate recovery from onebit measurements is possible even if the measurement vectors are drawn from a heavy-tailed distribution.Moreover, the recovery results we establish are robust to both adversarial and potentially heavy-tailed stochastic noise on the analog measurements, as well as to adversarial bit corruptions that may occur during quantization.In what follows we explain why dithering has such an effect: the geometric interpretation of dithering leads to random tessellations that can be used to approximate distances between signals and the ability to approximate distances has a crucial impact on the performance of recovery procedures.
To understand the connection between hyperplane tessellations and signal recovery from onebit quantized measurements, let us first assume that no bit corruptions occur in the quantization process; and that there is no pre-quantization noise (ν noise = 0).In this case, we observe q = sign(Ax + τ thres ).If we let X 1 , . . ., X m denote the rows of A and τ 1 , . . ., τ m the entries of τ thres , then q exactly encodes the cell of the hyperplane tessellation in which the signal x is located.A popular strategy to recover x is to search for a vector x # ∈ T that is quantization consistent, i.e., q = sign(Ax # + τ thres ).For instance, if T = Σ s,n , the set of all s-sparse vectors in the unit ball, then we can find such a vector by solving Geometrically, a quantization consistent vector is simply a vector lying in the same cell as x.
We can ensure that x # − x 2 ≤ ρ by showing that x − y 2 ≤ ρ for any y ∈ T located in the same cell as x.Since we have no further information on the identity of the cell in which x is located, one has to ensure that any pair of points in T located in the same cell are at distance at most ρ from each other, i.e., the hyperplanes H X i ,τ i must induce a ρ-uniform tessellation of T .Phrased differently, if x, y ∈ T are at distance at least ρ, then that fact must be exhibited by the hyperplanes H X i ,τ i : at least one of the hyperplanes must separate x and y.Thus, given a ρuniform tessellation of T , one can uniformly recover signals from T using only sign(Ax + τ thres ) as data.Moreover, the reverse direction is clearly true: the degree of accuracy in uniform recovery results in T is determined by the largest diameter (in T ) of a cell of the tessellation endowed by the hyperplanes H X i ,τ i .Unfortunately, even if (H X i ,τ i ) m i=1 induces a uniform tessellation of T there is still the question of pre-and post-quantization noise one has to contend with.To understand the effect of post-quantization noise (i.e., bit corruptions that occur during quantization), assume that one observes a corrupted sequence of bits q corr ∈ {−1, 1} m , where the i-th bit being corrupted means that instead of receiving q i = sign( X i , x +τ i ) from the quantizer, one observes (q corr ) i = − sign( X i , x +τ i ); thus, one is led to believe that x is on the 'wrong side' of the i-th hyperplane H X i ,τ i .As a consequence, recovery methods that search for a quantization consistent vector can easily fail even if a single bit is corrupted.For instance, the program (1.7) (with q replaced by q corr ) will in the best case scenario search for a vector in the wrong cell of the tessellation, and in the worse case the corrupted bit may cause a conflict and there will be no sparse vector z satisfying q corr = sign(Az + τ thres ) (see Figure 4 for an illustration). x The effect of a bit corruption associated with the dashed, red hyperplane H X i ,τ i .Either the bit corruption leads the program (1.7) (with q replaced by q corr ) to search in the wrong cell of the tessellation marked by the red dot (the picture on the l.h.s.) or causes the program to be infeasible (the picture on the r.h.s.).
The effect of pre-quantization noise (i.e., noise in the analog measurement process) is equally problematic: noise simply causes a parallel shift of the hyperplane H X i ,τ i , and one has no control over the size of this 'noise-induced' shift.Again, the recovery program (1.7) (with q = sign(Ax + ν noise + τ thres )) can easily fail if pre-quantization noise is present (see Figure 5).
One possible way of overcoming this 'infeasibility problem' due to noise is by designing a recovery program that is stable: its output does not change by much even if some of the given bits are misleading.For example, one may try search for a vector z ∈ T whose uncorrupted quantized measurements sign(Az + ν noise + τ thres ) are closest to the observed corrupted vector q corr .However, since one does not have access to ν noise , one can only try to match its proxy sign(Az + τ thres ) to q corr , i.e. to solve The effect of a noise-induced parallel shift of the dashed, blue hyperplane H X i ,τ i onto the dashed, red hyperplane H X i ,ν i +τ i .The program (1.7) (with q = sign(Ax + ν noise + τ thres )) searches for a vector z with sign( X i , z This means that the program incorrectly searches for a solution located to the right of the dashed, blue hyperplane H X i ,τ i ; as a consequence, a solution is found in the wrong cell of the tessellation marked by the red dot (the picture on the l.h.s) or it can even happen that no feasible point exists (the picture on the r.h.s).
where d H denotes the Hamming distance.In the context of sparse recovery, the latter program is To ensure that (1.8) yields an accurate reconstruction, the uniform tessellation has to be finer than in the corruption-free case: even if some signs are 'flipped', the distance between points in the resulting cell and points in the true one should still be small.Our hyperplane tessellation results ensure this: for any x, y ∈ T that are at least ρ-separated there are many hyperplanes that separate the two points-of the order of x − y 2 m.Thus, even after corrupting ρm bits one may still detect that x and y are 'far away' from one another.Finally, although (1.8) can guarantee robust signal recovery, there are no guarantees that it can be solved efficiently.In addition, since (1.8) matches sign(Az + τ thres ), rather than sign(Az + ν noise + τ thres ), to q corr , it is still quite sensitive to pre-quantization noise.Both problems can be mended by convexification.Indeed, observe that One may relax this objective function by replacing sign( X i , z An equivalent formulation of this program, which only requires the known data q corr and A, is This program was proposed in [21] and in what follows we explore a regularized version of (1.10): for λ > 0 we consider (1.11) max in the context of sparse recovery, this corresponds to the tractable program max Let us formulate our main signal recovery results, which are direct outcomes of the results on random tessellations.Fix a target reconstruction error ρ, recall that the quantization thresholds τ i are i.i.d.uniformly distributed in [−λ, λ], assume that the entries ν i of ν noise are i.i.d.copies of a random variable ν and that at most βm of the bits are arbitrarily corrupted during quantization, i.e., the observed corrupted vector q corr satisfies d H (q corr , q) ≤ βm.The adversarial component of the pre-quantization noise ν is |Eν|, σ 2 is its variance and ν L 2 is its L 2 norm.We write T r = (T − T ) ∩ rB n 2 for any r > 0. Our first recovery result concerns the recovery program (1.8) in the L-subgaussian scenario, in which the rows X i of A are i.i.d.copies of a symmetric, isotropic, L-subgaussian vector X.In addition, we assume that ν is L-subgaussian: Theorem 1.5.There exist constants c 0 , . . ., c 4 > 0 depending only on L such that the following holds.Let and that |Eν| ≤ c 3 ρ, σ ≤ c 3 ρ/ log(eλ/ρ) and β ≤ c 3 ρ/λ.
Then with probability at least 1−10 exp(−c 4 mρ/λ), for every x ∈ T , any solution x # of (1.8) To put Theorem 1.5 in some context, consider an arbitrary T ⊂ B n 2 and assume ν L 2 ≤ 1, so that λ is a constant that depends only on L. By Sudakov's inequality, (1.12) log and trivially * (T r ) ≤ * (T ), which means that a sample size of suffices for recovery.In the special case of T = Σ s,n , the subset of B n 2 consisting of s-sparse vectors, a much better estimate is possible.Indeed, it is standard to verify that there is an absolute constant c such that for any 1 ≤ s ≤ n, (1.13) * (Σ s,n ) s log(en/s) and log N (Σ s,n , r) ≤ cs log en sr .
implying that a sample size of (1.14) m = c (L)ρ −1 s log en sρ guarantees that with high probability one can recover any s-sparse vector in B n 2 with accuracy ρ via (1.8).
To illustrate this result, assume ν L 2 ≤ 1 and consider T = Σ s,n , so that λ is a constant depending only on L. Since T r ⊂ rΣ 2s,n , the first term in (1.15) is bounded by E 2 (Σ 2s,n ).The latter can be bounded by 2 * (Σ 2s,n ) under the assumptions on X mentioned after Theorem 1.3.Taking into account (1.13), it follows that even for these heavy-tailed vectors the sample size (1.14) is sufficient for recovery.
Let us compare Theorems 1.5 and 1.6 to existing work.As was mentioned previously, almost all the signal reconstruction results in (memoryless) one-bit compressed sensing concern standard Gaussian measurement matrices, see e.g.[8] for an overview.The most closely related work is [13], which concerns the situation when there is no dithering (τ thres = 0).Recall that in that case it is only possible to recover signals located on the unit sphere.It was shown in [13, Theorem 2] that if A ∈ R m×n is standard Gaussian and m ρ −1 s log(n/ρ) then, with high probability, any s-sparse x, x ∈ S n−1 for which sign(Ax) = sign(Ax ) satisfy x − x 2 ≤ ρ.In particular, one can approximate x with accuracy ρ by solving the non-convex program In comparison, Theorem 1.5 shows that the same result holds in the subgaussian scenarioand at the same time extends it to sparse vectors in the unit ball and makes it robust to preand post-quantization noise.Clearly, such a generalization is possible thanks to the effect of dithering.Remarkably, Theorem 1.6 shows that this result can be extended further to a large class of heavy-tailed measurements.This is the first known recovery result involving quantized heavy-tailed measurements.
In [3,14] the authors study sparse recovery with Gaussian measurements and introduce standard Gaussian dithering to derive recovery results for sparse vectors in the unit ball.The idea behind these results is to use a 'lifting trick': for instance, in [3] one interprets the dithered measurements sign(Ax + τ ) as sign([A τ ][x, 1]/ [x, 1] 2 ), where [A τ ] is obtained by appending τ to A as an additional column.Since [A τ ] is a standard Gaussian again, recovery methods for sparse vectors with unit norm can be used to find an approximation of 2 by the distance between the latter two vectors.Since this lifting argument is based on a reduction to the one-bit compressed sensing with zero thresholds model, it 'imports' the strong limitations of that model; in particular, it cannot be used to derive recovery results for a general class of non-Gaussian measurements.In addition, since the recovery methods in [3,14] rely on enforcing quantization consistency, they are not robust to post-quantization noise.In contrast, thanks to our new geometric understanding of the effect of dithering, we find robust recovery results for non-Gaussian measurements matrices and general signal sets.
Finally, let us present our main recovery result for (1.11).We only analyze this recovery program in the L-subgaussian scenario.We again assume that ν is L-subgaussian with variance σ 2 .We assume for the sake of simplicity that Eν = 0 and denote U = conv(T ) and U ρ = (U − U ) ∩ ρB n 2 .Theorem 1.7.There exist constants c 0 , . . ., c 4 that depend only on L for which the following holds.Let T ⊂ RB n 2 , fix ρ > 0, set and let r = c 1 ρ/ log(eλ/ρ).If m and β satisfy then, with probability at least 1 − 8 exp(−c 4 mρ2 /λ 2 ), for any x ∈ T the solution x # of (1.11) satisfies x # − x 2 ≤ ρ.
As an example, let T = √ sB n 1 ∩ B n 2 , the set of approximately s-sparse vectors in the Euclidean unit ball, and assume that σ ≤ 1. Observe that T = U and that one may set λ = c 0 (L) log(e/ρ).Also, for 0 , and it is standard to verify that * (U ρ ) s max{log(enρ 2 /s), 1}.Taking the estimate (1.12) for log N (T, r) into account, it is evident that if s log(en/s)log 3 (e/ρ) ρ 4 then with high probability one may recover any x ∈ T using the convex recovery procedure (1.11), even in the presence of pre-and post-quantization noise.
In the context of Gaussian measurement matrices, Theorem 1.7 improves upon work of Plan and Vershynin [21], who considered the situation when there is no dithering (τ thres = 0).They introduced the convex program (1.10) and proved recovery results for signal sets T ⊂ S n−1 of two different flavours.In a non-uniform recovery setting 2 they showed that m ρ −4 2 * (T ) measurements suffice to reconstruct a fixed signal, even if pre-quantization noise is present and quantization bits are randomly flipped with a probability that is allowed to be arbitrarily close to 1/2.In the uniform recovery setting, they showed that if m ρ −12 2 * (T ), one can achieve a reconstruction error ρ even if a fraction β = ρ 2 of the received bits are corrupted in an adversarial manner while quantizing.Theorem 1.7 extends the latter result to subgaussian measurements with a better condition on m and β, and at the same time incorporates pre-quantization noise and allows the reconstruction of signals that need not be located on the unit sphere.
When the measurements are not standard Gaussian, there are very few reconstruction results available.The work [1] generalized the non-uniform recovery results from [21] to subgaussian measurements under additional restrictions.For a fixed x ∈ T with T ⊂ S n−1 they showed that m ρ −4 2 * (T ) suffice to reconstruct x up to error ρ via (1.10) provided that either x ∞ ≤ ρ 4 (meaning that the signal must be sufficiently spread) or the total variation distance between the subgaussian measurements and the standard Gaussian distribution is at most ρ 16 .Theorem 1.7 significantly improves on these results.
Remark 1.8.At the expense of substantial additional technicalities, the proof strategies developed in this work lead to recovery results for sparse vectors when A is a random partial circulant matrix generated by a subgaussian random vector.The latter model occurs in several practical measurement setups, including SAR radar imaging, Fourier optical imaging and channel estimation (see e.g.[24] and the references therein).To keep this work accessible to a general audience and clearly expose the main ideas, we choose to defer the additional technical developments needed for the circulant case to a companion work [10].
1.2.Notation.We use x p to denote the p -norm of x ∈ R n and B n p denotes the p -unit ball in R n .For a subgaussian random variable ξ we let We use U to denote the uniform distribution.For k ∈ N we set [k] = {1, . . ., k} and for a set S we let |S| denote its cardinality.d H is the (unnormalized) Hamming distance on the discrete cube and Σ s,n = {x ∈ R n : x 0 ≤ s, x 2 ≤ 1} is the set of s-sparse vectors in the Euclidean unit ball.For T ⊂ R n we set T r = (T − T ) ∩ rB n 2 and denote by conv(T ) its convex hull.As before, we use * (T ) to denote the Gaussian mean width of T and, for any r > 0, we denote by N (T, r) the smallest number of Euclidean balls of radius r that are needed to cover T .Finally, c and C denote absolute constants; their value many change from line to line.c α or c(α) denotes a constant that depends only on the parameter α.We write a α b if a ≤ c α b, and a α b means that both a α b and a α b hold.

Random tessellations
This section is devoted to the proof of our main tessellation results, Theorems 2.2 and 2.3, which are generalizations of Theorems 1.1 and 1.3 respectively.Before we formulate the results let us define a mild structural property of a subset of a metric space.Definition 2.1.Let (X , d) be a metric space.A set T ⊂ X is (r, γ)-metrically convex in X if for every x, y ∈ T there are z 1 , ..., z ∈ X such that where we set z 0 = x, z +1 = y.If X = T , then we say that T is (r, γ)-metrically convex.
The idea behind this notion is straightforward: it implies that controlling 'local oscillations' of a function f ensures that it satisfies a Lipschitz condition for long distances.Indeed, assume that sup {w,v∈X :d(w,v)≤r} |f (w) − f (v)| ≤ κ and for any x, y ∈ T that satisfy d(x, y) ≥ 2r let (z i ) +1 i=0 be as in Definition 2.1.Then Therefore, f satisfies a Lipschitz condition for long distances with constant κ/γ 2 r.
Observe that if T is a convex subset of a normed space then it is (r, 1)-metrically convex for any r > 0; also, every subset of a normed space is (r, 1)-metrically convex in its convex hull.Finally, Σ s,n is (r, γ)-metrically convex in Σ 2s,n for an absolute constant γ.We omit the straightforward proofs of these claims.
Let us first state our main result in the heavy-tailed scenario.We consider a random vector X that is isotropic, symmetric, and satisfies an L 1 -L 2 norm equivalence: i.e, that for every Theorem 2.2.There exist constants c 0 , . . ., c 4 that depend only on L for which the following holds.Let T ⊂ RB n 2 and set λ ≥ c 0 R. Suppose that 0 < r < ρ < λ satisfy r ≤ c 1 ρ 2 /λ and assume that Then with probability at least 1−8 exp(−c 3 m(ρ/λ) 2 ), for every x, y ∈ T that satisfy x−y 2 ≥ ρ, Moreover, if T is (r, γ)-metrically convex then on the same event, if x − y 2 ≥ 2r, Proof of Theorem 1.3.Apply Theorem 2.2 to the set U = conv(T ), which is (r, 1) metrically convex for any r > 0, and for the parameters λ = c 0 R and r = c 1 ρ 2 /R.With these choices Theorem 1.3 follows immediately.
When X is L-subgaussian one may establish a sharper result.
In the context of tessellations, Theorem 2.2 and the first part of Theorem 2.3 improve the estimate from (1.2) in several ways: firstly, Theorem 2.2 holds for a very general collection of random vectors X -the vector has to satisfy a small-ball condition rather than being Gaussian.Secondly, both are valid for any subset of R n and not just for subsets of the sphere; and, finally, if X happens to be L-subgaussian, it yields the best known estimate on the diameter of each 'cell' in the random tessellation-even when X is Gaussian and T is a subset of S n−1 .
2.1.The heavy-tailed scenario.A fundamental question that is at the heart of our arguments has to do with stability: given two points x and y, how 'stable' is the set to perturbations?If one believes that the cardinality of ( * ) reflects the distance x − y 2 , it stands to reason that if r is significantly smaller than x − y 2 and x − x 2 ≤ r, y − y 2 ≤ r, then |{i : sign( X i , x + τ i ) = sign( X i , y + τ i )}| should not be very different from |( * )|.
Unfortunately, stability is not true in general.If either x or y are 'too close' to many of the separating hyperplanes, then even a small shift in either one of them can have a dramatic effect on the signs of X i , • + τ i and destroy the separation.Thus, to ensure stability one requires a stronger property than mere separation: points need to be separated by a large margin.
Definition 2.4.The hyperplane H X i ,τ i θ-well-separates x and y if • sign( Denote by I x,y (θ) ⊂ [m] the set of indices for which H X i ,τ i θ-well-separates x and y.

The condition that
is precisely what ensures that perturbations of x or y of the order of x − y 2 do not spoil the fact that the hyperplane H X i ,τ i separates the two points.
We begin by showing that even in the heavy-tailed scenario and with high probability, |I x,y (θ)| is proportional to m x − y 2 for any two (fixed) points x and y.Let us stress that the high probability estimate is crucial: it will lead to a uniform control on a net of a large cardinality.Theorem 2.5.There are constants c 1 , . . ., c 4 that depend only on L for which the following holds.Let x, y ∈ RB n 2 and set λ ≥ c 1 R.With probability at least The proof of Theorem 2.5 requires two preliminary observations.Consider a random variable τ that satisfies the small ball estimate and let Z be independent of τ .Then clearly (2.4) and (τ i ) m i=1 are independent copies of Z and τ respectively, then with probability at least 1 − 2 exp(−cmε/λ), (2.5) |{i The second observation is somewhat more involved.Consider a random variable τ that satisfies (2.6) Let Z and W be square integrable whose difference satisfies a small-ball condition: there are constants κ and δ such that Lemma 2.6.There are absolute constants c 0 and c 1 and constants c 2 , c 3 c τ κδ such that the following holds.Assume that Z and W are independent of τ and that If (τ i ) m i=1 , (Z i ) m i=1 and (W i ) m i=1 are independent copies of τ , Z and W respectively, then with probability at least

Proof.
Set θ to be named later and observe that where c 1 is an absolute constant; a similar estimate holds for (W i ) m i=1 .At the same time, recall that The above shows that there is an event A of (Z, W )-probability at least 1 − 2 exp(−c 3 δm) on which the following holds: there exists J ⊂ [m] of cardinality at least δm/4 such that for every j ∈ J,

Now fix two sequences of numbers (z i ) m
i=1 and (w i ) m i=1 and consider the independent events

Recall that by (2.6), for every
Hence, for every realization of (Z i ) m i=1 and (W i ) m i=1 from the event A, It follows that there are absolute constants c 4 and c 5 , such that with τ -probability at least 1 Thus, with the desired probability with respect to Next, let us consider the random variable τ and the random vector X from Theorem 2.2: τ ∼ U[−λ, λ] and X is isotropic, symmetric and satisfies an L 1 -L 2 norm equivalence with constant L. By the Paley-Zygmund inequality (see, e.g., [6]) there are constants κ and δ that depend only on L for which, for every t ∈ R n , Therefore, τ satisfies (2.6) with constant c τ = 1/(2λ) and the random variables Z = X, x and W = X, w satisfy Lemma 2.6 with constants κ and δ that depend only on the equivalence constant L.
Proof of Theorem 2.5.Clearly, by Lemma 2.6, x − y 2 λ with the promised probability, using the fact that One has to show that in addition, | X i , x + τ i | and | X i , x + τ i | are also reasonably large.To that end, one may apply (2.4) twice, for Z = X, x and Z = X, y , to see that for any ε > 0, Therefore, with probability at least 1 − 2 exp −c ε λ m , there are at most 4εm/λ indices i for which min Next, one has to use the individual high probability estimate from Theorem 2.5 to obtain a uniform estimate in T .The idea is to use a covering argument combined with a simple stability property: Lemma 2.7.Fix a realization of X and τ and set r > 0. Assume that w − v 2 ≥ r and that If v and w are θ-well separated by H X,τ then x and y are separated by H X,τ .
. Picture of a 'good' hyperplane H X i ,τ i that well-separates v and w.On the one hand, one needs to shift the hyperplane in parallel by a distance proportional to θ v − w 2 to hit w (shift marked in red).On the other hand, the parallel shift needed to hit y when starting from w is less than half this distance (shift marked in blue).As a consequence, a good hyperplane separates x and y.
Proof.Since v and w are θ-well separated by H X,τ , one has it follows that sign( X, x + τ ) = sign( X, y + τ ) (See Figure 6 for an illustration).
The key component in the proof of Theorem 2.2 is the following fact: Theorem 2.8.There exist constants c 0 , . . ., c 6 that depend only on L for which the following holds.Let λ ≥ c 0 R, r ≤ λ/2, and r ≤ r /4.Assume that and that Then with probability at least 1−8 exp −c 4 m(r /λ) 2 , for every x, y ∈ T such that x−y 2 ≥ 2r , and for every x, y ∈ T such that x − y 2 ≤ r /2, Proof.Let V ⊂ T be an r -cover of T .We apply (2.5) to every Z = X, v , v ∈ V , and Theorem 2.5 to every pair of points from V .Let c 1 ≤ min{c, c 2 }/2, where c and c 2 are as in (2.5) and Theorem 2.5, respectively.If then by the union bound there is an event A 1 of probability at least 1 − 6 exp (−c 2 mr /λ) such that for every v ∈ V , where the constants c 2 , c 3 and c 4 depend only on L. Now fix x, y ∈ T that satisfy x − y 2 ≥ 2r and let v, w be the nearest points in V to x and y respectively.By Lemma 2.7, if i ∈ I v,w (c 3 ) and then x and y are separated by , let c3 = min{c 3 , 1} and set A 2 to be the event which is the wanted lower bound.At the same time, if x − v 2 ≤ r then by combining (2.10) and (2.11), one has the upper bound All that is left is to estimate the probability of the event A 2 .Note that and by the bounded differences inequality (see e.g.[4, Theorem 6.2]), for a suitable absolute constant c.The claim follows with the choice of t = (c 4 /4) • (r /λ).
Proof of Theorem 2.2.We apply Theorem 2.8 for the choice r = ρ/2.Let us identify the conditions on r one has to impose to ensure that (2.8) is satisfied.By the Giné-Zinn symmetrization theorem [11] and the contraction inequality for Bernoulli processes [16], one has To satisfy (2.8) it suffices to bound both terms by cmρ/λ.The required estimate on (2) holds once and to ensure a suitable estimate on (1) it suffices that The claim follows by setting r = r .This immediately yields the lower bound in Theorem 2.2.To complete the proof of the upper bound, recall that T is (r, γ)-metrically convex.For given x, y with x − y 2 ≥ 2r let (z j ) j=1 be as in Definition 2.1.Then |{i : sign( X i , x +τ i ) = sign( X i , y +τ i )}| ≤ j=0 |{i : sign( X i , z j +τ i ) = sign( X i , z j+1 +τ i )}|, and the claim follows from the 'local' upper bound (2.9).
2.2.The subgaussian scenario.When X is an L-subgaussian random vector one may establish an improved version of Theorem 2.8: first, by showing that one may take r to be of the order of r up to a logarithmic factor; and second, by providing a better probability estimate on the outcome.Moreover, thanks to the subgaussian property, one may replace the empirical parameter E(T r ) by its Gaussian counterpart, * (T r ).
The only difference between the proof of Theorem 2.9 and that of Theorem 2.8 is the control one has on the probability that (2.14) sup When X merely satisfies an L 1 -L 2 norm equivalence, one has to resort to the bounded differences inequality for a high probability estimate.However, when X is L-subgaussian one has more machinery at one's disposal.Specifically, we use the following fact.We omit the proof of Theorem 2.10, which is standard.It is based on generic chaining (see e.g.[7,Theorem 3.2]) combined with Talagrand's majorizing measures theorem [25].
Proof of Theorem 2.9.Observe that if (a i ) m i=1 is a sequence of nonnegative numbers then the k-largest element satisfies With that in mind, one has to ensure that for k = Cmr /λ, The rest of the proof of Theorem 2.9 is identical to that of Theorem 2.8 and is omitted.Now one may complete the proof of Theorem 2.3, by setting ρ = 2r and r = r /2, and noting that Theorem 2.9 yields the lower bound and the 'local' upper bound.The upper bound follows directly from the local upper bound and the metric convexity assumption (see the end of the proof of Theorem 2.2 for this argument).

Random tessellations -noisy measurements
With the machinery developed for the proofs of Theorem 2.2 and Theorem 2.3 at our disposal, let us present the proofs of Theorem 1.5 and Theorem 1.6.
Recall that T ⊂ RB n 2 and that X is an isotropic and symmetric random vector, while τ ∼ U[−λ, λ].The rows of the measurement matrix A are (X i ) m i=1 and the given observations are the coordinates of the vector q corr , which is a corrupted version of sign(Ax + ν noise + τ thres ) = X i , x + ν i + τ i m i=1 , by at most βm 'sign flips'.In the first scenario, X and ν satisfy an L 1 -L 2 norm equivalence with constant L, while in the second they are L-subgaussian.
The goal is to show that there is a constant C depending only on λ and L so that with high probability, for any x, y ∈ T that satisfy x − y 2 ≥ ρ, and at the same time, Together these conditions imply that the recovery program (1.8), which minimizes the Hamming distance between q corr and (sign( X i , z + τ i )) m i=1 with respect to z ∈ T , achieves reconstruction accuracy ρ as long as the fraction of the corrupted bits is at most β ≤ Cρ/4.Indeed, (3.2) implies that any solution x # of (1.8) must satisfy d H (q corr , (sign( X i , x # + τ i )) m i=1 ) < Cmρ/2 and then (3.1) shows that x # − x 2 ≤ ρ.
The proofs of (3.1) in both scenarios follow from minor modifications of the results established in the previous section.Rather than repeating the arguments, let us sketch the adjustments one has to make.
First, one has to consider a modified notion of being 'well-separated' by a hyperplane: Definition 3.1.The hyperplane H X i ,τ i θ-well-separates x and y if Denote by J x,y (θ) ⊂ [m] the set of indices for which H X i ,τ i θ-well-separates x and y.
Next, one has to establish the analog of Theorem 2.5 and show that every pair x, y is well separated by a fraction of the hyperplanes that is proportional to x − y 2 .Theorem 3.2.There are constants c 1 , . . ., c 4 that depend only on L for which the following holds.Let x, y ∈ RB n 2 and set λ Just as in Theorem 2.5, the proof is an outcome of Lemma 2.6, only this time with the choice Z = X, x + ν and W = X, y .To that end, one has to verify that Z − W satisfies a smallball condition, and that is immediate from the following observation and the Paley-Zygmund inequality.
Lemma 3.3.Let Z and W be as above.Then Proof.Let α, β ∈ R and set ε to be a symmetric, {−1, 1}-valued random variable.Clearly, Since X is a symmetric random vector, X, x − y has the same distribution as ε X, x − y , where ε that is independent of X and of ν.Hence, The other components needed for ensuring separation in the sense of Definition 3.1 follow from an identical argument used in the proof of Theorem 2.5, by conditioning on X i and ν i rather than just on X i .Theorem 3.2 allows one to control the set of 'centers' V of a cover of T , and all that remains now is to show that if x is close to a center x and y is close y then there will be few indices for which sign( X i , x + ν i + τ i ) = sign( X i , x + ν i + τ i ), or sign( X i , y + τ i ) = sign( X i , y + τ i ).In both cases, and using the notation of the previous section, one may follow the argument in the proof of Theorem 2.8.Thus, it suffices to show that (3.3) sup and that concludes the proof of the bound (3.1) with C∼ L 1 λ .
The proof in the subgaussian case is analogous and therefore omitted.

Robust recovery via a convex program
This section is devoted to the proof of Theorem 1.7.Set (4.1) Recall that U = conv(T ) and that U ρ = (U − U ) ∩ ρB n 2 ; X 1 , ..., X m are the rows of the matrix A which are distributed according to an isotropic, symmetric, L-subgaussian random vector X; and τ ∼ U[−λ, λ].Here we assume for the sake of simplicity that ν has mean zero and variance σ 2 , though the modifications needed to handle the case in which ν has a nontrivial adversarial component are straightforward.Finally, as before q corr ∈ {−1, 1} m satisfies (4.2) d H (q corr , sign(Ax + ν noise + τ thres )) ≤ βm.
As in most regularized procedures, the idea is to study the 'excess functional' φ(z) − φ(x): for a reconstruction error ρ, we determine a sufficient condition on the number of measurements m which guarantees that φ(z) − φ(x) < 0 whenever z ∈ U and x − z 2 ≥ ρ.Clearly, that implies that the solution x # to (4.1) satisfies x # − x 2 ≤ ρ.As before, we wish to obtain a uniform estimate, i.e., the high probability event under which the above holds should not depend on the identity of x ∈ T .
The first step towards a uniform estimate is a decomposition of the excess functional.Note that We use this decomposition to find constants C and ρ > 0 and a high probability event on which, for every x ∈ T and z ∈ U , Estimating (3).The starting point is a straightforward observation: for τ ∼ U[−λ, λ] and any z ∈ R, Lemma 4.1.There exist absolute constants C and c for which the following holds.Let Z and W be random variables and let τ ∼ U[−λ, λ] be independent of Z and W . Then Proof.By (4.4), Hence, E sign(Z + τ )W = 1 λ EW Z + E( * ), and all that is left to show is for absolute constants C and c.
Corollary 4.2.There exist absolute constants c and C for which the following holds.For every x, z ∈ R n , 1 m E sign(Ax + ν noise + τ thres ), A(z − x) , where, as always, L is the subgaussian constant of X and of ν.
In particular, Proof.The first statement follows from Lemma 4.1 for the choices of Z i = X i , x + ν i and W i = X i , z − x : recalling that the ν i are centred, have variance σ 2 and are independent of X i , 2 ) ≤ cL 2 ( x 2 2 + σ 2 ) ≤ cL 2 (R 2 + σ 2 ); and, because X is isotropic, The 'in particular' part is evident because Corollary 4.2 leads to the wanted estimate on (3): Observe that for every x, z, and η i = (q corr ) i − sign( X i , x + ν i + τ i ), one has that To that end, observe the following: if f : R n → R + is positive homogeneous and W ⊂ R n is star-shaped around 0, i.e., θw ∈ W for all w ∈ W and 0 < θ < 1, then sup {w∈W : w 2 ≥ρ} f (w/ w 2 2 ) = sup {w∈W : w 2 =ρ} f (w)/ρ 2 .
We will refer to this argument, which reflects the general fact that star-shaped sets become 'relatively richer' close to their centres, as a 'star-shape argument'.Then with probability at least 1−2 exp(−c 3 βm log(e/β)), for every x, z ∈ U such that z −x 2 ≥ ρ one has Proof.Since U − U is star-shaped around 0, the above star-shape argument implies that for any I ⊂ Let A be the event from Lemma 4.6.Recall that P(A) ≥ 1 − 2 exp(−cmr /λ) and that on A, for any v ∈ V and x ∈ T such that x − v 2 ≤ r , sign ( X i , x + ν i + τ i ) differs from sign ( X i , v + ν i + τ i ) on at most (3r /λ)m indices.Therefore, for every v ∈ V , Let us begin by estimating the right hand side of (4.10).For every fixed v ∈ V and conditioned on (X i , ν i , τ i ) m i=1 , this is the supremum of a Bernoulli process indexed by a set of the form m i=1 a i w i : w ∈ W , where (a i ) m i=1 is a fixed vector of signs and W ⊂ X i , u m i=1 : u ∈ U ρ .By the contraction inequality for Bernoulli processes [16] applied conditionally on (X i , ν i , τ i ) m i=1 , it follows that Hence, for any t ≥ 1, with probability at least 1 − e −t ,

1 Figure 3 .
Figure 3. Bernoulli vectors in R 2 can only generate two different homogeneous hyperplanes.As a result, there exist two points on the sphere (the examples e 1and (e 1 + λe 2 )/ √ 1 + λ 2 for −1 < λ < 0 are marked in red) which are far apart, but cannot be separated by a Bernoulli hyperplane.This problem persists in high dimensions.In addition, any two points lying on a straight line through the origin (examples are marked in green) cannot be separated by a homogeneous hyperplane (the latter problem is not specific to the Bernoulli case).Both problems can be solved by using parallel shifts of the hyperplanes.

Theorem 2 . 10 . 2 1/ 2 ≤ c 2 *
Let X be an isotropic L-subgaussian random vector and let S ⊂ R n .If 1 ≤ k ≤ m and u ≥ 1 then with probability at least 1 − 2 exp(−c 1 u 2 k log(em/k)), sup z∈S max |I|≤k i∈I X i , z (S) + ud S k log(em/k) , where c 1 and c 2 depend only on L and d S = sup z∈S z 2 .