On efficient algorithms for computing near-best polynomial approximations to high-dimensional, Hilbert-valued functions from limited samples

Sparse polynomial approximation has become indispensable for approximating smooth, high- or infinite-dimensional functions from limited samples. This is a key task in computational science and engineering, e.g., surrogate modelling in UQ where the function is the solution map of a parametric or stochastic PDE. Yet, sparse polynomial approximation lacks a complete theory. On the one hand, there is a well-developed theory of best $s$-term polynomial approximation, which asserts exponential or algebraic rates of convergence for holomorphic functions. On the other hand, there are increasingly mature methods such as (weighted) $\ell^1$-minimization for computing such approximations. While the sample complexity of these methods has been analyzed in detail, the matter of whether or not these methods achieve such rates is not well understood. Furthermore, these methods are not algorithms per se, since they involve exact minimizers of nonlinear optimization problems. This paper closes these gaps. Specifically, we pose and answer the following question: are there robust, efficient algorithms for computing approximations to finite- or infinite-dimensional, holomorphic and Hilbert-valued functions from limited samples that achieve best $s$-term rates? We answer this in the affirmative by introducing algorithms and theoretical guarantees that assert exponential or algebraic rates of convergence, along with robustness to sampling, algorithmic, and physical discretization errors. We tackle both scalar- and Hilbert-valued functions, this being particularly relevant to parametric and stochastic PDEs. Our work involves several significant developments of existing techniques, including a novel restarted primal-dual iteration for solving weighted $\ell^1$-minimization problems in Hilbert spaces. Our theory is supplemented by numerical experiments demonstrating the practical efficacy of these algorithms.


Introduction
A fundamental task in computational science and engineering involves accurately approximating a smooth target function from limited data. Such a task arises notably in the study of parametric models of physical processes. Here the variables represent the parameters in the system, e.g., material properties, forcing terms, or boundary information, and the parametric model is often represented as a (system of) Differential Equations (DEs) depending on these parameters. Important objectives involve understanding how the choice of such parameters affect the output(s) of the system and, in the stochastic setting, to understand how uncertainty in the parameter values propagates to its output -the latter being one of the key tasks in Uncertainty Quantification (UQ) [59,89,121,124].

High-dimensional function approximation from limited samples
Abstractly, this task can be recast as that of approximating an unknown target function Here, the input space U is typically a subset of R d (in the finite-dimensional case) or R N (in the infinite-dimensional case). The output space V could either be a scalar field, a finite-dimensional vector space or an infinite-dimensional Banach or Hilbert space. This problem is challenging in a number of ways. First, the dimension d is high, since modern parametric models typically involve many parameters. It may also be infinite, e.g., in the case of a random field represented via its Karhunen-Loève expansion. Therefore, the curse of dimensionality is an ever-present issue. Furthermore, the amount of samples m is often highly limited. For example, in the parametric DE setting, each evaluation of f involves an expensive computational simulation. The data (1.1) is also always corrupted by errors, due to noise in physical experiments or numerical error in solving a DE. And finally, since the output f (y) is often the solution of DE parametrized by the vector y, it may consequently take values in an infinite-dimensional Banach or Hilbert space. While it is commonplace to circumvent this issue in practice by considering scalar-valued quantities of interest (i.e., functions of the form g(y) = Q(f (y)) for some known map Q : V → C), approximating the full function f is both of theoretical interest and practical importance [54]. Remark 1.1 As a further consideration, we note that in many scenarios one may have substantial flexibility to choose the sample points y 1 , . . . , y m ∈ U in (1.1). However, in other scenarios they may be fixed, e.g., when dealing with legacy data. In this work, we consider Monte Carlo sampling -which may be considered either as a chosen sampling strategy or a fixed one, depending on the setting. Here, the samples are drawn randomly and independently from an underlying probability measure on U. This is very common in practice, in particular in UQ settings.
Holomorphy motivates the use of polynomial approximation techniques, and in particular, best s-term polynomial approximation. Here, the function f is approximated by an s-term expansion corresponding to its largest s coefficients (measured in the V-norm) with respect to a polynomial basis. Common choices include Taylor polynomials, tensor-product Legendre and Chebyshev polynomials on bounded hypercubes or tensor-product Hermite and Laguerre polynomials on R d or [0, ∞) d . Over the last fifteen years, there has been a significant effort in developing the approximation theory of such techniques (see the aforementioned references, plus those in §1.6). Signature results have established exponential and algebraic convergence rates for the best s-term approximation. The former assert that the error decays at least exponentially fast in s 1/d in finite dimensions for any holomorphic function. The latter assert that the error decays algebraically fast; specifically, like s 1/2−1/p for some 0 < p < 1. These algebraic rates also hold in infinite dimensions, thus establishing best s-term approximation as a (theoretical) means to approximate holomorphic functions of infinitely many variables. We review several such results in §2.6.

Computing sparse polynomial approximations
Unfortunately, the best s-term approximation cannot generally be computed from the samples (1.1). Indeed, constructing it theoretically involves computing and then searching over infinitelymany coefficients. Both tasks are generally impossible. Therefore, there has also been a focus on methods to compute accurate polynomial approximations from the samples.
One line of work focuses on least-squares methods, wherein a polynomial approximation (or sequence of approximations) is computed in a fixed polynomial subspace (or sequence of nested subspaces). See §1.6 for relevant references. Such methods are essentially optimal if a (sequence of) polynomial subspace that gives a quasi-best s-term approximation is known.
However, this information is generally unavailable in practice (although it may be for certain simple parametric DEs). It essentially equates to knowing the region of holomorphy of the underlying function, which is itself similar to knowing the order of importance of the parametric variables, and their relative strengths. To counter this, there are adaptive least-squares methods [34,36,42,47,47,60,98,99]. Here one strives to construct such subspaces adaptively using the given data (1.1), typically via a greedy procedure. However, these currently lack theoretical guarantees [36,42].
To overcome this limitation, there has also been a substantial focus on methods inspired by compressed sensing [11,58,138]. See §1.6 for references. These methods seek a polynomial approximation in a larger subspace, whose coefficients are defined as a minimizer of an 1 -or weighted 1 -minimization problem. A substantial focus in this area has been determining the sample complexity of such schemes, i.e., quantifying how many (Monte Carlo) samples m are sufficient to obtain an approximation with a certain guaranteed error bound, related to a (weighted) best approximation error. Unfortunately, such an error bound does not easily translate into precise rates of approximation. Another key limitation of existing work is that such methods are not algorithms per se. Indeed, they consider exact minimizers of nonlinear optimization problems, which cannot be computed exactly in finitely-many arithmetic operations.

Main contributions
Least-squares and compressed sensing techniques are commonly applied to compute polynomial approximations to parametric and stochastic DEs. However, there is a key gap between theory and practice. The theory of the best s-term approximation asserts the existence of polynomial approximations that attain specific algebraic or exponential rates of convergence for arbitrary holomorphic functions. Yet, it is currently unknown whether or not similar rates in terms of the number of samples m can be obtained via an algorithm that computes a polynomial approximation from the samples (1.1). The purpose of this work is to close this gap. We focus on the symmetric hypercube U = [−1, 1] d , where d ∈ N or d = ∞, and the case of Chebyshev or Legendre polynomials. We assume that V is an arbitrary separable Hilbert space. We then construct algorithms that, given an input in the form of noisy sample values d i = f (y i ) + n i , i = 1, . . . , m, (1.2) of an unknown function f : U → V, compute the coefficients of a polynomial approximationf that satisfies with probability at least 1 − with respect to the (Monte Carlo) draw of the sample points y i . Here · L 2 (U ;V) is the Lebesgue-Bochner norm, where is the Chebyshev (arcsine) or uniform measure. Our error bound (1.3) provides a complete accounting for the main sources of error in the problem: • E app is a polynomial approximation error term. Depending on the specific setup, it decays algebraically or exponentially with respect to m (up to several log terms). For instance, in the infinite-dimensional setting, this term is given by where c 0 ≥ 1 is a universal constant, C is a constant depending on (the region of holomorphy of) f only and 0 < < 1 is the failure probability of (1.3). It is completely equivalent to the corresponding algebraic decay rate for the best s-term approximation error, except with s replaced by m/(c 0 L).
• E samp is the sampling error and is equal to i.e., the norm of the error in the samples (1.2). In other words, the algorithm is stable to noise in the samples.
• E disc is the physical discretization error. This term accounts for the fact that an algorithm cannot work with (i.e., take as input, or perform computations in) V when it is an infinitedimensional Hilbert space. The algorithm therefore works in a finite-dimensional discretization space V h ⊆ V. This is a standard step in parametric DEs, where discretization is often performed via techniques such as the Finite Element Method (FEM). In this case, V h is a finite element space. The term E disc quantifies the effect of the error. It is given by where P h : V → V h is the orthogonal projection onto V. In other words, the effect of working in V h instead of V is determined by the error of the (pointwise) best approximation P h (f ) to f from V h . If V has finite dimension we assume V h = V, which implies that E disc = 0.
• E alg is the algorithmic error. It depends on the number of iterations t performed by the algorithm that computes the coefficients of the polynomial approximationf . We construct one type of algorithm where this term is O (1/t) as t → ∞. This decay is relatively slow, especially in the regime where E app is exponentially small in m. However, we also present an efficient algorithm for which this term decays exponentially-fast in t (specifically, O(e −t ) as t → ∞), subject to an additional theoretical constraint. This constraint is seemingly an artefact of the proof. Our numerical experiments suggest it is unnecessary in practice.
Note that these computational cost estimates also depend polynomially on the dimension of the discretization space V h .

Discussion and further contributions
This work bridges a gap between the best s-term polynomial approximation theory and algorithms for computing such approximations from sample values. In particular, it asserts that algebraic and exponential rates with respect to the number of samples m that are highly similar to those of the best approximation. In other words, polynomial approximations of holomorphic functions can be achieved in a sample efficient manner. Furthermore, they can be computed in supexponential or algebraic computational cost. Our main results assume holomorphy of the underlying function in order to attain these rates. However, they assume no a priori knowledge of the region of holomorphy. As discussed, if such information is available, then least-squares methods can be used more straightforwardly to compute an approximation. The holomorphy assumption is made in order to have concrete algebraic and exponential rates. However, our algorithms exist independently of the smoothness assumption. It would be possible to also provide rates for other classes of functions, e.g., those possessing finite orders of (mixed) smoothness. We use holomorphy as our assumption due to its strong connections with the theory of parametric DEs.
Our algorithms and analysis are based on compressed sensing theory and involve computing approximate minimizers of certain weighted 1 -minimization problems. Here we make several additional contributions: (i) We provide precise error rates for polynomial approximation via compressed sensing. As noted, most prior work on compressed sensing involves quantifying the sample complexity to obtain a certain (weighted) best approximation error. Subject to a holomorphy assumption, we use this to obtain specific algebraic and exponential rates.
(ii) Prior works consider polynomial approximations formed by exact minimizers of nonlinear optimization problems. We introduce novel, efficient algorithms to compute approximate minimizers in finite computational time (see also below).
(iii) While these algorithms are motivated by the desire to have full error bounds, they are also completely practical. We present a series of numerical experiments demonstrating their practical efficacy. In fact, our experiments show that these algorithms work even better than our theoretical results suggest.
(iv) Most prior works on compressed sensing (with the exception of [54]) focus on scalar-valued functions, e.g., quantities of interest of parametric DEs. We develop algorithms that work in the Hilbert-valued setting, and, crucially, provide error bounds that take into account discretization error.
More precisely, our algorithms first formulate the approximation problem as the recovery of a finite, Hilbert-valued vector (i.e., an element of V N ) via a so-called weighted, Square-Root LASSO (SR-LASSO) optimization problem. The use of the SR-LASSO, as opposed to the classical LASSO or various constrained formulations, is crucial to this work. It is noise-blind. Hence it allows us to devise algorithms that do not require any a priori (and generally unavailable) estimates on the measurement error n i in (1.2) or the truncation error with respect to the finite polynomial space in which the approximation is constructed.
To develop algorithms, we use two key ideas. First, we use a powerful, general-purpose firstorder optimization method for solving nonsmooth, convex optimization problems. Second, we use the technique of restarts to drastically accelerate its convergence. For the former, we employ the primal-dual iteration (also known as the Chambolle-Pock algorithm) [30,31]. We present error bounds for this method for solving the Hilbert-valued, weighted SR-LASSO, which decay like O (1/t), where t is the iteration number. Next, we use a novel restarting procedure, recently introduced in [50,51], to obtain faster, exponential decay of the form O(e −t ).
To the best of our knowledge, this is the first time either the primal-dual iteration or a restarting scheme has been applied to the problem of sparse polynomial approximation. Many existing works use blackbox solvers such as SPGL1 [136,137]. See [54] for a forward-backwards splitting technique in combination with Bregman iterations and fixed-point continuation and [135] for an approach based on Douglas-Rachford splitting. Besides its amenability to theoretical analysis, the primaldual scheme is also particularly attractive because of its insensitivity to parameter choices and the possibility of performing acceleration via restarts.

Related work
The systematic study of best s-term polynomial approximation of high-or infinite-dimensional holomorphic functions began around 2010. See [22,23,25,27,37,43,44,73,110,110,132,134] and, in particular, [42] and [8,Chpt. 3]. Note that many of these works assume the function is a solution of a parametric PDE, and therefore first demonstrate that such a function is holomorphic. However, other works avoid this step and use specific properties of the DE to obtain refined estimates. See, e.g., [18,19].
Our work combines and extends several key elements of this literature. First, weighted 1minimization, which was developed in [1-3, 5, 38, 111, 115, 144] and [8,. Second, the notions of lower and anchored sets (see §2.7). These have been extensively studied in the best sterm polynomial approximation literature. Compressed sensing techniques aiming to exploit such structures were first considered in [2,3,38] and [8,Chpt. 7]. Third, the extension of classical compressed sensing theory from vectors in R N (or C N ) to Hilbert-valued vectors in V N . This was developed in [54]. In order to prove our main results, we also extend this framework to the weighted setting.
See [30][31][32] for more on the primal-dual iteration and [116][117][118] for the general notion of restarts in continuous optimization. Note that there are also various non-optimization based techniques in the compressed sensing literature (see, e.g., [58]), including iterative threshold and greedy methods (the latter are closely related to the adaptive least-squares methods discussed earlier [8, §6.2.5]). However, these do not currently possess theoretical guarantees in the weighted setting.
There have been several previous attempts to connect compressed sensing theory for analyzing the sample complexity of polynomial approximations via (weighted) 1 -minimization and best s-term polynomial approximation theory. In [113], the authors consider approximating scalar quantities of interest of solutions to affine parametric operator equations in Banach spaces. Assuming a certain weighted summability criterion, they first show holomorphy of the parametric solution map and then use a weighted 1 -minimization procedure in combination with Chebyshev polynomials to derive algebraic rates of convergence, similar to (1.4). Our work is more general, since its starting point is a holomorphic function, not a solution of a parametric operator equation. We also consider Hilbert-valued functions, i.e., the whole solution map, not a scalar quantity of interest of it. Moreover, the work of [113] is based on exact minimizers of certain constrained, weighted 1 -minimization problems, whereas we construct full algorithms. Recently, at the same time as writing this paper, some similar results were presented in the book [8] written by two of the authors. However, these only consider the scalar-valued case and do not address algorithms, which is the main focus of this work.

Outline
The remainder of this paper proceeds as follows. We commence in §2 with various preliminaries, including key notation and best s-term polynomial approximation theory. Next, in §3 we first formally define the problem and then state our main results on the existence of algorithms. In §4 we derive these algorithms. Then in §5 we present numerical experiments demonstrating their practical performance. §6-10 are devoted to the proofs of the main results. See §6 for a detailed overview of these sections. Finally, in §11 we present our conclusions.

Preliminaries
In this section, we introduce key preliminary material needed later in the paper. After some initial notation, we define the domains (the symmetric hypercubes), probability measures (the uniform and Chebyshev measures, respectively) and the Lebesgue-Bochner spaces. We next formalize our main smoothness assumption: namely, holomorphy in suitable (unions of) Bernstein polyellipses. We then introduce orthogonal polynomial expansions and best s-term polynomials approximations, before discussing sequence spaces and best s-term approximations of sequences. Finally, we conclude by reviewing algebraic and exponential rates of convergence for best s-term polynomial approximations, before a short discussion on lower and anchored sets.

Notation
We first introduce some notation. For 1 ≤ p ≤ ∞, we write · p for the usual vector p -norm and for the induced matrix p -norm. When 0 < p < 1, we use the same notation to denote the p -quasinorm. For 1 ≤ p, q < ∞ we define the matrix p,q -norm of an m × n matrix G = (G ij ) m,n i,j=1 as G q p,q := n j=1 ( m i=1 |G ij | p ) q/p , and similarly when p = ∞ or q = ∞.
Throughout this paper, we consider sets of multi-indices. Let d ∈ N. Then we define the multi-index set F as the set of nonnegative multi-indices, i.e.
When d = ∞, we consider multi-indices in N N 0 with at most finitely-many nonzero terms, i.e. we define In either finite or infinite dimensions, we write 0 and 1 for the multi-indices consisting of all zeros and all ones, respectively. Finally, the inequality µ ≤ ν is understood componentwise for any multi-indices µ and ν.

Domains and function spaces
In either finite or infinite dimensions, for 1 ≤ p ≤ ∞ we write L p (U) for the corresponding weighted Lebesgue spaces of complex scalar-valued functions over U and · L p (U ) for their norms.
Throughout, we let V be a separable Hilbert space over C (it presents few difficulties to consider a complex field instead of the real field). We write ·, · V and · V for its inner product and norm. We define the weighted (Lebesgue-)Bochner space L p (U; V) as the space consisting of (equivalence classes of) strongly -measurable functions f : When V is infinite dimensional, we usually cannot work directly with it. Hence, we consider a finite-dimensional discretization Here h > 0 denotes a discretization parameter, e.g., the mesh size in the case of a finite element discretization (as is common in parametric DEs). In the context of finite elements, assuming (2.6) corresponds to considering so-called conforming discretizations. We let {ϕ k } K k=1 be a (not necessarily orthonormal) basis of V h , where K = K(h) = dim(V h ). We write P h : V → V h for the orthogonal projection onto V h and, for f ∈ L 2 (U; V), we let P h f ∈ L 2 (U; V h ) be the function defined almost everywhere as (2.7)

Holomorphy
Here we recall the definition of holomorphy and holomorphic extension for Hilbert-valued functions. We note that equivalent definitions are possible (see, e.g., [76,Chapter 2]) and that the definition employed in this work is based on the notion of the Gateaux partial derivative. For other details on differentiability of Hilbert-valued functions we refer to [21,Chapter 17], and the references therein. Note the following definitions apply in both the finite-(d ∈ N) and infinite-(d = ∞) dimensional settings, where we recall that [d] = N and C d = C N when d = ∞.  We are interested in approximating Hilbert-valued functions f : U → V that are holomorphic in suitable complex regions containing U -specifically, regions defined by Bernstein (poly)ellipses. When d = 1 the Bernstein ellipse of parameter ρ > 1 is defined by This is an ellipse with ±1 as its foci and major and minor semi-axis lengths given by 1 2 (ρ ± ρ −1 ). For d ∈ N ∪ {∞}, given ρ = (ρ j ) d j=1 ∈ R d with ρ > 1, we define the Bernstein polyellipse as the Cartesian product We denote the class of Hilbert-valued functions that are holomorphic in E(ρ) with norm at most one as In infinite dimensions, we also consider a class of functions that are holomorphic in a certain union of Bernstein polyellipses. Let 0 < p < 1, > 0 and b = (b j ) j∈N ∈ p (N). We define In analogy with B(ρ), we write for the corresponding space of functions that are holomorphic in R(b, ) with norm at most one.

Orthogonal polynomials, polynomial expansions and best s-term polynomial approximation
Under mild assumptions on (1) ν is a polynomial of degree ν. For the measures (2.3), these are the Legendre and Chebyshev polynomials, respectively. Given the corresponding tensor-product measure on U = [−1, 1] d , we construct an orthonormal basis Ψ ν k (y k ), y ∈ U, ν ∈ F.

Note that Ψ
(1) 0 = 1 since (1) is a probability measure. Therefore, since ν ∈ F has only finitely-many nonzero entries, in infinite dimensions this equivalent to which is a product of finitely-many terms.
Let f ∈ L 2 (U; V). Then it has the convergent expansion (in L 2 (U; V)) given by where the coefficients c ν are elements of V. Now let S ⊂ F be a finite index set and Given this, the L 2 (U; V)-norm best s-term polynomial approximation f s of f is defined as Note that f s is has the explicit expression where S * ⊂ F, |S * | = s, is a set of consisting of the multi-indices of the largest s values of the coefficient norms ( . By Parseval's identity, the error satisfies (2.14)

Sequence spaces and best s-term approximation of sequences
The equivalence (2.14) motivates studying s-term approximation of the sequences of polynomial coefficients. To do this, we now introduce necessary further notation. Let Λ ⊆ F denote a (possibly infinite) multi-index set. We write v = (v ν ) ν∈Λ for a sequence with V-valued entries, v ν ∈ V. For 1 ≤ p ≤ ∞, we define the space p (Λ; V) as the set of those Note that 2 (Λ; V) is a Hilbert space with inner product On occasion, we will consider complex, scalar-valued sequences. In this case, V = (C, |·|) in the various definitions above. For ease of notation, we simply write p (Λ), · p , ·, · 2 and so forth in this case.  Let c = (c ν ) ν∈F be the coefficients of some function f ∈ L 2 (U; V), as defined in (2.10). Then, when p = 2, we have the following: where f s is its best s-term polynomial approximation (2.12). Thus, we can study the error of f s by studying the quantity σ s (c) 2;V .

Rates of best s-term polynomial approximation
As noted, best s-term polynomial approximation of holomorphic functions is a well-studied subject, especially in the context of solutions of parametric DEs. See, e.g., [22,23,25,27,37,43,44,73,110,110,132,134] and, in particular, [42] and [8,Chpt. 3]. In this section, we recap two standard types of error decay rates for this approximation, those of algebraic and exponential type, respectively. Note that these results are for Chebyshev and Legendre polynomial approximations -the main focus of the work. The latter type of decay rate holds in finite dimensions, while the former holds in both finite and infinite dimensions. In this work, these error decay rates serve as the optimal benchmark against which to compare the approximations computed from sample values.
The following two results are standard, and have appeared in various different guises in the aforementioned works.
where f S i = ν∈S i c ν Ψ ν for i = 1, 2 and C = C(b, , p) > 0 depends on b, and p only.
We next state a result on exponential convergence in finite dimensions. Such rates have been established in various different works (see, e.g., [22,23,42,110,134]). The following result is a minor modification of [8,Thm. 3.25], in which we allow arbitrary s ≥ 1 at the expense of a constant C in the error bound. Theorem 2.6 (Exponential rates of convergence; finite-dimensional case). Let 0 < p ≤ 2 and f ∈ B(ρ) for some ρ > 1. Let c = (c ν ) ν∈N d 0 be as in (2.10). Then, for every s ≥ 1 there is a set S ⊂ F, |S| ≤ s, such that for all
In Appendix A we show how these three theorems can be obtained as immediate consequences of several more general results.
Remark 2.7 It is possible to improve the rate (2.19) by removing the (d+1) −1 factor in (2.20) [134]. The difficulty in doing this is that such rates are not necessarily attained in lower sets (this is, however, true if ρ is sufficiently large -see [8,Lem. 7.20]). As we discuss next, lower sets are a crucial ingredient in our analysis. Conversely, the rates described in Theorem 2.6 can always be attained in lower sets.

Lower and anchored sets
Our objective in this work is to construct a polynomial approximation that satisfies similar error bounds to those of the best s-term approximation f s , for any holomorphic function f . Hence, ideally, we would have access to the multi-index set S corresponding to the largest s coefficients of f (measured in the V-norm). As discussed, this is not possible in general, since the only information we have about f is its values at a finite number of sample points. Another problem is that such coefficients could occur at arbitrarily-large multi-indices, thus necessitating a search over infinitelymany multi-indices. Fortunately, it is well known that near-best s-term polynomial approximations can be constructed using sets of multi-indices with additional structure. These are lower sets (used in the finite-dimensional case) and anchored sets (used in the infinite-dimensional case). Classical references for lower and anchored sets include [52,88,93,131]. More recently, these structures have been used extensively in the construction of interpolation, least-squares and compressed sensing schemes for polynomial approximation with desirable sample complexity bounds (see, e.g., [8] and references therein).
Definition 2.8. A set Λ ⊆ F is lower if the following holds for every ν, µ ∈ F: A set Λ ⊆ F is anchored if it is lower and if the following holds for every j ∈ N: Lower sets are typically used in finite-dimensional settings, with anchored sets being employed in infinite dimensions. They are a key notion we exploit in this paper. To underscore the usefulness of these structures, we remark in passing that the rates articulated in Theorems 2.4-2.6 can, up to possible changes in the constants, also be attained using s-term approximations in lower or anchored sets S i . See Appendix A.

Problem statement and main results
In this section, we first formally define the problem we aim to solve before stating our main results. This paper concerns algorithms for computing approximation of Hilbert-valued functions from finitely-many sample values. We define this concept formally in a moment. For now, though, we consider that an algorithm must take a finite input and produce a finite output. Hence, in order to discuss algorithms. we need first to define what these finite input and outputs are in our setting.

Samples
Let f ∈ L 2 (U; V) be the function we seek to approximate. Throughout this work, we consider m sample points y 1 , . . . , y m ∈ U drawn randomly and independently according to the probability measure . Corresponding to each sample point, we consider the noisy sample values where n = (n i ) m i=1 ∈ V m is an error term, referred to as the sampling error. Observe that the samples values d i are assumed to be elements of the finite-dimensional space V h . This is a natural assumption to make. Indeed, in the context of parametric DEs, the value f (y) (the solution of the DE with parameter value y) is typically computed via a (finite element) discretization of the DE, thus yielding an element of V h , which is the corresponding discrete (finite element) space.
As a result of the assumption d i ∈ V h , the error term n i encompasses the error involved in approximating f (y i ) ∈ V by an element of V h (e.g., the (finite element) discretization error in the context of a parametric DE). Note that we do not specify precisely how such an approximation is performed, nor how large an error this results in. In other words, we consider the computation that evaluates f at y i as a black box. A particular case of interest is when the d i are the orthogonal projections of the exact sample values f (y i ), i.e.
However, we do not assume this in what follows, since in practice the numerical computation that yields the d i may not involve computing the projection P h . Our objective is to develop algorithms for which the error scales linearly in n 2;V , the norm of the noise, thus accounting for any black box mechanism for computing the samples.
Recall that we consider a basis {ϕ k } K k=1 for V h . We assume that the computation that evaluates f (y i ) produces the coefficients of the sample values d i in this basis (i.e. the finite element coefficients in the aforementioned example). Therefore, we now write the sample values as and consider the values d ik ∈ C as the data we obtain by sampling f .

Problem statement
We now formally define the input and output of the algorithm. The input of the algorithm is the collection of sample points (y i ) m i=1 and the array of mK values (d i,k ) m,K i,k=1 ∈ C m×K defined by (3.1). We now define the output. To this end, we first fix a multi-index set Λ ⊂ F of size |Λ| = N for some N ≥ 1. This set defines a polynomial space P Λ;V h , as in (2.11), within which we shall construct the resulting polynomial approximation. Hence, we consider an approximation of the formf ∈ P Λ;V h given byf and ν 1 , . . . , ν N is some indexing of the multi-indices in Λ. In this way, we define formally the output of the algorithm as the coefficients (ĉ jk ) N,K j,k=1 ∈ C N ×K . Finally, in order to define an algorithm we need one additional ingredient. Let denote the Gram matrix of the basis {ϕ k } K k=1 ⊂ V h . Note that G is self adjoint and positive definite. However, G is only equal to the identity when {ϕ k } K k=1 is orthonormal. In what follows, we assume that it is possible to perform matrix-vector multiplications with G. In other words, we have access to the function For convenience, we write F (G) for the maximum number of arithmetic operations and comparisons required to evaluate T G (x) for arbitrary x. Note that F (G) ≤ K 2 in general. However, this may be smaller when G is structured. For instance, in the case of a finite element discretization, this computation can often be performed in O (K) operations.
Definition 3.1 (Algorithm for polynomial approximation of Hilbert-valued functions). Let Λ ⊂ F of size |Λ| = N be given, along with an indexing ν 1 , . . . , ν N of the multi-indices in Λ. An algorithm for polynomial approximation of Hilbert-valued functions from sample values is a mapping for which the evaluation of A((y i ), (d i,k )) involves only finitely-many arithmetic operations (including square roots), comparisons and evaluations of the matrix-vector multiplication function ). The computational cost of an algorithm A is the maximum number of arithmetic operations and comparisons (including those used in the evaluation of T G ) used to compute the output from any input.
Remark 3.2 As formulated above, it is up to the user to choose a suitable multi-index set Λ. Fortunately, as we see in our main results below, this multi-index set is given simply and explicitly in terms of m and another parameter (a failure probability). In particular, no 'oracle' knowledge of the function being approximated is required. Thus, one can also make the stronger assertion in what follows in which the algorithm takes the same input, but outputs both the desired index set Λ and the polynomial coefficients. For ease of presentation, we shall not do this.

Remark 3.3
When d = ∞ each sample point y i is an infinite sequence of real numbers. It is implicit in Definition 3.1 that the algorithm only accesses finitely-many entries of this sequence. This does not cause any problems. As noted, the polynomial approximation is obtained in the index set Λ, which is a finite subset of F. Hence, the multi-indices in Λ are nonzero only in their first n entries, for some n. Therefore, it is only necessary to access the first n entries of each sequence y i . More concretely, in our main results below, the polynomial approximation in infinite dimensions is obtained in a multi-index set Λ = Λ HCI n in which only the first n terms can be nonzero, where n is an integer given explicitly in terms of m and .

Main results
We now present the main results of this paper. We reiterate at this stage that these results are formulated for Chebyshev and Legendre polynomials. See §11 for some further discussion on other polynomial systems.
As noted above, these results employ specific choices of the index set Λ in order to obtain the desired approximation rates. Specifically, in finite dimensions, we consider the hyperbolic cross index set We term n the order of the hyperbolic cross. Note that it is common to consider (3.4) as the hyperbolic cross of order n − 1. We use n here as it is slightly more convenient for this work. When defined this way, Λ HC n,d is in fact the union of all lower sets (see Definition 2.8) in d dimensions of size at most n (see, e.g., [8,Prop. 2.5]). Thus, this set is a natural choice for polynomial approximation.
In infinite dimensions, we define the following index set Similarly, the union of all anchored sets (Definition 2.8) of size at most n in infinite dimensions is a subset of Λ HCI n (see, e.g., [8,Prop. 2.18]). Note that Λ HCI n is isomorphic to Λ HC n,n under the restriction For convenience, we now also define as the cardinality of the index set employed. In general, the exact behaviour of Θ(n, d) is unknown. However, it admits a variety of different bounds. These are summarized as follows for d < ∞: The bounds are based on [33,86]. See also [8 Finally, we also define α = 1 Legendre, log(3)/ log(4) Chebyshev, (3.8) and, given m ≥ 3 and ∈ (0, 1), where N = Θ(n, d) is as in (3.6) with n = m/L and L = L(m, d, ) as in (3.9), with the following property. Let f ∈ B(ρ) for arbitrary ρ > 1, draw y 1 , . . . , y m randomly and independently according to and let (d ik ) m,K i,k=1 ∈ C m×K be as in (3.1) for arbitrary noise terms n = ( ) and define the approximationf as in (3.2) based on the index set Λ = Λ HC n,d . Then the following holds with probability at least 1 − . The error satisfies
We now make several remarks about this result. The same remarks apply (with obvious modifications) to all subsequent results as well. First, notice how the index set Λ in which the approximation is constructed is given completely explicitly in terms of m, d and . Thus, as claimed in Remark 3.2, no 'oracle' information about the function being approximated is required. Indeed, notice that the mappings and algorithms described in this theorem universal, in the sense that they apply equally to any function f ∈ B(ρ) and any ρ > 1.
A key aspect of this theorem is the factor ζ, defined in (3.11), which determines the error bounds (3.10). As claimed in §1.4, this incorporates three main key errors arising in the approximation process: (i) The approximation error. This is the algebraically-decaying term E app = C ·(m/(c 0 L)) 1/2−1/p .
It is completely equivalent to the best s-term approximation error bound in Theorem 2.4, except with s replaced by m/(c 0 L).
(ii) The sampling error. This is the term . In other words, the effect of any errors in computing the sample values f (y i ) enters linearly in the overall error bound.
(iii) The physical discretization error. This is the term Notice that (i) also describes the sample complexity of the scheme. Indeed, Theorem 3.4 asserts that there is a polynomial approximation that can be obtained from m samples that attains the best s-term rate s 1/2−1/p , where s = m/(c 0 L) scales like m up to the polylogarithmic factor L. Theorem 3.4 asserts the existence of a mapping that takes samples values as its input and produces the coefficients of a polynomial approximation attaining a desired error bound as its output. The mapping, as we see later, arises as a minimizer of a certain weighted 1 -minimization problem. Thus, it is not an algorithm in the sense of Definition 3.1. In the next two theorems we assert the existence of algorithms that attain the same error, plus additional algorithmic error terms.
Theorem 3.5 (Existence of an algorithm; algebraic case, finite dimensions). Consider the setup of Theorem 3.4. Then, for every t ≥ 1, there exists an algorithm in the sense of Definition 3.1 such that the same property holds, except with (3.10) replaced by where c 1 , c 2 ≥ 1 are as in (3.10) and ζ is as in (3.11). The computational cost of the algorithm is bounded by

13)
where n = m/L is as in Theorem 3.4, Θ(n, d) is as in (3.6), α is as in (3.8) and c 3 > 0 is a universal constant.
The key element of this theorem is that the same error bound as in Theorem 3.4 is attained, up to an additional term. In particular, we have the three sources of errors (i)-(iii), plus the following: (iv) The algorithmic error. This is the error E alg = 1/t committed by the algorithm A t in approximately computing the mapping M in Theorem 3.4. It is given in terms of the parameter t, which also enters linearly into the computational cost estimate (3.13).
Unfortunately, the 1/t decay rate of the algorithmic error is slow. Thus, it may be computationally expensive to compute an approximation to within a desired error bound. Fortunately, as we now explain, it is possible to improve it to e −t subject to an additional technical assumption.
Theorem 3.6 (Existence of an efficient algorithm; algebraic case, finite dimensions). Consider the setup of Theorem 3.4. Then for every t ≥ 1 and ζ > 0 there exists an algorithm in the sense of Definition 3.1 such that the same property holds whenever ζ ≥ ζ, except with (3.10) replaced by

14)
where c 1 , c 2 ≥ 1 are as in (3.10) and ζ is as in (3.11). The computational cost of the algorithm is bounded by We refer to this as an 'efficient' algorithm, since the parameter t enters linearly in the computational cost but the algorithmic error scales like e −t . The main limitation of this result is that the algorithm parameter ζ needs to be an upper bound for the true error bound ζ in order for (3.14) to hold. This is a technical assumption for the proof, and does not appear necessary in practice. We demonstrate this through numerical experiment in §5.

Algebraic rates of convergence, infinite dimensions
We now consider algebraic rates of convergence in the infinite-dimensional setting. The next three results should be compared against the best s-term approximation result, Theorem 2.5.
be either the orthonormal Chebyshev or Legendre basis and {ϕ k } K k=1 be a basis for V h . Then for every m ≥ 3, 0 < < 1 and K ≥ 1, there is a mapping is as in (3.9), with the following property. Let > 0, 0 < p < 1 and b ∈ p (N), b > 0, be monotonically nonincreasing. Let f ∈ B(b, ), draw y 1 , . . . , y m randomly and independently according to and let (d ik ) m,K i,k=1 ∈ C m×K be as in (3.1) for arbitrary noise terms n = ( ) and define the approximationf as in (3.2) based on the index set Λ = Λ HCI n . Then the following holds with probability at least 1 − . The error satisfies c 0 , c 1 , c 2 ≥ 1 are universal constants and C = C(b, , p) depends on b, and p only.
Theorem 3.8 (Existence of an algorithm; algebraic case, infinite dimensions). Consider the setup of Theorem 3.7. Then, for every t ≥ 1, there exists an algorithm in the sense of Definition 3.1 such that the same property holds, except with (3.15) replaced by where c 1 , c 2 ≥ 1 are as in (3.15) and ζ is as in (3.16). The computational cost of the algorithm is bounded by In finite dimensions, the computational cost estimate (3.13) is somewhat difficult to interpret, since it behaviour depends on the relative sizes of m and d. Fortunately, in infinite dimensions we can give a more informative assessment. Suppose, for simplicity, that K is fixed (for example, K = 1 in the case of a scalar-valued function approximation problem). Then the computational cost is bounded by where c > 0 is a universal constant c K > 0 is a constant depending on K only. Recall from (3.6) that Θ(n, ∞) = |Λ HCI n | = |Λ HC n,n |. Now, when d = n and n is sufficiently large, the minimum in (3.7) is attained by the second term en 2+log(n)/ log (2) . Substituting this into the above expression and recalling that n = m/L , where L = L(m, ∞, ), we deduce that the computational cost is bounded by Since m ≥ 3 by assumption, we have log(m) ≥ 1 and therefore g(m) ≤ m. Hence, this admits the slightly looser upper bound c K · t · m 1+(α+1) log(4m)/ log (2) .
We conclude that the computational cost (for fixed K and t) is subexponential in m. Further, if we choose t = m 1/p−1/2 in accordance with the algebraically-decaying term in (3.16), then we conclude the following: it is possible to approximate a holomorphic function of infinitely-many variables with error decaying algebraically fast in m via an algorithm whose computational cost is subexponential in m. Whether or not this can be reduced to an algebraic cost is an open problem.
Theorem 3.9 (Existence of an efficient algorithm; algebraic case, infinite dimensions). Consider the setup of Theorem 3.7. Then, for every t ≥ 1 and ζ > 0 there exists an algorithm in the sense of Definition 3.1 such that the same property holds whenever ζ ≥ ζ, except with (3.15) replaced by where c 1 , c 2 ≥ 1 are as in (3.15) and and ζ ≤ ζ is as in (3.16). The computational cost of the algorithm is bounded by where n = m/L is as in Theorem 3.7, Θ(n, ∞) is as in (3.6), α is as in (3.8) and c 3 > 0 is a universal constant.

Exponential rates of convergence, finite dimensions
Finally, we consider exponential rates of convergence in finite dimensions. The following results should be compared against Theorem 2.6.
Theorem 3.10 (Existence of a mapping; exponential case, finite dimensions).
be either the orthonormal Chebyshev or Legendre basis and {ϕ k } K k=1 be a basis for V h . Then for every m ≥ 3, 0 < < 1 and K ≥ 1, there is a mapping , (3.19) and L as in (3.9), with the following property. Draw y 1 , . . . , y m randomly and independently according to . Then, with probability at least 1 − , the following holds. Let f ∈ B(ρ) for arbitrary ρ > 1, (d ik ) m,K i,k=1 ∈ C m×K be as in (3.1) for arbitrary noise terms n = ( ) and define the approximationf as in (3.2) based on the index set Λ = Λ HC n,d . Then the error satisfies for any
in the sense of Definition 3.1 such that the same property holds, except with (3.20) replaced by where c 1 , c 2 ≥ 1 are as in (3.20) and ζ is as in (3.21). The computational cost of the algorithm is bounded by where n is as in (3.19), Θ(n, d) is as in (3.6), α is as in (3.8) and c 3 > 0 is a universal constant.
Theorem 3.12 (Existence of an efficient algorithm; exponential case, finite dimensions). Consider the setup of Theorem 3.10. Suppose that there is a known upper bound ζ ≥ ζ, where ζ is as in (3.21). Then, for every t ≥ 1 and ζ > 0 there exists an algorithm in the sense of Definition 3.1 for which the same property holds whenever ζ ≥ ζ, except with (3.20) replaced by

23)
where c 1 , c 2 ≥ 1 are as in (3.20) and ζ is as in (3.21). The computational cost of the algorithm is bounded by where n is as in (3.19), Θ(n, d) is as in (3.6), α is as in (3.8) and c 3 > 0 is a universal constant.
As before, suppose that K is fixed and, since we consider exponential rates, that d is also fixed. Then, using the third estimate in (3.7), we deduce that the computational cost of this algorithm is bounded by Using the crude bound n ≤ m, we deduce the bound Thus, for fixed t, the computational cost is polynomial in m as m → ∞. In particular, with the efficient algorithm of Theorem 3.12 (subject to the caveat that an upper bound for the error is known) we deduce the following: in fixed dimension d, it is possible to approximate a holomorphic function with error decaying exponentially fast in m via an algorithm whose computational cost is polynomial in m. Whether or not the polynomial growth rate described above is sharp is an open problem.

Remark 3.13
There is a subtle difference between the algebraic and exponential results. The former are nonuniform in the sense that a single draw of the sample points y 1 , . . . , y m is sufficient for recovery of a fixed function f with high probability up to the specified error bound. The latter are uniform, since a single draw of the sample points y 1 , . . . , y m is sufficient for recovery of any function with high probability up to the specified error bound. The reason for this difference stems from bounding a discrete error term (8.10), which is a random variable dependent on f and the sample points. In the algebraic case, in order to obtain the desired algebraic exponent 1/2 − 1/p we bound this term with high probability for each fixed f . See Step 4 of the proof of Theorem 8.2. This renders the ensuing result nonuniform. Conversely, in the exponential case (where the appearance of small algebraic factors is not a concern, since they can be absorbed into the exponentially-decaying term) we bound this term with probability one for any f . See Step 4 of the proof of Theorem 8.4. Note that one could also derive uniform guarantees in the algebraic case by considering a fixed value of p and letting M and A depend on p, or by considering a restricted range 0 < p ≤ p * < 1. Both strategies involve a larger value of n, with its size depending on p or p * . See [8, §7.6.2] for further discussion.

Construction of the algorithms
In this section, we describe the construction of the algorithms asserted in our main results. These are based on techniques from compressed sensing [8,11,58] on the premise that the polynomial coefficients of a holomorphic function are approximately sparse. There are several main differences between standard compressed sensing and what we develop below. First, following [2,7,8,38,113,115], we work in a weighted setting in order to promote sparsity in lower or anchored sets (recall §2.7). Second, following [54], we work with Hilbert-valued vectors, whose entries take values in the Hilbert space V. Finally, so as to avoid unrealistic assumptions on the functions being approximated, we use consider noise-blind decoders, as in [3]. See also Remark 4.1.

Recovery via Hilbert-valued, weighted 1 -minimization
We first require some additional notation. Given N ∈ N we let V N be the vector space of Hilbertvalued vectors of length where v i ∈ V, i = 1, . . . , N . Next, given Λ ⊆ F and a vector of positive weights w = (w ν ) ν∈Λ , where w > 0, we define the weighted p Notice that 2 w (Λ; V) coincides with the unweighted space 2 (Λ; V). Now, let Λ ⊂ F be a finite multi-index set of size |Λ| = N and consider the ordering Λ = {ν 1 , . . . , ν N }. Note that we will, in practice, choose either Λ = Λ HC n,d when d < ∞ or Λ = Λ HCI n when d = ∞, where the order n is as described in the corresponding theorem (Theorems 3.4-3.12).
With this in mind, given f ∈ L 2 (U; V), define as the truncated expansion of f based on the index set Λ and as the finite vector of coefficients of f with indices in Λ. As explained in §3.2, our objective is, in effect, to approximate these coefficients. We do this as follows. Given y 1 , . . . , y m ∈ U, we define the normalized measurement matrix and the normalized measurement and error vectors For ease of notation, we make no distinction between the matrix A ∈ C m×N and the linear operator in what follows. Using this, we obtain and therefore We have now formulated the recovery of c Λ as the solution of a noisy linear system (4.6), where the noise term e + e encompasses both the noise e = (n i ) m i=1 / √ m in the sample values and the error e due to the truncation (4.1) of the infinite expansion (2.10) via the index set Λ.
Due to the discussion in §2.5-2.7, we expect the coefficients c Λ to not only be approximately sparse, but also well approximated by a subset of s coefficients whose indices define a lower or anchored set. In classical compressed sensing, one exploits sparse structure via minimizing an 1 -norm. To exploit sparse and lower structure, we follow ideas of [2,7,8,38] and use a weighted 1 -norm penalty. Specifically, we now compute an approximate solution via the Hilbert-valued, Here λ > 0 is a tuning hyperparameter.

Remark 4.1
As an alternative to solve this Hilbert-valued compressed sensing problem, we could use a formulation based on a constrained basis pursuit or unconstrained LASSO problem. However, we consider the SR-LASSO problem (4.7) instead. While other approaches are arguably more common, based on [3] the SR-LASSO has the desirable property that the optimal values of its hyperparameter λ is independent of the noise term (in this case e + e ). This is not the case for other formulations, whose hyperparameters need to be chosen in terms of the (unknown) magnitude of the noise in order to ensure good theoretical and practical performance (see, e.g., [11,Chpt. 6]). This is particularly problematic in the setting of function approximation, where such terms are function dependent (for instance, the term e depends on the expansion tail f − f Λ ) and therefore generally unknown. See [3] and [8, §6.6] for further discussion.
Notice that (4.7) is solved over V N h not V N , since the latter would not be numerically solvable in general. As we see below, it can be reformulated an optimization problem over C N ×K , where K = dim(V h ). However, since the true coefficients of f are elements of V and not V h , this discretization inevitably results in an additional error, which must also be accounted for in the analysis. This leads precisely to the physical discretization error (term (iii) in §3.3.1).
Finally, we now also specify the weights. Following [2,7,38] (see also [8,Rem. 2.14]), a good choice of weights (for promoting lower or anchored structure) is given by the so-called intrinsic weights In particular, for Chebyshev and Legendre polynomials these are given explicitly by where ν 0 := |supp(ν)|. Typically, we index these weights over the multi-indices ν ∈ Λ. However, we will, for convenience, write w i instead of w ν i in what follows, where, as above {ν 1 , . . . , ν N } is an ordering of Λ.

Reformulation as a matrix recovery problem and the mappings in Theorems 3.4, 3.7 and 3.10
We now describe the mappings whose existence is asserted in Theorems 3.4, 3.7 and 3.10. These maps all arise via exact solutions of weighted SR-LASSO optimization problems. However, since (4.7) yields a vector in V N h and the mappings should yield outputs in C N ×K , we first need to reformulate (4.7) using the basis where d = (d k ) K k=1 ∈ C K and G ∈ C K×K is the Gram matrix for {ϕ k } K k=1 , given by (3.3). Since G is positive definite, it has a unique positive definite square root matrix G 1/2 . Hence we may write We now use some additional notation. Given 1 ≤ p ≤ ∞ and 1 ≤ q ≤ 2, we define the weighted Note that this is precisely the weighted p w -norm of the vector of ( Further, if p = q = 2, then this is just the unweighted 2,2 -norm of a matrix (which is simply its Frobenius norm). In this case, we typically write · 2,2 . Now let z ∈ V N h be arbitrary, Z ∈ C N ×K be the corresponding matrix and z i ∈ C K be the ith row of Z. Then (4.3) and (4.4), respectively, and let B ∈ C m×K be the matrix corresponding to b. Then Therefore, we now consider the minimization problem This is equivalent to (4.7) in the following sense. A vectorĉ = (ĉ ν i ) N i=1 ∈ V N h is a minimizer of (4.7) if and only if the matrix C = (ĉ ik ) N,K i,k=1 ∈ C N ×K with entries defined by the relation is a minimizer of (4.10).
With this in hand, we are now ready to define the mappings used in Theorems 3.4, 3.7 and 3.10. These are described in Table 1. Note that these are indeed well-defined mappings, since the minimizer of (4.10) with smallest 2,2 -norm is unique (this follows from the facts that (4.10) is a convex problem, therefore its set of minimizers is a convex set, and the function Z → Z 2 2,2 is strongly convex). This particular choice is arbitrary, and is made solely so as to have a well-defined mapping. It is of no consequence whatsoever. Indeed, the various error bounds we prove later hold for any minimizer of (4.10).

The primal-dual iteration
To derive the algorithms described in the other main theorems, we need methods for approximately solving the optimization problems (4.7) and (4.10). We use the primal-dual iteration [30] (also known as the Chambolle-Pock algorithm) to this end. We first briefly describe the primal-dual • Let m, and n be as given in the particular theorem and set Λ = Λ HC n,d (Theorem 3.4 and 3.10) or Λ = Λ HCI n (Theorem 3.7).
be an input, as in (3.1), and set B = 1 √ m D.
• Define the output C = M(Y , D) as the minimizer of (4.10) with smallest 2,2 -norm. iteration in the general case (see [30][31][32], as well as [11, §7.5]) for more detailed treatments), before specializing to the weighted SR-LASSO problem in the next subsection. Let (X , ·, · X ) and (Y, ·, · Y ) be (complex) Hilbert spaces, g : Under this setting the (Fenchel-Rockafeller) dual problem is where g * and h * are the convex conjugate functions of g and h, respectively. Recall that, for a function f : X → R ∪ {∞}, its convex conjugate is defined by The Lagrangian of (4.11) is defined by and The primal-dual iteration seeks a solution (x,ξ) of the saddle-point problem by solving the following fixed-point equationx where τ, σ > 0 are stepsize parameters and prox is the proximal operator, which is defined by To be precise, given initial values (x (0) , ξ (0) ) ∈ X × Y the primal-dual iteration defines a sequence {(x (n) , ξ (n) )} ∞ n=1 ⊂ X × Y as follows: . (4.16)

The primal-dual iteration for the weighted SR-LASSO problem
We now apply this scheme to (4.7) and (4.10). We first describe an algorithm to approximately solve the Hilbert-valued problem (4.7), before using the equivalence between elements of V N h and C N ×K to obtain an algorithm for approximately solving (4.10).
Consider (4.7). We define X = (V N h , ·, · 2;V ), Y = (V m h , ·, · 2;V ) and g : X → R ∪ {∞}, h : Y → R ∪ {∞} as the proper, lower semicontinuous and convex functions We first find the proximal maps of g and h * . Using (4.13), we see that where δ B is the indicator function of the set B, taking value δ B (ξ) = 0 when ξ ∈ B and +∞ otherwise. Hence h * (ξ) = Re b, ξ V + δ B (ξ). (4.17) Using this, we obtain prox σh * (ξ) = arg min where proj B is the projection onto B, which is given explicitly by On the other hand, applying the definition of the proximal operator to the function τ g with parameter τ > 0, we deduce that Moreover, a simple adaptation of [21, Ex. 14.5] with the · V -norm gives Algorithm 1: primal-dual-wSRLASSO -the primal-dual iteration for the weighted SR-LASSO problem (4.7) inputs : ξ (0) ), an approximate minimizer of (4.7) initialize: With this in hand, we are now ready to define the primal-dual iteration for (4.7). As we see later, the analysis of convergence for the primal-dual iteration is given in terms of the ergodic sequencē where c (i) ∈ V N h is the primal variable obtained at the ith step of the iteration. Hence, we now include the computation of these sequences in the primal-dual iteration for the weighted SR-LASSO problem (4.7), and take this as the output. The resulting procedure is described in Algorithm 1.
Having done this, we next adapt Algorithm 1 in the way mentioned previously to obtain an algorithm for (4.10). This is given in Algorithm 2.
Remark 4.2 Note that even though the square-root matrix G 1/2 is used in Algorithm 2, this matrix does not need to be computed. Indeed, and for a matrix C ∈ C N ×K , we have where c i ∈ C K is the ith row of C. In particular, computing G 1/2 d involves c(F (G) + K) arithmetic operations, and computing CG 1/2 2,2 involves cm(F (G) + K) arithmetic operations, for some universal constant c > 0.
Proof. We proceed line-by-line. Line 2 involves a matrix-matrix multiplication and matrix subtraction, for a total of at most c · m · N · K (line 2) arithmetic operations for some universal constant c. Now consider lines 3-5. By the previous remark, we may calculate G 1/2 p i 2 = p * i Gp i using one multiplication with the matrix G, one inner product of vectors of length K and one square root (recall from Definition 3.1 that we count square roots as arithmetic operations). This involves at most c · (F (G) + K) arithmetic operations.  operations. After simplifying, we deduce that lines 2-10 involve at most c · (m · N · K + (K + F (G)) · (N + m)) .
operations. The result now follows by multiplying this by the number of iterations T .

The algorithms in Theorems 3.5, 3.8 and 3.11
We are now almost ready to specify the algorithms used in Theorems 3.5, 3.8 and 3.11. Notice that Algorithms 1 and 2 require the measurement matrix A as an input. Hence, we first describe the computation of this matrix for Chebyshev and Legendre polynomials. This is summarized in Algorithm 3. Notice that line 5 of this algorithm involves evaluating the first k one-dimensional Chebyshev or Legendre polynomials. This can be done efficiently via the three-term recurrence relation, as explained in the proof of the following result: • Compute A = construct-A(Y , Λ).
• Let G and w be as in (3.3) and (4.8), respectively.  Proof. Consider line 5 of the algorithm. Evaluation of the first k + 1 Chebyshev or Legendre polynomials can be done via the three-term recurrence relation. In the Chebyshev case, this is where c j = 1 if j ≥ 1 and 1/ √ 2 otherwise, and in the Legendre case, it is (recall that these polynomials are normalized with respect to their respective probability measures). Hence the computational cost for line 5 is bounded by c · n · k. The computational cost for lines 6-8 is precisely N · (k − 1). Hence, the computational cost for forming each row of A is bounded by c · (n · k + N · k). The result now follows.
With this in hand, we are now ready to specify the algorithms used in Theorem 3.5, Theorem 3.8 and 3.11. These are given in Table 2. 4.6 An efficient restarting procedure for the primal-dual iteration and the algorithms used in Theorems 3.6, 3.9 and 3.12 While the primal-dual iteration converges under very general conditions, it typically does so very slowly, with the error in the objective function decreasing like O (1/n), where n is the iteration number. To obtain exponential convergence (down to some controlled tolerance) we employ a restarting procedure. This is based on recent work of [50,51].
Restarting is a general concept in optimization, where the output of an algorithm after a fixed number of steps is then fed into the algorithm as input, after suitably scaling the parameters of the algorithm [116][117][118]. In the case of the primal-dual iteration for the weighted SR-LASSO problem, this procedure involves three hyperparameters: a tolerance ζ > 0 and scale parameters 0 < r < 1 and s > 0. After applying one step of the primal-dual iteration (Algorithm 1 or 2) yielding an Algorithm 4: primal-dual-rst-wSRLASSO -the restarted primal-dual iteration for the weighted SR-LASSO problem (4.7) inputs : measurement matrix A ∈ C m×N , measurements b ∈ V N h , positive weights w = (w i ) N i=1 , parameter λ > 0, stepsizes τ, σ > 0, number of primal-dual iterations T ≥ 1, number of restarts R ≥ 1, tolerance ζ > 0, scale parameter 0 < r < 1, constant s > 0, initial values c (0) = 0 ∈ V N h ξ (0) = 0 ∈ V m h . output :c = primal-dual-rst-wSRLASSO(A, b, w, λ, τ, σ, T, R, ζ , r, s), an approximate minimizer of (4.7) initialize: output c (1) , it then scales this vector and the right-hand side vector b by an exponentially-decaying factor a l (defined in terms of ζ , r and s), before feeding in these values into the primal-dual iteration as input.
We explain the motivations behind the specific form of the restart procedure for the primaldual iteration later in §9.2. For now, we simply state the procedures in the case of the weighted SR-LASSO problems (4.7) and (4.10). These are given in Algorithms 4 and 5, respectively. With these in hand, we can also give the algorithms used in Theorems 3.6, 3.9 and 3.12. See Table 3.
Note that these algorithms involve a number c , which is a universal constant. It is possible to provide a precise numerical value of this constant by carefully tracking the constants in several of the proof steps. Since doing so is not especially illuminative, we forgo this additional effort. Instead, we now give a little more detail on this constant: Remark 4.5 From (10.10) we see that c = 3296 √ c 0 , where c 0 is the universal constant that arises in (3.11). As shown in the proof of Theorem 8.2, the constant c 0 needs to be chosen sufficiently large so that the measurement matrix A satisfies the so-called weighted RIP. In particular, it is related to the universal constant c > 0 defined in Lemma 8.1. See, in particular, (8.2). A numerical value for this constant can indeed be found using results shown in [38]. With this in hand, one can then keep track of the constant c 0 in the proof of Theorem 8.2 to find its numerical value. This discussion also highlights why tracking the value of c is non particularly illuminative. Indeed, it is well-known that universal constants appearing in RIP estimates in compressed sensing are generally very pessimistic [8,11,58].

Experimental setup
We first describe the experimental setup.

Hyperparameter values
The algorithms used in the main theorems (see Tables 2 and 3) are designed to ensure the desired error bounds. In our numerical experiments, we deviate from these values in a number of minor ways. However, our hyperparameter choices are still closely based on theory. We now discuss the hyperparameter choices used in the experiments. These choices are summarized in Table 4.
First, we take the parameter λ to be λ = ( √ 25m) −1 . This differs somewhat from the value λ = (4 m/L) −1 used in the theoretical algorithms. The rationale behind doing this is that L is, in practice, a polylogarithmic factor that arises from the compressed sensing theory. It is well known that logarithmic factors appearing in compressed sensing theory are generally quite pessimistic [8,11,58]. Therefore, we avoid using L. The choice λ = (5 √ m) −1 was obtained in [8, App. A] after manual tuning.
As shown later, the primal-dual iteration converges subject to the condition A 2 2 ≤ (τ σ) −1 . See Lemma 9.2. Since the error bound (9.2) scales linearly in τ −1 and σ −1 , a standard choice for these parameters is In Tables 2 and 3 we choose τ = σ = (Θ(n, d)) −α , since the latter is an upper bound for A 2 , i.e., A 2 ≤ (Θ(n, d)) α . See (10.9). This bound is arguably quite crude. The reason for using it in our main theorems is to avoid having to compute A 2 , since this generally cannot be done in finitely-many arithmetic operations. However, in our numerical experiments we simply use (5.1) instead, since it is simpler and A 2 can approximated efficiently in practice. For the restarting scheme, we also have the scale parameter 0 < r < 1, the constant s > 0 and the number of inner iterations T . These parameters are inferred from Theorem 9.4. This result shows that the error in the restarted primal dual iteration after l restarts is bounded by be an input, as in (3.1), and set B = 1 √ m D.
• Let G, A and w be as in (4.3)  Based on Theorem 9.4 Table 4: Hyperparameter values used in the numerical experiments. The first three parameters are used in both the unrestarted and restarted primal-dual iterations. The final three parameters are used in the restarted scheme only.
Here, as discussed in Theorem 9.4, C > 0 is a numerical constant that arises from the compressed sensing theory. This and the choice (5.1) leads immediately to the following value for s: Unfortunately, the constant C is difficult to determine exactly (it is closely related to the constant c discussed in Remark 4.5). In our experiments, we simply pick the value C = 1. This immediately yields Finally, to determine a value of r we consider the error bound (5.2). This is based on [51]. After l restarts, the total number of iterations t = T l. Substituting the value of T , we see that Ignoring the ceiling function, it therefore makes sense to choose 0 < r < 1 to minimize the function r → r log(r). This attains its minimum value of −e −1 at r = e −1 . Hence we use this value.

Test functions
We consider four test functions. The first two are scalar-valued functions, given by These are standard test functions (see, e.g., [8, §A.1]). The first function varies very little with respect to y. Hence it is expected to be very well-approximated by a sparse polynomial approximation. The second has more variation in y, therefore we expect a larger approximation error. We also consider two Hilbert-valued functions. These both arise as solutions of the parametric elliptic diffusion equation −∇ · (a(x, y)∇u(x, y)) = g(x), ∀x ∈ D, y ∈ U, u(x, y) = 0, ∀x ∈ ∂D, y ∈ U, (5.6) which is a standard problem in the parametric PDE literature. We take the physical domain D as D = (0, 1) 2 . For simplicity, we also choose g(x) = 10 to be constant. In this case, the solution map is a Hilbert-valued function with codomain being the Sobolev space H 1 0 (D). We consider two different setups, leading to smooth and less smooth Hilbert-valued functions, which we denote as f 3 and f 4 , respectively. The first is is a simple two-dimensional problem with lognormal diffusion coefficient: a(x, y) = 5 + exp(x 1 y 1 + x 2 y 2 ). (5.7) Fo the second, we consider the diffusion coefficient from [6,Eqn. (24)], modified from an earlier example from [109,Eqn. (5.2)], with 30-dimensional parametric dependence and one-dimensional (layered) spatial dependence given by (5.8)

Error metrics and finite element discretization
In our experiments, we consider the relative L 2 (U)-norm error for the scalar-valued functions f 1 and f 2 and the relative L 2 (U; H 1 0 (D))-norm error for the Hilbert-valued functions f 3 and f 4 . To (approximately) compute this error we use a highorder isotropic Smolyak sparse grid quadrature rule based on Clenshaw-Curtis points. This rule is generated using the TASMANIAN software package [122]. We set the level of the quadrature rule in each experiment as large as possible within the constraints of computational time and memory. We now describe the discretization V h for the Hilbert-valued functions f 3 and f 4 . This is obtained via the finite element method as implemented by Dolfin [92], and accessed through the python FEniCS project [16]. We generate a regular triangulation T h of D composed of triangles T of equal diameter h T = h. We consider a conforming discretization, which results in a finitedimensional subspace V h ⊂ V = H 1 0 (D), where V h is the space spanned by the usual Lagrange finite elements {ϕ i } K i=1 of order k = 1. We rely on the Dolfin UnitSquareMesh method to generate a mesh with 33 nodes per side, corresponding to a finite element triangulation with K = 1089 nodes, 2048 elements and meshsize h = √ 2/32. See [6,54] for further implementation details. Explicit forms of the Hilbert-valued functions f 3 and f 4 are not available. Therefore, computing the relative error requires first computing a reference solution. This is usually done by using a finite element discretization with meshsize an order of magnitude smaller than that used to compute the various approximations. However, our main focus in these experiments is on the polynomial approximation and algorithmic errors E app and E alg . Since our theoretical results assert that the approximations are robust to physical discretization error, we do not perform this additional (and costly) computational step. Instead, we compute reference solutions using the same finite element discretization as that used to construct the various approximations. In other words, there is no physical discretization error present in these experiments.

Numerical results 1: the optimization error
Our first experiments, Figures 1-4, compare the behaviour of the unrestarted primal-dual iteration to the restarted primal-dual iteration with several different values of the tolerance parameter ζ .
In all cases, we observe a consistent improvement from the restarted scheme. This is particularly noticeable for the functions f 1 and f 3 , since the underlying approximation error ζ is smaller in these cases. Recall that these functions are well-approximated by polynomials. As predicted by our theoretical results, the error for the restarted scheme decays exponentially fast with respect to the number of iterations to this limiting accuracy. For example, in the case of f 1 the restarted scheme (with sufficiently small ζ ) achieves a relative error of less than 10 −6 using only 500 iterations. However, the unrestarted scheme only achieves an error of around 10 −3 after 1000 iterations.
An important takeaway from these experiments is the insensitivity of the algorithm to the parameter ζ . Our theoretical results only show exponential convergence (with respect to iteration number) when ζ ≥ ζ, where ζ is a certain upper bound for the error. This appears unnecessary in practice. For instance, in Figures 2 and 4 we expect the underlying error ζ to be roughly 10 −2 in magnitude, since this is the limiting error achieved by the unrestarted scheme. Yet setting ζ = 10 −10 has no noticeable effect on the performance of the restarted scheme.
Another noticeable feature of these experiments is the close agreement between the theorized rate of exponential decay of the restarted scheme, which is given by the right-hand side of (5. 3) and what is observed in practice. Since the value r = e −1 is used in these experiments, in Figures  1-4 we also plot the function versus the iteration number t. This theoretical curve exactly predicts the observed rate of exponential decay of the restarted schemes. Finally, in all four figures we also show the error of the (restarted) primal-dual iterates, as well as the ergodic sequences. Despite the theoretical results only holding for the latter, we see similar error decay for the iterates. In fact, the iterates give slightly better performance in the case of the unrestarted scheme. As expected, the ergodic sequence reduces the variation in the error for the restarted scheme. Moreover, plotting the ergodic sequence we can see more clearly the benefit of using restarts over not restarting.

Numerical results 2: the approximation error
In the second set of experiments, our aim is to study the approximation error versus the number of samples m. Having compared different solvers in the previous experiments, we now limit our attention to the restarted primal-dual iteration. The only modification we make is to introduce a stopping criterion for the number of restarts. Specifically, given a tolerance ζ , we halt the iteration if the difference between two consecutive iterates is less than 5 · ζ . Specifically, if in the scalar-valued case or in the Hilbert-valued case, wherec (l) is the output of the restarted primal-dual iteration after l restarts, then we halt and takec (l) as the polynomial coefficients of the resulting approximation. In the following experiments, we perform multiple trials for each value of m. For each trial, we generate a set of sample Monte Carlo points y 1 , . . . , y m , then compute the relative error (5.9) or (5.10) of the approximation using a sparse grid quadrature as before. Having done this, we then compute the sample mean and (corrected) sample standard deviation after a log transformation. See [8, §A.1.3] for further discussion and rationale behind this computation.      Figure 5 shows the results for f 1 . As discussed, this function is expected to be well-approximated by polynomials. In accordance, the error decreases rapidly, achieving roughly 10 −7 relative L 2 error when m ≈ 200. This is in broad agreement with the exponential decay rate of the error shown in our main theorems. In Figure 6 we consider the more challenging, higher-dimensional function f 2 . Here, as expected, the error decreases significantly more slowly. Figure 7 displays the performance of the restarted scheme on the Hilbert-valued function f 3 . Here we also observe rapid decrease in the error with respect to increasing number of samples m, with relative L 2 error approximately 10 −6 when m ≈ 200. Finally, Figure 8 shows the results for the less smooth high-dimensional Hilbert-valued function f 4 . For this function, we expect slower decrease in the error with respect to m, which is reflected in this set of results. Nonetheless, despite its high dimensionality (d = 30) we still achieve two digits of relative accuracy using only m ≈ 1000 samples.

Overview of the proofs
The rest of this paper is devoted to proving the main results. Since these involve a number of technical steps, we now give a brief overview of how these proofs proceed.
We commence in §7 by developing compressed sensing theory for Hilbert-valued vectors. We introduced the so-called weighted robust Null Space Property (rNSP) over V, and then show in Lemma 7.4 that it implies certain error bounds for inexact minimizers of the Hilbert-valued, weighted SR-LASSO problem. Next, we introduced the weighted Restricted Isometry Property (RIP) and then in Lemma 7.6 we show that this property over C implies the weighted rNSP over V.
In §8 we focus on the polynomial approximation problem. We first give a sufficient condition in terms of m for the measurement matrix (4.3) to satisfy the weighted RIP with high probability (Lemma 8.1). Next, we state and prove three general results (Theorems 8.2-8.4) that give error bounds for polynomial approximations obtained as inexact minimizers of the Hilbert-valued, weighted SR-LASSO problem. These results are split into the three cases considered in our main results, i.e., the algebraic and finite-dimensional case, the algebraic and infinite-dimensional case, and the exponential case. The error bounds in these results split into terms corresponding to the polynomial approximation error, the physical discretization error, the sampling error, and the error in the objective function at the inexact minimizer.
With this in mind, in the next section, §9, we first present error bounds for inexact minimizers obtained by finitely-many iterations of the primal-dual iteration. See Lemma 9.2. Having done this, we then have the ingredients needed to derive the restarting scheme. We derive this scheme and present an error bound for it in Theorem 9.4.
We conclude with in §10 with the final arguments. We use the three key theorems (Theorems 8.2-8.4) and then proceed to estimate each of the aforementioned error terms. For the polynomial approximation error we applied to several results that are given in Appendix A. For the error in the objective function we use the results shown in §9. After straightforwardly bounding the other two error terms, we finally obtain the main results.

Hilbert-valued compressed sensing
In this section, we develop Hilbert-valued compressed sensing theory. Here, rather than the classical setting of a vector in C N , one seeks to recover an Hilbert-valued vector in V N . This was considered in [54] in the for the classical sparsity model with 1 -minimization. Here, we consider the weighted sparsity model and weighted 1 -minimization. This model was first developed in [115]. See also [2,38] and [8,Chpt. 6]. Note that in this section, we shall write V rather than V h , as is done in (4.7). Of course, all the results shown below for V will apply in the case of V h .

Weighted sparsity and weighted best approximation
Let Λ ⊆ F and w = (w ν ) ν∈Λ > 0 be positive weights. Given a set S ⊆ Λ, we define its weighted cardinality as |S| w := i∈S w 2 i .
The following two definitions extend Definitions 2.2 and 2.3 to the weighted setting: Definition 7.1 (Weighted sparsity). Let Λ ⊆ F. A V-valued sequence c = (c ν ) ν∈Λ is weighted (k, w)-sparse for some k ≥ 0 and weights w = (w ν ) ν∈Λ > 0 if where supp(z) = {ν : z ν V = 0} is the support of z. The set of such vectors is denoted by Σ k,w .
Definition 7.2 (Weighted best (k, w)-term approximation error). Let Λ ⊆ F 0 < p ≤ 2, w > 0, c ∈ p w (Λ; V) and k ≥ 0. The p w -norm weighted best (k, w)-term approximation error of c is Notice that this is equivalent to Here and elsewhere, for a sequence c = (c ν ) ν∈Λ and a set S ⊆ Λ, we define c S as the sequence with νth entry equal to c ν if ν ∈ S and zero otherwise.

The weighted robust null space property
For the rest of this section, we consider the index set Λ = {1, . . . , N } for some N ∈ N. Our analysis of the weighted SR-LASSO problem is presented in terms of the so-called weighted robust null space property. Let w > 0 and k > 0. A bounded linear operator A ∈ B(V N , V m ) has the weighted robust Null Space Property (rNSP) over V of order (k, w) with constants 0 < ρ < 1 and γ > 0 if Importantly, the weighted rNSP implies distance bounds in the 1 w -and 2 -norms. The following lemma is standard in the scalar case (see, e.g., [8,Lem. 6.24]). We omit the proof of its extension to the Hilbert-valued case, since it follows almost exactly the same arguments. Lemma 7.3 (Weighted rNSP implies 1 w and 2 distance bounds). Suppose that A ∈ B(V N , V m ) has the weighted rNSP over V of order (k, w) with constants 0 < ρ < 1 and γ > 0. Let x, z ∈ V N . Then where the constants are given by Fortunately, it also implies bounds for approximate minimizers, such as those obtained by a finite number of steps of the primal-dual iteration.

Lemma 7.4 (Weighted rNSP implies error bounds for inexact minimizers).
Suppose that A ∈ B(V N , V m ) has the weighted rNSP over V of order (k, w) with constants 0 < ρ < 1 and γ > 0. Let x ∈ V N , b ∈ V m and e = Ax − b ∈ V m , and consider the problem (7.5) with parameter Then, for anyx ∈ V N , where C 1 , C 2 , C 1 and C 2 are as in Lemma 7.3.
Proof. First notice that C 1 /C 2 ≤ C 1 /C 2 since 0 < ρ < 1, where C 1 , C 2 , C 1 and C 2 are as in Lemma 7.3. Hence the condition on λ implies that Using this lemma and this bound, we deduce that The definition of G in (7.5) gives which is the first result. The second follows in an analogous manner.

The weighted rNSP and weighted restricted isometry property
In the next section, we give explicit conditions in terms of m under which the measurement matrices (4.3) satisfy the weighted rNSP over V. It is well known that showing the (weighted) rNSP directly can be difficult. In the classical, scalar setting, this is overcome by showing that the (weighted) rNSP is implied by the so-called (weighted) restricted isometry property. Hence, in this subsection, we first introduced this property and describe its relation to the (weighted) rNSP. Let w > 0 and k > 0. A bounded linear operator A ∈ B(V N , V m ) has the weighted Restricted Isometry Property (RIP) over V of order (k, w) if there exists a constant 0 < δ < 1 such that The smallest constant such that this property holds is called the (k, w)th weighted Restricted Isometry Constant (wRIC) of A, and is denoted as δ k,w . It is first convenient to show an equivalence between the scalar weighted RIP over C and the Hilbert-valued weighted RIP over V. Lemma 7.5 (weighted RIP over C is equivalent to the weighted RIP over V). Let w > 0, k > 0 and A = (a ij ) m,N i,j=1 ∈ C m×N be a matrix. Then A satisfies the weighted RIP over C of order (k, w) with constant 0 < δ < 1 if and only if the corresponding bounded linear operator A ∈ B(V N , V m ) defined by satisfies the weighted RIP over V of order (k, w) with the same constant δ.

Error bounds for polynomial approximation via the Hilbertvalued, weighted SR-LASSO
Having developed the necessary tools for compressed sensing in the Hilbert-valued setting, we now specialize to the case introduced in §4.1 of polynomial approximation via the Hilbert-valued, weighted SR-LASSO problem (4.7). Our main results in this section, Theorems 8.2-8.4, yield error bounds for (inexact) minimizers of this problem in terms of the best polynomial approximation error, the Hilbert space discretization error and the noise.
Proof. The proof uses ideas that are now standard. The matrix A is a specific type of measurement matrix associated to the bounded orthonormal system {Ψ ν } ν∈Λ (see, e.g., [8,Sec. 6.4.3] or [58,Chpt. 12]). Such a matrix satisfies the weighted RIP of order k > 0 with constant δ k,u ≤ δ whenever for a potentially different universal constant c. Next, we use (3.7) (and recall that |Λ HCI n | = |Λ HC n,n |) to estimate log(2N ) ≤ c min{d + log(n), log(2d) · log(2n)} d < ∞, for a potentially different universal constant. The result now follows after substituting this into the previous expression.

Bounds for polynomial approximations obtained as inexact minimizers
We now present the main results of this section. These three results provide error bounds for polynomial approximations obtained as (inexact) minimizers to the weighted SR-LASSO problem (4.7). Each theorem corresponds to one of the three scenarios in our main results in §3.3. Hence, we label them accordingly as algebraic and finite dimensional, algebraic and infinite dimensional, and exponential. In order to state these results, we now define some additional notation. Given f ∈ L 2 (U; V) and Λ ⊆ F, where F is as in (2.1)-(2.2), we let where f Λ is as in (4.1), and, given a subspace V h ⊆ L 2 (U; V), we let where P h (f ) is as in (2.7).  (3.9), and n is as in (4.4).
Proof. We divide the proof into several steps.
Step 1: Splitting the error into separate terms. Consider the L 2 (U; V)-norm error first. By the triangle inequality and the fact that P h is a projection, we have Then, by orthonormality, we have Similarly, for the L ∞ (U; V)-norm error, we have Using the definition (4.8) of the weights u, we deduce that Therefore, the rest of the proof is devoted to showing the following bounds: We do this in the next two steps by first asserting that A has the weighted rNSP ( Step 2) and then by applying the error bounds of Lemma 7.4 (Steps 3 and 4).
Step 2: Asserting the weighted rNSP. We now show that A has the weighted rNSP over V h of order (k, u) with probability at least 1 − /2. This is based on Lemma 8. where L is defined as in Lemma 8.1, and therefore (again assuming a suitably-large choice of c 0 ) (8.2) holds with k replaced by 2k. We deduce that A satisfies the weighted RIP of order (2k, u) with constant δ 2k,u ≤ 1/4, with probability at least 1 − /2. Then, we deduce from Lemmas 7.5 and 7.6 that A has (with the same probability) the weighted rNSP of order (k, u) over V h with constants ρ = 2 √ 2/3 and γ = 2 √ 5/3.
Step 3: Bounding P h (c Λ ) −c using the weighted rNSP. We use Lemma 7.4. First, consider the value of λ. Since c 0 ≥ 1 we have m/L ≥ m/(c 0 L) = k. Hence, recalling the values for ρ and γ obtained in the previous step, we have (8.5) Therefore (7.6) holds. We now apply this lemma with V = V h , x = P h (c Λ ),x =c and e = AP h (c Λ ) − b. Notice first that the best (k, u)-approximation error (7.2) satisfies since P h is a projection. Hence, applying Lemma 7.4 and using the lower bound in (8.5), we get with probability at least 1 − /2. Therefore, to show (8.4) and therefore complete the proof, it suffices to show that the following holds with probability at least 1 − /2: The overall result then follows by the union bound.
Step 4: Showing that (8.8) For this final step, we follow near-identical arguments to those found in [8,Lem. 7.11]. This shows that with probability at least 1 − /2, provided m ≥ 2k log(2/ ). However, this follows due to the assumptions on m and the arguments given in Step 2. Thus we obtain (8.8) and the proof is complete.
1. An application of this lemma now shows that A has the weighted RIP of order (2k, u) with constant δ 2k,u ≤ 1/4, as required.
be either the orthonormal Chebyshev or Legendre basis, V h ⊆ L 2 (U) be a subspace of L 2 (U) and Λ = Λ HC n,d be the hyperbolic cross index set with n as in (3.19). Draw y 1 , . . . , y m randomly and independently according to . Then, with probability at least 1 − , the following holds. Let f ∈ L 2 (U; V) and suppose that A, b and e are as in (4.3) and (4.4). Consider the Hilbert-valued, weighted SR-LASSO problem (4.7) with weights w = u as in (4.8) and λ = (4 m/L) −1 . Then there exists universal constants c 0 , c 1 , c 2 ≥ 1 such that anyc = (c ν ) ν∈Λ ∈ C N satisfies c Λ is as in (4.2), P h (c Λ ) = (P h (c ν )) ν∈Λ , k = m/(c 0 L) for L = L(m, d, ) as in (3.9), and n is as in (4.4).
Proof. The proof has the same structure as that of Theorem 8.2.
Step 1 is identical, and reduces the proof to showing that (8.4) holds. We now describe the modifications needed in Steps 2-4: Step 2: Asserting the weighted rNSP. We now show that A has the weighted rNSP over V h of order (k, u) with probability at least 1 − . This step is essentially the same, except for the choice of n and the probability 1 − instead of 1 − /2.
Step 3: Bounding P h (c Λ ) −c using the weighted rNSP. Since λ and k are the same as in Theorem 8.2, the bound (8.5) also holds in this case. We then follow the same arguments, leading to (8.7) holding with probability at least 1 − . Finally, rather than (8.8), we ask for the slightly modified bound to hold with probability one.
Step 4: Showing (8.11) holds. By the same argument, we see that (8.9) holds. Instead of the probabilistic bound for E Λ,disc (f ), we now simply bound it as This immediately implies (8.11).
Finally, we observe that we can simplify the previous estimates in this case using the bound E Λ,2 (f ) ≤ E Λ,∞ (f ).
9 Error bounds and the restarting scheme for the primal-dual iteration Theorems 8.2-8.4 reduce the problem of proving the main results (Theorems 3.4-3.12) to two tasks. The first involves bounding the error in the objective function, i.e. the term wherec is either an exact minimizer or an approximate minimizer obtained via the primal dual iteration. The second involves the various approximation error terms depending on f and its polynomial coefficients.
In this section, we address the first task. We first provide an error bound for the (unrestarted) primal-dual iteration when applied to Hilbert-valued weighted SR-LASSO problem (7.5), and then use this to derive the specific restart scheme.

Error bounds for the primal-dual iteration
We now return to the general setting of the primal-dual iteration, where it is applied to the problem (4.11) and takes the form (4.16). The following result from [31, Theorem 5.1] establishes an important error bound for the Lagrangian difference.
The following lemma shows a decay rate of 1/n on the objective function in the case of the primal-dual iteration when applied to the problem (7.5). It is an extension of [11,Lem. 8.6] to the weighted and Hilbert-valued setting.
Consider the sequence {(x (n) , ξ (n) )} ∞ n=1 generated by the primal-dual iteration in (4.16) applied to (7.5) with x (0) ∈ V N and ξ (0) = 0 ∈ V m . Then, for any x ∈ V N , Proof. Using (4.14) and (4.17), the left-hand side of (9.1) is given by where B is the unit ball in V m . Observe that the term ξ (n) produced by this iteration satisfies ξ (n) 2;V ≤ 1. This follows from the observation shown in §4.4 that the proximal mapping involves the projection onto the unit ball B. Hence the ergodic sequenceξ (n) satisfies ξ (n) 2;V ≤ 1 as well. Suppose now that Ax (n) − b = 0 and set Then δ B (ξ) = δ B (ξ (n) ) = 1 and therefore Clearly, the same bound also holds in the case Ax (n) − b = 0 where ξ is an arbitrary unit vector. Hence Theorem 9.1 and the fact that ξ − ξ 0 2;V = ξ 2;V = 1 gives the result.

The restarting scheme
For convenience, we now introduce new and slightly modify some existing notation. First, we redefine the objective function G of the Hilbert-valued weighted SR-LASSO problem (7.5) to make the dependence on the term b explicit: namely, we set We then let Now consider the ergodic sequencex (n) produced by n iterations of the primal-dual iteration (4.16) applied to (7.5) with parameters τ, σ > 0, x 0 ∈ V N and ξ 0 = 0 ∈ V m . For reasons that will become clear in a moment, we now make the dependence on the vector b in (7.5), the number of iterations x (n) and the initial vector x 0 explicit, by defining With this in hand, we conclude this discussion by noting the following two scaling properties: These hold for any a > 0 and for any x, z ∈ V N and b ∈ V m . Lemma 9.3. Suppose that A ∈ B(V N , V m ) has the weighted rNSP over V of order (k, w) with constants 0 < ρ < 1 and γ > 0. Consider the Hilbert-valued weighted SR-LASSO problem (7.5) with parameter λ = c/ √ k, where 0 < c ≤ (1+ρ) 2 (3+ρ)γ . Let E and P be as defined above, τ, σ satisfy where C = 2 max C 1 /c, C 2 , (9.5) C 1 , C 2 are as in Lemma 7.3 and Proof. The scaling property (9.4) and Lemma 9.2 give Now consider the term x − x 0 2;V . Since A has the weighted rNSP and λ satisfies (7.6), we may use Lemma 7.4 to get Substituting this into the previous expression now gives the result.
This lemma gives the rationale behind the restarted scheme. It says the error in the objective function of the scaled output aP(x 0 /a, b/a, n) of the primal-dual iteration with initial value x 0 can be bounded in terms of the error in the objective function at the initial value, plus terms depending on the scaling parameter a, the number of iterations n and the compressed sensing error term ξ. By choosing these parameters suitably and iterating this procedure, we obtain the restarting scheme. We summarize this in the following theorem: Theorem 9.4 (Restarting scheme). Suppose that A ∈ B(V N , V m ) has the weighted rNSP over V of order (k, w) with constants 0 < ρ < 1 and γ > 0. Consider the Hilbert-valued weighted SR-LASSO problem (7.5) where ξ is as in (9.6), 0 < r < 1 and define the sequence ε 0 = b 2;V , ε k+1 = r(ε k + ζ ), k = 0, 1, 2, . . . .
Let E and P be as defined above, τ, σ satisfy A 2 B(V N ,V m ) ≤ (τ σ) −1 and set where C is as in (9.5). Then the iterationx (0) ,x (1) ,x (2) , . . . defined bỹ Proof. We use induction on k. Suppose first that k = 0. Then, by definition, Now suppose that the result holds for k. The previous lemma gives We now substitute the values of n and a k to obtain This completes the proof.
This theorem states that the restarted primal-dual iterationx (0) ,x (1) ,x (2) , · · · yields an objective function error E(x (k) , x, b) that converges exponentially fast in the number of restarts k. Further, each (inner) primal-dual iteration involves a number of steps n that depends on the parameters C, r, σ and τ . In other words, n is a constant independent of k. Hence, the restarted scheme converges exponentially fast in the total number of primal-dual iterations as well.
As discussed in §5.1.1, it is typical to use this theorem to optimize the choice of r. Recall that this leads to the explicit choice r = e −1 . We use this value in our algorithms -see Table 3.

Final arguments
We are now ready to prove the main results, Theorems 3.4-3.12. In several of these proofs, we require the following definition. Let s ∈ N and define k(s) := max{|S| u : S ⊂ N d 0 , |S| ≤ s, S lower}, We now do this using Lemma 9.2. In order to apply this lemma we first need to estimate h and define p(y) = ν∈Λ x ν Ψ ν . Then Now the set Λ is lower and of cardinality |Λ| = Θ(n, d). Hence, by (10.2) with s = N , we have |Λ| u ≤ (Θ(n, d)) 2α , where α is as in (3.8). Since x was arbitrary, we get Since the primal-dual iteration in §4.5 is used with τ = σ = (Θ(n, d)) −α , we have that A 2 2;V ≤ (τ σ) −1 . Hence we may apply Lemma 9.2. Since the iteration is also initialized with the zero vector and run for a total of T = 2(Θ(n, d)) α t iterations (see §4.5 once more), this gives Here, in the last step, we use the fact that f ∈ B(ρ), and therefore f L 2 (U ;V) ≤ f L ∞ (U ;V) ≤ 1.
Using this and the value of T , we deduce that Substituting this into (10.4) and combining with the other estimates (10.5)-(10.7) derived in Step 2 of the proof of Theorem 3.4 now gives the desired error bound. It remains to estimate the computational cost. We do this via Lemmas 4.3 and 4.4. First observe that the value k in Lemma 4.4 is equal to k = d in this case, since the index set Λ = Λ HC n,d is a d-dimensional hyperbolic cross index set. Similarly, the value n in Lemma 4.4 is bounded by the order n of this hyperbolic cross. As Λ is a lower set, we also have n ≤ N . Hence, the computational cost for forming the matrix A is bounded by c · m · N · d. We now use Lemma 4.3 to bound the computational cost of the algorithm. Finally, we recall that N = Θ(n, d) and T = 2(Θ(n, d)) α t in this case.
Proof of Theorem 3.6. As in the previous proof, we only need to estimate the term G(ĉ)−G(P h (c Λ )). Recall from Table 3 that in this caseĉ =c (R) is the output of the restarted primal-dual iteration with R restarts. Our goal is to use Theorem 9.4 applied to the problem (4.7) with weights w = u as in (4.8), λ = (4 m/L) −1 and x = P h (c Λ ) We first show that the conditions of this theorem hold. Recall from Step 2 of the proof of Theorem 8.2 that the matrix A has the weighted rNSP of order (k, u) over V h with constants ρ = 2 √ 2/3 and γ = 2 √ 5/3. In particular, We now use (8.5) to see that for a sufficiently large choice of c 0 . Next, with this choice of x, we see that

Algebraic rates of convergence, infinite dimensions
Proof of Theorem 3.7. The proof is similar to that of Theorem 3.4, except that it uses Theorem 8.3 in place of Theorem 8.2. In particular, we see that (10.3) also holds in this case with ξ as in (10.4) and k = m/(c 0 L).
Step 2 is identical. The only differences occur in Step 1. We now describe the changes needed in this case. First consider the term σ k (c Λ ) 1,u;V / √ k. To bound this, we use (ii) of Theorem A.3 with q = 1 > p. This gives Finally, for E Λ,∞ (f ), we use (iii) of Theorem A.3 once more (with q = 1 > p) to get Having done this, we also observe that G(ĉ) − G(P h (c Λ )) ≤ 0 in this case, sinceĉ is once more an exact minimizer. Using this and the previously-derived bounds, we conclude that ξ ≤ ζ, where ζ is as in (3.16). This gives the result.
Proof of Theorem 3.8. The argument is similar to that of Theorem 3.5. Hereĉ =c (T ) is the ergodic sequence obtained after T steps of the primal-dual iteration applied to (4.7) as well.
We recall that the set Λ is lower and of cardinality |Λ| = Θ(n, d) with d = ∞. Hence, by (10.2) with s = N , we have |Λ| u ≤ (Θ(n, d)) 2α , where α is as in (3.8). Using this, we get A 2;V ≤ (Θ(n, d)) α , as before. Since the primal-dual iteration in Table 3 is used with τ = σ = (Θ(n, d)) −α , we have that A 2 2;V ≤ (τ σ) −1 . Hence, following the same steps we deduce that Substituting this into (10.4) and combining with the other estimates (10.5)-(10.7) derived in Step 2 of the proof of Theorem 3.4 now gives the desired error bound. The computational cost estimate is similar to the that in the proof of Theorem 3.5. In this case, observe that the value k in Lemma 4.4 is equal to n. Hence the computational cost of forming A is bounded by c · m · N · n in this case. The computational cost for the algorithm is given by Lemma 4.3. To complete the estimate, we substitute the values N = Θ(n, d) and T = 2(Θ(n, d)) α t , as before.
Proof of Theorem 3.9. The proof is similar to that of Theorem 3.6 and involves estimating the term G(ĉ) − G(P h (c Λ )). Using the same steps, we deduce that ξ(x, b) ≤ ζ, However, this bound also clearly holds for all m ≥ 1, up to a change in the constant C(d, γ, ρ). After relabelling the universal constant 4c 0 as c 0 , we deduce that ξ ≤ ζ, where ζ is as in (3.21). This concludes the proof.
Proof of Theorem 3.11. The argument is the same as the proof of Theorem 3.5. The difference relies on the fact that now ζ has the following bound To estimate the final term, we argue exactly as in the proof of Theorem 3.5. The computational cost estimate is likewise identical.
We now use (8.5) to see that for a sufficiently large choice of c 0 , as before. Next, with the choice x = P h (c Λ ) as before, we see that ξ(x, b) = σ k (P h (c Λ )) 1,u;V √ k + AP h (c Λ ) − b 2;V Using (8.11), we get with probability 1 − . It now follows from the proof of Theorem 3.10 that with probability at least 1 − , where ζ is as in (3.21). Hence, ξ(x, b) ≤ ζ . The rest of the proof follows the same steps as the proof of Theorem 3.6.

Conclusions
Sparse polynomial approximation is a useful tool in parametric model problems, including surrogate model construction in UQ. The theory of best s-term approximation supports the use of polynomialbased methods, and techniques such as least squares and compressed sensing are known to have desirable sample complexity bounds for obtaining polynomial approximations. In this work, we have closed a key gap between these two areas of research, by showing the existence of algorithms that achieve the algebraic and exponential rates of the best s-term approximation with respect to the number of samples m. Thus, sparse polynomial approximation can be practically realized in a provably sample-efficient manner. As our numerical experiments confirm, our algorithms are practical, and actually perform better than the theory suggests.
There are a number of avenues for further research. First, this work has focused on Chebyshev and Legendre polynomials on the hypercube [−1, 1] d . It is plausible that it can be extended to general ultraspherical or Jacobi polynomials. A more significant challenge involves Hermite or Laguerre polynomials on R d or [0, ∞) d . This is an interesting problem for future research.
It is notable that the algorithms developed in this paper do not generally compute m-term polynomial approximations. Indeed, (inexact) minimizers of the SR-LASSO problem will generally be nonsparse vectors of length N = Θ(n, d). It is interesting to investigate whether or not one can develop algorithms that achieve the same error bounds while computing m-term polynomial approximations. In classical compressed sensing, one can typically computes sparse solutions by using a greedy or iterative procedure (see, e.g., [58]). Unfortunately, it is not clear how to extend these procedures to the weighted case with theoretical guarantees. Nonetheless, certain weighted greedy methods appear to work well in practice for sparse polynomial approximation [5].
Another motivation for considering different algorithms is to see if the computational cost estimates can be reduced. While this is often not the main computational bottleneck in parametric model problems (generally, computing the samples is the most computationally-intensive step), it is still an important issue. We have shown that the computational cost is at worst subexponential in m in infinite dimensions, and algebraic in m (for fixed d) in finite dimensions. Whether or not these are optimal is an interesting open problem. Here, ideas from sublinear-time algorithms [39,40] may be particularly useful.
In the case of the exponential rates, it is notable that the best s-term approximation error is exponentially small in γ ·s 1/d (see Theorem 2.6), whereas the exponents in §3.3.3 are γ/2(m/(c 0 L)) 1/d (Chebyshev) and γ(m/(c 0 L)) 1/(2d) (Legendre case). The reason for this can be traced to the sample complexity estimate for computing a sparse (and lower) polynomial approximation via compressed sensing with Monte Carlo sampling, i.e., m ≈ c 0 · 2 d · s · L (Chebyshev) or m ≈ c 0 · s 2 · L (Legendre). To see why this is the case, combine Lemma 8.1 with (10.2). In the setting of least squares, in which the desired polynomial subspace is known, it is possible to change the sampling measure to obtain sample complexity bounds that are log-linear in s and therefore near optimal. See, e.g., [46,69]. More recently, several works have also introduced sampling schemes that achieve linear in s sample complexity -i.e., optimal up to a constant [20,45,84,90,130]. Unfortunately, it is unknown whether or not linear or log-linear sample complexity possible in the compressed sensing setting, where the target subspace is unknown. See [10] for further discussion on this issue.

A.1 The finite-dimensional case
We first consider the finite-dimensional case, where U = [−1, 1] d for d < ∞ and f : U → V is a Hilbert-valued function (in fact, the following results also apply in the more general setting of Banach-valued functions; however, we shall not consider this explicitly). We now summarize the various approximation error bounds in the following theorem. This result combines various wellknown results in the literature. It is essentially the same as [8,Thm. 3.25]. However, we have made a number of minor edits to fit the notation and setup of this paper (see Remark A.2 below).
Theorem A.1 (Best s-term decay rates; finite dimensions). Let d ∈ N, f ∈ B(ρ) for some ρ > 1, where B(ρ) is as in (2.8), and c = (c ν ) ν∈N d 0 be its Chebyshev or Legendre coefficients. Then the following best s-term decay rates hold: (i) for any 0 < p ≤ q ≤ 2 and s ∈ N, there exists a lower set S ⊂ N d 0 of size |S| ≤ s such that σ s (c) q;V ≤ c − c S q;V ≤ c − c S q,u;V ≤ C · s 1/q−1/p , where σ s (c) q;V is as in Definition 2.3 (with Λ = N d 0 ), u is as in (4.8) and C = C(d, p, ρ) > 0 depends on d, p and ρ only; (ii) for any 0 < p ≤ q ≤ 2 and k > 0, σ k (c) q,u;V ≤ C · k 1/q−1/p , where σ k (c) q,u;V is as in Definition 7.2, u is as in (4.8) and C = C(d, p, ρ) > 0 depends on d, p and ρ only; (iii) for any 0 < p ≤ 2, where σ s (c) p;V is as in Definition 2.3 (with Λ = N d 0 ), u is as in (4.8) and C = C(d, γ, p, ρ) > 0 depends on d, γ, p and ρ only.
Remark A.2 There are several differences between Theorem A.1 and [8,Thm. 3.25]. A minor difference is that we do not specify the various constants C appearing in the result. Another difference is in the presentation of (iii). Here we allow arbitrary s ≥ 1 (instead of s ≥s) at the expense of a larger (and unspecified) constant C. The main difference, however, is the additional term c − c S q,u;V appearing in (i). This can be shown as follows. First, one defines the sequencē c = (u 2/q−1 ν c ν ) ν∈N d 0 so that c − c S q,u;V = c −c S q;V and then uses Stechkin's inequality in lower sets (see, e.g., [8,Lem. 3.9]) to show that c −c S q;V ≤ s 1/q−1/p c p,M ;V , where · p,M ;V is the norm on the majorant p space p M (N d 0 ; V) (see, e.g., [8,Defn. 3.8]). Finally, it can be shown that c p,M ;V ≤ C(d, p, ρ) using standard arguments. See, e.g., [8,Lem. 7.19] (this lemma only considers the scalar-valued case; however the extension to the Hilbert-valued case is straightforward).
Note that Theorem A.1 immediately implies Theorems 2.4 and 2.6. For the former, we note that f − f S 1 L 2 (U ;V) = c − c S 1 2;V and f − f S 2 L ∞ (U ;V) ≤ c − c S 2 1,u;V . We then apply (i) with q = 2 or q = 1. For the latter, we use (iii) with p = 1.

A.2 The infinite-dimensional case
We now consider the infinite-dimensional case, where d = ∞ and U = [−1, 1] N . Theorem A.3 (Best s-term decay rates; infinite-dimensional case). Let d = ∞, 0 < p < 1, > 0, b ∈ p (N) with b > 0 and f ∈ B(b, ), where B(b, ) is as in (2.9). Let c = (c ν ) ν∈F be the Chebyshev or Legendre coefficients of f . Then the following best s-term decay rates hold: (i) For any p ≤ q < ∞ and s ∈ N, there exists a lower set S ⊂ F of size |S| ≤ s such that σ s (c) q;V ≤ c − c S q;V ≤ c − c S q,u;V ≤ C · s 1/q−1/p .
where σ s (c) q;V is as in Definition 2.3 (with Λ = F), u is as in (4.8) and C = C(b, , p) > 0 depends on b, and p only.
(ii) For any p ≤ q ≤ 2 and k > 0, σ k (c) q,u;V ≤ C · k 1/q−1/p , where σ k (c) q,u;V is as in Definition 7.2, u is as in (4.8) and C = (b, , p) > 0 depends on b, and p only.
(iii) Suppose that b is monotonically nonincreasing. Then, for any p ≤ q < ∞ and s ∈ N, there exists an anchored set S ⊂ F of size |S| ≤ s such that where σ s (c) q;V is as in Definition 2.3 (with Λ = F), u is as in (4.8) and C = (b, , p) > 0 depends on b, and p only. 2), the main difference is the assertion of the bound on c − c S q,u;V . This can be established through similar arguments, using either the majorant p space p M (F; V) or the anchored p space p A (F; V) (see, e.g., [8,Defn. 3.31]) and then Stechkin's inequality in lower or anchored sets (see, e.g., [8,Lem. 3.32]). See also [8,Lem. 7.23] (this lemma only considers the scalar-valued case; however the extension to the Hilbert-valued case is straightforward).
Note that Theorem A.3 implies Theorem 2.5. This follows from (i) with q = 2 or q = 1.