Distributed function estimation: adaptation using minimal communication

We investigate whether in a distributed setting, adaptive estimation of a smooth function at the optimal rate is possible under minimal communication. It turns out that the answer depends on the risk considered and on the number of servers over which the procedure is distributed. We show that for the $L_\infty$-risk, adaptively obtaining optimal rates under minimal communication is not possible. For the $L_2$-risk, it is possible over a range of regularities that depends on the relation between the number of local servers and the total sample size.

congestion.In this paper we investigate this problem for a basic distributed architecture, where we have m local servers over which the data is distributed and that each carry out a statistical procedure using their local data, independently of each other.They communicate their result to a central server that performs some aggregation and produces a final estimate of the quantity of interest.The two goals of high accuracy and little communication are conflicting in this setting.It is intuitively clear that to achieve high accuracy it is beneficial to have a lot of data in one server, which is only possible if the total number of local severs m is small, which we will not assume, or the local servers are allowed to communicate a lot of information to the central server, which we will consider to be undesirable.
The problem becomes most interesting if the unknown object is high-, or infinite-dimensional.To be specific, we will consider a distributed signal estimation problem in which the goal is to estimate a function f ∈ L 2 [0, 1] with (Besov) regularity s > 0. (A precise description of the model is given in Section 2.) The best accuracy that can be achieved with respect to the L 2 -norm can be described by minimax lower bounds.In the classical, nondistributed setting the minimax lower bound over Besov balls of regularity s is known to be of the order n −s/(1+2s) , where n is the sample size, or signalto-noise ratio (e.g.[12]).Recently established lower bounds for distributed nonparametric methods under communication constraints (see [25], [28], and Section 2 ahead) show that this optimal rate can also be achieved by distributed methods, but only if each local machine is allowed to communicate at least order n 1/(1+2s) bits of information to the central machine.This is what the authors of [28] call the sufficient regime.
A distributed strategy that achieves the rate n −s/(1+2s) under the restriction that the local machines communicate at most the minimal order n 1/(1+2s) bits is easily constructed (see Theorem 2.2).However, this simple strategy uses knowledge of the regularity s of the unknown signal.The real interesting question is whether this can be done adaptively, without knowing s.This greatly complicates the problem, since we do not only want adaptation to smoothness of the estimator, but we also require that the local machines determine the maximally allowed number of bits in a purely data-driven manner.
It turns out that whether or not this is possible for the L 2 -risk depends on the relation between the number of machines m and the total sample size, or signal-to-noise ratio n.We prove that if m = n p for some p ∈ (0, 1/2), then: • There exists a distributed estimator that is adaptive over any range of regularities [s 1 , s 2 ] such that achieving the optimal rate and transmitting the minimal amount of bits.
however, then there exists no distributed procedure that achieves the optimal rate for every signal f with regularity in {s 1 , s 2 }, while transmitting the minimal amount of bits.
Stated differently, when considering L 2 -risk, adaptively achieving the optimal rate using minimal communication over a range of regularities [s 1 , s 2 ] is possible if and only if (2 + 4s 2 ) log m < log n.
This shows that it is problematic if either the number of machines is too large, or the range of regularities to which adaptation is required is too large.The adaptive, minimal communication procedure that we propose in the first case implicitly exploits the fact that for the L 2 -risk, there is a difference between lower bounds for estimation and testing, see for instance [12,15].Indeed, we employ the testing result of [9] to extract sufficient information about the regularity of the unknown signal in the local servers, which we then use in the subsequent estimation procedure.This approach depends crucially on the fact that we consider the L 2 -risk.For the L ∞ -risk there is no difference between testing and estimation rates and this approach breaks down.In fact we prove that for the L ∞ -norm, adaptive estimation at the optimal rate under minimal communication is never possible!
The impossibility results all derive from the fact that in the local servers, sample size is too small to extract sufficient information about the regularity of a general signal.This suggests that if we restrict to a class of "nice" signals for which we do have access to such smoothness information from limited data, we should be able to obtain optimal rates and minimal communication adaptively.We prove that this is indeed the case if we consider the class of self-similar functions, first introduced in [5] in the context of nonparametric confidence regions, where closely related issues occur.See also for instance [6,7,13,20,22,23].
The remainder of the paper is organized as follows.In the next section we first present the minimax lower bounds under communication restrictions that show that if we want to attain the optimal rate n −s/(1+2s) for estimating s-smooth functions in the distributed setting, we need to transmit at least order n 1/(1+2s) bits from the local machines to the central one.For completeness we show that it is easy to obtain the optimal rate under minimal communication if s is known.We also prove that if it is assumed that s belong to some known range (s 0 , s max ), then adaptation to smoothness over that range is possible while transmitting order n 1/(1+2s 0 ) bits.After this we present our main results.Theorems 2.4 and 2.5 and Corollary 2.6 assert that whether simultaneous adaptation over a range of regularities and minimal communication is possible for the L 2 risk, depends on the relation between the range of regularities and the number of local machines.Theorem 2.7 shows that simultaneous adaptation and minimal communication is not possible when L ∞ risk is considered.Finally, Theorem 2.8 asserts that it is possible under a self-similarity assumption.Proofs and auxiliary results are deferred to Section 3-5 and the appendices.

Main results.
In our analysis we work with the distributed Gaussian white noise model also considered for instance in [24], [28], and which can be seen as an idealized version of the nonparametric regression model.Our results can in principle be derived in the regression context as well, similar as we did in [25].However, since the additional technical issues would seriously lengthen the already long paper and would add no fundamental insight, we formulate everything in the signal in white noise setting in this paper.
We assume that we have m machines and in the ith machine we observe the random function X (i) t given by the stochastic differential equation where W (1) , ..., W (m) are independent standard Wiener processes and f 0 is the unknown function of interest.It is common to assume that the unknown true function f 0 belongs to some regularity class.We work in our analysis with Besov smoothness classes, more specifically we assume that L), see Appendix B for a rigorous introduction of these smoothness classes.The first class is of Sobolev type, while the second one is Hölder type.
Parallel to each other, the local machines carry out a local statistical procedure and transmit the results to the central machine, which provides the final inference about the functional parameter of interest f 0 by somehow aggregating the local outcomes.There are however constraints on the communication between the local and global machines.Local machine i is allowed to send at most B (i) bits (on average) to the central machine.The central machine will then collect the transmitted bits from the local computers and combine them to a global, aggregated answer.More formally, for a target function class F, we write f ∈ F dist (B (1) , ..., B (m) ; F) if fn is a measurable function of messages of length B(i) sent from the local machines and for every f 0 ∈ F it holds that E f 0 B(i) ≤ B (i) for every i.For simplicity, we will focus on the case B (1) that the communication restriction is the same for every local machine.
2.1.Distributed minimax rates.As a first step we give lower bounds for the minimax risk for the L 2 -norm.We assume that in each local machine we have the same communication budget, i.e.B (1) = ... = B (m) = B. Then the corresponding minimax L 2 estimation rates are the following, see also [25,28]. sup

Proof. See Section A.1
The result shows that it is indeed only possible to obtain the optimal rate n −s/(1+2s) over Besov balls of regularity s if, up to a logarithmic factor, every machine is allowed to transmit order n 1/(1+2s) bits to the central machine.
The following theorem shows that this result is indeed sharp (up to logfactors), i.e. if order n 1/(1+2s) bits are allowed, then the optimal rate can indeed be achieved with some procedure.In fact, the theorem considers the first two cases of the preceding one, i.e. (n log(n)/m 2+2s ) 1/(1+2s) ≤ B. The third case is not interesting since in that case distributed methods do not perform better than any standard technique applied on a single, local server.
Proof.See Section A.2 One can also derive similar matching lower and upper bounds for the L ∞ -norm for f 0 ∈ B s ∞,∞ (L) in case of the Gaussian white noise model, as in [25] where the nonparametric regression model was considered.Since our focus in this paper is not on deriving minimax rates, we have deferred this result to Section A.3 in the appendix.

Simultaneous adaptation to smoothness and minimal communication.
In view of the preceding two theorems we can conclude that when the goal is to estimate s-smooth functions at the rate n −s/(1+2s) , the optimal, minimal number of transmitted bits is n 1/(1+2s) (up to a logarithmic factor).Transmitting less bits will result in (polynomially) sub-optimal convergence rate for any distributed method, while by transmitting at least the optimal amount of bits one can construct distributed estimators reaching the convergence rate of non-distributed techniques.

2.2.1.
Adaptation in L 2 .The procedure f exhibited in (the proof of) Theorem 2.2 has the desirable property that if f 0 ∈ B s 2,∞ (L), then, up to log-factors, using the minimal communication it achieves the optimal rate n −s/(1+2s) .This procedure is, however, not adaptive: it uses the knowledge of the regularity level s of the unknown function.In this section we investigate the more relevant question under which conditions can we simultaneously achieve the optimal convergence rate and minimal communication without using any information about the smoothness of the truth.
If we are willing to assume that the true regularity is at least s ≥ s 0 for some known s 0 > 0 and are in addition willing to allow order n 1/(1+2s 0 ) bits to be communicated between the local and the central machines, then it is straightforward to achieve adaptation to smoothness.Proposition 2.3.Let s max > s 0 > 0, L > 0, m ≤ n, and B 0 = n 1/(1+2s 0 ) log n.Then there exists a distributed estimator Proof.See Section 3.1 The problem with the above method is that it always transmits a multiple of n 1/1+2s 0 log n bits, which can be substantially more than the optimal n 1/(1+2s) if the true smoothness s happens to be larger than the assumed lower bound s 0 .The question naturally arises: is it possible to achieve adaptation to smoothness while at the same time automatically transmitting the minimal amount of bits?
We show that in case of the L 2 -norm one can only adapt up to a limited range of smoothness levels (depending on the number of local machines), and beyond that one will achieve sub-optimal rates (where the rate is suboptimal by a polynomial factor).
Theorem 2.4.Suppose that m = n p for some p ∈ (0, 1/2).Then for any regularity parameters s 2 > s 1 > 1/(4p) − 1/2 there does not exist a distributed method which adapts to the number of transmitted bits and at the same time achieves the minimax risk as well, i.e. it is not possible to simultaneously obtain a distributed method with B(i) ≤ n 1/(1+2s 1 )+ε 1 log n and for l = 1, 2 that sup i∈{1,...,m} for some small enough constants ε 1 , ε 2 > 0 depending only on s 1 , s 2 and p.
The above theorem tells us that considering even just two regularity classes (with regularities above some threshold level) there doesn't exist any distributed method, which transmits the optimal amount of bits multiplied by some (small) polynomial factor and reaches the minimax rate in both smoothness classes up to a (small) polynomial factor.The above negative results deliver a strong message, as the question of non-existence can not be resolved by allowing extra logarithmic factors, but is on the polynomial level.
The phenomenon behind the negative result is that in case of many local machines (large m) it is getting more difficult to test locally between the regularity classes (as the local "sample size" decreases in m) and also the "local regularity" of the function which one can judge at noise level m/n might be completely different than the "global regularity" of the truth which can be judged at a smaller noise level 1/n.Although full adaptation is not possible, it turns out that on a limited range of regularity levels it is possible to construct adaptive methods.Below we derive the complement of the preceding result and show that for regularities below the threshold 1/(4p) − 1/2 we can adapt to smoothness and transmit the minimal number of bits at the same time.
The proposed procedure has two stages.First we "estimate" the smoothness of the underlying functional parameter of interest in every local machine parallel to each other and based on that transmit the right amount of information to the central machine.In the second stage we aggregate the locally transmitted information and provide a "global" adaptive estimator.The difficulty, as also discussed above, arises from the higher noise level in the local problems which results in less accurate tests between the smoothness classes.The existence of an estimator which can achieve adaptation (in a limited range of smoothness classes) is a consequence of the difference between the nonparametric testing and estimation rates in the case of the L 2 -norm, see for instance [12,15].Since one can test between smoothness classes with a faster rate than the corresponding estimation rate, it can compensate (up to some extent) for the higher local noise level m/n.
The preceding result can be extended to a scale of smoothness classes as well.
The idea of the proof of this corollary is to introduce a grid of regularities in the interval [s 1 , s 2 ] and test between which two grid points the true regularity lies.Then one can apply the distributed method introduced in the proof of Theorem 2.5 to derive the stated results.

2.2.2.
Adaptation in L ∞ .Next we deal with the L ∞ -norm case.Here we show that in contrast to the L 2 -case, adaptation is not possible even on a limited range of smoothness classes.The reason behind it is that in this case the minimax testing and estimation rates are the same and hence there is no room left to compensate for the higher local noise level.
Proof.See Section 4.1 Next we introduce some additional restriction on the true function of interest under which adaptation is possible in the distributed setting.To do so we consider the so-called self-similarity assumption, where loosely speaking we assume that the true function has similar smoothness at every resolution level.This will allow us to estimate the regularity s of the functional parameter of interest and therefore transmit the right amount of bits from the local machines to the central one.
We first introduce necessary notation.Let ψ jk be the wavelet basis functions described in Appendix B. For f ∈ L 2 [0, 1] and natural numbers j 1 ≤ j 2 we define Then following [5] we say that the function The self-similarity property was introduced (amongst other places) in the context of adaptive confidence bands.It was shown that under self-similarity one can construct adaptive L ∞ confidence bands whose size also adapts to the level of regularity, see for instance [5,13,19].The underlying idea is the same as here.Under this assumption one can provide a consistent estimator for the smoothness and based on that construct the band corresponding the function class.
The following theorem shows that under the self-similarity assumption there exists a distributed method which adapts to regularity and at the same time transmits the minimal amount of bits (again up to logarithmic factors).
Theorem 2.8.Consider the distributed Gaussian white noise model with m ≤ n δ , for some δ ∈ (0, 1) and assume that f 0 ∈ B s ∞,∞ (L) for some s ∈ [s 1 , s 2 ] (where 0 < s 1 < s 2 are arbitrary).Then there exists a distributed method such that the number of transmitted bits satisfies 1 and Proof.See Section 4.2 3. Proofs for the adaptation results.In the proofs we will work with the wavelet decomposition of the functional parameter f 0 .In our analysis we consider the Daubechie wavelets ψ jk (t) for j = 0, 1, ..., k = 1, ..., 2 j , t ∈ [0, 1] and denote by f 0,jk = 1 0 ψ jk (t)f 0 (t)dt the corresponding wavelet coefficients.In Section B we have collected a few properties of Daubechie wavelets which we will apply throughout the proofs.
We note that following from the orthonormality of the Daubechie wavelets we have that the Gaussian white noise model can be written in the sequence representation where jk are iid standard normal random variables.
where Z jk are iid standard Gaussian random variables and |ε jk | ≤ n −1/2 are random variables representing the error term arising from transmitting only the first 0.5 log n digits of the observations.These error terms are in fact negligible.Then using arbitrary adaptation technique (for instance Lepski's method [18]) one can construct an estimator f achieving the minimax risk for every f ∈ B s 2,∞ (L), s 0 ≤ s ≤ s max .
3.2.Proof of Theorem 2.4.We argue by contradiction.We assume that the inequalities (2.2) and (2.3) hold.Then we construct a finite but large enough set F 0 ⊂ B s 1 2,∞ (L) such that there does not exist a consistent test between the elements of the set and the zero function, which clearly belongs to the smoother class B s 2 2,∞ (L).Using this non-existence result we arrive to contradiction with our assumptions.
As a first step we construct the set F 0 .Let us introduce the following notations and constants ε 3 ∈ (0, p(1+2s 1 )−1/2 1/2+2s 1 ), where p(1 + 2s 1 ) − 1/2 > 0 follows from the assumption s 1 > 1/(4p) − 1/2, and ε 1 ∈ 0, In view of the definition of δn this implies that δn ≥ n Therefore we can conclude that for large enough n The elements f ∈ F 0 are then defined with the wavelet coefficients as and besides, for every f ∈ F 0 , in view of the definition of δn , Next we take the average likelihood ratio over the class , where In view of (6.23) of [12] inf for every η n ∈ (0, 1), where the infimum is taken over all local tests in the local problems.Furthermore one can show by following the steps in the proof of Theorem 6.2.11 c) on pages 493-494 of [12] (with γ ′ n/m = c 2 0 (n/m) δn and we get that inf for some large enough constant C > 0, concluding the proof of the nonexistence of consistent tests between F 0 and the zero function.
Next we show that (3.6) contradicts our assumptions.Let us define the test First note that following from Markov's inequality and assumption (2.2) Therefore in view of (3.6) we have that As a consequence and in view of assumption B(i) ≤ Cn This means that the expected number (with respect to the joint distribution of the variables F and P f , f ∈ F 0 ) of transmitted bits on the class F 0 is bounded from above by a multiple of β n .So the distributed estimator satisfies assertion (A.7) in the proof of Theorem A.1 with B (i) replaced by Cβ n .Hence in view of the minimax lower bound derived in assertion (A.9) and the definition of δn (with B (i) replaced by β n in the definition of δ n in the proof of Theorem A.1) with ε 2 = 2ε 5 s 1 /(1 + 2s 1 ), where the last inequality follows from (3.3).This contradicts assumption (2.3), finishing the proof of our statement.

3.3.
Proof of Theorem 2.5.In our proof we work with the equivalent sequence representation of the model (3.1).As a first step we split the data in all of the local models i ∈ {1, ..., m} into two subsets X (i,1) jk , X (i,2) jk for j = 0, 1, 2, .., k = 1, ..., 2 j , such that they are pairwise independent and their variance is 2m/n (this can be done by adding and substracting Z(i) jk ).Let us then denote by P X (i,1) and P X (i,2) the distribution of the first and second subset of observations, respectively, and by P X (i,2) |X (i,1) the conditional distribution of the second subset given the first.The corresponding expected values are denoted by E X (i,1) , E X (i,2) , and E X (i,2) |X (i,1) , respectively.Finally let us introduce the notations X l = (X (1,l) , ..., X (m,l) ), l = 1, 2 and denote by P X l and E X l the corresponding probability distributions and expected values.
Next note that it was shown in [9] that there exists a consistent composite test between the classes B s 2 2,∞ (L) and B s 1 2,∞ (L) in the local problem using the first subset of observations X (i,1) if they are at least (n/m) −s 1 /(1/2+2s 1 ) separated.The test proposed in Section 3 of [9] takes the form (in the local machines using the first subset of observations X (i,1) ) where , for l > 0, where Π l f denotes the projection of the function f to the resolution level l, i.e.Π l f = n/m is the wavelet estimate of f in the ith local machine using observations X (i,1) , see the top of page 6 of [9], and z 0 = 1 (since for notational convenience we take J 0 = 0, see Section B, we have z 0 = 2 J 0 = 1).Let us introduce the notation 1/2+2s 1 }.In view of Lemma 5.4 we have for all α ∈ (0, 1) and 0 < m ≤ n that sup 1+2s 1 / √ α and c not depending on α, n, m.
1 )(1/2+2s 1 ) tending to infinity (where the positivity of the exponent follows form the assumption s 1 < 1/(4p) − 1/2).Then there exists a consistent test Ψ Using the test function above, we define the smoothness estimate as In each local model we take the first n and note that P X 2 (E c ) ≤ n 2 e −c ′ n e −cn , for any 0 < c < c ′ .Hence the number of transmitted bits conditioned on the first subsample Let us denote by Ñ the median of the values n 1/(1+2ŝ n/m ) , i = 1, ..., m and ŝ the corresponding regularity estimator.Then we construct our estimator f as the average of the transmitted observations (for the first Ñ coefficient), i.e. fn,jk = where M jk is the collection of local machines satisfying 2 j +k ≤ n 1/(1+2ŝ n/m ) , i.e. the machines from which the local approximations Y (i) jk are transmitted.We show that this procedure achieves the minimax convergence rate and transmits the optimal amount of bits (up to a logarithmic factor).First note that B(i) n 1/(1+2s 1 ) log n follows immediately by construction.Then recall that the test Ψ verifying that the number of transmitted bits is indeed optimal.
Next we provide optimal upper bounds for the risk.First let us consider the case , where the estimator n /2 .Let us introduce the notation M for the number of machines in {1, ..., m}, where the ŝ(i) , respectively.Note that M has a binomial distribution with parameters m and p ≤ ce −M 1/2 n /2 .Then by Hoeffding's inequality Then in view of the almost sure inequality Ñ ≤ n 1/(1+2s 1 ) we have that ≤ n 1/(1+2s 2 ) + n 1/(1+2s 1 ) e −m/5 ≤ (1 + o(1))n 1/(1+2s 2 ) , for m ≥ 5 log n ≥ 10(s 2 −s 1 ) (2s 1 +1)(2s 2 +1) log n.Then similarly to the proof of Theorem 2.2 (with m replaced by |M jk |) we get on the set E (with √ n for all i, j, k.Using this reformulation of the estimator and the notation jn = ⌊log Ñ ⌋ we get that sup .
It remained to deal with the intermediate set, i.e.
4. Proof of Corollary 2.6.We adapt the method and proof of Theorem 2.5 to the collection of regularity classes s 0 ∈ [s 1 , s 2 ], where s 0 denotes the regularity of the truth we want to adapt to.Similarly to the discrete case we divide the data in each machine to two independent samples X (i,1) and X (i,2) .Let S n denote a 1/ log n-grid of the interval [s 1 , s 2 ], i.e. S n = {s 1 , s 1 + 1/ log n, ..., s 2 }, and denote by s = s 1 + γ n / log n, for some 0 ≤ γ n ≤ ⌈(s 2 − s 1 ) log n⌉, γ n ∈ N, the lower bound of the 1/ log nbin containing s 0 , i.e. s 0 ∈ [s, s + 1/ log n].We will describe next a testing procedure for the regularity hyper-parameter s 0 .Let us compute the test Ψ (i) n/m (M −1 n,t , t, s) for all t < s, s, t ∈ S n and take ŝ(i) n/m to be the largest regularity s for which the null hypothesis was retained for every t < s, i.e.
The aggregated regularity estimator ŝ and the distributed estimator f is then constructed the same way as in the proof of Theorem 3.3, using the above defined ŝ(i) n/m .The probability of under smoothing is bounded from above by (γ n −1) 2 ≤ (s 2 −s 1 ) 2 log 2 n times the probability of rejecting the correct null-hypothesis.Hence in view of assertion (3.8) and the monotone decreasing property of the function s → M n,s , we get that This implies that for all i ∈ {1, ..., m} and similarly to assertions (3.10) and (3.11) that ≤ e −m/5 and for m ≥ 5 log n.
It remaines to show that our procedure adapts to the minimax risk.First note that in view of assertion (3.13) and (4.1) We deal with the three terms on the right hand side separately.In view of assertion (4.1) and f 0 2 2 ≤ L 2 we have that s<s Then it is also easy to see that Then for arbitrary s > s, s ∈ S n , using the notation R Therefore, by Hoeffding's inequality, sup hence by combining the preceding two displays we get that sup For any Hence sup Combining the upper bounds above we get that sup concluding the proof of the corollary.
4.1.Proof of Theorem 2.7.The proof follows the same lines of reasoning as the proof of Theorem 2.4, here we highlight only the differences.
First of all the set of functions F 0 is defined slightly differently.Let us introduce the notations δn = δn ∧ (m/n), with (4.4) . By elementary computations one can deduce that δn ≥ n ε 1 /2−1 and therefore δn ≥ n (ε 1 /2∧p)−1 .(4.5) Next, let us denote by K j the largest set of Daubechies wavelets with disjoint supports at resolution level j.Note that |K j | ≥ c 0 2 j (for large enough j and sufficiently small c 0 > 0).Then we consider the class of functions Since the functions in F 0 have disjoint supports we have sup following from the definition of δn .Hence it is not possible to test between the zero function and the set F 0 in the local servers.
Using the notation Z for the likelihood ratio introduced in the proof of Theorem 2.4 we note that in view of the proof of Theorem 6.2.11 b) on page 493 of [12] we have that This means that the expected number (with respect to the joint distribution of the variables F and P f , f ∈ F 0 ) of transmitted bits on the class F 0 is bounded from above by a multiple of β n .So the distributed estimator satisfies assertion (A.7) in with B (i) replaced by Cβ n .Hence in view of the minimax lower bound derived in assertion (A.13) (with B (i) replaced by β n in the definition of δ n in the proof of Theorem A.3) and the definition of δn sup , where the last inequality followed from (4.5).This contradicts assumption (2.5), finishing the proof of our statement.4.2.Proof of Theorem 2.8.First note that in Lemma 5.2 of [5] it was shown that the smoothness can be consistently estimated under the selfsimilarity condition, i.e. there exists an estimator ŝn/m (i) such that for every i ∈ {1, ..., m} and c > 0 there exists C > 0 satisfying inf By choosing c = 1/(1 − p) we have (m/n) c = 1/n.Then we propose a similar estimation method as in Theorem 2.5.First we split the data into X (i,1) and X (i,2) and use the first sample X (i,1) to construct the estimator ŝ(i) n/m for the smoothness parameter s.Next transmit the approximation of the first as in Theorem 2.5) of the second subset of observations X (i,2) , following Algorithm 1. Then B(i) ≤ (n/ log n) 1/(1+2s 1 ) log n and Besides we also have that the median Ñ of the values Ñ (i) satisfy that for some large enough constants C 1 , C 2 > 0.
Similarly to before let jn = ⌊log Ñ ⌋ and f 0,j≤ jn = j≤ jn 2 j k=1 f 0,jk ψ jk .Then using the notation E introduced in (3.9) we get that where Therefore in view of (4.8) concluding the proof of our statement.
5. Technical lemmas.The first lemma extends sligthly the results of Shannon's source coding theorem by allowing also non-prefix codes, see Lemma 5.1 of [25].
Lemma 5.1.Let Y be a random finite binary string.Its expected length satisfies the inequality Let us take an arbitrary x ∈ R and write it in a scientific binary representation, i.e. |x| = Then let us take y consisting the same digits as x up to the (D log 2 n)th digits, for some D > 0, after the binary dot (and truncated there), i.e. |y| = n, in which case we set y to zero, see also Algorithm 1, a slightly modified version of Algorithm 1 from [25].In the algorithm the function x → sign(x) is one if x ≥ 0 and zero otherwise.Transmit: 0.
The next lemma gives an upper bound for the number of transmitted bits and the accuracy of the procedure described in Algorithm 1.It is a slightly reformulated version of Lemma 2.3 of [25] to accommodate almost sure upper bound on the code length.Lemma 5.2.For X ∼ N (µ, σ 2 ), with |µ| ≤ M and σ ≤ 1 let the approximation Y of X given in Algorithm 1 and denote by E X the event that |X| ≤ √ n.Then for large enough n, for some c > 0.
Proof.It is straightforward to see that the last two inequalities of the statement hold.To prove the first one note that Next we provide an extended version of Lemma 4.2 of [9] with tighter upper bounds for small ∆ > 0. The main difference in the proof is that instead of Chebyshev's inequality we apply a more accurate concentration inequality, see Lemma 8.1 of [3].
for c = 3/2 and z 0 = 2 J 0 the number of father wavelets (at resolution level J 0 ) and Π l f = 2 l k=1 f lk ψ lk the projection of f into the wavelet resolution level l.
Proof.Note that for the wavelet estimator f with signal-to-noise ration nwe get that Π l f 2 2 = k f 2 lk , where flk − f lk iid ∼ N (0, 1/n).Hence in view of Lemma 8.1 of [3] (with degree of freedom D = 2 l , noncentrality parameter B = n 2 l k=1 f 2 lk and x = 1/(2 √ δ l )) we get for δ l ≤ 1/4 that By the definition of T n (l) and union bound these results imply that Setting similarly to Lemma 4.2 of [9] the parameters δ l = (2 −(j−l)/2 + 2 −l/4 )∆/12 and δ J 0 = ∆/12 we get in view of which implies together with z 0 ≥ 1 that concluding the proof of the lemma.
The next lemma is a slightly rewritten version of Theorem 3.1 of [9] with tighter error bounds (for small α > 0).Lemma 5.4.Let α > 0. The test Ψ n (α) satisfies that for all α > 0 and n > 0 where and Cα = 24 Proof.The proof goes the same way as of Theorem 3.1 of [9], with the only difference that we apply Lemma 5.3 instead of Lemma 4.2 of [9].
We also recall a slight modification of Fano's inequality, see Corollary 1 of [11] or Theorem A.6. of [25].Given a finite set F 0 ⊂ F, we use the notations where E(Y ) denotes the set of all estimators depending only on Y and the function class F, and F is a uniformly distributed random variable on F 0 .
The next lemma gives an upper bound for the mutual information between the uniform random variable F on F 0 ⊂ R d and the set of observations on all local machines Y = (Y (1) , ..., Y (m) ) in the d-dimensional many normal means model.Lemma 5.6.Let F = δβ, with δ 2 ≤ 2 −10 m/(n log(md)) and β a uniformly distributed random variable over {−1, 1} d .Furthermore, suppose that X = (X (1) , ..., X (m) ), where X (i) s are d-dimensional random variables satisfying that X (i) j | F j and F j are independent of F −j , and and the inequality I(X (i) ; Y (i) ) ≤ H(Y (i) ) holds.Then by plugging in the above inequalities into (5.1) and using the inequalities e x ≤ 1+2x for x ≤ 0.4 and C 2 ≤ 2 we get that Furthermore, from the data-processing inequality and the convexity of the KL divergence We conclude our statement by noting that The next theorem provide an upper bound for the mutual information, see Theorem A.9 in [25] or Lemma 3 of [26].
Theorem 5.7.Let us consider the Markov chain F → X (i) → Y (i) , where F is the uniform distribution on F 0 ⊂ R d and for a constant C ≥ 1 and density p(x j |f j ).Then where I(X (i) ; Y (i) ) is the mutual information between X (i) and Y (i) .

APPENDIX A: PROOFS FOR THE MINIMAX RATES IN THE GAUSSIAN WHITE NOISE MODEL
A.1.Proof of Theorem 2.1.The proof of the theorem follows from the following, more general theorem with taking B (1) = ... = B (m) = B.The proof is slight extension for a larger set of estimators and adaptation to the Gaussian white noise setting of the proof of Theorem 2.1 [25].
Theorem A.1.Let the sequence δ n = o(1) be defined as Then in the distributed Gaussian white noise model (2.1) we have for any s > 0 that inf f ∈F dist (B (1) ,...,B (m) ) sup Proof of Theorem A.1.Note that without loss of generality we can multiply δ n with an arbitrary constant.In the proof we define δ n as the solution to We note, however, that all the computations below hold for arbitrary δ ′ n ≤ δ n as well.
We prove the desired lower bound for the minimax risk using a modified version of Fano's inequality, given in Theorem 5.5.As a first step we construct a finite subset F 0 ⊂ B s 2,∞ (L).We use the wavelet notation outlined in Appendix B and define j n = ⌊(log δ −1 n )/(1 + 2s)⌋.For β ∈ {−1, 1} 2 jn , let f β ∈ L 2 [0, 1] be the function with wavelet coefficients Therefore, for an arbitrary set of estimators F we have that inf To prove the statement of the theorem we take the set of distributed estimators F = F dist (B (1) , . . ., B (m) ; B s 2,∞ (L)), but the inequality holds more generally.
For this set of functions F 0 , the maximum and minimum number of elements in balls of radius t > 0, given by Recall the notations X = (X (1) , . . ., X (m) ) for the data available at the local machines and Y = (Y (1) , . . ., Y m) ) for the binary messages transmitted to the central machine satisfying the distribution protocol, and consider the Markov chain F → X → Y , where F is a uniform random element in F 0 .It then follows from Theorem 5.5 (with Hence, recalling that 2 jn = δ and as a consequence to derive the statement of the theorem it is sufficient to show that Observe that for the class of distributed estimators F = F dist (B (1) , . . ., B (m) ; B s 2,∞ (L)), by definition the following inequality holds with wavelet coefficients given by fjk = mean{Y (i) jk : The procedure is summarized as Algorithm 2.
Algorithm 2 Algorithm for the L 2 -norm In the algorithm described above each machine transmits the approximations of at most n 1/(1+2s) ∧ (B/ log n) noisy coefficients.Note that for any jk | ≤ √ n, and l(Y where the set E was defined in (3.9) and satisfies that P X (E) ≤ e −cn , for some c > 0. Therefore we need at most B bits to transmit n 1/(1+2s) ∧ (B/ log n) coefficients, hence f ∈ F dist (B, ..., B; B s 2,∞ (L)).Next for convenience we introduce the notation A jk = {⌊µ jk m/η⌋ + 1, ..., ⌊(µ jk + 1)m/η⌋} for the collection of machines transmitting the (j, k)th coefficient and note that #(A jk ) ≍ m/η.Then our aggregated estimator f on the set E satisfies for 2 j + k ≤ ηB/ log n (i.e. the total number of different coefficients transmitted) that ] and Z jk iid ∼ N (0, 1).
Let j n = ⌊log n 1/(1+2s) ∧ (ηB/ log n) ⌋.Then the risk of the aggregated estimator is bounded as where we have used that for f 0 ∈ B s 2,∞ (L) we have |f 0,jk | ≤ L for any j ≥ 0, k = 1, ..., 2 j .The above inequality together with concludes the proof of the theorem.
This theorem is actually a direct consequence of the following more general theorem where the communication thresholds can vary between the machines.
Theorem A.3.Consider s, L > 0, communication constraints B (1) , . . ., B (m) > 0 and let the sequence δ n = o(1) be defined as the solution to the equation (A.1).Then in the distributed Gaussian white noise model (2.1) we have that Proof.First of all we note that in the non-distributed case where all the information is available in the global machine the minimax L ∞ -risk is (n/ log n) − s 1+2s .Since the class of distributed estimators is clearly a subset of the class of all estimators this will be also a lower bound for the distributed case.The rest of the proof goes similarly to the proof of Theorem A.1.
First we construct a finite subset F 0 ⊂ B s ∞,∞ (L) and then give a lower bound for the minimax risk over it.Let us denote by K j the largest set of Daubechies wavelets at resolution level j with disjoint supports.Note that |K j | ≥ c 0 2 j (for large enough j and sufficiently small c 0 > 0).Let us again multiply δ n with a sufficiently small constant and work with this δ n in the rest of the proof Furthermore, if f β = f β ′ , then there exists a k ′ ∈ K jn such that β k ′ = β ′ k ′ .Then due to the disjoint support of the corresponding Daubechies' wavelets ψ jn,k , k ∈ K jn the L ∞ -distance between the two functions is bounded from below by Next observe that for an arbitrary set of estimators F inf Now let F be a uniform random variable on the set F 0 .Then in view of Fano's inequality (see Theorem 5.5 with t = δ s/(1+2s) n and p = 1) we get that inf where the second inequality follows from Theorem 5.1 and assertion (A.7) for F = F dist (B (1) , . . ., B (m) ; B s ∞,∞ (L)) and the third by the definition of δ n , see (A.11).Hence we can conclude that inf f ∈F dist (B (1) ,...,B (m) ;F 0 ) sup Note that we have used the properties of the distributed estimation class F only in assertion (A.7), hence for any class distributed estimator F satisfying this inequality we have that inf Next we give an algorithm providing matching upper bounds in the first two cases.Note that the last case, similarly to the L 2 -norm is less relevant as using the data available only on a single machine would provide at least as good an estimator as any distributed algorithm.The algorithm is very similar to the L 2 -case, i.e.Algorithm 2, and is basically the rewrite of Algorithm 4 of [25] tailored to the Gaussian white noise model.Here we just highlight the differences compared to Algorithm 2. We divide the machines into η = (⌊ L 2 n(log 2 n) 2s /B 1+2s 1 2+2s ⌋ ∧ m) ∨ 1 equal sized groups (η = 1 corresponds to case (ib), while η > 1 corresponds to case (iib)).Similarly to before machines with indexes 1 ≤ i ≤ m/η transmit the approximations Y .Then in the central machine we average the corresponding transmitted coefficients in the obvious way, similarly to the L 2 -norm case.The procedure is summarized as Algorithm 3 and the (up to a logarithmic factor) optimal behaviour is given in Theorem A.4 below.jk : µ jk m/η < i ≤ (µ jk + 1)m/η}.9: Construct: f = fjk ψ jk .

1. 1 .
Notations.For two positive sequences a n , b n we use the notation a n b n if there exists an universal positive constant C such that a n ≤ Cb n .Along the lines a n ≍ b n denotes that a n b n and b n a n hold simultaneously.In the proofs we use the notation C and c for universal constants which value can differ from line to line and denote by #S or |S| the cardinality of the finite set S. Furthermore, let l(Y ) denote the length of a binary string Y , and log x denote the logarithm with base 2, i.e. log 2 x.

3. 1 .
Proof of Proposition 2.3.Consider the sequence representation of the distributed Gaussian white noise model, see (3.1) using at least s max regular Daubechie wavelets.Then by transmitting n 1/(1+2s 0 ) log n bits (the first n 1/(1+2s 0 ) elements of the sequence representation of the model up to the first 0.5 log n digits in the binary representation of the number, see Algorithm 1) to the central machine and averaging the transmitted local data we arrive to the global sequence model ) coefficients in the second subset of observations in the sequence representation, i.e.X (i,2) jk with2 j + k ≤ n 1/(1+2ŝ (i)n/m ) .Since these numbers might note have a finite binary representation we transmit their approximations Y (i) jk following Algorithm 1.Note that in view of Lemma 5.2 (with µ = f 0,jk ) we have that l(Y (i) jk ) ≤ log n with approximation error |ε