Wasserstein distance and metric trees

We study the Wasserstein (or earthmover) metric on the space $P(X)$ of probability measures on a metric space $X$. We show that, if a finite metric space $X$ embeds stochastically with distortion $D$ in a family of finite metric trees, then $P(X)$ embeds bi-Lipschitz into $\ell^1$ with distortion $D$. Next, we re-visit the closed formula for the Wasserstein metric on finite metric trees due to Evans-Matsen \cite{EvMat}. We advocate that the right framework for this formula is real trees, and we give two proofs of extensions of this formula: one making the link with Lipschitz-free spaces from Banach space theory, the other one algorithmic (after reduction to finite metric trees).


Introduction
Embeddings of metric spaces, especially discrete metric spaces like graphs, into the Banach spaces ℓ 1 or L 1 , form a well-established part of metric geometry, with applications ranging from computer science to topology: we refer to [Na18], part I of [DL97] or Chapter 1 in [Os13].In this paper we will be concerned with embeddings of Wasserstein spaces, that we now recall.
Let (X, d) be a metric space and let P 1 (X) be the space of probability measures µ on X with finite first moment, i.e.X d(x 0 , x) dµ(x) < +∞ for some (hence any) base-point x 0 ∈ X.For compact X, the space P 1 (X) coincides with the space P (X) of all probability measures on X.
The Wasserstein metric is a distance function on P 1 (X).Intuitively, given µ, ν ∈ P 1 (X), the distance W a(µ, ν) represents the amount of work necessary to transform µ into ν.More precisely, a probability measure π ∈ P (X × X) is a coupling between µ and ν if its marginals are µ and ν, i.e. µ(A) = π(A × X) and ν(A) = π(X × A) for any Borel subset A ⊂ X.And the Wasserstein distance W a(µ, ν) is defined as W a(µ, ν) = inf X×X d(x, y) dπ(x, y) : π coupling between µ and ν .
Note that X embeds isometrically in P 1 (X) by x → δ x (the Dirac mass at x). See Chapter 5 in [Sa15] or Chapter 7 in [Vi15] for more on the Wasserstein distance, also called Kantorovitch-Rubinstein distance or earthmover distance (EMD) in computer science papers.We denote by W a(X) the space P 1 (X) endowed with the Wasserstein distance, and call it the Wasserstein space of X.For a coupling π, the cost of π is the quantity X×X d(x, y)dπ(x, y).
Let Y = (Y i , d i ) i∈I be a finite family of metric spaces.We say that a metric space (X, d) embeds stochastically in Y with distortion D ≥ 1 if there exists nonnegative numbers (p i ) i∈I summing up to 1, and maps f i : X → Y i (for each i ∈ I) such that: • Each f i is non-contracting, i.e. for every x, y ∈ X we have d i (f i (x), f i (y)) ≥ d(x, y).
• For every x, y ∈ X we have The first aim of this paper is to prove the following result; Theorem 1.1.Assume that the finite metric space (X, d) embeds stochastically with distortion D into a family of finite metric trees.Then W a(X) embeds bi-Lipschitz into ℓ 1 with distortion at most D.
Here, by a metric tree, we mean a tree T = (V, E) endowed with a positive weight function w : E → R >0 : e → w e .For x, y ∈ V we denote by [x, y] the set of edges on the unique path from x to y and we endow V with the distance d T (x, y) = e∈[x,y] w e .
We learned Theorem 1.1 from the paper [IT03] by P. Indyk and N. Thaper, who get a less precise O(D) for the distortion of the embedding into ℓ 1 , and provide a rather frustrating comment that prompted our desire to provide a direct proof of Theorem 1.1. 1 It was shown by J. Fakcharoenphol, S. Rao and K. Talwar [FRT04] that any finite metric space on n points embeds stochastically with distortion O(log n) into a family of finite metric trees (and this bound is optimal).Using this it was shown by F. Baudier, P. Motakis, G. Schlumprecht and A. Zsak (Corollary 8 in [BMSZ20]) that, for X a finite metric space on n points, the lamplighter metric space La(X) embeds into ℓ 1 with distortion O(log n) = O(log log |La(X)|).Using the same result from [FRT04], our Theorem 1.1 immediately implies: Corollary 1.2.For any finite metric space X on n points, the Wasserstein space W a(X) embeds bi-Lipschitz into ℓ 1 with distortion at most O(log n).
Combining with the isometric embedding X → W a(X) : x → δ x , we get as corollary a celebrated result by J. Bourgain [Bo85] 2 Corollary 1.3.Any finite metric space on n points, embeds bi-Lipschitz into ℓ 1 with distortion at most O(log n).
It turns out that on finite metric trees there is a remarkable closed formula for the Wasserstein distance.It originated in papers in computer science in 2002 and probably earlier: see Charikar [Ch02], for measures supported on the leaves of the tree3 .For general probability measures on a finite metric tree, the formula appears in a paper in biomathematics (see section 2 in S.N.Evans and F.A. Matsen [EM12]).We believe it deserves to be better known in mathematical circles.To understand it, let T = (V, E) be a metric tree, fix a base-vertex x 0 ∈ V (so that T appears as a rooted tree).Any edge e ∈ E separates T into two half-trees, and we denote by T e the set of vertices of the half-tree NOT containing x 0 : if we view the tree as hanging from the root, T e is the subtree hanging below the edge e.
Theorem 1.4.Let T = (V, E) be a finite, rooted metric tree.Then for µ, ν ∈ P (V ): This formula has numerous implications: first, the RHS is independent of the choice of the root; second, it shows that the Wasserstein metric on T is a L 1 -metric (see lemma 2.4 below).
Our second aim in this paper is to give two new proofs of Theorem 1.4.The first one advocates that the right framework for Theorem 1.4 is real trees: by exploiting a connection with the theory of Lipschitz-free spaces from Banach space theory, we will extend the result to metric trees with countably many vertices.The second proof is by double inequality: the inequality W a(µ, ν) ≥ e∈E w e |µ(T e ) − ν(T e )| follows by considering the canonical embedding of the tree into ℓ 1 and its barycentric extension to P 1 (V ).The converse inequality is proved by first reducing to finite metric trees and, for those, given µ, ν ∈ P (V ), by providing an algorithmic construction of a coupling π with V ×V d(x, y) dπ(x, y) = e∈E w e |µ(T e ) − ν(T e )|.The paper is organized as follows.In section 2 we prove Theorem 1.1, taking Theorem 1.4 for granted.Sections 3 and 4 present our two proofs of Theorem 1.4, suitably generalized to metric trees with countably many vertices (see Theorem 3.3).Finally the Appendix provides a comparison between various σ-algebras of sets on a real tree, that appeared in the literature.
The second lemma was suggested to us by F. Baudier.
Lemma 2.2.If the finite metric space (X, d) embeds stochastically into Y = (Y i , d i ) i∈I with distortion D, then W a(X) embeds stochastically into (W a(Y i )) i∈I with distortion D.
Proof For i ∈ I, let p i ≥ 0 and f i : X → Y i be realizing the stochastic embedding with distortion D of X into Y.Consider then (f i ) * : P (X) → P (Y i ) : µ → (f i ) * (µ).We claim that the stochastic embedding with distortion D of W a(X) into the family (W a(Y i )) i∈I is realized by the p i 's and the (f i ) * 's; to see this, we check the two points in the definition of a stochastic embedding.Fix µ, ν ∈ W a(X).
where the first inequality follows from the fact that f i is non-contracting.So (f i ) * is non-contracting as well.
• Let π ∈ P (X × X) be a coupling between µ and ν such that where the second inequality follows from the fact that the f i 's provide a stochastic embedding.This concludes the proof.
Combining lemmas 2.1 and 2.2 we immediately get: To prove Theorem 1.1, in view of Corollary 2.3, it is therefore enough to observe: Proof Fix a root x 0 ∈ V and, for any edge e ∈ E, let T e ⊂ V be defined as in the Introduction.The map is an isometric embedding of W a(T ), by Theorem 1.4.This concludes the proof of Theorem 1.1 (taking Theorem 1.4 for granted).
3 First proof of Theorem 1.4

Lipschitz-free spaces
For a metric space (X, d) with a base-point x 0 ∈ X, we denote by Lip 0 (X) the Banach space of Lipschitz functions on X vanishing at x 0 , endowed with the Lipschitz norm.The space Lip 0 (X) has a canonical pre-dual, called the Lipschitzfree space of X (see e.g.Chapter 2 in [We99], Chapter 10 in [Os13]) and denoted by F (X): it is the closed linear subspace of the dual space Lip 0 (X) * generated by the point evaluations δ x (x ∈ X \ {x 0 }).
For µ ∈ P 1 (X), the linear form f → X f (x) dµ(x) defines an element of the dual Lip 0 (X) * : this way we get an embedding of W a(X) into Lip 0 (X) * .When X is a complete separable metric space, it can be shown that this is actually an isometric embedding of W a(X) into F (X) (see Theorem 1.13 in [OO19] or section 2 in [NS07]).

Real trees
A real tree (T, d) is a geodesic metric space which is 0-hyperbolic in the sense of Gromov.For x, y ∈ T , we denote by [x, y] the segment between x and y, i.e. the unique arc joining them.A point x ∈ T is a branching point if T \ {x} has at least 3 connected components; we denote by Branch(T ) the set of branching points of T .Fix a base-point x 0 ∈ T .For x ∈ T , we set so letting T hang from the root x 0 , the set T x is the part of T lying below x.
Following A. Godard [Go10]) we say that a subset A ⊂ T is measurable if, for every x, y ∈ T , the set A∩[x, y] is Lebesgue-measurable in [x, y].On the σ-algebra G of measurable subsets, there is a unique measure λ such that λ([x, y]) = d(x, y): we call λ the length measure4 .It is defined as follows: for S a segment in T , let λ S denote Lebesgue measure on S.Then, for where R is the set of subsets of T that can be expressed as finite disjoint unions of segments.
For A a closed subset of T containing x 0 , still following [Go10] we define a function Assume from now on that the real tree T is complete and separable.Then by the previous sub-section, for A a closed subset of T containing Branch(T ), the space W a(A) isometrically embeds into L 1 (A, µ A ).This embedding is not written explicitly in [Go10]; by making it explicit we get a closed formula for the Wasserstein distance on closed subsets of real trees.
Proposition 3.1.Let (T, d) be a complete, separable real tree and let A be a closed subset of T containing Branch(T ).For µ, ν ∈ W a(A) we have: (3) Proof By the proof of Theorem 3.2 in [Go10], the map is an isometric isomorphism which is weak * -weak * continuous, so its transpose Φ * realizes the desired isometric isomorphism F (A) → L 1 (A, µ A ). Denoting by χ [x,y] the characteristic function of the interval [x, y], the previous formula may be re-written: For ν ∈ W a(A), we compute Φ * (ν).For g ∈ L ∞ (A, µ A ), we have As the measure µ A is σ-finite (here we use that the tree T is separable), we may appeal to Fubini: Since this holds for every g ∈ L ∞ (A, µ A ) we deduce that, for almost every x ∈ A: Equation (3) follows.
Remark 3.2.When A = T , Proposition 3.1 becomes, for T a complete separable real tree and µ, ν ∈ W a(T ): When T is the geometric realization of a finite metric tree, equation (4) appears as equation ( 5) in [EM12]; the proof is different.
Theorem 3.3.Let T = (V, E) be a rooted metric tree with countably many vertices.Then for µ, ν ∈ P 1 (V ): Proof Let x 0 be a base-point in X. Composing β with a translation in E, we may assume that β(x 0 ) = 0.Then, as β(x) − β(y) ≤ C • d(x, y) for any x, y ∈ X, we get To check that β is C-Lipschitz, observe that for µ, ν ∈ P 1 (X) and π a coupling between µ and ν, we have: The result follows by taking the infimum over all couplings π.Remark 4.2.Observe that, if β in Proposition 4.1 is bi-Lipschitz, in general its extension β is not.Indeed take E = R, and let X ⊂ R be any subset with at least 3 elements, the inclusion β : X → R is isometric, but β is not even injective.
Let T = (V, E) be a metric tree; we denote by χ [x,y] the characteristic function of the set of edges in [x, y].There is a well-known isometric embedding β : V → ℓ 1 (E, w) : x → χ [x 0 ,x] (it is hard to locate the first appearance of this embedding in the literature: we learned it from [Ha79]).By Proposition 4.1, we extend it to a 1-Lipschitz map β : P 1 (V ) → ℓ 1 (E, w).Ultimately we will see that β is isometric.For the moment we prove: Proposition 4.3.Let T = (V, E) be a metric tree.For µ, ν ∈ P 1 (V ): Proof The inequality follows from Proposition 4.1, we focus on the equality.But So it it enough to prove that β(µ)(e) = µ(T e ).So we compute: Our aim now is to prove that the inequality in Proposition 4.3 is actually an equality, i.e for metric trees T = (V, E) with countably many vertices we wish to prove the reverse inequality (5) Theorem 1.13 of [OO19] implies that the set of finitely supported probability measures is dense in (P 1 (V ), W a). Of course W a(•, •) : P 1 (V ) × P 1 (V ) → R is continuous, and is continuous too, as an immediate consequence of Proposition 4.3.So to show (5) we may restrict for finitely supported measures i.e. we may restrict to finite metric trees.

An algorithm for finite metric trees
Let T = (V, E) be a finite metric tree.Recall from the proof of theorem 3.3 that if e ∈ E is an edge, we write e + and e − its two extremities chosen so d(x 0 , e + ) < d(x 0 , e − ), moreover if v, w ∈ V , we say that w is a descendant of v if v ∈ [x 0 , w] (notice that a vertex is its own descendant) and we say that w is a child of v -and that v is the parent of w -if w is a descendent of v and [v, w] = {w, v}.If v ∈ V we write T v the half tree with set of vertices the set of all descendants of v, hence T x 0 = T and if e ∈ E then T e = T e − .
To show that we provide an algorithm which transforms a probability measure µ ′ , initially set to µ into ν.In parallel, this algorithm keeps track of a variable (here a matrix) π ′ = (π ′ (x, y)) x,y∈V := (π ′ x,y ) x,y∈V that, all the way through the running of the algorithm, provides a coupling between µ and µ ′ .When the algorithm stops we will have µ ′ = ν and the cost of the coupling π ′ will be e∈E w e | µ(T e ) − ν(T e ) |.This algorithm runs in two phases; intuitively speaking the first phase brings up (towards the root) the excess of mass from those subtrees T e with µ(T e ) > ν(T e ), and the second phase let that mass fall (towards the leaves) in the subtrees T e with µ(T e ) < ν(T e ).Still intuitively, for every vertex x, π ′ x,y is the mass attributed by µ ′ to x coming from y; the coupling remembers where the mass comes from.We consider that the vertices of T are numbered with 1, 2, ..., n := |V | (e.g. in such a way that given two vertices that are at distinct depths in the tree, the deeper one is associate to a lower number than the other).The algorithm is such that it moves first the mass coming from vertices with a low number.
end for M ← 0 % This variable is used just for the proof % Phase (1): for N depth level, from the deeper up to 1: for all T e subtree whose root e − is at depth N: % Loop (*) % "we bring (µ ′ (T e ) − ν(T e )) up one level": for N depth level from 0 to the deepest level in the tree−1: for all T subtree whose root r is at depth N: let s 1 , ..., s n be the sons of r if µ ′ (r) > ν(r): We must now prove that the algorithm works as intended, that is π ′ is always a coupling between µ and µ ′ , and when the algorithm terminates we have µ ′ = ν and the cost of Proof A probability measure on a tree T is determined by the measure attributed to all subtrees T e .To see that µ ′ = ν when the algorithm terminates, we thus show that µ ′ (T e ) = ν(T e ) for all subtree T e : • If µ(T e ) = ν(T e ) then neither phase (1) nor (2) modifies µ ′ (T e ) = µ(T e ) = ν(T e ) (even though the distribution may vary).
• If µ(T e ) > ν(T e ) then phase (1) removes the adequate quantity of mass from µ ′ (T e ) so that once phase (1) is over we have µ ′ (T e ) = ν(T e ).Then phase (2) does not change the quantity µ ′ (T e ) = ν(T e ) (even though it could change the distribution on that subtree).
• If µ(T e ) < ν(T e ) phase (1) does not change the quantity µ ′ (T e ) = µ(T e ) (even though it could change the distribution on that subtree).We write µ N for the measure µ ′ after all subtrees whose root is at depth N have been treated by phase (2) (N going from 0 to the deepest level in the tree −1).
Then we proceed by induction on N, assuming e + is at depth N. The initial step consists in seeing that µ N =0 , the measure µ ′ just after phase (1), is a probability measure on T ; ν being one too it follows µ N =0 (T ) = ν(T ) = 1.For the induction step, we write e − = v 1 , ..., v m the children of e + and assume that (induction hypothesis) for all i = 1, ..., m: Since phase (1) is over µ N (T v i ) ≤ ν(T v i ), hence µ N (e + ) ≥ ν(e + ).Then, phase (2) of the algorithm modifies: And: Then we have the desired fact for i = 1.
Eventually when the algorithm stops µ ′ = ν.µ ′ and π ′ are modified only during loops (*) and (**), we write µ M and π M for the values of µ ′ and π ′ after M rounds through loops (*) or (**).Then π M = (π M (x, y)) x,y∈V := (π M x,y ) x,y∈V is a coupling between µ and µ M : just after initialization it is clear that π ′ = π 0 is a coupling between µ and µ 0 = µ, if follows by induction that π M is a coupling between µ and µ M (treating separately the case where moving from M to M + 1 is done during phase (1) and the case where this move is done during phase (2)).About the cost of the coupling, if moving from M to M + 1 is done during phase (1), in loop (*) we have: Where we used that d(s, j) − d(r, j) = d(s, r) and that every i such that π M r,i contributes to the sum i<j (d(s, j) − d(r, j))π M r,i is such that d(s, i) − d(r, i) = d(s, j) − d(r, j) = d(s, r) ≥ 0: Each vertex i such that π M r,i = 0 is (non strictly) below r in the rooted-tree (since π M r,i is the mass µ M in r coming from i; we let the reader check it formally).Then those vertices i such that π M r,i contributes to the sum i<j (d(s, i) − d(r, i))π M r,i are (non strictly) below r and for those we have d(s, i) ≥ d(r, i) and then d(s, i) − d(r, i) = d(s, r) since s is the father of r.By definition of j, π M r,j = 0 and thus j is (non strictly) below r, hence d(s, j) − d(r, j) = d(s, r) ≥ 0. If moving from M to M + 1 is done during phase (2), in loop (**), we conclude similarly that the cost of the coupling is increased by x • d(s i , r).During phase (1) excess measure is always brought up one level at the time, in the loop (*) we thus always have x = µ ′ (T ) − ν(T ) = µ(T ) − ν(T ).And phase (1) brings up excess measure exactly from those subtrees T e with µ(T e ) > ν(T e ).During phase (2) measure is always brought down one level at the time, in the loop (**) we thus always have And phase (2) brings down adequate quantity of measure exactly in those subtrees T e with µ(T e ) < ν(T e ).Since just after initialization π ′ has null cost, the cost of it at the end of the algorithm is thus e∈E w e |µ(T e ) − ν(T e )|.
Remark 4.4.Let (X, d) be a Polish metric space.For µ, ν ∈ P 1 (X), we have from the Kantorovich-Rubinstein duality: where the supremum is taken over all 1-Lipschitz functions f : see Theorem 1.3 in [Vi15]; see also [Ed10] for a short proof.We observe that, for a finite metric tree, our second proof of Theorem 1.4 does not appeal to Kantorovich-Rubinstein duality (in contrast e.g. with the proof in [EM12]).
5 Appendix: σ-algebras on real trees Let (T, d) be a real tree.Apart from Godard's construction from [Go10] of the σ-algebra G recalled above, we are aware of other constructions of σ-algebras on T and of corresponding length measures: • The σ-algebra S generated by segments, see [Va90].
• The Borel σ-algebra B generated by open subsets, see [EPW06] for compact real trees, then for locally compact real trees in [AEW13].
All these constructions have in common that the length measure of a segment [x, y] is exactly d(x, y).In order to clarify the relation between S, B and G, we also introduce the σ-algebras B 0 generated by open balls (so that B 0 ⊂ B) and S obtained by completing S with respect to λ-negligible subsets.
The following proposition explains our choice to work with Godard's σ-algebra G.
Proposition 5.1.Let T be a real tree.The inclusion S ⊂ G follows from the fact that G is complete, as can be seen from the definitions.
2. The equality B 0 = B holds in every separable metric space (any open set being then a countable union of open balls).
To prove the inclusion G ⊂ S, we consider the subset T 0 =: ∪ x,y∈T ]x, y[ and its complement L = T \ T 0 : the latter is the set of leaves of T .For every segment [x, y] we have L ∩ [x, y] ⊂ {x, y}, so that L ∈ G; moreover λ(L) = 0 by equation (2).
Then we take A ∈ G.To show A ∈ S we use separability: let D be a countable subset of T .It is easy to see that T • = ∪ x,y∈D ]x, y[ which implies that T • is S-measurable, as well as its complement L. On the one hand A ∩ L ⊂ L and λ(L) = 0, then A ∩ L is λ-negligible and thus S-measurable.
On the other hand A ∩ T • ∈ S since A∩]x, y[∈ S for all x, y ∈ T because the σ-algebra of Lebesgue-measurable subsets is the completion of the σalgebra generated by sub-intervals i.e.A∩]x, y[∈ S.This concludes the proof.

1.
We have S ⊂ B 0 ⊂ G and S ⊂ G.2. IfT is separable then B 0 = B and S = G Proof 1.To show that S ⊂ B 0 , fix x, y ∈ T and let (z n ) n>0 be a dense sequence in [x, y].Then the equality [x, y] = m≥1 n>0 B(z n , 1/m) shows that [x, y] ∈ B 0 .Now, let B be an open ball in T .For any x, y ∈ T , the intersection B ∩ [x, y] is convex in [x, y], so it is a sub-interval in [x, y].In particular B ∩ [x, y] is Lebesgue-measurable in [x, y], so B ∈ G.