Gaussianity and typicality in matrix distributional semantics

.


Introduction
A research programme "Linguistic Matrix Theory" of understanding the characteristics of randomness in natural language, specifically in matrix/tensor datasets arising from type-driven compositional distributional semantics, using the framework of random matrix/tensor theories was initiated in [1].
Distributional or vector space models of meaning in natural language semantics argue that meanings of words are representable by the contexts in which they often occur.The ideas that have inspired this way of reasoning about meaning go back to the works of Firth [2] and of Harris [3], the former of which famously said: "You shall know the meaning of a word by the company it keeps."These ideas were implemented via vectors of co-occurrence contexts [4,5].Contexts, e.g.words in a fixed neighbourhood window of size k, are taken to be the basis of a vector space whose elements represent meanings of other words.The coefficients of a word vector w over a basis vector b i is a function of the number of times w occurred in the context of b i .These co-occurrence frequencies are collected from large corpora of data, such as crawls of web domains, Google's news and book archives, and the Wikipedia.The distances between the word vectors represent their semantic similarity/relatedness, e.g.see [6].
In order to extend the distributional model from words to phrases and sentences, one has to take into account grammatical structure.Type-logical approaches to grammar, e.g.Combinatory Categorial Grammar [7] and the Lambek Calculus [8], have been shown to have a straightforward interface to the vector space models of meaning.The ideas behind these grammatical formalisms are the same, although they follow different notational conventions and syntactic rules and in this paper we adopt the terminology of Lambek Calculus.In a type-logical grammar, some words, such as nouns, have atomic types and others, such as adjectives and verbs, have functional (or functor) types.If we start with the set {n, s} of atomic types, n for the type of a noun and s for the type of a sentence, then an adjective will have type n → n: this type says that an adjective is a function that takes an argument of type noun, modifies it and returns an adjective noun phrase of type n.For instance, the adjective "red" takes the noun "cat" as an argument and after modifying it, returns the phrase "red cat" as an adjective noun phrase.An intransitive verb has type n → s, i.e. a function that takes an argument of type noun and returns a sentence.An example here is the verb "snore", which takes the noun "cats" as argument and returns the sentence "cats snore".A transitive verb has type n → n → s; this is a function that takes an arguments of type noun, returns a verb phrase of type n → s, which in turn takes an argument of type noun and returns a sentences.An example is the verb "like", it takes the noun "fish" and returns the verb phrase 'like fish", which in turn takes the noun "cats" and returns the sentence "cats like fish" 4 .
Type-driven distributional models of meaning start from a type-driven analysis of grammar and assign a compositional vector semantics to natural language constructions [9,10,11,12].These models are based on the centreline argument that words with atomic types should be represented as vectors, but words with functional types as linear or multilinear maps, equivalently matrices, cubes, or higher order tensors, depending on the number of arguments they take.If we assign the vector space N to the type n and the vector space S to the type s, then adjectives, intransitive verbs, and verb phrases that only have one argument become elements of the tensor spaces N ⊗ N and N ⊗ S, respectively.These are represented as matrices.Transitive and ditransitive verbs have two and three arguments each: they become elements of the tensor spaces N ⊗ N ⊗ S and N ⊗ N ⊗ N ⊗ S and are represented as cubes and hypercubes, respectively.When a word with a functional type composes with a word with an atomic type, the composition is represented by the application of the corresponding linear/multilinear map with the vector of the atomic word.As an example, consider the distributional meaning of "red cat", which becomes the result of the matrix multiplication of the matrix of "red" and vector of "cat".Denoting the former with M red ij ∈ N ⊗ N and the latter with V cat j ∈ N , we obtain M red ij × V cat j as the meaning of "red cat".Similarly, denoting the meaning of "snore" by M snore ij ∈ N ⊗ S, we obtain M snore ij × V cats j as the meaning of "cats snore".Similarly, the meaning of "cats like fish" becomes M like ijk × V f ish k × V cats j and so on.The collection of matrices associated to adjectives and intransitive verbs of a corpus have large matrix sizes, ranging from 100 up to 10K, and to 40K, e.g.see the original work of [13] for the higher dimensions and the sequel work of [14] for the lower ones.While the AI inspired tasks are focused on extracting linguistic structure, e.g.word similarity, from these matrices, such a large collection inevitably has elements of randomness.Any corpus is a finite, even if large, sample selected from everything written in a language.Even if it is a good approximation to everything written, the written words in a corpus are influenced by the experience of the authors, subject for example to a wide range of interactions with the environment and other humans.We may ask if there are universal patterns in the randomness existing in the large datasets of matrices encoding the complex natural system that is human language.The experience of random matrix theory has indeed shown that the patterns in the distribution of energy eigenvalues of complex nuclei [15,16] also occur in a wide variety of complex systems (see for example [17,18,19,20]).
In the Linguistic Matrix Theory (LMT) programme of [1], one of the first steps was to identify the appropriate type of symmetry.Here it was useful to consider the kinds of mathematical expressions which are used in distributional semantics to extract the meaning encoded in words.For vector, matrix and tensor data in D dimensions, some of these expressions are invariant under the orthogonal group of all rotations in D dimensions, but the generic expressions are only invariant under the smaller symmetry of all permutations of D objects, the symmetric group S D .This motivated us to consider matrix models with S D symmetry.The polynomial functions of matrix variables M ij which are S D invariant have an elegant classification in terms of polynomials labelled by directed graphs.The degree of the polynomial is the number of edges in the graph : the number of nodes is unconstrained.There are two graphs at linear order, each associated with a permutation invariant polynomial.A general permutation invariant linear function is a sum of these two polynomials with arbitrary coefficients.We restrict these linear coefficients to be real numbers numbers µ 1 , µ 2 .There are eleven independent quadratic functions.As a simple toy model we considered three quadratic polynomials with three associated coefficients Λ 1 , Λ 2 , Λ 3 .We defined a function S(µ 1 , µ 2 , Λ 1 , Λ 2 , Λ 3 ) and considered a probability distribution defined by the partition function Given any permutation invariant polynomial, which we will henceforth refer to as observ-ables and denote O(M ), we can calculate a theoretical expectation value The expectation values of the linear and quadratic observables O are expressible as simple functions of the µ a , Λ i .In order to match these probability distributions with experimental data, the experimental expectation values for these five observables were computed as averages over the words in the dataset A is a label for the words in the dataset and N words is the number of words in the dataset.Equating these to the theoretical expectation values, we determined the µ a , Λ i parameters of the model, for a given dataset.
The theoretical model was also used to calculate the expectation values of a number of cubic and quadratic observables.These theoretical values, using the input of µ a , Λ i determined as above, give the predictions of the 5-parameter Gaussian model for these observables.We calculated the ratios of the theoretical to experimental values, with a ratio close to 1 being good agreement between theory and experiment.The best ratios were approximately 60 %, but for a number of observables the ratios were very low, the lowest being around 0.6 % .We argued that a more complete treatment with a general Gaussian model that includes all the eleven parameters would likely give better ratios.
The theoretical model with all eleven quadratic parameters was solved in [21].It was useful to employ a representation theoretic approach to the space of quadratic permutation invariant functions.The eleven parameters were organised according to four irreducible representations V 0 , V H , V 2 , V 3 of S D .Λ V 0 is a symmetric 2 × 2 matrix with three real parameters, Λ V H is a symmetric 3 × 3 real matrix with 6 parameters and Λ V 2 , Λ V 3 are each real numbers.We have an action which defines a probability distribution and associated partition function Convergence of the measure requires that Λ V 0 , Λ V H are positive semi-definite matrices, and Λ V 2 ≥ 0, Λ V 3 ≥ 0. The first main goal of this paper is to report on the application of this 13-parameter Gaussian model from [21] to the same dataset constructed in [1], to test its effectiveness at predicting cubic and quartic expectation values along the lines of the approach in [1].
It is useful at this point to discuss our investigations of Gaussianity in language in the broader context of studying statistical aspects of language and of applications of matrix models in physics.The two central elements of the programme initiated in [1] are Gaussianity and permutation symmetry.It is worthwhile discussing a potential objection to Gaussianity in linguistics.Zipf's Law [22,23] is the observation that in corpora of natural language, e.g.collections of written text in a language such as English, the frequency of a word is inversely proportional to its rank.A power law is of course nothing like a Gaussian, so a quick argument is that Gaussians are not what one typically gets in the statistics of linguistic corpora.
Since the 1990's, starting from the works [24,25,26], matrix theories have seen an explosion of applications in theoretical physics, to match their widespread use in applied physics along the lines of the original work of Wigner and Dyson.A prominent area of application is gauge-string duality, where quantum field theories with matrix degrees of freedom in diverse dimensions, have an emergent dual description in terms of a string theory.The present application of matrix models to language, can be viewed as the exploration of another instance of emergence from matrix theory.In the present case, the emergence is that of universal aspects of randomness in natural language from the mathematics of matrix theory.
One of the lessons from the applications of matrix models in gauge-string duality and indeed more broadly from applications of quantum field theory, is that the validity of Gaussianity or approximate Gaussianity often needs to be carefully delineated in a complex system.Following the analogy between integrals and path integrals describing quantum field theories, field theoretic realizations of Gaussianity or near-Gaussianity are phenomena described by free quantum field theories where the actions are quadratic in the field variables, and perturbations of these free theories by small higher order corrections.Taking the field theory to be the gravitational space-time field theory in the AdS string background in the context of the AdS/CFT correspondence [27], an important instance of this is the failure of perturbative gravity in accounting for high energy graviton interactions [28] even at a qualitative level, which is related to the phenomenon of giant gravitons [29].In the dual matrix CFT description of this physics, this giant graviton physics appears from the combinatorics of large composite operators associated with particular shapes of Young diagrams [30].The Gaussian regime of perturbative gravitons arises from correlators of low order gauge invariant polynomials.The lesson is that the phenomena described by matrix theories are rich and diverse, with Gaussianities emerging from the identification of the appropriate observables.
The same theme is evident in the applications of quantum field theory to the real world.In the context of quantum field theories applied to particle physics phenomena, free-field behaviour (approximate Gaussianity) arises in the high energy (ultra-violet) regime for theories such as quantum chromodynamics (QCD) while it arises in the lowenergy (infra-red) regime for quantum electrodynamics (see for example textbooks in quantum field theory such as [31]).In cosmology, the detailed study of the approximate Gaussianity of fluctuations in the cosmic microwave background, which originate from a very early period in the history of the universe, and bounds on the non-Gaussianities are used to constrain the space of theoretical models of inflation [32].In this setting, the fluctuations of the temperature of the CMB in different directions in the sky are experimentally measurable observables which can be related to theoretical models of inflation described by a path integral for a scalar field.In the present context of type-driven compositional distributional semantics, the experimental data consists of matrices constructed from linguistic corpora, which can be used to compute averages of their polynomial functions.The Gaussian matrix theory we consider, and potential perturbations thereof one might consider in the future, are the theoretical analogs of the path integrals for inflation considered in cosmology.This natural science perspective on distributional semantics based on matrix models offers an interesting complement to the artificial intelligence perspectives which drive much research on distributional semantics.Beyond the question of quantifying Gaussianity, our investigations of linguistic matrix data in this paper are guided by the intriguing interfaces between the two perspectives of natural science and artificial intelligence.
Going back to the first goal of this paper, we find that low order permutation invariant polynomials, and specifically the 13-parameter Gaussian permutation invariant matrix models, are indeed the right objects to detect strong evidence of Gaussianity.While the best theory/expt ratios achieved by the 5-parameter model are near 60%, the best ratios now are near 99% and indeed for a number of cubic and quartic observables, these ratios are above 90%.The lowest ratio is 16%, so that the Gaussian model still predicts the right order of magnitude of the expectation value even in the worst case.In all the experiments studied, we find that the linear and quadratic expectation values lead to theoretical parameters µ, Λ consistent with the convergence criteria.
Since the comparison of experiment with theory in the above discussion has only used, for each observable, the experimental averages of the observables O(M ) over all the words, it is oblivious to the detailed distribution of the observable over the set of words used to calculate the average.This distribution has a standard deviation (δO(M )) EXP T .As a further test of Gaussianity, we can use the standard deviations of the linear and quadratic observables from the data, to determine perturbed theoretical parameters µ + δµ, Λ + δΛ, and then use the theoretical equations to determine theoretical predictions (δO(M )) T HEO for the standard deviations of the higher order observables.We find that the theory/experiment ratios for the standard deviations range over 26% to 95%.This looks to be a very good success rate, which we confirm by comparing to a simple random walk model for the standard deviations.In our prediction of the standard deviations, we are using for the same dataset, a range of possible values of the couplings in the Gaussian matrix model, effectively a Gaussian model with a distribution of couplings.The success of these predictions of the standard deviations is our second main result.
Our tables of theory/experiment ratios for O(M ) and δO(M ) show that some pairs of observables have distinctly similar characteristics whether we are looking at expectation values or standard deviations.Each observable can also be used to rank the words in the dataset, starting from the word with the lowest O(M ) to the one with the highest.Since ranked lists of words form a standard tool in distributional semantics, it is natural to ask whether observables which have very similar matrix model characteristics also produce similar ranked lists.We find evidence for a positive answer.
The plan of the paper is as follows.Section 2 is a technical introduction describing the range of experimental data we will be analysing.Section 2.1 gives the system of equations for the theoretical expectation values of the two linear and eleven quadratic observables as a function of the theoretical parameters µ, Λ.For each experiment, the thirteen experimental expectation values are matched with the theoretical ones, to determine the appropriate µ, Λ for each experiment.Section 4 gives the ratios of experimental to theoretical expectation values for cubic and quartic observables.Section 5 explains the motivations from possible applications in distributional semantics for our investigations of typicality.It then proceeds to explain the experiment/theory comparisons and presents the results.Section 6 provides evidence showing that observables with similar matrix model characteristics, in terms of expectation values and dispersions, produce similar ranked lists.The comparison of ranked lists is done with the Spearman ρ characteristic as well as two-dimensional rank correlation plots.We conclude with a discussion of our results and future directions.
Appendix A lists the equations for the expectation values of cubic and quartic observables in terms of the theoretical parameters µ, Λ.A number of these equations are reproduced, in one or two instances with typos corrected, from [21] and there are four new observables which are computed by the same methods explained there.

Experiments and the 13 theoretical Parameters
In this paper, we will be using the matrices for adjectives and verbs that were constructed for the paper [1].The detailed algorithm is explained there.The matrices are of size D × D, where D ranges in steps of 100 from 300 to 2000.
As explained in [1] the counting of linearly independent permutation invariant polynomials (observables) of a fixed degree is equivalently given by the counting of directed graphs.The nodes correspond to the indices, the matrix M ij correspond to an edge going from node i to node j.The graphs corresponding to the 11 quadratic polynomials are given in Appendix B of [1].One minor technical point : in this paper, we find it convenient to associate unrestricted sums to graphs, e.g. for a graph having two edges from one node to another, we associate i,j M 2 ij , and not i =j M 2 ij as in [1].There are 52 cubic observables/graphs and 296 quartic ones.In [21] the thirteen parameter model was solved.The computation of expectation values was given for a selection of four cubic and two quartic observables.In this paper, we have developed the theoretical formulae for an additional four observables.The graphs corresponding to the set of ten cubic/quartic observables under consideration in this paper are given in Appendix C. The theoretical equations for the ten expectation values are given in Appendix A.

The system of equations for the 13-parameters
The strategy for comparison of experiments with data we use here is exactly as in [1], with the only difference now being that we have the full 13-dimensional parameter space of permutation invariant Gaussian matrix models.
From [21], the two equations expressing expectation values of linear permutation invariant functions of M , in terms of the µ, Λ parameters of the Gaussian model. (2.1) i,j where The eleven equations [21] expressing expectation values of quadratic permutation invariant functions of M , in terms of the µ, Λ parameters of the Gaussian model are as follows. i,j (2.5) i,j (2.12) i,j

Parameter values for adjectives at D = 2000
To 3 significant figures, the parameter values for D = 2000 are given below.
The values of the determinants of the coupling matrices for each irreducible representation of S D , calculated by entering the experimental linear and quadratic expectation values into the system of equations in Section 2.1, are (to 3 significant figures) Since these are all positive, the criteria are satisfied.This is evidence for the Gaussian ansatz.

Parameter values for verbs at D = 2000
The parameters of the model (to three significant figures) calculated are Parameter Value μ1 4.29 Convergence criteria for verbs are also satisfied: Theory/Expt comparisons for expectation values of observables : evidence for Gaussianity In this section, we describe the comparisons for expectation values of cubic and quartic observables.We find significant agreements at very high levels of accuracy, in the range 90 − 99% for a number of observables.This is to be compared with the 57% accuracies that were achieved as the optimum ratios with the 5-parameter model [1].The lowest theory/expt ratio with the 13-parameter model is at 16%.So we have the right order of magnitude even in this worst case.
There are regularities in the nature of high ratio versus low ratio observables, in terms of simple characteristics of the observable-graph, notably the number of nodes.The very high Gaussianities, reflected in ratios O(M ) T HEO / O(M ) EXP T close to 1 occur for graphs with four or more nodes.The number of nodes corresponds to the number of indices being summed, hence also to a D-scaling of the number of terms in the defining sum.
In detail, the results for the Cubic and Quartic ratios for 13

Results as a function of dimension
Upon further testing, the convergence criteria for all dimensions of both verb and adjective data sets were confirmed to be satisfied.The explicit criteria calculation values for the dimensions 700 and 1300 are provided below.The parameters can also be cast a function of D and plotted to evaluate the dependence.Included here are example plots for two selected parameters, detailing their value for dimensions ranging from 300 to 2000 (see figures 1 and 2).We tend to see an onset of simple scaling behaviours at around D = 700, hence we also present the calculations of the theory/expt ratios for D = 700 and D = 1300.
Remark It is worth mentioning that some of the cubic ratios, which are very low at D = 2000, improve at D = 700.The construction of vectors for nouns and noun phrases, which is subsequently used to construct matrices for adjectives and verbs, relies on identifying sets of target nouns t along with some context words c.There is a reasonable and well-defined prescription for dealing with the cases where target is equal to context word [1].However, these cases are perhaps more subtle.It is conceivable that the low theory/experiment ratios for higher D might be due to higher number of c = t cases.This can be investigated by repeating the experiments with datasets which have filtered out the c = t cases.We hope to return to this investigation in the future.

Typicality
We have found that the postulate of Gaussianity allows the prediction, to a high degree of accuracy, of expectation values of a large number of cubic and quartic observables in typedriven compositional distributional semantics.These expectation values are calculated by taking averages over large numbers of large matrices, one for each adjective/verb.A more detailed characterisation of the data in type-driven compositional distributional semantics gives, for each observable, a distribution of frequencies over a space of possible values of the observable.This can be visualized in terms of a histogram for each observable.The mean of the distribution is the expectation value but we may also look at the spread or variance of the observable.We may ask for example whether these distributions become very narrow in the limit of large D. The observables are sums of large numbers of matrix elements, these numbers being D p for some p, which is a characteristic of each observable.p is in fact the number of indices in the sums, which is equal to the number of nodes in the graph.For example for the first graph/observable in the tables of Sections 3 4, we have p = 1, while for the last observable we have p = 8.When we calculate the observables, normalized by D p , and assume a simplistic model of random walks [33] where each term in the sum is a step, then we would have standard deviations of order D p/2 .This simplistic model suggests that the standard deviations of the D p -normalised observables would behave like D −p/2 and thus vanish at large d.This can also be argued as a consequence of the "law of large numbers" [34].The qualitative expectation of a vanishing of the standard deviations in the limit of large numbers is indeed consistent with the standard deviations we find -so in this sense the distributions are consistent with typicality, in other words the distributions become peaked at large D. It is interesting, however, to ask if we can get a more precise prediction of the standard deviations observed in the permutation invariant observables using the permutation invariant Gaussian matrix models.A precise understanding of these standard deviations, or degrees of typicality for each observable, is motivated both by theoretical physics and the AI goals of distributional semantics.
In many physical systems with large numbers of degrees of freedom, considerations of typicality are of fundamental interest.A common aspect in discussions of typicality is the statement that a large majority of the members of a large collection share some specified characteristic [35].A typicality characteristic of quantum states of composite systems, made of a physical system of interest with its environment, is explained as the origin of thermodynamic equilibrium states in quantum statistical thermodyanmics [36,37,35].Typicality has also been used in the context of the AdS/CFT correspondence as a proposal to account for the emergence of gravitational thermal states such as black holes, or closely related "superstar geometries" [38,39].
There are also practical motivations from computational linguistics for a systematic understanding of the typicality properties of the observables.The construction of matrices of type-driven distributional semantics, have applications to word and sentence similarity, disambiguation, and inference tasks [40,41,42].In the similarity tasks, the goal is to decide how similar a pair of language units, such as words, phrases, sentences, and eventually paragraphs and texts are to each other.Examples of sentence similarity from the three bands of HIGH, MED, and LOW similarity are the following pairs (Project presented problem, Report discussed difficulties): HIGH (Gentleman closed his eyes, man shot the door): MED (Project presented problem, Gentleman closed his eyes): LOW Human judgements for these pairs are collected, often using a crowd sourcing engine such as the Amazon Turk, and the degree of correlation between these judgements and the model measurements are computed.The degree of correlation often used is the Spearman's ρ, which is calculated between the two sets of values of average human judgements per pair of sentences, and the measurement of the model for the same pair, mainly via computing the cosine of the angle between the vectors of the sentences of the pair.At the word level, in a type-driven setting one builds matrices for adjectives and intransitive verbs (and cubes and hyper cubes for transitive and ditransitive verbs) and computes the degree of correlation between the human annotations and the model similarity measures, see [14] for an adjective similarity task on the adjective subset of the MEN word similarity dataset [43] and [44] for a verb similarity task on the VerbSim3500 verb similarity task [45].The inference task is slightly different, in that instead of a degree of similarity in the unit interval one works with a Boolean value: 1 indicates that the first sentence entails the second one, as in the pair (A cat danced, An animal moved) and 0 says that it does not, as in (A cat danced, The report presented a problem).
Here, asymmetric measures, such the KullbackLeibler divergence, are computed and compared with the Boolean measures.
An important problem in all of these tasks is to devise ways to efficiently construct the matrices for the large collection of words that have functional types.Recall that these are the majority of the words of a language, ranging over adjectives, verbs, adverbs, wh-words, auxiliaries, and many more.The methods of constructing matrices are computationally expensive: one has to first parse the corpora of data to tag the words with their grammatical types, aka their part of speech or POS tags.This procedure will determine which words have atomic types and which ones are functional.Despite recent advances in parsers via the use of neural network algorithms, these procedures are still erroneous and given the large quantities of data that are needed to build the matrix, they will take long periods of time to train.The question is whether we can supplement existing algorithms for producing the matrices by using universal statistical characteristics of the existing ones.It is conceivable that the methods of linguistic matrix theory can be used to aid the construction.Imagine a sample of adjectives has been constructed.We would then determine the expectation values of some observables from the data of these matrices.Suppose then that the observable in question is a high typicality observable.Then if we wish to construct a new word matrix, we can devise algorithms which takes this predicted average as an input.This would require the development of algorithms which construct the matrices, but constrain their values for these high typicality observables to be very near the known averages.
To describe precisely the typicality properties of an observable, we can plot the histogram for the observable.For a given observable, we consider the range of its expectation values.We divide the range into a set of small bins, and we draw a histogram where the heights of the rectangles in each bin are the numbers of words (i.e.adjectives or intransitive verbs) having their expectation values for the specified observable in that bin.These histograms can be constructed for both observables that parametrise the model, denoted with superscript p (see figure 3) and also the higher order observables, denoted with superscript h.The histogram for a quadratic observable is given in Figure 3 and a cubic observable in Figure 4.There is a significant diversity in the behaviour of the standard deviations of the ten observables O (h) G i as a function of D. As observed at the beginning of this section, when these observables are divided by D p , we get standard deviations which go to zero as a function of D in the large D region near 2000.Interestingly, for the three of the observables G 3 the dispersions go to zero in this region even before dividing by D p .Despite this diversity of behaviours in the standard deviations as a function of D, the theoretical predictions based on the Gaussian matrix models, work well for the whole range of observables considered, predicting the correct orders of magnitude for all of the observables.

Theoretical predictions for typicality
The permutation invariant Gaussian matrix model (PIGMM) with fixed µ, Λ, determined by matching the linear and quadratic experimental averages, gives expectation values for the higher order polynomial invariants.By considering variations δµ, δΛ to fit the experimental expectation values shifted by their standard deviations, we can calculate the corresponding shifts in the theoretical expectation values.These shifts can be compared to the standard deviations in the higher order expectation values.This effectively involves using the PIGMM with a distribution of values of the µ and Λ parameters, to predict both the expectation values and the dispersions of higher order observables.It is interesting to note that physical models with random couplings are widely studied in condensed matter physics (e.g.[46]) and a class of these (SYK models [47,48]) have recently attracted interest because of their potential links [49] to tensor model holography.
We now describe in more detail this prediction of the dispersions of the higher order observables.We consider the Gaussian model parameterised by the µ a parameters, for a ∈ {1, 2} and the eleven parameters organised as matrix elements of Λ V , for V ∈ {V 0 , V H , V 2 , V 3 }.We have an action S(µ a , Λ V ) which determines the Gaussian measure.We have experiments parameterised by a binary choice -verbs or adjectives -and a choice of D. In section 3, we have used, for each experiment, two linear experimental expectation values and eleven quadratic expectation values.We will refer to these thirteen experimental expectation values as O ii .The majority of word expectation values are closely clustered around the mean.Data collected from the adjective data set, at dimension D=2000.
that these observables are used to parameterize the theoretical models.The subscript G refers to the fact that the structure of the polynomial corresponds to a graph.This experimental input has been used to determine theoretical parameters µ a , Λ V .We have then used these theoretical parameters to determine theoretical cubic and quartic expectation values O (h) G T H .The superscript h refers to observables of higher order than linear or quadratic.We tabulated the ratios for a number of these experiments.
Using the histograms, for each parameterising observable we can determine a standard deviation (δO We can define positively shifted variables we use the linear system in section 2.1 to calculate shifted variables These shifted variables are used to calculate theoretical shifted expectation values T H .We can repeat these steps with the negatively shifted expectation values of Using the equations in Section 2.1, these lead to The positively and negatively shifted parameters define shifted theoretical values for the higher order observables using the equations (Appendix A ) derived from the matrix model.Using the positively shifted theoretical parameters, we can define a magnitude of theoretical shift in the higher order observables Similarly the negatively shifted theoretical parameters define a magnitude of theoretical shift A measure of the theoretically predicted shift in expectation value is taken as the average (δO In these theoretical predictions of the dispersion for the higher order polynomial invariants , we have taken as experimental input 26 parameters (13 expectation values and 13 standard deviations for linear and quadratic observables) from the data, which are being used alongside the equations of the permutation invariant Gaussian matrix model.
In section 3 we were using the expectation values for the higher order observables O h G .A more refined look considers histograms for each observable.The mean value extracted from the histogram is the expectation value used earlier.The standard deviation of each histogram determines a (δO (h) ) EXP T .In the tables below, for a number of the experiments, we tabulate (δO Using the above methodology, the standard deviation ratios between theory and experiment, for the 10 cubic/quartic observables O (h) G i were calculated and are provided in the tables below.
Cubic and Quartic standard deviation ratios for 13 parameter model:   For half the observables, the prediction agrees with the data at a level above 70%, the best ratios between theoretical and experimental standard deviations reaching 95%, while the worst are at 26%.Considering that the best ratios (for expectation values) obtained with the 5-parameter model were at 57% and the worst at 0.6%, this can be considered another significant success of the thirteen parameter models.Another way to understand the range of ratios is to compare with a simple Gaussian random walk model.Given that each of the observables involved a sum over a number of indices ranging from 1 to D, the number of terms in each observable is D p , where p ranges from 1 ( for O G 1 ) to 8 for O G 10 .A simple random walk model for the dispersions in the first table is D p/2 σ for some constant σ.Fixing σ to match exactly the last dispersion, we find σ = 0.205 (5.12) and the following list of ratios for σD p/2 /(δO h G i ) EXP T : {26, 151, 2830, 255, 0.19, 15.9, 0.52, 0.97, 0.57, 1} (5.13) The comparison of the range 0.19 to 2830 of these ratios to the range 26% − 95% from the 13-parameter PIGMM, with random couplings, is another way to see the effectiveness of our theoretical framework for predicting the dispersions of the observables.

Matrix model characteristics and correlations of word rankings
The inspection of the data on Gaussianity and typicality of the observables allows us to rank these observables in terms of how alike they are.For example A refined look at each observable can be used to produce a ranked list of the adjectives, using the value of the observable as a ranking criterion, for example listing the adjective with the smallest (O (h) G ) EXP T first and the one with the highest (O (h) G ) EXP T last.Many computational tasks in distributional semantics work with ranked lists of words and compare their degrees of correlation.The main task here is ranking pairs of strings of words that are semantically related to each other.For example, SimLex-999 is a dataset that quantifies the degree of similarity or relatedness of 999 pairs of words, such as (cup, mug) and (cup, coffee).It includes adjectives, nouns and verb pairs.Each pair is assigned a set of rankings as judged by numerous human annotators and as predicted by different models.A degree of correlation is computed between rankings of different annotators and a set of different models.The models that better correlate with human annotations are returned as the "better predicting" models.Often, the human annotators are also correlated with each other in order to find out how much do they agree with each other and to compute an inter-annotator agreement.These datasets have often been specialised to only contain specific grammatical structures, e.g.adjective noun phrases, as in [50], which contains pairs of adjective noun phrases such as (last number, vast majority) together with gold-standard human similarity judgements.We also have the sentence similarity datasets mentioned in the previous section on Typicality.Adjective and verb similarity datasets, also mentioned in the section on Typicality, are other examples.A further slightly different task is inspired by the dataset of [51], which consists of a set of unobserved acceptable phrases such as "ethical statute" and a set of of deviant phrases such as "cultural acne".The task is here to measure how well can different models distinguish between these two different pairs.A future direction of our project is to find out whether our model can predict and be applicable to any of these tasks.
As a first step in this direction, we investigate whether the patterns observed in the matrix model characteristics of the different observables are also reflected in the properties of the ranked list for the observables.If two observables are very similar in terms of matrix model characteristics, do they produce very similar ranked lists ?
We have investigated this question by comparing the Spearman ρ for the four observables using the adjectives data set at dimension D = 2000.We find that The correlation values are displayed in the following table : Another way of comparing two ranked lists L (1) , L (2) is to use two-dimensional correlation plot.We can write the elements of the first list as L In the second list L (2) , suppose L 1 appears in position i 1 , L 2 in position i 2 etc.We can plot on the x − y plane, the points (1, i 1 ), (2 If the two ranked lists are identical, then these points fall on s straight line of gradient  We conjecture that a systematic study of the degree of similarity between the matrix model characteristics of observables will show that these are indeed very well correlated  with the lists.These regularities in the matrix model characteristics and ranked lists associated with observables are observational properties of the data.Is there a theoretical prediction of this property ?Given the usefulness of ranked lists in the tasks of distributional semantics, are these regularities an avenue towards applications of the matrix model perspective as a tool in facilitating concrete tasks in computational linguistics ?

Summary and Outlook
Matrix models have had widespread success in capturing universal characteristics of randomness in diverse types of complex systems [15,16,17,18,19,20].The program of Linguistic Matrix theory (LMT) [1,21] follows the same philosophy and aims to characterise universal features of the randomness in the matrices/tensors constructed in typedriven compositional distributional semantics.Concretely it postulates Gaussianity in the expectation values of permutation invariant polynomial functions of matrices associated with adjectives, or intransitive verbs.In this paper, we have found high levels of success in the predictions of Gaussianity for a significant number of observables.These tests of Gaussianity have been formulated both for the expectation values of observables, as well as the standard deviations of the observables.Another strong piece of evidence in favour of the Gaussianity hypothesis is that, in all the experiments we have done, the theoretical parameters extracted are compatible with convergent Gaussian measures.These high levels of success show that the Gaussianity hypothesis is fundamentally sound, and this raises a number of questions for further investigation.
• For a small number of observables, O G 2 , O G 3 in Table 1, the theory/experiment ratios are noticeably smaller than the others.One possibility is that certain justifiable improvements in the algorithms for constructing the matrices can increase these ratios.For example, in the present construction of the noun vectors ( which are fed into a linear regression method to produce adjective matrices), it can happen that the list of nouns has some overlap with the list of context words.When this happens, there is an issue of how to count the frequency of proximity of a word with itself.The present algorithm uses a reasonable, but perhaps non-unique, choice for handling these cases.A modification of the algorithm would exclude these cases from the construction of word vectors, and investigate the resulting matrices.
• If it turns out that O G 2 , O G 3 really are less Gaussian than the remaining observables ( as in the table 1), after any reasonable changes in the construction algorithms for the matrices, then we may ask how to modify the matrix model so as to increase the accuracy of prediction for these expectation values.Perturbations of the Gaussian model by adding these specific observables as perturbations would be a natural guess.
• Can we get comparable or higher levels of success in predicting expectation values when using different matrix constructions?How universal are the statistical characteristics we are finding?Different constructions have been used in the computational side to produce the linguistic matrices, e.g.algorithms such as linear regression [13], multi-step linear regression [52], and neural networks, e.g. the extensions of the hierarchal softmax algorithm of the Word2Vec model of [53], developed in [14].The main ideas behind these constructions is the same: they all explore the original intuitions of Firth and Harris, that we can use the context of the words and the degrees of similarities between them to build matrices for words with functional types.As the methods behind these constructions advance, the matrices become denser and learn to perform better in tasks that they are trained on.
• In section 6 we have investigated rankings of words directly associated with the observables.The idea of considering rankings was motivated by uses of word and phrase rankings in AI tasks, as discussed in detail in section 5. Relating the rankings of word and phrase similarity tasks to the rankings associated with the observables of the matrix model (section 6) is a very interesting avenue for further investigation.
• It will be interesting to compute the matrix model characteristics -the µ, Λ parameters, the theory/experiment ratios for higher order observables, and their standard deviations -for other linguistic corpora, e.g.specialising to particular genres of literature, different domains (e.g.news articles) and modes (e.g audio and video) of content, or using other languages than English.This will be a way to identify which of the matrix model characteristics are universal and which are corpus-dependent.
The conventional applications of distributional semantics in AI focus on structural aspects of the data related to the meanings assigned by humans to words.LMT focuses on the characteristics of the randomness, successfully predicts some of these characteristics to high accuracy, and demonstrate simple patterns in the success rates in terms of structures of the observables as encoded in graphs.A very interesting conceptual question is: How do structure and randomness interface in distributional semantics ?The observables and experiments in this paper provide some tools for investigating this question.Mathematical perspectives on the broad question of interfaces between structure and randomness are discussed in [54].
Natural language is a very interesting natural and complex system, amenable through universal perspectives based in matrix theories, to ideas from theoretical physics.We have so far used ideas from theoretical physics to identify regularities in the randomness present in language.An interesting question for the future is whether the characterization of the universality classes of randomness existing in language holds some lessons for theoretical physics, in its quest to understand the complex natural system that is the universe.

B Additional Spearman ρ plots
The remaining rank correlation plots associated to the Spearman ρ calculations made in table 1 are provided below.Collectively, all the plots in Section 6 and below confirm the conclusion that O (h) G 10 have the best pairwise correlation of the associated ranked list of words, followed by the pair O

Figure 3 :
Figure 3: Histogram for O (p) G 13 = i,j,k M ii M jk .Data collected from the adjective data set, at dimension D=2000.

G
10 are very similar to each other.In the course of developing the theory/experiment comparisons for the typicalities of the observables, we have made use of the histograms for these observables.These histograms are built by dividing the range of values (O (h) G ) EXP T into a number of bins and depicting in terms of vertical bars the multiplicity of words which have the evaluation of their (O (h) ) EXP T in each bin.The dataset for adjectives has a total of 273 adjectives which fall in the various bins.

Figure 5 :
Figure 5: Rank correlation plot corresponding to graph 2 and 3 observables

Figure 7 :
Figure 7: Rank correlation plot corresponding to graph 9 and 10 observables

Table 1 :
Table shows the Spearman correlation coefficient and associated p-value for pairs of observable lists.1.How close the plots for two lists are to a straight line of gradient 1 can be used as a visual estimate of their degree of similarity.The correlation plots for the significant pairs of lists produced from {O G 2 , O G 3 |O G 9 , O G 10 } are shown below with the remaining plots given in appendix B.