Compositional data analysis in geochemistry: Are we sure to see what really occurs during natural processes?by A. Buccianti, E. Grunsky

Journal of Geochemical Exploration


Economic Geology / Geochemistry and Petrology




Compositional data analysis

Geochemical data

Simplex geometry

Log-ratio approach

Environmental modelling lly r subject to a constant sum (e.g. 100%, 1,000,000 ppm). This latter implies that such scaled by the total of consequence of this is n be identified with a

Journal of Geochemical Exploration 141 (2014) 1–5

Contents lists available at ScienceDirect

Journal of Geochem j ourna l homepage: www.e lsexample, whenwemeasure concentrations of some geological material in laboratory we do not expect to find negative values, only positive values from which an interpretation is based. The sample space is x ¼ C x1; x2;…; xD½  ¼ x1  kX ; x2  kX ;…; xD  kX 2 4 3 5; ð1Þthe meaning of these words, it is necessary to link geochemistry to geometry, two fields of research apparently distant but, in reality, closely linked.

The geometry of a composition is the metric of the sample space. For to a constant, e.g. to unity since they have been the components as a standardization practice. A that a composition of D parts, [x1, x2,…,xD], ca closed vector30 years from a first solution based on log-ratios proposed by John

Aitchison (Aitchison, 1982; Fig. 2). Since then, the approach has been characterised by many studies on the natural geometry of the sample space where compositional data are positioned. In order to understand are defined for statistical inference.

Some sample spaces may be better than others to exploit the information contained in the data. This is the case of compositional data where the elements of the composition are non-negative and sum upon spurious correlation in 1897 (Pearson, 1897; Fig. 1) and more than mathematical rules to attribute probability to the occurrences of events,⁎ Corresponding author. Tel.: +39 0552757493; fax: +

E-mail address: (A. Buccia 0375-6742/© 2014 Elsevier B.V. All rights reserved.l Pearsonwrote his paper where compositional values are located. Compositions are compared by measuring their distance and translations along linear or nonlinear trends. It is in this sample space where random variables, theMore than 100 years have elapsed since Kar1. Introductiontion, introduced by Karl Pearson in 1897, affects all data measuring parts of somewhole, which are by definition, constrained; and such type of measurements are present in all fields of geochemical research. The use of the logratio transformwas introduced by JohnAitchison to overcome these constraints by opening the data into the real number space, within which standard statistical methods can be applied. However, many statisticians and users of statistics in the field of geochemistry are unaware of the problems affecting compositional data, as well as solutions that overcome these problems. A look into the ISI Web of Science and Scopus databases shows that most papers where compositional data are the core of a geochemical research continue to ignore methods to correctly manage constrained data. A key question is how we can demonstrate that the interpretation of the behaviour of chemical species in natural environment and in geochemical processes is improvedwhen the compositional constraint of geochemical data is taken into account through the use of new methods. In order to achieve this aim, this special issue of the Journal of Geochemical Exploration focuses on the correct statistical analysis of compositional data. Applications in exploration, monitoring and environments by considering several geological matrices are presented and discussed illustrating that several paths can be followed to understand how geochemical processes work. © 2014 Elsevier B.V. All rights reserved.Available online 26 March 2014 d a ata are “closed”; that is, for a composition ofD-components, onlyD− 1 components are required. The statistical nalysis of compositional data has been a major issue for more than 100 years. The problem of spurious correla-Received 11 March 2014

Accepted 17 March 2014 cents, parts per million, etc.,Compositional data analysis in geochemist really occurs during natural processes?

A. Buccianti a,⁎, E. Grunsky b a Department of Earth Sciences, University of Florence, Italy b Geological Survey of Canada, Ottawa, Ontario K1A 0E8, Canada a b s t r a c ta r t i c l e i n f o

Article history: Geochemical data are typica39 055284571. nti).: Are we sure to see what eported as compositions, in the form of some proportions such as weight perical Exploration ev ie r .com/ locate / jgeoexpi¼1Dxi i¼1Dxi i¼1Dxi where C is called the closure operation to the constant k (Aitchison, 1986). 2 A. Buccianti, E. Grunsky / Journal of Geochemical Exploration 141 (2014) 1–5The set of real positive vectors closed to a constant k constitutes the constrained sample space called simplex of D parts, denoted by SD and defined as

SD ¼ x1; x2;…; xDð Þ : x1N 0; x2N 0;…; xDN 0; x1 þ x2 þ…þ xD ¼ kf g: ð2Þ

Note that geochemical data are always non-negative and are restricted to the positive part of the real sample space, R+D . To be noticed here is that the previous approach gives importance to the sample space (sumconstraint) but compositional data cannot be a priori closed.

Inmost situations is the analystwho decides that the total of each sample is not relevant and then normalise the data to proportions. All the sets of data are equivalence classes from a mathematical point of view (Buccianti and Pawlowsky-Glahn, 2005).

The key in understanding compositional data relies on defining a


Fig. 1.Karl Pearson (27March 1857–27 April 1936) the scientist who founded the discipline of mathematical statistics.correspondence between the simplex S and R , the real space governed by Euclidean geometry, through the use of a metric where classical statistics can be applied for an unbiased interpretation of the relationships and patterns of geochemical data.

The equivalence between SD and RD is obtained by defining equivalent operations in SD. The definition of the operations of sum (difference) and product, called perturbation and powering, together with the definition of other properties (norm, distance, inner product) allow us to consider SD as a space with a structure governed by the Euclidean geometry completely equivalent to the geometry of the correspondent