Talk:Mahalanobis distance

Classification using pooled within-group covariance matrix

This page says "In order to use the Mahalanobis distance to classify a test point as belonging to one of N classes, one first estimates the covariance matrix of each class, usually based on samples known to belong to each class. Then, given a test sample, one computes the Mahalanobis distance to each class, and classifies the test point as belonging to that class for which the Mahalanobis distance is minimal."

However, I have two sources that both use the single pooled within-group covariance matrix for computing all distances, instead of a covariance matrix per class.

"Analyzing Multivariate Data", Lattin, Carroll, and Green, page 458.
http://people.revoledu.com/kardi/tutorial/LDA/Numerical%20Example.html

How to reconcile these two views? I believe the classification section should be rewritten.

dfrankow (talk) 17:02, 23 December 2008 (UTC)[reply]

It depends on the problem you are examining. Suppose you have a single physical phenomenon that generates all the points. Then it makes sense to estimate the covariance matrix and use that to estimate the distance of any two points. The same covariance matrix can be used to estimate the distance of a point from the centroid of the pool.

Suppose now that you have several distinct physical phenomenons which generate the points, and that you have some method to compute the centroids and the covariances for each of the phenomenons. Then, in order to classify some given point, you can use each matrix to compute the distance of it from each of the centroids. Assigning the point to the nearest class is the same as using a maximum likelihood criterion, if the distributions are multivariate Gaussian.

Does this explanation satisfy you? Do you think that the article is not clear in this respect? --Pot (talk) 23:01, 23 December 2008 (UTC)[reply]

I think the article is not clear in this respect. In fact, knowing which one to try is not obvious. When is there a single versus distinct phenomena that are generating some numerical data? Deciding that is likely beyond the scope of the article, but discussion of the issue in the article would be welcome. For example, what are the cases when results of these two methods would differ greatly? dfrankow (talk) 16:44, 29 December 2008 (UTC)[reply]

I'll try to explain it from another point of view. The two cases I described above can not be mistaken for one another, because each is relevant to a different problem that you are trying to solve:

you have the outcomes of an experiment, i.e. the samples, and you want to tell how far a given sample is from the mean; example: you repetitively measure the distance between two fixed points – the M. distance of a given measurement is its offset from the mean divided by the standard deviation of the measurement error, which you estimate from the measurement samples themselves
you have the outcomes of two experiments, mixed together, and you want to tell from which experiment a given samples was generated; example: you receive a binary signal, of whom you know the values amplitude, to which additive Gaussian noise with known variance is added – the M. distance of a given sample from each of the two possible values of the signal is the distance from that value divided by the standard deviation of the noise; in this simple case the covariances are equal for both values --Pot (talk) 00:23, 4 January 2009 (UTC)[reply]

That is reasonable. It should be in the article. dfrankow (talk) 20:13, 1 February 2009 (UTC)[reply]

Leverage

Statistical leverage links here. Is that appropriate?? Seems to me the former concept is broader than this article. — DIV (128.250.204.118 07:46, 9 July 2007 (UTC))[reply]

Yes. There is a section for it Mahalanobis_distance#Relationship_to_leverage. Shyamal 09:29, 9 July 2007 (UTC)[reply]

That section is very uninformative, as it doesn't explain what leverage is. Ideally it would be linked to a page on statistical leverage, but that page doesn't seem to exist. --84.9.85.135 12:21, 30 October 2007 (UTC)[reply]

If someone creates the article it will be located at Leverage (statistics). ~MDD 46 96 20:10, 3 December 2007 (UTC)[reply]

Brief article now created at Leverage (statistics). Now changing entry on Leverage (disambiguation). Melcombe (talk) 11:07, 17 December 2008 (UTC)[reply]

Intuitive explanation

That intuitive explanation was very helpful, great work. Matumio (talk) 09:34, 21 December 2007 (UTC)[reply]

I totally agree! --Uliba (talk) 14:41, 21 April 2009 (UTC)[reply]

The intuitive explanation is very well written. Many other wikipedia articles would be much improved by a similar section. Ceoyoyo (talk) 21:11, 6 January 2010 (UTC)[reply]

Yeah I just came to the talk page to thank the person who wrote the intuitive explanation as well. Would be nice to be a standardized section name in Wikipedia maths articles. 81.133.20.197 (talk) 14:01, 15 August 2012 (UTC)[reply]

Fully agree with the above remarks. Sources are more or less irrelevant to this section. If there has to be one, then the it should be the author of this excellently cogent explanation Lawrence Normie (talk) 11:47, 7 July 2022 (UTC)[reply]

Covariance matrix symbol

The covariance matrix is usually called $\Sigma$ rather than $P$ . It hampers readability. I am changing it unless there is a specific reason for calling it $P$ . Sourangshu (talk) 13:34, 25 February 2008 (UTC) Monday, February 25 2008[reply]

I have changed the text in the third paragraph. The covariance matrix was called there

\mathrm {P}

, instead of

\Sigma

. Now the paragraph is consistent with formula (2). Rlopez (talk) 13:25, 17 June 2008 (UTC)[reply]

I understand the desire to use a symbol for covariance that is not confused with 'sum', but every other article makes due, including that for covariance. For that reason, I prefer $\Sigma$ over S. 192.35.35.34 (talk) 23:27, 13 March 2009 (UTC)[reply]

Maybe I do not understand what you mean, but in covariance

\Sigma

is never used. The notation used there is Cov(). --Pot (talk) 08:08, 14 March 2009 (UTC)[reply]

Cov(x) = $\Sigma$ . Cov(.) is a function on x. $\Sigma$ is the result of applying that function to x. The usual distinction between $\Sigma$ and S is that the former indicates a population value and the latter a sample estimate (in statistics). In the present case, that distinction appears moot. Kmarkus (talk) 13:49, 18 September 2009 (UTC)[reply]

I don't think it is moot. I believe that strictly speaking a "Mahalanobis distance" in its original meaning was defined for a sample covariance and was defined for the purpose of being a distance between samples from two populations. Melcombe (talk) 15:43, 18 September 2009 (UTC)[reply]

Probability or likelihood?

The text ends saying that using it is the same of finding the group of maximum probability. Isn't it the maximum likelihood?... And AFAIK that is just if the distributions are the same, with radial symmetry, and also if the groups are considered to happen with the same probability. -- NIC1138 (talk) 20:21, 29 October 2008 (UTC)[reply]

I had independently made the same observation, and came here to see if anyone else had noticed it. If no one objects, I'll soon correct the article by saying that this choice is a maximum likelihood criterion. —Preceding unsigned comment added by Fpoto (talk • contribs) 16:28, 3 November 2008 (UTC)[reply]

User:Aetheling removed the text. Aetheling, can you please explain why? --Pot (talk) 11:58, 27 May 2010 (UTC)[reply]

Sure! I removed the claim in question because it is false for several reasons. The claim read as follows: "Using the probabilistic interpretation given above, this is equivalent to selecting the class with the maximum likelihood, provided that all classes are equally likely."

The likelihood that a test point belongs to any given cluster depends very strongly on the probability density function of the points in the given cluster. Consider a two-cluster case, in which one cluster is bivariate Gaussian and the other is bivariate Cauchy. The procedure given for calculating the Mahalanobis distance from sample data will produce a meaningless number when applied to the Cauchy cluster. It may be numerically greater than or less than the distance to the Gaussian cluster, but that will no bearing whatsoever on which cluster the test point belongs to, and it will have nothing to do with the true likelihood of its membership in the Cauchy cluster. That's because the Cauchy has no finite moments, so every sample moment is completely meaningless, but it still has a meaningful likelihood function.

For another counterexample, consider a different two-cluster case. Cluster A has all of its points uniformly distributed around the circumference of a circle. Locate the test point anywhere except exactly on the circle. Thus the likelihood that it is a member of A is exactly zero. Suppose that Cluster B is a bivariate Gaussian located anywhere. The likelihood that the test point lies in Cluster B is positive, no matter where B is located, so by the likelihood criterion the test point is always closer to B. Nevertheless, we can easily locate the test point such that its Mahalanobis distance to A is less than to B (for example, locate the test point at the center of the circle).

Even if we introduce the assumption that every cluster is Gaussian with covariance matrix of full rank — a very severe assumption — we can still find pathological examples. Consider the two-cluster case in which the two clusters have very different covariance matrices but identical centroids. A test point also located at the centroid will have a Mahalanobis distance of zero to each cluster, yet different likelihoods of belonging to each cluster.

I could go on, constructing more counterexamples and pathological cases. Rather than trying to fix the statement — a very difficult task, given that restrictions have to be placed on the topology of the clusters — I elected simply to remove it. —Aetheling (talk) 04:53, 28 May 2010 (UTC)[reply]

I think that removing it is a pity, because I suspect that the concept is useful in practical cases. The typical example is having points whose distribution is in fact Gaussian, or unknown and assumed to be Gaussian. In practical cases centroids are typically distinct and covariance is full rank: many problems are modelled just like this. Isn't this a case where the maximum likelihood concept turns out to be practically useful? If yes, then the restrictions are satisfied in many significant cases. --Pot (talk) 09:46, 28 May 2010 (UTC)[reply]

The specific use of the Mahalanobis distance measure in clustering is a topic that is already discussed in the article on clustering. The likelihood interpretation of this form of clustering also belongs in the clustering article — perhaps you would like to write that contribution? Meanwhile, there should be something we can say here in this article about likelihoods that will not be so blatantly incorrect. I'm traveling today, but I might be able to look at this again in the next day or two. —Aetheling (talk) 21:59, 28 May 2010 (UTC)[reply]

References

Adding Template:Refimprove banner. I recently added the main reference to Mahalanobis, 1936, but I just noticed it had been previously removed for some obscure reason (this article obviously exist, even if it is hard to find (you need to register to the journal). More generally, this article is insufficiently sourced. Intuitive explanation and Relationship to leverage should have at least 1 reference each. Each application should be referenced. Calimo (talk) 09:57, 7 December 2008 (UTC)[reply]

Requesting references for specific claims using Template:Fact. Challenged material may be removed in a few months. Calimo (talk) 10:08, 15 December 2008 (UTC)[reply]

Please don't. That would be rude. See discussion below. linas (talk) 03:48, 17 December 2008 (UTC)[reply]

oh please, mathworld is not a 'reference'

Why not? I had added this and it was removed Hotelling T2 Distribution, The MathWorks. Retrieved on 2008-12-16. --Pot (talk) 14:10, 16 December 2008 (UTC)[reply]

This has been discussed a couple of times by members of WikiProject Mathematics (which includes me) and we came to the conclusion that there are many mistakes in MathWorld so it is not a very reliable source (it's also, as far as we can see, under the control of just one guy, Eric Weisstein). It's certainly not on the level of peer-reviewed journal articles. By the way, MathWorks is different from MathWorld. -- Jitse Niesen (talk) 02:41, 17 December 2008 (UTC)[reply]

Thank you Jitse! If the mathworld article isn't too thin, then I have nothing against using the mathworld template {{mathworld|title=whatever}} as a 'generic' reference (rather than being used to support a single claim). (In this case, the mathworld article seems pretty darned thin). It's often the case that PlanetMath has good info; it has a template {{planetmath}} and also the WP:PMEX project. Last but not least, the Springer encylcopedia is interesting, although usually deals with advanced topics presented at an advanced level. The template is {{springer}} .

Thanks to both of you for the explanations. --Pot (talk) 12:34, 17 December 2008 (UTC)[reply]

One minor concern about this article: I noticed that someone placed 'need-reference' tags on some obvious assertions, statements that should be clear as a bell if you spend even just a tiny amount of time thinking about them. This is a mis-use of the concept of referencing: references should be for things that are *not* obvious, for things that are not easily found, researched, or verified. Statements like Mahalanobis distance is an example of a Bregman divergence should be fore-head-slappingly obvious to anyone who actually looks at both articles (and thus not in need of a reference). Ditto for statements like Mahalanobis distance is used in data mining and cluster analysis (well, duhh). linas (talk) 03:47, 17 December 2008 (UTC)[reply]

Differing variances?

Suppose I have two uncorrelated 2D points with zero mean but with different variance, x and y. Suppose

cov(x) = diag(1,10)

and

cov(y) = diag(10,1)

As I understand Mahalanobis distance, If I had x=[0.1,1] it would have a Mahalanobis distance of 0.1414 from the origin. Likewise if y=[1,0.1] it would have a Mahalanobis distance of 0.1414 from the origin (would that be a "Mahalobis norm"?) but y=[0.1,1] would have a Mahalobis distance of 1.0 from the origin. What is the Mahalanobis distance between a given x and y? My feeling is that this intuitively measures the unlikelyhood of the distance between x and y. As such, it should have something to do with the expected value of $x-y$ . Any thoughts? In particular, I'm guessing that we can let $z=x-y$ and use the sum of normally distributed random variables rules to find that

Z~N(0,\operatorname {diag} (1,10)^{2}+\operatorname {diag} (10,1)^{2})=N(0,\operatorname {diag} (101,101))

and so the Mahalanobis distance between x and y would be

d_{M}(x,y)=D_{M}(z)={\sqrt {(x-y)^{T}\operatorname {diag} \left({\frac {1}{101}},{\frac {1}{101}}\right)(x-y)}}

.

Does that sound right? —Ben FrantzDale (talk) 18:02, 30 March 2009 (UTC)[reply]

Looking back at this, I think there's a missing factor of 1/2 under the radical since this should degenerate to the given definition for Mahalanobis distance in the case that the variances are equal. —Ben FrantzDale (talk) 18:11, 30 March 2009 (UTC)[reply]

Distribution

I'm not editing the page, because I don't know if it's _true_ or not. But, this page http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_mahalanobis.htm claims that the distribution of the mahalanobis distance squared is the chi square distribution with p degrees of freedom. Is this so? (I've experimentally tried it, and the values seem to work, but I know enough to know that that doesn't necessarily prove anything...) If it _is_ the same tho, it might be good to add it in the article. 68.174.98.250 (talk) 18:02, 13 March 2010 (UTC)[reply]

I don't think the current statement about the distribution of the Mahalanobis distance is correct. The link above is broken so I cannot check it but consider the example of two independent $n$ -dimensional random vectors ${\vec {X}}$ and ${\vec {Y}}$ with identical means a common diagonal covariance matrix $S=\mathrm {diag} (\sigma _{1}^{2},\ldots ,\sigma _{n}^{2})$ . Then

d^{2}({\vec {X}},{\vec {Y}})=\sum _{i=1}^{n}\left({\frac {X_{i}-Y_{i}}{\sigma _{i}}}\right)^{2}

This would have a $\chi ^{2}$ distribution with $n$ degrees of freedom if the thing in brackets would be normally distributed. But it is not! Both $X_{i}$ and $Y_{i}$ have variance $\sigma _{i}^{2}$ , so their difference has variance $2\sigma _{i}^{2}$ which means the term in brackets has variance 2 and not 1. I think the correct statement is that $D_{M}^{2}({\vec {X}})$ is $\chi ^{2}$ -distributed with $n$ degrees of freedom since there we are subtracting the true mean and not another random vector. Based on the above reasoning I also think that $d^{2}({\vec {X}},{\vec {Y}})/2$ is $\chi ^{2}$ -distributed with $n$ degrees of freedom. Unfortunately I don't have a reference for any of this. Any objections to changing the article accordingly? MWiebusch78 10:41, 12 June 2019 (UTC)[reply]

Devs from mean

I find this article obtuse. Is it an overgeneralization to say that Mahalanobis distance is just the multidimensional generalization of "how many standard deviations is x from the mean?" —Ben FrantzDale (talk) 18:33, 29 May 2010 (UTC)[reply]

What you suggest sounds like "normalised" or "standardised" Euclidean distance and, unlike Mahalanobis distance, would not take into account covariance between dimensions. (That is my understanding at least) —Preceding unsigned comment added by 124.184.162.72 (talk) 23:32, 3 February 2011 (UTC)[reply]

I find the following part in the article misleading, and hence comment

> If each of these axes is re-scaled to have unit variance, then the Mahalanobis distance corresponds to standard Euclidean distance in the transformed space.

As far as I understand, it is not just re-scaling to have a unit variance, what will turn Mahalanobis to Euclidean distance. The dimensions should be indipendent too, i.e. Whitening. Tarekdyen (talk) 14:01, 26 October 2022 (UTC)[reply]

Original reference

The current link to the original reference, Mahalanobis (1936), is not working

I got the paper (for free) from http://www.insa.ac.in/insa_pdf/20005b8c_49.pdf but this did require registration (and I am not sure the link will work without or how to test that - it does at least work for me in a different browser than the one I used to register). —Preceding unsigned comment added by 124.184.162.72 (talk) 23:30, 3 February 2011 (UTC)[reply]

Thanks. That links works for me too and I haven't registered, so I'll add it to the article. Qwfp (talk) 09:59, 4 February 2011 (UTC)[reply]

Non-invertible covariance matrix

How do you calculate Mahalanobis distance when the covariance matrix has determinant=0 (can´t be inverted)? (talk) 0:31, 27 February 2012 (UTC)

I'd imagine there are two answers: 1) It's undefined. 2) It's infinite if the point in question is any distance in the direction of the singular value; otherwise use the pseudoinverse and you have something that's well-defined. —Ben FrantzDale (talk) 02:29, 28 February 2012 (UTC)[reply]

The inversion problem seem to be an inherent issue of covariance metrices. Maybe someone with some expertise in the field could explain the challenges and introduce pseudoinverse as the solution? — Preceding unsigned comment added by 89.182.26.147 (talk) 20:25, 15 January 2014 (UTC)[reply]

Inner product

I'm not sure I agree with the characterization of Euclidean distance. Mahalanobis distance is clearly just an inner product defined by the nonnegative definite matrix S^-1. Anyone know why a concept that had been around in mathematics since about the time of Cauchy or before got named for a statistician in the 1930s?

briardew (talk) 17:29, 16 August 2012 (UTC)[reply]

Numerical example

Maybe a numerical example should be added to the article? Carstensen (talk) 17:48, 10 November 2012 (UTC)[reply]

Non-standard name

This name "Mahalanobis distance" is completely non-standard. It is a simple concept, showing up frequently in analyses of a dependent variable with correlated errors, and has been around for long before this person. Why should a very old concept be named after somebody from the 20th century? The article is of poor quality and makes statements referring to this as "Mahalanobis' discovery," when this is patently false.

I have a similar gripe, expressed two comments above. I think the answer to both of our comments requires a little more research than I'm willing to do. Indeed, the term "Mahalanobis distance" is used in the statistical literature, so there ought to be some context in which M. added something which is new. I'd love for one of the maintainers of this article to add some clarity to the point. Or we could just delete the article altogether. Kidding ... well, not really. briardew (talk) 20:48, 10 May 2013 (UTC)[reply]

Actually, I think I've solved this puzzle. The Mahalanobis distance is a metric defined on collections/ensembles of samples. This article describes plain Euclidean distance, which clearly Mahalanobis didn't discover because Euclid is like 2000 years older than him. This is clarified in a great review by McLachlan (http://www.ias.ac.in/resonance/June1999/pdf/June1999p20-26). Unless someone responds to this in the near future (couple months), I'm going to almost completely rewrite this article, so it's actually accurate. briardew (talk) 21:09, 10 May 2013 (UTC)[reply]

"Discussion" section

This contains the following: "we will get an equation for a metric that looks a lot like the Mahalanobis distance". What does this mean? "Looks a lot like" is vague and completely unhelpful - either it is the same expression or it isn't. If it isn't the difference should be explicated. Perhaps an expert could rewrite this section clearly - otherwise I think it would be better deleted, but I thought I'd comment here before doing that.

Also perhaps the section should have a more specific heading than "Discussion" - it's far from being a general discussion. — Preceding unsigned comment added by 213.162.107.11 (talk) 11:57, 2 June 2014 (UTC)[reply]

Pseudocode or real data example

It is extremely necessary. Currently I am unable to understand it BurstPower (talk) 13:08, 18 November 2015 (UTC)[reply]

Definition and Properties

This section concludes with a opaque statement that "if the data has a nontrivial nullspace, Mahalanobis distance can be computed after projecting the data (non-degenerately) down onto any space of the appropriate dimension for the data." It would really help if somone who understands this statement can rephrase it more precisely while linking to pages that define any necessary jargon. An equation would be ideal. Carroll.ian (talk) 04:10, 20 April 2017 (UTC)[reply]

Contours of credibility

Often you want to know, what is the probability for a sample to have a Mahalanobis distance greater than R? I think this should be discussed in the article, including how it maps to sigma values of a univariate distribution, and with reference to confidence regions. Cesiumfrog (talk) 01:35, 10 November 2017 (UTC)[reply]

Diagrams available

A search at commons found the following diagrams:

However, since their descriptions (click on images to see them) are in Polish only, some expert is needed to select the most appropriate one and devise a good caption. - Jochen Burghardt (talk) 19:05, 28 March 2019 (UTC)[reply]