Talk:Anscombe's quartet

This is the talk page for discussing improvements to the Anscombe's quartet article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Mathematics Mid‑priority

	Mathematics portal This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.MathematicsWikipedia:WikiProject MathematicsTemplate:WikiProject Mathematicsmathematics articles
Mid	This article has been rated as Mid-priority on the project's priority scale.

Statistics High‑importance

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics articles
High	This article has been rated as High-importance on the importance scale.

Notes[edit]

After putting this together, I saw that Anscombe's Quartet is listed under Wikipedia:Requested articles/Mathematics, under Logic / Set Theory. Which surprised me. Is there another Anscombe's Quartet out there ? Chris24359 20:56, 12 March 2007 (UTC)[reply]

There was a mention about it at Correlation#Correlation_and_linearity; it now links back to this page. Schutz 08:30, 13 August 2007 (UTC)[reply]

Variances...[edit]

The variances of the x and y variables had been miscalculated. Whoever did the first calculation seems to have summed all (x-mean_x)^2, and then divided by 10, instead of N=11. This error seems to have been repeated elsewhere on the internet, but there are webpages (like this one [1]) which give the correct standard deviation/variance.

Could people "refix"ing the values on the page leave a comment here explaining their calculation? Erkcan (talk) 07:10, 18 April 2008 (UTC)[reply]

The x-variances for the 4 datasets are 10,10,10,10. The y-variances are 3.75206280991736, 3.75239008264463, 3.74783636363636, 3.74840826446281. Erkcan (talk) 07:25, 18 April 2008 (UTC)[reply]

There is some confusion here between population and sample variance, in the former case the denominator is n (11), in the latter case n-1 (10). Which one is correct depends on whether x and y are the population or a sample. But it doesn't really matter, what is more important is that the variance and mean are the same (however calculated) for each data set. It is incorrect to refer to mean and variance of each x or y. Mean of x would be better. Also the lines in the graphs don't intercept the y-axis at 3, I presume the origin is not zero which is a bit confusing. I would also ask that the other statistics from the original paper are added, this seems to be in hand from the page source. Jmgibbons (talk) 13:57, 2 September 2009 (UTC)[reply]

I have rephrased the table to avoid the "each x" usage. Melcombe (talk) 17:24, 12 November 2009 (UTC)[reply]

Whoops, made an anonymous edit refixing those values before I read this page. I actually ran accross this as I was writing a minor report on the quartet and the page threw me off for a while, thinking I was calculating the variance wrongly, somehow. So yes, I assure you that correct statistics matter and I would kind of like to know why people calculated them with n instead of n-1. Also, the image was generated with the n-1 variances, so there's another reason to keep them as such. 81.57.247.167 (talk) 08:09, 12 November 2009 (UTC)[reply]

Delete part of a sentence[edit]

Finally, the fourth example (bottom right) shows another example when one outlier is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear.

replaced with

Finally, the fourth example (bottom right) shows another example when one outlier is enough to produce a high correlation coefficient.

REASON: The relation between x and E[Y|x] in this "made-up population" may or may not be linear. There is no basis to test lack of linear fit, with the given "design" of x values. (There are degrees of freeedom for pure error only, but NONE for lack of fit when there are only 2 distinct x values.) I think that going into these matters is beyond the scope of the page, so my proposal is just a deletion.

129.1.23.19 (talk) 20:42, 30 September 2011 (UTC)[reply]

What the heck is "d.p." in the first table?[edit]

Can anyone substitute in the longer statistical terminology? — Preceding unsigned comment added by 18.111.93.217 (talk) 14:36, 15 October 2011 (UTC)[reply]

It means decimal places --Rumping (talk) 00:27, 17 November 2011 (UTC)[reply]

File:Anscombe's quartet 3.svg to appear as POTD soon[edit]

Hello! This is a note to let the editors of this article know that File:Anscombe's quartet 3.svg will be appearing as picture of the day on December 11, 2011. You can view and edit the POTD blurb at Template:POTD/2011-12-11. If this article needs any attention or maintenance, it would be preferable if that could be done before its appearance on the Main Page so Wikipedia doesn't look bad. :) Thanks! howcheng {chat} 18:37, 9 December 2011 (UTC)[reply]

Picture of the day

Anscombe's quartet is a group of four data sets that have identical simple statistical properties, yet appear very different when graphed. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analysing it and the effect of outliers on statistical properties.Image: Schutz

Archive – More featured pictures...

Certainly a interesting picture - but shouldn't discrepancies cited here in this discussion page be resolved first?--173.69.135.105 (talk) 03:16, 14 December 2011 (UTC)[reply]

Regression lines[edit]

The regression lines shown (and mentioned in the body of the article) are least squares regression lines. As other forms of regression calculations can be carried out giving different results, I'm going to insert "least squares" in the first mention of the regression lines. An L1 regression, for example, minimizes the sum of the absolute values of the residuals, and has the property that the outlier in the third dataset will be effectively ignored.

Addendum: after inserting the phrase "least squares" and reviewing the article before saving it, I came to the conclusion that the wording had become overly convoluted. I'm not going to save the edited version of the article, but I am still of the opinion that in some way it needs to make clear that the regression lines shown are least squares.Floozybackloves (talk) 04:08, 11 December 2011 (UTC)[reply]

I think that the variance values are wrong[edit]

I was doing an homework about means and variances and I tried one of dataset from Anscombe's quartet. Then the variance results I calculated was different from the ones written in wikipedia. Firt I thought I made a mistake and searched for it. After search everything seemed correct then I started searching on the internet. The previous versions of wikipedia page had these numbers:

variance of x = 10

variance of y = 3.75

these numbers are same with the results I found. Can someone check it? — Preceding unsigned comment added by 193.140.194.64 (talk) 18:57, 9 October 2012 (UTC)[reply]

The issue is N = 11 versus N-1 = 10 in the denominator in the computation of variance, where N is the number of observations. If the data are a sample from a population and the mean is estimated from the sample, then using N-1 in the denominator has the desirable property that E(sample variance) = population variance. The intuition is that estimating the sample mean "uses up" one of the observations so that dividing by N (instead of N less one used up observation) would understate the spread in the data. The Wikipedia page on Bessel's correction has a very clear discussion.

The sample variance computed with N-1 in the denominator are:

sample variance of X = 11

sample variance of Y ≈ 4.12

Michaelaoash (talk) 20:19, 8 August 2013 (UTC)[reply]

External Links[edit]

I wonder if it it appropriate to link to "worksheets" that can be used to explore the topic, especially if: 1) those worksheets are blank, and 2) the link to them does not lead to a space where they can be used interactively, but instead must be downloaded and run by some other means.

The example I have in mind is: http://nbviewer.ipython.org/github/psychemedia/ou-tm351/blob/master/notebooks-RFC/Anscombe's%20Quartet%20%5Bopen%5D.ipynb

Is there any policy on linking to things: a) like IPython notebooks; b) non-interactive previews of them; c) non-interactive previews of them in their "unexecuted" form, as compared to an "executed" version of the same notebook where example output from the execution of each cell is displayed? — Preceding unsigned comment added by 81.152.226.164 (talk) 17:24, 30 June 2014 (UTC)[reply]

I don't think there's a policy which proscribes linking to iPython notebooks per se, but we do tend not to link to "community edited resources" where there's an alternative (see WP:ELNO). I'm generally happy linking to tools for literate programming if the content of the link is something which meets that first ELNO restriction--it wouldn't be duplicative of the content in a fully fleshed out article. As far as executed versus not, it seems reasonable to link to both, no? e.g.

Anscombe's quartet explored in python: linky linky, with example output.

Does that make sense? Protonk (talk) 18:43, 30 June 2014 (UTC)[reply]

In general, linking to both the raw/empty and output rendered notebooks seems like a sensible approach. I guess the fact that a src doc is mutable, eg by virtue of being on github is both a strength (errors likely to be updated) and a weakness - the resource as linked to is mutable. I guess any "access on DATE" reference could, in such circumstances, link to an actual commit? Though in case of nbviewer, I don't know if a particular checkin can be previewed? — Preceding unsigned comment added by 81.152.226.164 (talk) 21:43, 30 June 2014 (UTC)[reply]

- Linking to a specific commit (if possible/reasonable) would be great. If not, could you link to a tag (or just make a specific branch for WP)? Forgive me for not knowing too much about nbviewer. Protonk (talk) 23:06, 30 June 2014 (UTC)[reply]

4th dataset[edit]

We say:

the fourth graph (bottom right) shows an example when one outlier is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear

Well I think this misses the point - it appears there is no relationship, linear or otherwise! I'd like to say something like the following instead:

the fourth graph (bottom right) shows an example when one outlier is enough to produce a high correlation coefficient, even though the two variables otherwise are independent/unrelated/uncorrelated

But I haven't checked all the sources, so I won't be bold and change rigth now.

By the way, I think one can today quite easily do a little better than Anscombe did, using e.g. Excel Goal Seek -- for instance, based on Anscombe's quartet, I've created a sextet where I also include an uncorrelated scatterplot (x and y are independently normally distributed) + one outlier, as well as an exponential function. But of course this article is and should be only about Anscombe's original dataset.--Nø (talk) 08:20, 31 January 2017 (UTC)[reply]

I agree with your point, and the proposed change looks good. Smyth (talk) 18:09, 31 January 2017 (UTC)[reply]

Done - I chose unrelated as last word.--Nø (talk) 16:13, 2 February 2017 (UTC)[reply]

@Nø: "Unrelated" is incorrect in this case. There clearly is a relation between the two variables: that's what the graph illustrates. It's just not entirely a straight-line relation. Obviously the two variables are not statistically independent either, not are they uncorrelated. MartinPoulter (talk) 17:52, 3 February 2017 (UTC)[reply]

The implication of the graph seems to be that the true value of X is constant, and the outlier is a measurement error. In which case there is no real-world relationship whatsoever between X and Y, even though mathematically there is a correlation. Smyth (talk) 22:42, 3 February 2017 (UTC)[reply]

I agree with Smyth and disagree with MartinPoulter. I'll not revert Poulter's revert, but I think someone should, or if possible find a better wording.--Nø (talk) 10:29, 4 February 2017 (UTC)[reply]

@Nø: @MartinPoulter: Having thought about it more, there are other possible interpretations of the graph. For example, the underlying model may be some sort of step function. So how about ... one outlier is enough to produce a high correlation coefficient, even though the other data points do not indicate any relationship between the variables? Smyth (talk) 13:24, 5 February 2017 (UTC)[reply]

@Smyth: @MartinPoulter: I support that solution.--Nø (talk) 14:01, 5 February 2017 (UTC)[reply]

@Smyth: @Nø: That's an improvement. I think we should be very wary that the statements should be about the data, not about subjective personal interpretations of the data, so I'm glad to see the comments go in that direction. MartinPoulter (talk) 17:18, 6 February 2017 (UTC)[reply]