Talk:Statistics/Archive 1

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Archive 2

Archive 3

→

Archive 5

Miscellaneous

I was taught statistics starting with the definition "a statistic is a function of data" as the first sentence of the Part 1B Stats course at Cambridge. I think the definition was useful and so it should be included BozMo (talk). Done

~~On special:Statistics, what are 'junk pages'? They seem to equal total pages minus (non-talk comma pages + talk pages). How many of these are #REDIRECTs? --Damian Yerrick~~

~~Why is the Main Page article counter different than the one in Statistics? --Chuck Smith~~

It's been some number of years since I studied statistics, but the terms used throughout the article did ring some bells very quietly in the back of my mind. The singular exception was ANOVA, so I followed the link to seek an explanation: Analysis of variance. That was familiar! I was then surprised by the number of hits that Google gave me about ANOVA (197,000). Still, I believe that the full expression is far more meaningful than the acronym, and I don't think that we should be encouraging the use of cute but meaningless acronyms. Eclecticology, Thursday, May 2, 2002

The three topics of statistics -- experimental design, description/exploration and inference -- are excellently described. The ongoing discussion between data miners and modellers (eg. Statistical Modeling: The Two Cultures, Leo Breiman and discussants, Statistical Science 2001;16:199-231) might deserve some more attention. Johannes Hüsing

I wonder if we can improve on the phrase "uncertain observations"? It's not the observations that are uncertain; it's what they entail about the population from which they came, the uncertainty resulting from the random way in which the observations came from the population. Michael Hardy 20:00 17 Jul 2003 (UTC)

Well, unless you're talking about measurement error, in which case the observations are uncertain. Anyway, I agree that the article needs a major rewrite. Oh, I guess that's not what you said... - dcljr 00:15, 9 Aug 2004 (UTC)

Even with measurement error, it's not the observations that are uncertain. You know what number your measuring instrument gave you; what you're uncertain about is what it should have given you. Michael Hardy 01:09, 9 Aug 2004 (UTC)

Hmm. A subtle distinction, indeed. But whatever. As a statistician yourself, surely you can provide us with a better introductory paragraph than the current version.... (See also item "What is statistics?" below.) - dcljr 05:46, 10 Aug 2004 (UTC)

Suggest update to US National Statistical Services to FedStats

Under "National Statistical Services", it appears that for a particular country, that country's main national statistics site is listed, except for the United States. For the US, the American Statistical Association is listed, which is primarily a professional association for statisticians. I would suggest that the FedStats web site, http://www.fedstats.gov, be listed as the web link for the US. The FedStats web site is the US government's gateway portal to it's underlying Federal statistical system, with links to more than 100 agencies with statistical information.

Puzzled by definition

Why is human knowledge part of the definition -- is it really necessary?CSTAR 03:26, 10 May 2004 (UTC)

I wouldn't call it a science either. — Miguel 06:28, 2004 May 10 (UTC)

Why not? cf Nelder JA (1999). From statistics to statistical science. The Statistician 48(2), 257-269. Johannes

What is statistics?

I don't like the introductory paragraph. I haven't come up with anything better, but here's a "definition of statistics" I used when I taught the subject to undergraduates:

[Statistics] is a logic and methodology for the measurement of uncertainty and for an examination of the consequences of that uncertainty in the planning and interpretation of experimentation or observation.

— Stephen M. Stigler, The History of Statistics (Belknap/Harvard, 1986)

Of course, I followed it with a lot of explanation...

I propose interested parties list their own preferred definition of statistics (serious ones, I mean) here and maybe we can come up with a consensus on the best one. (And then monkeys... well, nevermind.)

- dcljr 05:46, 10 Aug 2004 (UTC)

For me, statistics is a methodology for the collection, interpretation and presentation of information - I don't feel strongly about the words "methodology" or "information", but I don't like "uncertainty" in the primary definition. You can have statistics on the numbers of Olympic Gold Medal winners so far; they may be right or wrong, but I have yet to see anyone put error bands on them. To me "uncertainty" is part of the collection, interpretation and presentation in many cases, but not always a necessary part. --Henrygb 23:39, 12 Aug 2004 (UTC)

Your discomfort with the word uncertainty seems to stem from the difference between descriptive statistics (your definition) and inferential statistics ("mine"). (continued below)

Hmm. Or not. I just looked at your contributions, Henrygb. Anyway, I still say to do (or describe) meaningful statistics you have to have the idea of uncertainty or randomness in there somewhere. - dcljr 23:07, 31 Aug 2004 (UTC)

In descriptive stats, you usually just take the data as given; whether it's the whole population or just a sample, you can summarize it graphically and numerically in much the same ways. My background is mathematical statistics, so I usually don't even think of the descriptive side when I think statistics. It's my own bias. Anyway, we should try to address both aspects. - dcljr 22:55, 31 Aug 2004 (UTC)

I came to statistics through management science, the applied branch of operations research, and econometrics, an applied branch of mathematical statistics, with a big dose of John Tukey's pragmatism. I wound up with a perspective that some find unusual. For one thing, management science gave me a decision theoretical outlook. Part of that is reserving the word "uncertain" for situations that lack probability distributions. Data are raw materials; there's no infomation until you interpret descriptive or inferential statistics. I'm not sure what level to shoot for here, but here goes. I've done things like this with more example and less technical stuff but that takes more time or space, and I wanted to be brief.

Before you get to description, you have to know about the population the data represent (if any - most online polls, for example, represent no one except those who happened to participate. That includes some sampling theory. Then there's data entry and preparation, including quality checks, etc.

Assuming the data are numeric rather than categoric (counts of people belonging to various political parties, for example), the biggest challenge in description is to get people to pay attention to more than the median or mean. Box plots (aka box-and-whisker diagrams or plots) are critical for understanging data whose center is taken to be the median. The standard deviation is critical if you're assuming the normal distribution (I like to call it Gaussian but that's a small point) and using the mean, etc. Otherwise, you're trapped into the talking head focus on a single number that conveys very little useful information.

Once I get past description, statistics is about figuring out how much risk you are willing to take. Sometimes that's a guesstimate (choosing between pizza places in a town you've never visited before), sometimes it's as precise as you can make it (choosing the person who will perform open heart surgery on a loved one or yourself). In formal inference, that value is alpha and the decision about whether to reject the applicable null hypothesis comes down to whether the estimated risk that rejecting the null is a Type-I error (the p-value) is larger or smaller than the risk you are willing to take. If p>alpha, there is too much risk of a Type-I error to reject the null given your ex-ante choice of alpha. If alpha>=p, the risk of a Type I error is small enough (according to your ex-ante choice) to reject the null.

A single paragraph along those lines might be something like:

"Statistics is the art and science of seeking to understand a population and predict its future by collecting and using data that represent the population. Data collection includes sampling, data entry, and checking. Using data in statistics has two parts. Descriptive statistics includes estimates of most likely data values, their variation, and graphs. Inferential statistics looks for associations and causal relationships between variables that help to explain observed and predict future values."

That doesn't say anything about data mining, an approach that was taboo in my econometric youth. I haven't kept up with the subject, though, so I'm in no position to say anything about it here. If it's an outgrowth of resampling theory, for example, I'd be sympathetic even though that probably puts me outside mainstream econometrics, but I don't know enough to comment one way or another. --George Brower

Ah, now this paragraph (George's above) is, I think, mainly coming from a practical perspective of statistics as a set of procedures and "best practices" (i.e., what I would call applied statistics). (No offense, oversimplifying your viewpoint like that...) I come at statistics from a more theoretical standpoint (much to the chagrin of my students), emphasizing why those practices work and (ultimately, like in grad school) how to assess their efficacy and develop new and better ones. But my perspective is probably more suited to the mathematical statistics article (part of the reason I created it in the first place — in time I hope it will grow into something "useful").

I accept that this article should remain almost entirely "applied". At the very least we should allude to the following in the first paragraph:

data collection (sampling, etc.)
data summary (descriptive stats)
data interpretation (inference, relationship)

A more detailed outline, which might be the basis of constructing the opening paragraphs (i.e., preferably above the table of contents):

basics
- population
- sample
- randomness (uncertainty) and probability (frequentist/subjectivist viewpoints should probably be alluded to but not explained in any detail)
focus
- applied statistics (description, inference, modeling)
- theoretical (math stat)
data collection
- sampling
- experimental design
data summary: descriptive statistics
- graphical
- numerical
data interpretation: inferential statistics
- estimation
- prediction
- hypothesis testing
relationships and modeling
- correlation
- regression/ANOVA
- time series
- data mining? (I don't know much about it either!)

Obviously, and not surprisingly given my previous admissions, this reads like a course syllabus. But it does stress what you can actually do with statistics. If we could somehow pack all that information (if only obliquely, and certainly not necessarily in that order) into the opening paragraphs without hopelessly confusing everyone, that would be great!

Subsequent sections can flesh out what it all means and point to "main articles" about each topic for more detail. (Still, obviously I'm evisioning a much lengthier article!)

I think we should also mention above the table of contents the use of "statistics" or "stats" as a synonym for "data" and why that's not quite right.

These are my thoughts at the moment, anyway...

- dcljr 22:55, 31 Aug 2004 (UTC)

My attempt at article lead section

I just discovered the term lead section for what I've been variously calling preamble, intro[duction], introductory paragraphs, and stuff above the table of contents. <g>

Anyway, I'm sure some people thought it would be impossible to include all that stuff (see my previous comment) in the lead, but here's my attempt. I got almost everything in there.

Statistics is a broad mathematical discipline which studies ways to collect, summarize and draw conclusions from sample data. It is applicable to a wide variety of academic disciplines from the physical and social sciences to the humanities, as well as to business, government and industry.

Once data is collected, either through a formal sampling procedure or by recording responses to treatments in an experimental setting (cf experimental design), or by repeatedly observing a process over time (time series), graphical and numerical summaries may be obtained using descriptive statistics.

Randomness and uncertainty in the observations is modeled using probability in order ultimately to draw inferences about the larger population. These inferences may take the form of answers to essentially yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), prediction of future observations, descriptions of association (correlation), or modeling of relationships (regression).

The framework described above is sometimes referred to as applied statistics. In contrast, mathematical statistics (or simply statistical theory) is the subdiscipline of applied mathematics which uses probability theory and analysis to place statistical practice on a firm theoretical basis.

The word statistics (or stats) is also used colloquially to refer to data collected on an entire population rather than a subset of it. Formally, however, statistics is almost always based on samples. In fact, the word statistic (singular) may be defined as a quantity calculated from sample observations.

I found that I just couldn't find a good way to stick in the frequentist/subjectivist thing. My concern about that was mainly to point out the difference between "classical" and "bayesian" approaches. Perhaps another short "non-sequitur" paragraph could deal with that. Also, I didn't say anything about ANOVA (which is closely related to hypothesis testing, regression and experimental design, so I didn't feel too bad about not mentioning it by name) or data mining (maybe just doesn't belong in the lead). Oh, and not all the links lead to useful articles this point. (contined below)

I think there is no need to mention the frequentist/subjectivist split in an article on statistics. As far as "best practices" go, you can use whatever philosophy you like, or none at all, to come up with good statistical practice. In mathematical statistics, everyone must agree that, as mathematical theorems, frequentist and bayesian theorems are all "true". Finally, for a while I have held the opinion that frequentism as a philosophy of probability stems from the erroneous identification of the definition of probability on the one hand, and the measurement of a probability on the other hand. Whatever the meaning one ascribes to the word "probability", there is essentially only one way to determine it empirically, and that is to observe a large random sample and make inferences about it using statistics. — Miguel 07:53, 2 Sep 2004 (UTC)

But the two probability interpretations do lead to (almost) completely different approaches to inference. It probably should be mentioned somewhere, just not in the lead. BTW, despite being educated almost entirely from the frequentist perspective, I'm always a little uncomfortable when relative frequency is presented in textbooks as the "definition" of probability. (IOW, I agree with you.) - dcljr 19:31, 2 Sep 2004 (UTC)

Comments? Suggestions? (...I ask with much trepidation) - dcljr 20:49, 1 Sep 2004 (UTC)

Well it's better than what's there now. The reference to human knowledge in the first sentence of the current article is weird (I can't decide whether it's redundant or just wrong). Your additions will be the object of further modifications, but I suggest you blow away the current lead section.CSTAR 23:44, 1 Sep 2004 (UTC)

Okay, I'll leave it here for a few days so others can comment. If there are no strong objections, I'll move it to the article. - dcljr 19:31, 2 Sep 2004 (UTC)

Be bold in updating pages — Miguel 17:33, 3 Sep 2004 (UTC)

In my opinion: I am happy with your first paragraph except for the word "sample"; the rest of your paragraples should be in the contents; statistics is not "formally" about samples; nor is your distinction between mathematical statistics and applied statistics particularly clear. --Henrygb 01:04, 5 Sep 2004 (UTC)

Is this a Bayesian/frequentist (/decision theory) thing? As I recall, all the classes I've taken and all (?) the textbooks I've seen talk about the subject in terms of samples — both applied and theoretical approaches. I guess I still don't understand what alternative you're proposing. (If not "uncertainty", if not "samples", then what?? Hmm... Are you the person who added the note about decision theory in the opening paragraph?) And when you say "formally", how formal are we talking? "Let X₁, X₂, ..., X_n be a random sample" formal? "Let X be a random vector with covariance matrix T" formal? "Let X be absolutely continuous with respect to Lebesgue measure μ" formal? Anyway, as I've already mentioned, I don't think this should be an article about statistical theory. Speaking of which, that's what I mean by mathematical statistics: the theory as opposed to the applications (applied = what you do with statistics; theory = why it works). I'm not sure how I could make that paragraph more clear. Suggestions? - dcljr 18:41, 7 Sep 2004 (UTC)

No. I mean things both like "the population of the United Kingdom is about 59.5 million", and like "the difference between the mean and the median is less than or equal to one standard deviation", neither of which have anything to do with samples, but are about data. Statistics covers both of these, as well as sampling. --Henrygb 00:44, 11 Sep 2004 (UTC)

I'm responding to Henrygb's last comment above (at 00:44, 11 Sep 2004), but the indentation is getting a bit extreme, so it's back to the left margin... Okay. Your examples actually wouldn't (necessarily) be covered by the term "statistics" in my book (especially in an article that's trying to explain what statistics is, as opposed to other, similar disciplines/practices):

"the population of the United Kingdom is about 59.5 million"

This figure is a "statistic" only in the colloquial sense of the word. It's presumably based on a census. That's not statistics (as in, "I have a degree in statistics"). In fact, you may be familiar with the controversy over using statistical methods in the U.S. census (see the Census article). It's not allowed under most people's interpretation of the relevant clause in the Constitution. (This only serves to illustrate the difference in the concepts; I'm not saying it's an airtight argument.) One could argue that graphical and numerical summaries of populations fall under the term "descriptive statistics", but no one objects to the use of those techniques to interpret census data. My point is, when the word "statistics" is used by statisticians (or by someone teaching the subject, etc.) it almost always means "inferential statistics", which uses information about a sample to infer something about a larger population. Of course, confusing the whole issue is the use of the word "statistics" by governments to refer to census data and summaries thereof (e.g., "Statistical Abstract of the United States" or the "Bureau of Labor Statistics"). The difference here is akin to the difference between the colloquial use of the term geography to refer to the "lay of the land" of an area, and the academic subject of geography, which studies many other things. In any case, the issue(s) you raise (and I've discussed) here should certainly not be ignored, but should be dealt with directly in the article.

"the difference between the mean and the median is less than or equal to one standard deviation"

That statement can be made in probability; you don't need statistics at all for that one. Certainly statistics relies heavily on probability, but they are different fields (just as engineering and physics are very different fields, even though the former relies heavily on the concepts and methods of the latter). This is why a great many Wikipedia articles start out, "In probability and statistics..." and not just "In statistics...." I don't want to offend you, Henrygb, but may I ask what your academic background is, especially as it relates to statistics? As you can see above, at first I thought your objections were based on a philosophical difference among statisticians (Bayesians, etc.), then I thought maybe you were objecting at a deep mathematical/theoretical level. I'd like to know what exactly you're basing your views on. - dcljr 05:17, 13 Sep 2004 (UTC)

A strange request, but I'll play. I have a mathematics degree from the University of Cambridge having concentrated on what was called "applicable mathematics" (i.e. numerical analysis, probability, statistics, mathematical economics, coding theory etc.). I am now a member of the (British) Government Statistical Service. Your turn.
I am saying statistics is about data and its handling, presentation and use for drawing inferences, and that the use of samples is only one part of that. What you describe as the "colloquial sense of the word" (which presumably also refers to topics like baseball statistics) is not only the origin of statistics but one of its major contemporary meanings. While random variables and distributions in probability have descriptive statistics, so too do data sets which are not random. Indeed I would suggest that what you think of as statistics is much more probability based than the broader concept I am considering. Look at the list of statistical topics and my guess is that the majority of the articles do not mention sampling. --Henrygb 00:13, 14 Sep 2004 (UTC)

So... when you're doing inference and not using sampling, then you must be using either Bayesian analysis or some decision-theoretic approach, right? Not classical inference (t-test, ANOVA...). Anyway, nevermind. I give up. If others want to weigh in on this subject, please do. Henrygb, at my User page you can see both my statistics credentials (User:dcljr) and my (latest) revised lead section (User:dcljr/Statistics#Preamble — ~~I know you won't agree with one sentence in there~~). I haven't done anything to the article yet because I'd like to flesh out a little more of the main article text to complement the extensive lead section I'm proposing. Then others can have at it. ~~- dcljr 06:15, 21 Sep 2004 (UTC)~~ I removed the offending statement from my lead section draft in my last edit. - dcljr 06:36, 21 Sep 2004 (UTC)

Probability

I can't make heads or tails from this paragraph:

However, this can often lead to misunderstandings and dangerous behaviour, because people are unable to distinguish between, e.g., a probability of 10-4 and a probability of 10-9, despite the very practical difference between them. If you expect to cross the road about 105 or 106 times in your life, then reducing your risk of being run over per road crossing to 10-9 will make you safe for your whole life, while a risk per road crossing of 10-4 will make it very likely that you will have an accident, despite the intuitive feeling that 0.01% is a very small risk.

What is meant by 10-4 or 10-9? Is that meant to be scientific notation (ten to the -4th and 10 to the -9th)?

The example makes little sense either. Why 105 or 106 road crossings and not 100, say. And I don't think reducing the risk to 10-9 means it will make you safe for your whole life, rather than that it will be very unlikely that you will be run over.

Unfortunately, the only statistics I learnt was in high school, so I'm not certain how to improve this article myself.

--Martin Wisse 06:51, 2 Nov 2004 (UTC)

You are ignoring (or not seeing) the superscripts. 10⁻⁴ does indeed mean 10 to the power of −4, i.e. 0.0001 or a 1 in 10000 chance. --Henrygb 21:06, 29 Nov 2004 (UTC)

Could someone who feels half way competent to do so put some pointers to the philosophical foundations of probability and statistics. Statistical reasoning has always fascinated and amazed me with some breathtaking inferences, and it would be nice to know if there is a way into this stuff. --Publunch 18:08, 22 Dec 2004 (UTC)

Probabilities in Bayesian statistics

The following puzzles me

Use of prior probabilities of 0 (or 1) causes problems in Bayesian statistics, since the posterior distribution is then forced to be 0 (or 1) as well. In other words, the data is not taken into account at all! As Lindley puts it, if a coherent Bayesian attaches a prior probability of zero to the hypothesis that the Moon is made of green cheese, then even whole armies of astronauts coming back bearing green cheese cannot convince him. Lindley advocates (…)

I haven't read Lindley's book, but I am a statistician and Bayesian statistics is my area, and I have no idea what the above is supposed to mean. As it stands it is just nonsense to me.

To keep it simple, let's assume a linear model and a normal (Gaussian) distribution. In this case, the posterior distribution is a weighted average of the prior distribution and the distribution of the observations. Before any observations are gathered, the posterior distribution is identical to the prior distribution. As more and more observations arrive, the posterior distribution will converge to the distribution of the observations. Infinately many observations would result in a posterior distribution identical to distribution of the observations, with no weight on the prior distribution at all. No matter what the prior distribution is, it will count less and less as more observations are taken into account. In particular, if we use a degenerate prior distribution with infinite variance (and zero density everywhere), the Bayesian approach gives the same result as a “frequentist” approach. The reason is that the prior distribution has zero density and contributes nothing in the weighted average of the prior distribution and the distribution of the observations, giving a posterior distribution always identical to the distribution of the observations. Anyway, I have no idea why probabilities of 0 or 1 should cause trouble in Bayesian statistics. –Peter J. Acklam 22:36, 18 Jan 2005 (UTC)

I think you've misunderstood. The statement that if the prior probability of a proposition is 0 or 1, then so is the posterior, is correct; it's trivial mathematics. You're being really vague about your proposed model. You wrote:

let's assume a linear model and a normal (Gaussian) distribution. In this case, the posterior distribution is

Posterior distribution of what?? Often one talks about a N(μ, σ²) distribution of some quantity to be observed--call that X, and one speaks of prior and posterior distributions of μ (or of μ and σ, but let's keep it simple, and while we're at it assume σ = 1). That's the conditional distribution of X given μ. OK, simple case: the prior says that μ = 1 or 2, each with probability 1/2. Now keep repeating the experiment. The observations of i.i.d. copies of X are conditionally independent given μ. If μ is really equal to 1, then the posterior will, with probability 1, converge to a probability distribution that assigns probability 1 to μ = 1. The posterior distribution will not "converge to the distribution of the observations", since those will be normally distributed! Michael Hardy 02:50, 19 Jan 2005 (UTC)

a weighted average of the prior distribution and the distribution of the observations. Before any observations are gathered, the posterior distribution is identical to the prior distribution. As more and more observations arrive, the posterior distribution will converge to the distribution of the observations.

I wrote that way too late yesterday. What I had in mind was a case where you are estimating μ or σ². The posterior distribution does not, as you point out, converge to the distribution of the observations, but to a distribution based on the information in the observations. Anyway, never mind. –Peter J. Acklam 08:21, 19 Jan 2005 (UTC)

I wrote the paragraph quoting Lindley that Peter finds puzzling. Bayes' theorem can be expressed in the form

Posterior probability is proportional to Prior probability x Likelihood

If the prior is zero then so is the posterior, since zero times anything equals zero. A similar argument applies if the pror is 1. The likelihood is the part which mathematically models the information content of the data. In the case where the prior is zero it makes no difference what the likelihood is, since it just gets multiplied by zero to make zero. So by choosing a prior probability of zero (or one) you cut yourself off from the ability to take on board the informaton contained in the data. I hope this helps.Blaise 23:23, 19 September 2005 (UTC)

One of the great things about Bayesian stats is that you can do things directly rather than indirectly, as in frequentist stats. For example, in frequentist stats you can't attach a probability to a hypothesis, so you can't talk about the probability of the hypothesis given the data. Instead, you have to mess around with the probability of the data given the hypothesis, which is not really what you want. Bayesian hypothesis testing is thus very simple in principle (though the sums may get hard in practice). You attach a prior probability to the hypothesis, multiply by the likelihood and divide by the probability of he evidence to normalise it. The point I was making was that probabilities of zero and one make poor choices as priors. (I am not saying that Bayesian stats generally has a problem with zero and one as probabilities, just when they are used as priors.) Your mention of linear models and so on suggests that you are thinking in terms of Bayesian methods of parameter estimation, whereas I was thinking in terms of Bayesian hypothesis testing.Blaise 15:28, 20 September 2005 (UTC)

Help needed

Hi there. Could somebody take a look at the trend article? Is the statistical term trend correct? If so, it needs expansion. Thanks. Oleg Alexandrov | talk 03:46, 24 Jan 2005 (UTC)

Removed some external links

I have massaged the External links section a bit and removed the following entries (others could stand to be culled IMHO, but I didn't do so):

http://www.thenakedscientists.com/HTML/Columnists/robstanforthcolumn2.htm The Probability of Co-incidence
Dedicated website (in Italian)

While the first may be an interesting article, it's not really directly relevant to statistics (it would belong at Probability, if anywhere); and I moved (2) to the statistics article at the Italian Wikipedia. - dcljr 23:11, 27 Jan 2005 (UTC)

virtual reality

Probability: What is the meaning of: In reality there is virtually nothing...? In throwing a dice the event "the dice has been thrown" has probability exactly 1. Meant is, I assume, from no future event one can be absolutely sure. But then this "event" itself is absolutely sure!?130.89.219.54 17:18, 31 Jan 2005 (UTC)

...yes? --justing magpie 14:58, 2 August 2006 (UTC)

Statistical Software - removal of SigmaXL link

You seem to have a problem with links to commercial sites, but in a inconsistent manner. Why then don't you remove STATA's link? The rules of Wikipedia do not forbid links to commercial sites.

STATA seems to actually have helpful information on their page, and allows you to try certain things. While SigmaXL Excel Add-in only tells you to "Download a 30-Day trial". This together with your repeated insistance, makes me think that you are looking for free advertising. Then, Wikipedia is not the place to go. Oleg Alexandrov 16:33, 6 Mar 2005 (UTC)

For someone who does not know the difference between stata.com and statsoft.com you should not be editing the Statistical software page. As for my insistence, our product is a significant contribution to the market for powerful, easy to use and inexpensive statistical software. Therefore it deserves a place of mention alongside products like Minitab. I will remove the url, but request that you keep the name up.

This is meant to be a list of statistical packages in common use. I have previously heard of all the packages listed, except SigmaXL, StatPro and MacAnova. Some quick Google tests give 372,000 hits for Minitab, 363,000 hits for GNU Octave, over 6 million for Stata, and over 8 million for R (actually for statistical R). This compares to less than 500 for the StatPro add-in, about 700 for MacAnova, and 33,000 for SigmaXL. I have therefore removed StatPro and MacAnova from the list, and am tempted to remove SigmaXL unless someone can give evidence that it is as commonly used as some of the other remaining packages. -- Avenue 12:41, 21 Mar 2005 (UTC)

The 8 million Google references for Statistical R goes down to 41,000 if you enter "R language" or 38,000 for "R project" statistical.

A Google test for R is inevitably going to be subjective, and I admit that Statistical R will include some false hits, but I think those two search phrases are somewhat unnatural. There are 2.86 million results for R Statistical Software, and the first false hit was number 43 in the list, so I believe the true number of references to R would be measured in hundreds of thousands at least. -- Avenue 15:38, 31 Mar 2005 (UTC)

That SigmaXL was put in by an employee of that company, could be in itself a good enough reason to remove the thing. Probably that employee meant well, but we would not want Wikipedia to be a medium of free advertising. Oleg Alexandrov 13:00, 21 Mar 2005 (UTC)

I disagree; I believe their contribution should be judged on its merits. But the fact that, as an employee, they may have an interest in promoting their company's product means some skepticism is probably called for. -- Avenue 15:38, 31 Mar 2005 (UTC)

You are right, just because it is put by an employee it does not mean to be deleted automatically. It is all up to you if to keep that link, I know nothing about statistics. I am just weary of people abusing the external link section. Oleg Alexandrov 16:08, 31 Mar 2005 (UTC)

No evidence that SigmaXL is in relatively common use has been provided. I also note that it only has the fourth highest Google pagerank of the add-ins listed here [1]. I will therefore delete it from the list. However I will also add Google's link. -- Avenue 13:32, 3 Apr 2005 (UTC)

Questions and Suggestions

I'm neither a statistician nor mathematician so bear with me through these comments.

First sentence.

Is "statistics" a science or is "statistical theory" the science and "statistics" the term for the information gathered? Is "human knowledge" compared to "inhuman knowledge" or "non-human knowledge"? Should the separate article "data" be merged with "statistics" and a redirect left at the "data" heading? The separate "information" article within Wikipedia is significant and easily stands alone but the "data" article seems subsidiary to "statistics".

The first sentence would be more clear to me, a layman, if it read as follows: "Statistics are the information (i.e. knowledge) created by the application of mathematics to data."

Rest of first paragraph:

"The branch of mathematics used is statistical theory. Within statistical theory, randomness and uncertainty are modelled by probability theory. Because one aim of statistics is to produce the "best" information from available data, some authors consider statistics a branch of decision theory. Statistical practice includes the planning, summarizing, and interpreting of observations, allowing for variability and uncertainty."

I think the separate articles "data" and "probability theory" should be merged with "statistics".

And I think there needs to be a discussion of the "statistical failure" in the exit polls during the 2004 U.S. Presidential elections to explain -- in layman's terms -- the importance of the date accumulation, how errors arise, mathematically less probability of error the greater the number on the sample, etc.

Someone (either Mark Twain or Benjamin Disraeli) once said: "There are three kinds of lies: lies, damned lies, and statistics." I think there should be a discussion of "false statistics", information produced to prove a point rather than producing "correct" information.

Johnwhunt 18:49, 27 Mar 2005 (UTC)

Statistics is a science; for example, there are Statistics Departments in many universities. These teach the science (and hopefully some of the art) of statistics, including statistical theory and applications. But I agree that our article should probably also mention the more concrete meaning, i.e. statistics = the plural of statistic.

"Human" does seem redundant. I'll delete it and see if anyone complains.

Data has different meanings in statistics and in computer science, with the latter usage becoming more widespread over time. I think the Data article is needed to distinguish them.

Probability theory is a distinct area from statistics or even mathematical statistics, and deserves its own article.

The Misuse of statistics article discusses misleading statistics, and is listed in the "See also" section here.

There is a separate article on problems related to the 2004 exit polls: 2004 U.S. presidential election controversy, exit polls.

-- Avenue 01:38, 28 Mar 2005 (UTC)

A quick question--the sentence

The implication of using probability theory is that statistical results can not provide definitive cause and effect relationships but can only show correlation relationships.

doesn't seem right to me. Being able to determine cause-and-effect relationships vs. correlation relationships has more to do with experimental design than with the practice of statistics itself, no? (For example, double-blind studies on drug efficacy would presumably use statistics to analyze the data, and those studies are certainly after proving causal relationships.) And the problem of providing "definitive" proof is a weakness of science in general, not statistics. I think this sentence should either be removed or replaced with something more to the effect of "The implication of using probability theory is we can quantify how likely it is a particular outcome occurred due to random chance rather than another factor."

Plus, could this article include something on statistical weights? Or should there be a different article for that entirely?

Origin

The Origin section could stand to be made more consistent and accurate (esp. by cross-referencing with other Wikis and other sources). ~ Dpr 05:46, 11 Jun 2005 (UTC)

I've fixed it a little. Does anyone know why "most notably astronomy" is there? Was astronomy particularly important in driving the historical development of statistics? Joshuardavis 19:44, 19 February 2006 (UTC)

On the sciences end (vice social sciences), astronomical measurements (for navigation) and biometry were driving forces. JJL 22:22, 19 February 2006 (UTC)

"Random Sample" and "Simple Random Sample"

We have :

looks like they're talking about the same thing. Is there a statistician in the room ? Flammifer

They bear the same relation as mammals and monkeys; i.e. a simple random sample is a random sample, but a random sample need not be a simple random sample. For example, cluster samples and stratified samples can be random samples, but are not simple random samples. Avenue 13:15, 23 September 2005 (UTC)

Probability

I'm not real fond of part of the following paragraph:

The probability of an event is often defined as a number between one and zero. In reality however there is virtually nothing that has a probability of 1 or 0. You could say that the sun will certainly rise in the morning, but what if an extremely unlikely event destroys the sun? What if there is a nuclear war and the sky is covered in ash and smoke?

Based on this logic, one could say that the probability of something having the probability of 1 (or 0) is 0, and thus, by contradiction, the above statement currently in this section is incorrect. Just remove it and the statement following it.

I agree. As a whole, this section adds little to the Statistics article. It seems to connect only to Bayesian statistics, which does not even appear in the article until this point. Essentially, it is a meandering, anecdotal discussion of the real-world applicability of the mathematical notion of probability. On the other hand, parts of this section might belong in Probability. The example of 10^-4 vs. 10^-9 explains quite well how people misunderstand risk. (But I think these numbers should be written 1/10000 and 1/1000000000, since the intended audience is ostensibly not number-savvy.) Joshuardavis 02:44, 4 February 2006 (UTC)

Where do I find stats about Wikipedia?

For example:

The number of users,
Pages with the most revisions,
current number of pages,
Number of revisions per day,
and the like..

You would need this page: Wikipedia:Statistics. --shaile 03:18, 15 November 2005 (UTC)

Hidden assumption of probability theory

As written on the morning of Nov. 30, 2005, the text contained a strong but hidden assumption. The assumption was that probability theory was preserved. There are situations, though, under which probability theory is empirically invalidated. A failure to recognize the possibility of this happening has torpedoed much of the past research and literature in the engineering field of "nondestructive testing."

Researchers should be warned of the possibility of making this blunder. To accomplish this, I added a paragraph, which falls after the first paragraph. Terry Oldberg, http://www.oldberg.biz, terry@oldberg.biz

I have deleted this paragraph from the article, because the failure of probability theory assumptions is much rarer in typical statistical practice than other mistaken assumptions (such as independence or normality). We should give these more common problems much greater prominence than the ideas listed in the deleted paragraph.

Also, the repeated self-citations give at least the appearance of vanity information, weakening the article. I have placed the deleted text below in case someone feels there is something here worth incorporating, although I would strongly suggest that articles such as Misuse of statistics or Probability theory would be better places to attempt that. Avenue 12:06, 1 December 2005 (UTC)

In virtually every case, the methods of statistics assume probability theory. However, like any other theory, probability theory can be incorrect. Christensen and Reichert (1976), Oldberg and Christensen (1995) and Oldberg (2005) report observations of systems in which the Unit Measure axiom of probability theory is empirically invalidated. A result is that a number of statistical concepts either do not apply or apply only under restrictive circumstances; these concepts include probability, population, sample, sampling unit, signal and noise. It follows that blindly applying the methods of statistics without first checking for preservation of Unit Measure can lead to blunders. Oldberg and Christensen (1995) and Oldberg (2005) report that a blunder of this type plagues an entire field of engineering. The following presentation assumes probability theory.

Christensen, R. and T. Reichert, 1976, "Unit Measure Violatiions in Pattern Recognition: Ambiguity and Irrelevancy," Pattern Recognition, Oct. 1976, pp. 239-245; Pergamon Press.

Oldberg, T. and R. Christensen, 1995, "Erratic Measure," in NDE for the Energy Industry 1995, pp. 1-6; The American Society of Mechanical Engineers, New York, NY. Republished by ndt.net at http://www.ndt.net/article/v04n05/oldberg/oldberg.htm .

Oldberg, T., 2005, "An Ethical Problem in the Statistics of Defect Detection Test Reliability", ndt.net, http://www.ndt.net/article/v10n05/oldberg/oldberg.htm.

Regarding the suggestion that my posting might have been motivated by vanity, in the hope of pouring water on a possible flaming war, I'll restrict my remarks to pointing out that, for seekers of truth, attacking one's opponent's errors is permissible. Attacking one's opponent is not.

That 2 of the 3 works cited bear my name is a result of the fact that I am unaware of any other works on the topic of violations of the Unit Measure axiom of probability theory in the practice of statistics. If anyone is aware of additional works, I request that they supply citations to them.

Failure to expose the assumptions supporting a conclusion is forbidden in technical writing. A failure of the statistics community to sufficiently expose this one has had dire consequences for the field of nondestructive testing. If you live near a nuclear reactor or refinery, cross bridges, work in a steel framed building, fly on aircraft or own stock in a company that owns any of these devices, your life and property are threatened by the simultaneous assumption of Unit Measure and empirical violation of it in nondestructive testing.

The person who edited out the paragraph which I contributed states that false assumptions of probability theory occur more rarely in the practice of statistics than other false assumptions but supplies no citation to a study or studies supporting this assertion. On the other hand, I supplied citations to peer reviewed articles demonstrating violations of Unit Measure that lay unrecognized in the field of nondestructive testing for an extended period and to the distinct detriment of the people of the world. Decisions about the content of the Wikipedia article on statistics should be made on the basis of peer reviewed articles rather than anecdotes, where possible.

For all of the above reasons, I submit that a warning of the assumption of probability theory and possibility of it being empirically violated is required in this article. Said warning should appear before the second paragraph, wherein the term "sample" appears. Samples do not exist under violations of Unit Measure.

Terry Oldberg Dec. 1, 2005

I have no wish to get into a flame war either. I said that your self-citations gave the appearance of vanity information. I did not mean to imply that you were motivated by vanity, and I apologise if I caused any offence. Perhaps a better way of phrasing my criticism would be that the references provided had overlapping authorship, not meeting Wikipedia's goal of multiple independent sources. References from reputable statistics journals would also carry more weight in the context of this article. (The 1976 reference might qualify, but I think not the others.)

I do agree with you that so far our article does not give enough emphasis to the dangers of unfounded assumptions, but I disagree about which assumptions are most important. I have searched for "unit measure" in the online versions of two reference works on statistics, namely the Encyclopedia of Statistical Sciences (2nd Ed., Wiley, 2005) and StatSoft's Electronic Statistics Textbook. No documents matched, suggesting that this is not a topic of serious concern to the editors of either work. In contrast, there are two articles in Wiley's encyclopedia on testing for normality, or departures from it, and there were over 500 matches both for "normality" and for "independence". I think this strongly supports my belief that our article should cover these assumptions first. Avenue 15:11, 2 December 2005 (UTC)

Avenue: I've added a brief, second paragraph, with a warning that a) statistics assumes probability theory, b) key elements of the terminology of statistics assume probability theory and c) probability theory can be violated empirically. There are no references to my own works or to any works at all, for that matter. I moved a detailed discussion to the topic of "Misuse of statistics", per your suggestion. Thank you for making it.

Unit measure is a way of identifying an axiom that is described as Kolmolgorov's 2nd axiom in the current edition of Wikipedia. The phrase "unit measure" appears in one of the papers that I referenced in my original submission. I don't know whether there is a way of referencing this axiom that is conventional or, if so, what this way is. I have no information on the frequency with which violations of this axiom have appeared in scientific studies or the usage of statistical models. I can tell you that a) it has been observed in biomedical research (see the paper by Christensen and Reichert) and in the field of defect detection testing. The latter is where I encountered it more than 20 years ago, while serving in a role in which I directed much of the world's research on the safety inspections of the tubes of nuclear reactor steam generators. The methods of inspection violated unit measure but the scientific literature assumed unit measure.

Violations of this axiom are ubiquitous in the literature of defect detection testing and they are buried by misusage of terms that imply the preservation of probability theory; this has been true for a period of more than 30 years. If a person you care about lives near a nuclear power reactor, refinery or chemical plant, flies on airplanes, crosses bridges, or relies on the reliability of any other kind of structure that functions under mechanical stress, this person's expectation of a healthy and prosperous life is diminished by this type of misuse of statistics.

My claim that what I have said in the above paragraph is true is based, in part, by four, peer reviewed publications. The contrary claim of the United States Nuclear Regulatory Commission (whose studies are statistically flawed and which agency has subjected the people of the United States to unneccessary risk due to its own incompetency, if my claim is true) were rejected and mine accepted by a peer review panel, for a highly reputed engineering society, that included an academic statistician. By the way, the co-author of one of my papers, is a theoretical physicist who has worked as a statistician for more than 35 years; he has published 7 books and a number of articles on theoretical and applied statistics. To my knowledge, nobody has refuted nor limited my claims in the 16 years since I began to publish them or 21 years since I made them orally to an engineering conference. This is true even though: a) a publication with a circulation of 1 million copies, Business Week magazine, published an article featuring one of my papers a decade ago b) two of my papers have been published on a Web-based journal with an international circulation of 80,000 readers and discussed in an online forum for more than 6 years.

In light of the above, it seems to me that caveats in Wikipedia's article on statistics are apropos.

Terry Oldberg 06:36, 16 December 2005 (UTC)

Mr. Oldberg, could I ask you to conform to Wikipedia conventions (see Wikipedia:Manual of Style)? You've created some articles with gratuitous capitals in their titles. I moved Unit Measure to unit measure (with a lower-case initial m). You've started article with dictionary-style definitions rather than complete sentences, and neglected to bold the title phrase at its first appearance. Sometimes you omit all links. To see what I have in mind, look at my edits to the article's you've worked on. Michael Hardy 19:56, 1 December 2005 (UTC)

Mr. Hardy: Thank you for alerting me to this. Terry Oldberg

Perhaps someone could expand unit measure? Presently it is very short. Punkmorten 11:58, 14 January 2006 (UTC)

In the comments above, Terry Oldberg asserts that "like any other theory, probability theory can be incorrect". Can we agree that probability theory, as a mathematical theory (distinct from a scientific theory -- see theory), cannot be incorrect (ignoring certain philosophical issues)? On the other hand, it can certainly be applied incorrectly to situations that do not satisfy the assumptions of the theory, and this is worth noting in the article. In my opinion, this warning belongs in the conceptual overview, not the intro, for two reasons:

I want to keep the intro short. It is an intro to the article, not an intro to statistics. The article itself is an intro to statistics.
Putting a "problem" in the intro might cause a casual reader to infer that statistics is unsound. I would like readers to gain a healthy skepticism for statistics, but not an unhealthy one. (In my mind this is analogous to the healthy skepticism that schoolkids should have of evolution or any other overwhelmingly validated scientific theory.)

For these reasons I removed the last of Mr. Oldberg's warnings today, but replaced them with a paragraph about misuse of statistics and statistical literacy. Respectfully, Joshuardavis 20:50, 20 February 2006 (UTC)

Joshuardavis asks: "Can we agree that probability theory, as a mathematical theory (distinct from a scientific theory -- see theory), cannot be incorrect (ignoring certain philosophical issues)?" If his proposition is that probability theory follows from its premises, it would be impossible to disagree with him. However, whether his proposition is true is off the pertinent topic. The topic is whether the Statistics article should warn readers that: a) mathematical statistics assumes probability theory and b) an axiom of probability theory can be and has frequently been empirically invalidated in scientific studies that assumed statistics; these studies reached necessarily false conclusions. Should the Statistics article fail to warn of the possibility that mathematical statistics doesn't work when large numbers of the world's people are exposed to unnecessary risks from explosions of nuclear reactors and downings of aircraft from false assumptions of mathematical statistics? Comments?Terry Oldberg 05:23, 25 February 2006 (UTC)

I do not believe that we need to warn readers about the issue discussed here. There are certainly many assumptions underlying statistical theory and practice, and some of these are widely recognized as being vital to much statistical work. For instance, many statistical procedures assume normality, and much has written about failures of this assumption and how to detect and address them (e.g. the articles in Wiley's Encyclopedia of Statistical Sciences that I mentioned above). Failures of the "unit measure" assumption that Mr Oldberg raises do not seem to concern many statisticians, and we should not include warnings about them in our article unless wide concern about them can be demonstrated. -- Avenue 12:32, 25 February 2006 (UTC)

Inspired by Terry Oldberg, I have expanded the "subtle but serious" discussion to emphasize how important/dangerous such errors in application of statistics can be. My feeling is that discussion of particular kinds of errors should be left entirely to the misuse of statistics article. I am not qualified to evaluate Avenue's assertion that the normality assumption is more worrisome than the unit measure assumption, but if it is then I hope someone will write about it for misuse of statistics; it's not there. Lastly, I suggest that language like "empirical invalidation of probability theory" should be replaced with language like "misapplication of statistics/probability (due to incorrect assumptions)", which places the blame where it's due. Joshuardavis 03:04, 27 February 2006 (UTC)

RFC

Hi. I was wondering if any of you guys can help us out over at Talk:Intelligent design. There's a line in the current ID article that says:

A Newsweek article reported The Discovery Institute's petition being signed by about 350 scientists, while the AAAS (the largest association of scientists in the U.S.) has 120,000 members, indicating that around 0.3 % of U.S. scientists give some support to ID. The international percentage is likely to be much smaller.

Some people (including me) there think there is some sort of sampling bias (I was thinking it is selection bias, but not being statisticians we're not really sure what exactly it is. Others are fairly adamant that it's just a simple matter of math. I was hoping someone could help us sort it out one way or the other. Thanks. You can just post here, I'll check back, or if you want leave a message on my talk page or post on Talk:Intelligent design under the heading "Support among scientists – this is bogus." --Ben 02:33, 23 December 2005 (UTC)