Talk:Misuse of p-values

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Dividing into sections[edit]

An excellent start! The article could use some section headings and reorganization to improve the flow however. Perhaps the following outline would work:

  • Origins (Laplace/Pearson/Fischer)
  • Misinterpretations (Goodman S (2008). "A dirty dozen: twelve p-value misconceptions" (PDF). Seminars in Hematology. 45 (3): 135–40. doi:10.1053/j.seminhematol.2008.04.003. PMID 18582619.)
  • Alternatives (Bayesian)

Cheers. Boghog (talk) 07:56, 21 February 2016 (UTC)[reply]

I've taken a stab at it. -Bryanrutherford0 (talk) 13:42, 21 February 2016 (UTC)[reply]
Either version looks good to me. Is there perhaps a better section name for "Alternatives"? It feels like that might imply "Alternatives to using p-values" which might not fit in the scope of an article about the fallacy. Also, we might want to move things around so that everything fits into the individual sections more neatly - I really just put the content I had into paragraphs. :-) Sunrise (talk) 15:02, 21 February 2016 (UTC)[reply]

Worked example[edit]

The jargon and wiki-linked phrases that many will need to follow may make this a little abstract for the lay reader. I don't know if there's any place for it, but I've found a good teaching aid in this article:

It does point up the pitfalls of small sample size and the inevitability of false positives when you look at enough measurements. Even if it's not used, it's worth a read. --RexxS (talk) 00:51, 22 February 2016 (UTC)[reply]

Bayesian statistics and p-values[edit]

I'm having a hard time understanding what this article is trying to communicate. Let's start at the beginning. The first sentence is, "The p-value fallacy is the binary classification of experimental results as true or false based on whether or not they are statistically significant." This is excellent. I understand what it's saying and agree with it entirely. But moving ahead to the second paragraph, we are told that "Dividing data into significant and nonsignificant effects ... is generally inferior to the use of Bayes factors". Is it? From a Bayesian perspective, certainly. But a Bayesian would tell you that you shouldn't be doing hypothesis testing in the first place. Whereas a frequentist would tell you that a Bayes factor doesn't answer the question you set out to ask and that you need a p-value to do that.

In a pure Bayesian approach, you begin with a prior probability distribution (ideally one which is either weakly informative or made from good expert judgment) and use your experimental results to create a posterior distribution. The experimenter draws conclusions from the posterior, but the manner in which he or she draws conclusions is unspecified. Bayesian statistics does not do hypothesis testing, so it cannot reject a null hypothesis, ever. It does not even have a null hypothesis. At most you might produce confidence intervals, but these confidence intervals are not guaranteed to have good frequentist coverage properties; a 95% Bayesian confidence region says only that, under the prior distribution and given the observed data, there is a 95% chance that the true value of the parameter lies in the given region. It says nothing about the false positive rate under the null hypothesis because in Bayesian statistics there is no such thing as a null hypothesis or a false positive rate.

Let's move on to the "Misinterpretation" section. It begins, "In the p-value fallacy, a single number is used to represent both the false positive rate under the null hypothesis H0 and also the strength of the evidence against H0." I'm not sure what the latter half of this sentence means. The p-value is, by definition, the probability, under the null hypothesis, of observing a result at least as extreme as the test statistic, that is, the probability of a false positive. That's one part of what the article defines as the p-value fallacy. But what about "the strength of the evidence against H0"? What's that? Perhaps it's intended to be a Bayes factor; if so, then you need a prior probability distribution; but hypothesis testing can easily be carried out without any prior distribution whatsoever. Given that I don't understand the first sentence, I guess it's no surprise that I don't understand the rest of the paragraph, either. What trade-off is being discussed, exactly?

The paragraph concludes that something "is not a contradiction between frequentist and Bayesian reasoning, but a basic property of p-values that applies in both cases." This is not true. There is no such thing as a Bayesian p-value. The next paragraph says, "The correct use of p-values is to guide behavior, not to classify results; that is, to inform a researcher's choice of which hypothesis to accept, not provide an inference about which hypothesis is true." Again, this is not true. A p-value is simply a definition made in the theory of hypothesis testing. There are conventions about null hypotheses and statistical significance, but the article makes a judgmental claim about the rightness of certain uses of p-values which is not supported by their statistical meaning. Moreover, p-values are used in frequentist statistical inference.

The last paragraph claims, "p-values do not address the probability of the null hypothesis being true or false, which can only be done with the Bayes factor". The first part of the sentence is correct. Yes, p-values do not address the probability of the null hypothesis being true or false. But the latter part is not. Bayes factors do not tell you whether the null hypothesis is true or false either. They tell you about odds. Your experiment can still defy the odds, and if you run enough experiments, eventually you will defy the odds. This is true regardless of how you analyze your data; the only solution is to collect (carefully) more data.

It sounds to me like some of the references in the article are opposed to hypothesis testing. (I have not attempted to look at them, though.) I am generally skeptical of most attempts to overthrow hypothesis testing, since it is mathematically sound even if it's frequently misused and misunderstood. I think it would be much more effective for everyone to agree that 95% confidence is too low and that a minimum of 99.5% is necessary for statistical significance. Ozob (talk) 03:59, 22 February 2016 (UTC)[reply]

I am not an expert in statistics, but I am familiar with parts of it, and I can't understand this article. Introductory textbooks teach that a p-value expresses the probability of seeing the sample, or a sample more "extreme" than it, given the null hypothesis. Is the p-value fallacy a particular misinterpretation of this conditional probability, such as the probability of the null hypothesis given the data? Try to explain where exactly the fallacious reasoning becomes fallacious. Mgnbar (talk) 04:20, 22 February 2016 (UTC)[reply]

I agree with the above comments. Bayesians will say a Bayesian approach is better, but frequentists disagree. Bondegezou (talk) 11:14, 22 February 2016 (UTC)[reply]
There are also notable criticisms of the bayesian approach: Objections to Bayesian statistics2001:56A:75B7:9B00:61C6:619F:B3AB:5001 (talk) 04:10, 23 February 2016 (UTC)[reply]
I'm somewhat surprised by the comments here, because it seems like the article is being interpreted in the context of the frequentism vs Bayesianism debate, which I have no interest in. The "inferior to Bayes factors" comment is relatively unimportant to this article, and I'd be fine with removing or qualifying it.
For the statements which are claimed to be incorrect, all I can say is that they're sourced from the references, unless I've misread or misunderstood them (which is entirely possible).
  • The "single number" statement is drawn from the following passages in Goodman: "...the mistaken idea that a single number can capture both the long-run outcomes of an experiment and the evidential meaning of a single result." and "the question is whether we can use a single number, a probability, to represent both the strength of the evidence against the null hypothesis and the frequency of false-positive error under the null hypothesis...it is not logically possible." The trade-off is that you can't have both of these at the same time.
  • The "basic property" statement is sourced from the comment in the source that "we are not discussing a conflict between frequentist and Bayesian reasoning, but are exhibiting a fundamental property of p values that is apparent from any perspective." I think my wording may have been unclear since it could be interpreted as referring to the preceding sentence and not the fallacy in general, or perhaps I shouldn't have converted "any perspective" to "these specific perspectives" (even though the latter is logically contained in the former). Part of the point of including this statement was to reinforce the point that the fallacy is not about frequentism vs Bayesianism. It's definitely not intended to claim that there are Bayesian p-values, and proposed alternative wordings would be appreciated. :-)
  • The third statement seems to be based on a misreading of the article, because there isn't any claim that Bayes factors tell you about the truth of H0, only about probabilities. This is an equivalent statement to the "correction" statement provided here (since, of course, odds and probabilities are interconvertible). Again, AFAICT the source supports the statement. It also adds "in some cases the classical P-value comes close to bringing such a message," which I left out since it didn't seem directly relevant, but I'd be fine with adding that.
In response to Mgnbar, the fallacious reasoning (as defined in the first sentence) is "result X is statistically significant; therefore result X is true" or "result X is not statistically significant; therefore result X is false." Sunrise (talk) 10:34, 23 February 2016 (UTC)[reply]
Thank you for your response. My response is below. Mgnbar (talk) 14:16, 23 February 2016 (UTC)[reply]

Merger proposal[edit]

Continuing on from the above, I've had another read through of this article and, the more I read, the more flawed it appears to me. It appears inconsistent as to precisely what the "p-value fallacy" is and is, more broadly, simply a Bayesian critique of p-values. There is useful material here on challenges with p-values, common mistakes, and that Bayesian critique, but all that would be better integrated into the statistical significance article, not separated off here (WP:CFORK). I question notability for an article under the present name, and propose a merger into statistical significance. Bondegezou (talk) 15:09, 22 February 2016 (UTC)[reply]

Change that: merger proposed to p-value. Bondegezou (talk) 15:14, 22 February 2016 (UTC)[reply]
I disagree with the merger. This is a topic too important to be restricted to a subsection of the p-value article. It needs a section in p-value summarizing the fallacy, and with a {{Main}} template pointing to here. Headbomb {talk / contribs / physics / books} 15:47, 22 February 2016 (UTC)[reply]
The (subjective) importance of a topic is not, and never has been, a reason for creating a fork or spinoff article. Please review Wikipedia:Content forking. If having a section in p-value discussing the fallacy would make that article too long or would give undue weight to this topic, then a spinoff article with a summary section would be appropriate, but there are no other reasons laid out in policy for a separate article. --RexxS (talk) 16:46, 22 February 2016 (UTC)[reply]
There is plenty to be said on the limitations of p-values, but that material should be upfront in the p-value article. If it's important, then you want it in the prominent article, not sidelined to a separate article. As RexxS says, if the p-value article gets too long, then we need to take a look at the article as a whole and consider how best to split it up. Bondegezou (talk) 16:52, 22 February 2016 (UTC)[reply]
I disagree with the merger for quite a different reason. Because I don't know what the "p-value fallacy" is (despite reading this article several times now), I'm not sure that it hasn't already been covered at P-value#Misunderstandings. A clear description of the p-value fallacy, in terms of probability theory concepts and their practical interpretations, as I requested in the preceding section of this talk page, would go a long way toward convincing me one way or the other. Mgnbar (talk) 17:14, 22 February 2016 (UTC)[reply]
Also, this article could really use a concrete example or case study (whether or not the content gets moved somewhere). Mgnbar (talk) 18:13, 22 February 2016 (UTC)[reply]
I'm currently on the "no merge" side, for editorial reasons. It will probably be easier to include this List of fallacies and related lists and categories if it's in a single article. WhatamIdoing (talk) 22:57, 22 February 2016 (UTC)[reply]
Looking at List of fallacies, it seems to me that this article does not fit there and does not describe a fallacy in the same sense. This is an article about a misunderstanding of or problem with the p-value, not a type of fallacy. Bondegezou (talk) 00:07, 23 February 2016 (UTC)[reply]
I considered making this a section of the p-value article, but eventually concluded that this was best as a separate topic, due to e.g. being named as such in the cited sources. It's not intended as a critique of p-values (Bayesian or otherwise), although ironically I think it's moving in that direction with the recent additions to the article. (If I'd wanted to create an article on misinterpretations of p-values, I would have used the content in the p-value article as a template!)
For concrete examples, there are several in the cited papers. Goodman Dixon provides one from which I sourced the statement "analysis of nearly identical datasets can result in p-values that differ greatly in significance" - he presents two datasets, one which has a significant main effect and nonsignificant interaction effect, while the other has a significant interaction effect but a nonsignificant main effect.
I don't have a strong opinion on whether this should be named a fallacy - I used the term because it's what the sources use. I don't think it seems out of place among the entries in Category:Probability fallacies though. Sunrise (talk) 10:53, 23 February 2016 (UTC)[reply]
I think it needs to be part of the article; one should know this central point about p-values while studying the concept.Limit-theorem (talk) 11:37, 16 March 2016 (UTC)[reply]

Odds of a false positive[edit]

Headbomb has added a section on the odds of a false positive statistically significant result. Plenty of good material there, but this is a separate issue to the purported p-value fallacy that this article is about. Why not move that section to p-value or Type I and type II errors? This looks like classic content forkery to me. This article should not be a dumping ground for all flaws with or misunderstandings of the p-value. Bondegezou (talk) 16:39, 22 February 2016 (UTC)[reply]

The p-value fallacy is, according to the lead "The p-value fallacy is the binary classification of experimental results as true or false based on whether or not they are statistically significant. It derives from the assumption that a p-value can be used to summarize an experiment's results, rather than being a heuristic that is not always useful." The odds of having false positive occurring, seems pretty relevant to this article. That the material could also be added to other articles seems to bear little impact on whether we should include it here. The xkcd portrayal is significant commentary on the p-value fallacy, and is considered a good/great enough example of it by statisticians to use it themselves. I don't see what is gained by removing the section. Headbomb {talk / contribs / physics / books} 17:04, 22 February 2016 (UTC)[reply]
The problem of false positives increasing with multiple testing is a different issue to the "p-value fallacy" as currently described in the article. You have not provided any citations that link the problem you've written about with the term "p-value fallacy". I am thus suggesting moving your section, not removing it, to somewhere more relevant and, indeed, more high profile! Bondegezou (talk) 17:13, 22 February 2016 (UTC)[reply]
But isn't this exactly what this example is about? You get a false positive, but consider it a true deviation from the null hypothesis because it passed the 'p-value test' for significance. That's the p-value falacy: "the binary classification of experimental results as true or false based on whether or not they are statistically significant". Headbomb {talk / contribs / physics / books} 19:11, 22 February 2016 (UTC)[reply]
As others have said, it isn't that clear what this article is about, but as far as I can make out, the problem of multiple testing (which is a very real problem) occurs whether or not one accepts the criticism of the p-value fallacy. The p-value fallacy has critiqued your action of doing a single p-value, long before you get into multiple testing problems. The problem you're writing about is a specific problem about multiple testing, whereas the purported p-value fallacy is about every single test. Thus, while both criticisms of p-values, they are distinct from each other.
The proof of the pudding, so to speak, is in the reliable source citations. Do you have any that connect what you are writing about to the term "p-value fallacy"? If it's an example of the p-value fallacy, such should be easy to find. Bondegezou (talk) 22:44, 22 February 2016 (UTC)[reply]

I note there are 4 citations given in this section. I have looked through them all (7, 8, 9, 10): none of them use the term "p-value fallacy". Indeed, I note that other citations given in the article (2, 6, 11) do not use the term "p-value fallacy". If no citations in this section make any reference to "p-value fallacy", then I suggest again that this section is not about the p-value fallacy and constitutes WP:OR in this context. There is good material here that I am happy to see moved elsewhere, but this is not relevant here. Would others than me and Headbomb care to chime in? Can anyone firmly link this section to the p-value fallacy? Bondegezou (talk) 10:27, 23 February 2016 (UTC)[reply]

I would agree that multiple testing isn't part of the p-value fallacy - this is useful content but I think this section probably fits better elsewhere. As a point of clarification, though, the fallacy is not about having used p-values at all; it's about a specific conclusion that is drawn from their use (as described above). Sunrise (talk) 10:51, 23 February 2016 (UTC)[reply]
Multiple testing isn't the p-value fallacy, it's related to it, in that you will have more false positive in the case of multiple testing. And then, ignoring the broader context (and exception rates of false positives), declaring that you have discovered something (or someone else declaring that someone has discovered something) because there's only a 5% change it could be wrong, that is the p-value fallacy in action. [1]. Headbomb {talk / contribs / physics / books} 13:49, 24 February 2016 (UTC)[reply]
Headbomb, you are the only person who is defending having this section here. You have provided no citations showing a connection between this section and the term "p-value fallacy". I propose we remove this section. Bondegezou (talk) 09:41, 3 March 2016 (UTC)[reply]
I agree. This article is not about all fallacies involving p-values; it is about a specific fallacy which has been called the "p-value fallacy". Multiple testing is a different subject. Ozob (talk) 13:28, 3 March 2016 (UTC)[reply]
A different, but related subject with relevance to the p-value fallacy. e.g [2]/[3]/ [4]. To quote from the third link "There are serious problems with interpreting individual P-values as evidence for the truth of the null hypothesis (Goodman, 1999). It is also well established that reporting measures of scientific uncertainty such as confidence intervals are critical for appropriate interpretation of published results (Altman, 2005). However, there are well established and statistically sound methods for estimating the rate of false discoveries among an aggregated set of tested hypotheses using P-values (Kendziorski and others, 2003; Newton and others, 2001; Efron and Tibshirani, 2002)." Goodman 1999 being the article where the 'P-value fallacy' term was first mentioned, but Goodman did by no means invent or discovery a new type of fallacy here. Headbomb {talk / contribs / physics / books} 14:56, 3 March 2016 (UTC)[reply]
To the contrary, you're the only one arguing for its removal. To have an article on the p-value fallacy without a discussion of false positive rates is like having an article on weight loss without mentioning calorie intake. Headbomb {talk / contribs / physics / books} 13:28, 3 March 2016 (UTC)[reply]
You still haven't linked the term "p-value fallacy" to this section.
User:Sunrise and User:Ozob appear to be in agreement with me here. No-one else has posted in this section. Bondegezou (talk) 17:52, 3 March 2016 (UTC)[reply]
Yeah, based on the sources we have it seems that the p-value fallacy only refers to one out of the many different misuses of p-values. It's good content and I wouldn't want to lose it, so one option might be to start a new article expanding on p-value#Misunderstandings, called "Misunderstandings of p-values" or something similar (Shock Brigade Harvester Boris also suggested "Fallacies in null hypothesis significance testing," though I'd prefer "misunderstandings" over "fallacies.") Multiple testing and the p-value fallacy would be separate sections in that article. The main potential issue is POV-forking, but I think a neutral article could still be written. Sunrise (talk) 22:11, 3 March 2016 (UTC)[reply]

Not explained clearly enough[edit]

I think it would be possible to explain the fallacy better. Here is my attempt: When an experiment is conducted, the question of interest is whether a certain hypothesis is true. However, p values don't actually answer that question. A p value actually measures the probability of the data, assuming that the hypothesis is correct. The fallacy consists in getting this backward, by believing that the p value measures the probability of the hypothesis, given the data. Or to put it a bit differently, the fallacy consists in believing that a p value answers the question one is interested in, rather than answering a related but different question. Looie496 (talk) 18:22, 22 February 2016 (UTC)[reply]

You hit the nail in the head. I strongly support including this explanation in the lead. Cheers. Neodop (talk) 18:56, 22 February 2016 (UTC)[reply]
Likewise, I like it. We need a dead simple explanation and the above fits the bill. Boghog (talk) 19:32, 22 February 2016 (UTC)[reply]
I agree as well. Headbomb {talk / contribs / physics / books} 20:23, 22 February 2016 (UTC)[reply]
So, as I conjectured earlier on this talk page (but never got a response), the fallacy consists of confusing two conditional probabilities? Then isn't it essentially the first misunderstanding already listed at P-value#Misunderstandings? Mgnbar (talk) 20:31, 22 February 2016 (UTC)[reply]
Agree with Mgnbar. Bondegezou (talk) 22:46, 22 February 2016 (UTC)[reply]
I also agree. I now believe that this page should be merged with p-value. Ozob (talk) 01:22, 23 February 2016 (UTC)[reply]

Why don't you guys read the references? Sellke et al make it quite clear in their paper. They indicate three misinterpretations of the p-value, the first of which corresponds to the p-value fallacy sensu stricto as formulated by Goodman in 1999. The second one is what Looie496 explained above. All three misinterpretations are interrelated to some extent and merit being explained in the article, although the first one should be the one more extensively described per most of the literature.

"the focus of this article is on what could be termed the "p value fallacy," by which we mean the misinterpretation of a p value as either a direct frequentist error rate, the probability that the hypothesis is true in light of the data, or a measure of odds of H0 to H1. [The term "p value fallacy" was used, in the first of these senses, in the excellent articles Goodman (1999a,b).]" (Sellke et al., 2001)

Given the large number of publications on the topic and the fact that it has a consistently used name, I find it a terrible idea to merge into the p-value article. Neodop (talk) 01:42, 23 February 2016 (UTC)[reply]

I haven't read the reference because it is behind an inconvenient paywall. Please explain what the phrase "direct frequentist error rate" means. If the phrase "p-value fallacy" describes all three misinterpretations that you quoted, and the second misinterpretation is explainable in basic probability theory concepts, then presumably the first misinterpretation is also explainable. So please explain it. Regards. Mgnbar (talk) 02:54, 23 February 2016 (UTC)[reply]
The phrase "what could be termed the "p value fallacy,"" suggests this is not an established piece of terminology and the article could be falling foul of WP:NEO. That aside, there are important and much-discussed critiques of p-values: what I don't understand, User:Neodop, is why you want to bury those in a fork article and not have them up–front in the p-value article. Each click required loses readers: if you want people to read this material, put it in the main article. Bondegezou (talk) 08:36, 23 February 2016 (UTC)[reply]
To follow-up on that: hits yesterday for p-value fallacy were 231, while hits for p-value were 5732. If something is an important critique of p-values, it should be in the main article. Bondegezou (talk) 10:12, 23 February 2016 (UTC)[reply]
The term was coined in 1999, that paper was published in 2001, so at that time it may have been a "neologism" as you imply, but now it's 2016 and papers keep being published using the term extensively (an example). Regardless, of the title, which is fine, the article clearly refers to a self-contained, widely discussed topic that has been the focus of numerous publications since the early 1990s. Goodman started writing about this in 1992 and responses to his early papers were still being published in 2002! So clearly this is a relevant topic within the field and merits a good Wikipedia article. Cheers. Neodop (talk) 11:05, 23 February 2016 (UTC)[reply]
The phrase "direct frequentist error rate" means assuming that p=0.05 indicates a false discovery rate of 5%, when in reality the associated FDR is much higher (36% in this paper's example). This is the main practical implication of the p-value fallacy and one of the reasons why most published research findings are false, to quote Ioannidis. Neodop (talk) 11:05, 23 February 2016 (UTC)[reply]
I'm now concerned that there is not a single, identifiable "p-value fallacy". The term seems to refer to any of several fallacies around p-values. I say this because the first two misinterpretations here are mathematically distinct, right?
  1. "...the misinterpretation of a p value as either a direct frequentist error rate,
  2. the probability that the hypothesis is true in light of the data,
  3. or a measure of odds of H0 to H1."
Is the third misinterpretation distinct from the other two? Even if not, are there two "p-value fallacies"? Are there others? Mgnbar (talk) 14:16, 23 February 2016 (UTC)[reply]
Yes, what Mgnbar said. What User:Neodop says above and what it says at P-value_fallacy#Misinterpretation_of_the_p-value looks to me distinct from what User:Sunrise says further above and what it says in the article lede...? Bondegezou (talk) 14:20, 23 February 2016 (UTC)[reply]
What I explained is the same as what Sunrise has written above, which in turn is what Goodman, Sellke and Colquhoun argue (they explain it in different ways but it is the same). To be more clear, what Sunrise means is that testing the significance of a result (p-value) and calculating the false discovery rate of said experiment cannot be done at once, and there is no easy way of connecting one to the other (they are very different concepts). The fallacy consists in thinking that the p-value somehow is or implies the FDR. I hope this helps. I think the initial problem of the article is that it was written based on the Goodman paper, which uses a convoluted and highly technical language. PS: It is also worth noting that the multiple comparison testing part included by Headbomb (the xkcd strip), is related but not essential to this matter since the FDR can effectively be calculated to "interpret a single p-value", as done by Colquhoun in his paper. It would be useful to divide the explanation on the FDR into two parts: one core section on the interpretation of single p-values, and another section dealing with multiple comparisons, their weaknesses and corrections (e.g. Bonferroni). Neodop (talk) 14:51, 23 February 2016 (UTC)[reply]

So let me take a stab at explaining this, based on the Colquhoun paper.

  • Let C be the event that a randomly chosen person has the medical condition in question. Let T be the event that the person tests positive for the condition. The false discovery rate is defined to be P(T, -C | T).
  • The medical test in question is a form of hypothesis test. The null hypothesis is H0 = -C. The alternative is C. Let D be the data. Then the p-value is p = P(D | -C). The connection to T above is that T is the event "p < 0.05".
  • Therefore the p-value fallacy, in its first and primary sense, is the mistaken identification of these two probabilities:
  1. P("p < 0.05", -C | "p < 0.05"),
  2. p.
  • In particular, the p-value fallacy is NOT this other common mistake, of identifying
  1. P(D | -C),
  2. P(-C | D).

I'm not confident that this explanation is correct or even sensible. I'm just trying to convey the kind of mathematical detail that would help me understand how this particular misinterpretation of probabilities differs from another. Also, Wikipedia's audience is diverse. The article should offer an explanation in plain English. But it should also offer an explanation in mathematical notation. Mgnbar (talk) 16:10, 23 February 2016 (UTC)[reply]

Responding to myself: Appendix A.1 of Colquhoun's paper suggests that the p-value fallacy is nothing more or less than P(-C | T) = P(T | -C). Mgnbar (talk) 16:38, 23 February 2016 (UTC)[reply]
User:Sunrise gave the following explanation above: "the fallacious reasoning (as defined in the first sentence) is "result X is statistically significant; therefore result X is true" or "result X is not statistically significant; therefore result X is false."" That formulation, it seems to me, is a simpler misunderstanding. Sunrise's fallacy is not going from P(D | -C) to P(-C | D), but going from P(D | -C) to merely P(C) = 1 or P(C) = 0 depending on the arbitrary 5% cutoff. That's a common misperception about p-values, but I would say a different one to Mgnbar's descriptions, which are indeed other misperceptions.
If Colquhoun means just P(-C | T) = P(T | -C), that is indeed another common misperception, but I still don't see why that isn't covered better in the p-value article. Bondegezou (talk) 16:42, 23 February 2016 (UTC)[reply]
The Dixon paper says, "However, this strategy is prone to the “p-value fallacy” in which effects and interactions are classified as either “noise” or “real” based on whether the associated p value is greater or less than .05. This dichotomous classification can lead to dramatic misconstruals of the evidence provided by an experiment." So, as User:Sunrise described. Bondegezou (talk) 18:21, 23 February 2016 (UTC)[reply]
Goodman, however, who first appears to have used the term "p-value fallacy" is making a more complicated point and is more critical of the whole premise of p-values. Goodman argues that Fisherian p-values and Neyman/Pearson hypothesis testing are not coherent positions. Thus: "The idea that the P value can play both of these roles is based on a fallacy: that an event can be viewed simultaneously both from a long-run and a short-run perspective. In the long-run perspective, which is error-based and deductive, we group the observed result together with other outcomes that might have occurred in hypothetical repetitions of the experiment. In the “short run” perspective, which is evidential and inductive, we try to evaluate the meaning of the observed result from a single experiment. If we could combine these perspectives, it would mean that inductive ends (drawing scientific conclusions) could be served with purely deductive methods (objective probability calculations)." Goodman basically then argues for a Bayesian approach.
Notably Dixon does not cite Goodman's paper, so I don't think it's clear that Dixon and Goodman do mean the same thing by the phrase "p-value fallacy". Bondegezou (talk) 18:34, 23 February 2016 (UTC)[reply]
That's important. The term "p-value fallacy" might show up a lot in the literature, without there being any single standardized meaning for it.
So Dixon's point is basically that the function f : [0, 1] -> {TRUE, FALSE} defined by f(p) = "p < 0.05" is discontinuous?
I still don't understand Goodman's point. Can it be rephrased in probability theory notation, as I tried to do above?
By the way, Colquhoun does not ever use the word "fallacy". So don't take that source to be an authority on what "p-value fallacy" might mean. Mgnbar (talk) 19:46, 23 February 2016 (UTC)[reply]
I interpret Goodman and Dixon as referring to the same phenomenon in different terms. From Goodman, later in the article he rephrases the trade-off as "we could not both control long-term error rates and judge whether conclusions from individual experiments were true." So it's just a more complicated description that incorporates a rationale for why the reasoning fails. The only case where I see alternative definitions is Sellke, and in that case the options still seem to refer to the same essential error, which is the use of p-values as direct evidence about the truth of the hypothesis. As noted above, that paper was also published soon after the term was coined, so it's also possible the other versions just didn't catch on. (If I'm misunderstanding this then we could provide multiple definitions, or merge it somewhere, though I think it would be WP:UNDUE to have all of this at the main p-value article.) Sunrise (talk) 08:27, 25 February 2016 (UTC)[reply]

Bayes factors[edit]

The defense of Bayesian analysis should not be as prominent in the article, and definitely not included in the lead (it is highly misleading, no pun intended :)). Most of the authors exposing the misuse of p-values do defend Fisherian hypothesis testing and are highly critical of "subjective" Bayesian factors, etc. particularly Colquhoun. Neodop (talk) 14:57, 23 February 2016 (UTC)[reply]

We seem to have some consensus on that, so I'll try trimming the article accordingly. Revert me if I get it wrong! Bondegezou (talk) 16:43, 23 February 2016 (UTC)[reply]
The Bayes factor comparisons turned out to be a distraction, so I'm happy with those changes. :-) Sunrise (talk) 08:31, 25 February 2016 (UTC)[reply]

False discovery rates[edit]

I just put some {{dubious}} tags on the false discovery rate section. My objection is as follows: Suppose that we run many hypothesis tests; we get some number of rejections of the null hypothesis ("discoveries"). The FDR is defined as the expected proportion of those discoveries which are erroneous, i.e., false positives. There are statistical procedures that effectively control the FDR (great!). But if we get 100 rejections using a procedure that produces an FDR of 5%, that means that we expect 5 false positives, not that we have 5 false positives. We might have 1 false positive. Or 10. Because of this, the probability that a given rejection is a false positive is not 5%, even though the FDR is 5%! All we know is that if we were to repeat our experiment over and over, we would expect 5 false positives on average, i.e., the probability of a discovery being a false positive is 5% on average (but perhaps not in the particular experiment we ran). The article does not seem to make this distinction. Ozob (talk) 16:29, 23 February 2016 (UTC)[reply]

@Ozob: "All we know is that if we were to repeat our experiment over and over, we would expect 5 false positives on average" --> that is generally what a probability is. It's only in relation to a similar set of experiments (or a similar imagined set of experiments, with the same conditioning information) that one can talk about probability at all, rather a binary 1/0 of success or failure of a particular trial.
"5% on average (but perhaps not in the particular experiment we ran)" --> if we can identify particular relevant conditioning information, that would distinguish a particular trial from others, then we should bring it to the table, and perhaps fit a more complicated model. Then one can offer different probabilities, conditioned on the particular model they are being conditioned on. But in general, to be able to talk about probability at all, you need to have some notion of what kind of trials you consider to be similar.
It's entirely possible to beat a fair bookie (Bayesian or otherwise) -- if you know something relevant the bookie doesn't. Jheald (talk) 16:32, 24 February 2016 (UTC)[reply]
My point is, however, that there are two probabilities at play. One is the probability that, in a particular experiment, a discovery is a false positive. The other is the probability that, in an average experiment, a discovery is a false positive. These two are not the same; the latter is an expectation over all possible experimental outcomes. The article does not seem to distinguish them. Ozob (talk) 02:57, 25 February 2016 (UTC)[reply]
In a particular experiment there is no probability, just success or failure. Probability only comes in if we can consider the experiment as one of a set of "similar" experiments (whether actual or hypothetical). Different notions of "similar", ex ante, will lead to different probabilities. But that is in the nature of probability. Jheald (talk) 11:41, 25 February 2016 (UTC)[reply]
Sorry, I think we're talking past each other. Perhaps because I'm not a statistician, I'm misusing terminology; so let me try to explain by way of example. Suppose I run a drug trial. That trial measures many outcomes. For instance, I might test for changes in patients' weight, blood pressure, cholesterol, white blood cell count, blood glucose, iron, and so on. Altogether I find 1000 things to test (gotta use that grant money somehow). I perform a hypothesis test on each of these outcomes to see whether administering the drug is statistically significant. Because I'm aware of issues involving multiple testing, I decide to set my significance level to control the false discovery rate of my hypothesis tests. I'm aggressive, so I choose 50% (can't publish if you don't get any positive results). When I calculate my results, I get 4 rejections of the null hypothesis. I am elated: With a false discovery rate of 50%, two of those rejections are true positives. The drug is a success; I will soon be a multibillionaire, and then I can run for president, and... anyway. Ever the fastidious scientist, I want to know which of my four rejections were true positives and which were false positives. I run a larger trial (more grant money!) to test each of the four hypotheses with greater power. This time I get zero rejections of the null hypothesis.
In my first clinical trial, I expected two false positives but got four. Is there a bug in my R code? Perhaps I should have used SPSS. But no, my code is correct. The bug is in thinking that I actually had two false positives. Suppose that I test many drugs using the same protocol. Each time I get some rejections of the null hypothesis, and each time I follow up with a larger trial to determine which rejections were true positives and which were false positives. I find that, on average, half of the rejections are true positives and half are false positives. But in any given clinical trial, sometimes 75% of the rejections are true positives, sometimes 66% are true positives, sometimes 25% are true positives, etc. If I get really unlucky, 0% are true positives.
This is why I say that the false discovery rate is an average over many experiments. In an average experiment, the probability that a rejection of the null hypothesis is a false positive is 0.5. But in any one experiment that may not be so. In the initial experiment outlined above, there were no true positives, so the probability that a rejection is a false positive is 1.0. Ozob (talk) 00:37, 26 February 2016 (UTC)[reply]

Technical[edit]

I have a rudimentary grasp of statistics - this article gets into the weeds way too fast. Can the folks working on this please be sure that the first section describes this more plainly, per WP:TECHNICAL? Thanks. Jytdog (talk) 18:45, 23 February 2016 (UTC)[reply]

Based on the discussion we've been having, I'd say no, we can't describe it more plainly. We can't describe it at all, or even agree what it is. Come back later; we might know what we're talking about by then. Ozob (talk) 03:23, 24 February 2016 (UTC)[reply]
:) Jytdog (talk) 08:31, 24 February 2016 (UTC)[reply]

Some suggestions to reduce confusion[edit]

I'm stimulated to write since someone on twitter said that after reading this article he was more confused than ever.

I have a couple of suggestions which might make it clearer.

(1) There needs to be a clearer distinction between false discoveries that result from multiple comparisons and (2) false discoveries in single tests. I may be partially responsible for this confusion because I used the term false discovery rate for the latter, though that term had already been used for the former. I now prefer the term "false positive rate" when referring the interpretation of a single test. Thus, I claim that if you observe P = 0.047 in a single test. and claim that you've made a discovery, there is at least a 30% chance that you're wrong i.e. it's a false positive. See http://rsos.royalsocietypublishing.org/content/1/3/140216

(2) It would be useful to include a discussion of the extent to which the argument used to reach that conclusion is Bayesian, in any contentious sense of that term. I maintain that my conclusion can be reached without the need to invoke subjective probabilities. The only assumption that's needed is that it's not legitimate to assume any prior probability greater than 0.5 (to do so would be tantamount to claiming you'd made a discovery and that your evidence was based on the assumption that you are probably right). Of course if a lower prior probability were appropriate than the false positive rate would be much higher than 30%.

(3) I don't like the title of the page at all. "The P value fallacy" has no defined meaning -there are lots of fallacies. There is nothing fallacious about a P value. It does what's claimed for it. The problem arises because what the P value does is not what experimenters want to know, namely the false positive rate (though only too often these are confused).

David Colquhoun (talk) 19:49, 23 February 2016 (UTC)[reply]

Thank you. We have been spinning our wheels, trying to help the editors clarify what is going on here. Based on what I've seen, I believe you that there is no single agreed-upon "p-value fallacy".
The editors need to decide what this article is about. If it is about confusing false positive rate with p-value, then maybe it is better treated as part of False positive rate. Mgnbar (talk) 20:07, 23 February 2016 (UTC)[reply]

I notice that the entry on False Positive Rate discusses only the multiple comparison problem, so this also needs enlargement. David Colquhoun (talk) 11:34, 24 February 2016 (UTC)[reply]

From Abracadabra to Zombies: p-value fallacy[edit]

From an average non-sciences reader/editor who was recently and not out of personal interest self-introduced to the p-value concept via Bonferroni correction, during a content discussion/debate, I find this explanation clear and straightforward (so hopefully it is accurate as well): the Skeptic's Dictionary: From Abracadabra to Zombies: p-value fallacy.

Also, this article's title seems practical, as p-value fallacy appears to be common enough - as in, for example, randomly, The fallacy underlying the widespread use of the P value as a tool for inference in critical appraisal is well known, still little is done in the education system to correct it - therefore, it is the search term I'd hope to find an explanation filed under, as a broad and notable enough subject that merits standalone coverage. --Tsavage (talk) 21:17, 23 February 2016 (UTC)[reply]

"The" p-value fallacy[edit]

Maybe I'm missing something, but there does not yet seem to be consensus that there is a single, identifiable concept called "the p-value fallacy". For example:

  • In #Not explained clearly enough there is a passage from Sellke et al. (2001) that uses the term to denote three distinct fallacies.
  • In #Some suggestions to reduce confusion, David Colquhoun (whose work is cited elsewhere on this talk page as authoritative) explicitly states that there is no single identifiable p-value fallacy, but rather several.

Because there are multiple fallacies around p-values, Google searches for "p-value fallacy" might generate many hits, even if there is no single, identifiable fallacy.

So is this article supposed to be about all fallacies around p-values (like p-value#Misunderstandings)? If not, then is this particular p-value fallacy especially notable? Does it deserve more treatment in Wikipedia than other fallacies about the p-value? Or is the long-term plan to have a detailed article about each fallacy? Mgnbar (talk) 15:25, 3 March 2016 (UTC)[reply]

We should have one single article covering pretty much everything related to p-value fallacies. They're not independently notable and are all interlinked anyway. Headbomb {talk / contribs / physics / books} 17:29, 3 March 2016 (UTC)[reply]
As per WP:CFORK, problems with p-values should be in the p-value article. That the term "p-value fallacy" is used to mean different things shows that it isn't a specific thing warranting an article, it's just a phrase that different people have used to mean different things.
There are significant issues with p-values, so let's put those clearly in the article that people read, which is p-value, and not here in this mess of WP:OR and WP:NEO. Bondegezou (talk) 17:49, 3 March 2016 (UTC)[reply]
I agree. If you think that Wikipedia needs more detailed treatment of fallacies around p-values, then put it into p-value#Misunderstandings, until that section becomes so big that the article forks naturally. Mgnbar (talk) 19:12, 3 March 2016 (UTC)[reply]

Based on the discussion above, I'd like to propose turning the present article into a redirect to p-value#Misunderstandings. There is no one "p-value fallacy" in the literature, so the present article title will always be inappropriate. Moreover, p-value#Misunderstandings is quite short, especially given the breadth of its content. The content currently in this article would be better placed there where it's easier to find and will receive more attention. Ozob (talk) 00:12, 4 March 2016 (UTC)[reply]

Does this article offer anything of value? I mean, should it be merged into p-value or significance testing? Or is there nothing to be saved from here? (I honestly don't have an opinion.) Mgnbar (talk) 00:28, 4 March 2016 (UTC)[reply]
My analysis of the sources is that the term is generally used to refer to a specific independent concept, but I've already explained myself above. Do we have a recent source (not within a few years of the term being coined) which identifies "the p-value fallacy" as something different from that described here? I'd be fine with Headbomb's suggestion that we move the entire article this to a location where other misunderstandings/fallacies can also be discussed. But I don't think this material belongs in the p-value article because it would be WP:UNDUE. Sunrise (talk) 01:05, 4 March 2016 (UTC)[reply]
I don't understand your UNDUE argument, Sunrise? If the material is UNDUE in the p-value article, then it's UNDUE in its own article. Bondegezou (talk) 14:00, 4 March 2016 (UTC)[reply]
Whether or not content is undue is affected by the scope of the article. Information on the Flat Earth theory is undue in the Earth article, but not in the Flat Earth article, and so forth. I think this would be excessive detail for the article focused on p-values in general, but not for an article with a narrower focus.
By the way, you're partly misinterpreting WP:CFORK as well. A content fork is not inherently against policy, and is part of the usual WP:Summary style organization of Wikipedia (per the first paragraph on that page). It's WP:POVFORKs that have to be avoided, and that depends on how the article is written. There isn't any consensus on whether an article titled "Criticism of X" is always a POV fork, but nobody is proposing that, and similar types of articles (e.g. Objections to evolution and Creation-evolution controversy come to mind) aren't POV forks as long as they're written neutrally. Sunrise (talk) 02:32, 6 March 2016 (UTC)[reply]

Comment: As I understand it, largely from reading in and around this discussion, what is usually considered the p-value fallacy is, in simple English, misusing p-value testing as way to determine whether any one experimental result actually proves something.

There are a number of different mistaken beliefs about p-values that can each lead to misuse that results in the same fallacious application, i.e. to establish proof. So there appears to be one distinct overall p-value fallacy that is referred to in multiple sources, regardless of any other p-value misuses that may fall entirely outside of this definition.

This would mean that we have a notable topic (per WP:GNG), and perhaps some clarification to do within that topic. Have I oversimplified or plain got it wrong? (I'm participating as a guinea pig non-technical reader/editor - this article should be clear to me! :)--Tsavage (talk) 20:34, 4 March 2016 (UTC)[reply]

I don't think the literature supports that. What we have is a very large and established literature that discusses various issues with and misunderstandings of p-values, and then a small number of primary sources that use the phrase "p-value fallacy", some independently of each other and meaning different things. The or a "p-value fallacy" is not something you see discussed in good secondary/tertiary sources. Bondegezou (talk) 23:36, 4 March 2016 (UTC)[reply]
Thanks. A question: used as a popular term (e.g. as casual shorthand in blog posts), does p-value fallacy fairly represent the general idea that p-values are widely misused as proof? If so, perhaps "P-value fallacies" instead, with a redirect from "P-value fallacy"? I'm thinking about general readers who might hear the term, or be specifically interested in the "problem with p-values" as encountered elsewhere, and perhaps be best served by an article that directly covers that concern. We wouldn't want to imply that the term is formally established if it isn't, on the other hand, an aspect of p-values that could call into question a vast number of studies seems worthy of in-depth, article-length coverage, discussing the criticisms, possible remedies, and so forth. That's the input I have. If there's an answer to my initial question, I'll read that with appreciation, but I have no further point to argue. :) --Tsavage (talk) 00:50, 5 March 2016 (UTC)[reply]
As I see it, the phrase "p-value fallacy" is unhelpful. There are criticisms of the p-value that argue the whole idea is flawed, and then there are common misinterpretations of p-values. Lumping these together misses that important distinction. Bondegezou (talk) 08:09, 5 March 2016 (UTC)[reply]
Bondegezou, I don't think you're reading the sources correctly. For instance, most of the information cited here appears to be secondary, per the usual definition at WP:PSTS; I wouldn't have used it otherwise. I don't want to repeat myself too much, so I'll just ask again: do we have a recent source (not within a few years of the term being coined) which identifies "the p-value fallacy" as something different from that described here? I'm certainly open to changing my mind if clear evidence is presented. And what do you think of the compromise of moving this to a broader article that addresses misconceptions about p-values in general? Sunrise (talk) 02:32, 6 March 2016 (UTC)[reply]
I for one am open to an article on misconceptions about p-values generally. I think there should be enough content for such an article, and it makes little difference to me whether it grows out of p-value#Misunderstandings or out of the present article. Ozob (talk) 03:16, 6 March 2016 (UTC)[reply]

The main sources used for the substantive content of this article are the papers by Goodman and Dixon. They coined and promulgated the term "p-value fallacy", so they can be considered primary sources. Least, that's how something like WP:MEDMOS would describe them. Bondegezou (talk) 08:21, 6 March 2016 (UTC)[reply]

I think you mean WP:MEDRS, since that's where primary and secondary sources in medicine are defined. But in either case, you'll need to clarify how you think they apply, e.g. medical statistics doesn't fit anywhere under WP:MEDSECTIONS. Again, could you please answer my questions? Sunrise (talk) 06:05, 8 March 2016 (UTC)[reply]
Sunrise, I for one am sorry not to have addressed your questions. But it's because I don't really know how. For example, one of your questions depends on the first occurrence of "p-value fallacy" in the literature. But when was the first occurrence? And has the meaning stabilized since then? How do we know?
To be honest, I feel that we should stop arguing here. We should instead put our time into crafting text about misinterpretations of p-values. Once material is written, it can be put wherever it seems best: P-value fallacy, p-value, or wherever. Mgnbar (talk) 21:39, 8 March 2016 (UTC)[reply]
My apologies for being unclear - I'm assuming that it was coined in 1999, from the statement in the article. It actually isn't supported by Sellke (it looks like the editor who added that got it wrong), but I'm pretty sure I read that in one of the other sources I read, so I haven't addressed that yet.
The other question is just asking for support for moving this page and broadening its scope to include other misinterpretations/fallacies as well. At this point I think we're close to a consensus in favor. Without that article, I don't really think there's a good place to put additional content yet, but I strongly support the idea of trying to focus on writing new material. Sunrise (talk) 06:59, 9 March 2016 (UTC)[reply]


Statement from the American Statistical Association[edit]

Sources for future reference:

Manul ~ talk 20:02, 8 March 2016 (UTC)[reply]

Quoting the ASA's principles may help give the Wikipedia article some focus:

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

Manul ~ talk 20:11, 8 March 2016 (UTC)[reply]

Thank you. But these are (almost?) all covered at p-value#Misunderstandings. This article, on the other hand, is (currently) concerned with yet another misinterpretation: that a p-value is the same thing as a false positive rate. Mgnbar (talk) 21:32, 8 March 2016 (UTC)[reply]
It is not. Headbomb {talk / contribs / physics / books} 22:38, 8 March 2016 (UTC)[reply]
I could easily be wrong. Apparently, after weeks of following this article, I still don't know what it's about. Unfortunately, your response is too terse to educate me. Mgnbar (talk) 22:46, 8 March 2016 (UTC)[reply]
@Manul, it's great to see you on WP again. :-) I think the ASA is a good source - while the issues are mostly at the p-value page already, it looks like there's a lot of additional information here that we could use for a general article on misinterpretations of p-values.
@Headbomb: well, I'd say that currently it isn't, because of the information on multiple testing! But perhaps the issue will be nullified if we move the article to a broader title.
--Sunrise (talk) 06:59, 9 March 2016 (UTC)[reply]
I agree that the article should be enlarged to include pretty much every misconception about p-values. The core misconception being that a p-value test can be used to produce a yes/no answer to the question of whether or not there exists a 'true effect' vs 'statistical noise'. Nearly all other misconceptions are related to this one, and are always compounded by multiple testings. Statistics Done Wrong is a great resource on this. Headbomb {talk / contribs / physics / books} 12:02, 9 March 2016 (UTC)[reply]
Then I think we should drop "p-value fallacy" as a title and engage the Talk page of the p-value article. Bondegezou (talk) 12:06, 9 March 2016 (UTC)[reply]
Developing the article here will be much more productive than using a different talk page to do so. Headbomb {talk / contribs / physics / books} 12:07, 9 March 2016 (UTC)[reply]
What you are now proposing is a big fork from p-value. You should not do that without engaging editors who work on that article. Bondegezou (talk) 13:23, 9 March 2016 (UTC)[reply]
Everyone's free to participate here. This talk page is, after all, where the notice on p-value is pointing. See also WP:BUREAUCRACY. Headbomb {talk / contribs / physics / books} 15:03, 9 March 2016 (UTC)[reply]
I'm not being WP:BUREAUCRACY-ish. I do not understand why you are so keen to separate everything from an existing article and editing community that covers exactly this topic. We will get better outcomes by involving more editors. Bondegezou (talk) 18:04, 9 March 2016 (UTC)[reply]
There's no separation required for this. The easiest way is just to leave a notice on the relevant talk page, which I now see that you've already done. :-) Sunrise (talk) 00:57, 11 March 2016 (UTC)[reply]

My assumption was that this article would eventually become the expansion of p-value#Misunderstandings, with that section of the p-value article being a summary of this one. I haven't yet delved into the recent discussions, but from an outsider's perspective I can say that that's what readers expect. It would seem super confusing for this article to cover some (or one) misunderstanding but not others. Manul ~ talk 12:45, 10 March 2016 (UTC)[reply]

Move to Misunderstandings of p-values[edit]

I've moved the article, since it seems like we have enough support for it. If the fork becomes unnecessary, we can just merge this into the main article. Since Bondegezou still prefers that option, I left the merge tags open and pointed it to this page instead. In the meantime, help with building additional content is appreciated! Sunrise (talk) 00:56, 11 March 2016 (UTC)[reply]

Okay, I've now made some changes aimed at the different scope, and imported a lot of content from other articles. Explanations for specific edits are in the edit summaries. I used the ASA statement a couple of times, but there's a lot more information that could be added. One important thing is to finish dividing everything into appropriate sections, especially the list of misunderstandings from p-value#Misunderstandings, which is currently in the lead. That will probably need careful reading to figure out which parts are supported by which sources. Once that's done it will be a lot easier to work on expanding the individual sections. Sunrise (talk) 01:56, 11 March 2016 (UTC)[reply]

Equipped with its new mission, the article is rapidly becoming much better. I am glad that we seem to be out of the swamp. Mgnbar (talk) 20:59, 11 March 2016 (UTC)[reply]

The narrow sense of Goodman (1999)[edit]

I finally downloaded the Goodman (1999) paper that apparently coins the term "p-value fallacy" in its narrow sense. It's very different from the statistics that I usually read, so I'm hoping for some clarification here. For starters, Goodman criticizes another source for including this passage:

The statement "P < 0.01" indicates that the discrepancy between the sample mean and the null hypothesis mean is significant even if such a conservative significance level as 1 percent is adopted. The statement "P = 0.006" indicates that the result is significant at any level up to 0.6 percent.

My interpretation of this passage, which is admittedly out of context, is:

The statement "P < 0.01" indicates that the discrepancy between the data and the null hypothesis would be significant, if the threshold α for significance had been set to 0.01 (i.e. 99% confidence). The statement "P = 0.006" indicates that the discrepancy would be significant, if α had been set to any number greater than 0.6% (i.e. any confidence less than 99.4%).

So my questions are:

  1. Is my version of the passage factually correct? Would Goodman agree that it is correct?
  2. Is the original version of the passage incorrect? If so, then what crucial difference between the two versions is Goodman criticizing? Is it an issue of setting α after seeing the data? Is it that the phrase "up to 0.6 percent" must be interpreted "backward"?
  3. Or if the original passage is correct, is Goodman's point merely that it is commonly misinterpreted or misapplied? What is the common misinterpretation? Mgnbar (talk) 20:59, 11 March 2016 (UTC)[reply]
As I understand it, the idea is as follows:
  • The problem is the "dual evidence/error-rate interpretation." In this case, the first sentence is describing an error rate, while the second sentence is describing a measure of evidence (it implies that a lower p-value gives stronger evidence, i.e. "measure[s] how severely the null hypothesis was contradicted by the data"). Both interpretations are valid, but they're incompatible with each other, so they can't both be used at the same time.
  • He would agree with your version, because you converted the second sentence from a claim addressing levels of evidence to a claim addressing (hypothetical) preset error rates. I think he'd add that as long as we adopted the error rate interpretation by setting thresholds ahead of time, then the (true) statement about hypothetical error rates can't be used to make inferences about results. If we did that, one way to interpret it would be as an issue of setting α after seeing the data, as you suggest.
  • On a more speculative level: I think he's also criticizing people who conflate α and p, although I'm not sure how this fits with the rest of it. The point might just be that interpreting p as a significance level is essentially the same as using p to measure evidence. A related point might be that even under the error rate interpretation, the p-value only applies if the null hypothesis is true; if it's false, then we won't get the right value because the mean of the distribution will be different.
Let me know your thoughts. Sunrise (talk) 06:32, 12 March 2016 (UTC)[reply]
Thank you for your response, Sunrise. When I read the original passage, I "auto-correct" it in my mind and thus can't see the problem. But now I get how someone could misinterpret the original passage, even into something as grotesque as p = P(H0 | data). So maybe now I can make some progress on understanding his deeper points, including the "can't be used to make inferences" part. Mgnbar (talk) 12:50, 12 March 2016 (UTC)[reply]

New paper[edit]

This paper seems very useful here. Bondegezou (talk) 09:40, 27 May 2016 (UTC)[reply]

This is also a good summary from Science. Bondegezou (talk) 10:33, 3 June 2016 (UTC)[reply]
Agreed, and thanks. :-) I also have this article by the same author (Goodman) that I've been planning to read through, as well as this one. Sunrise (talk) 23:56, 3 June 2016 (UTC)[reply]

List of Misunderstandings is Problematic and Not Supported by the Cited Sources[edit]

The p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false. It is not connected to either. - The first sentence is unequivocally accurate. It essentially states that P(A|B) is not generally the same as P(B|A), where A is the event that the null is true, B is the observed data, and the p-value is P(B|A). However, the second sentence seems unnecessary and overly strong in saying the p-value, P(B|A), is "not connected" to the posterior probability of the null given the data, P(A|B). In fact, the two probabilities are, at least in some sense, "connected" by Bayes rule: P(A|B)=P(B|A)P(A)/P(B)

The p-value is not the probability that a finding is "merely a fluke." - I couldn't find the word "fluke" in any of the 3 sources cited for the section, so it is not clear (1) that this misunderstanding is indeed "common," and (2) what the word "fluke" means exactly in this context. If "merely a fluke" means that the null is true (i.e. the observed effect is spurious), then there seems to be no distinction between this allegedly common misunderstanding and the previous misunderstanding. That is, both misunderstandings are the confusion of P(A|B) with P(B|A), where A is the event that the null is true, B is the observed data, and the p-value is P(B|A).

The p-value is not the probability of falsely rejecting the null hypothesis. That error is a version of the so-called prosecutor's fallacy. - Here again, it is not clear exactly what this means, where in the cited sources this allegedly common misunderstanding comes from, and whether or not this "prosecutor's fallacy" is a distinct misunderstanding from the first one. The wiki article on prosecutor's fallacy suggests that there is no distinction--i.e. both misunderstandings confuse P(A|B) with P(B|A), where A is the event that the null is true, B is the observed data, and the p-value is P(B|A).

The significance level, such as 0.05, is not determined by the p-value. - Here again, is this really a common misunderstanding? Where is this allegedly common misunderstanding listed in the cited sources?

It should also be noted that the next section ("Representing probabilities of hypotheses") AGAIN seems to restate the first "common misunderstanding." The section also contains the following weirdly vague statement: "it does not apply to the hypothesis" (referring to the p-value). What is "the hypothesis?" — Preceding unsigned comment added by 50.185.206.130 (talk) 08:46, 6 July 2016 (UTC)[reply]

You keep wanting to use the event A that the null is true. That means you are working in a Bayesian framework. But p-values are valid even outside the Bayesian framework. In frequentist statistics, either the null is true or the null is false, full stop. P(B|A) is well-defined, because it is the probability of the data under the null, but there is no such thing as P(A). This explains the first point: The p-value is not connected to things which, in frequentist statistics, do not exist.
As to the second point, we're doing frequentist statistical inference. Assume for the moment that the null hypothesis is true. If we ran the experiment many times, then, since we're working in a frequentist framework, we would expect to observe events as or less likely than the data with probability equal to the p-value. So we might fairly interpret the p-value as the probability that our initial data was "unlucky", i.e., a "fluke". The problem is that the null hypothesis might actually be false. In that case, the p-value is well-defined but says nothing about reality. Since p-values are used to do frequentist statistical inference, we cannot know a priori which situation we are in, and hence the interpretation of the p-value as the probability of the data being a fluke is generally invalid.
I believe my description above should make it clear why the third point is not the same as the first one.
The fourth point is, in my experience, extremely common. People assume that if the p-value is, say, 0.035, then the result is more reliable than if the p-value is 0.04. This is not how frequentist hypothesis testing works. You set a significance level in advance of analyzing any data, and you draw your conclusion solely on the basis of whether the p-value is larger or smaller than the significance level. But people want to pretend that they have a prior distribution on the null and alternative hypotheses.
I continue to believe that this section of the article is correct and useful, so I've restored it. Ozob (talk) 16:17, 10 July 2016 (UTC)[reply]

Regarding your response to the first point, sure, the null is either true or false. But if someone doesn't know whether it's true or false, I don't see a problem with that person speaking in terms of probabilities based on the limited information they have access to. By analogy, if you don't know what card I've randomly drawn from a deck, you could speak of the probability of it being a red suit or a face card or the Ace of Spades, even though from an omniscient perspective there is no actual uncertainty--the card simply is what it is. I'm aware that there are different philosophical perspectives on this issue, but they are just that--perspectives. And if you're uncomfortable using the term "probability" for events in a frequentist framework, you can simply substitute "long-term frequency." In any case, I don't see how including the vague and potentially controversial statement that "it is not connected to either" is at all necessary or adds anything useful to the article section; the immediately preceding sentence is sufficient and straightforward.

Your response to the second point isn't clear to me. So "a fluke" means "unlucky?" And what is the "finding" in the phrase "finding is merely a fluke?" The data? So there is a common misunderstanding that the p-value is the probability of the data being unlucky? It's hard to see how that is even a coherent concept. Perhaps the misunderstanding just needs to be explained more clearly and with different vocabulary. Indeed, as I noted previously, the word "fluke" does not appear in any of the cited sources.

You didn't really respond to the third point, except to say that your response to the second point should apply. It seems we agree that the first misunderstanding is p = P(A|B) (even though you've noted that it's debatable whether P(A|B) is coherent in a frequentist framework). Isn't the "prosecutor's fallacy" also p = P(A|B)? In fact, the wiki article on prosecutor's fallacy appears to describe it precisely that way (except using I and E instead of A and B). Maybe part of the problem is the seemingly contradictory way the alleged misunderstanding is phrased: first it's described as thinking the p-value is the probability of falsely rejecting the null hypothesis (which appears to mean confusing the p-value with the alpha level), and then it's described as "a version of prosecutor's fallacy" (which appears to be something else entirely).

Your response to the fourth point seems to be POV. The functional difference between p=.04 and p=.035 may be relatively trivial in most cases, but p=.0000001 need not be treated as equivalent to p=.049999999 just because both are below some arbitrarily selected alpha level. Here again, there may be different perspectives on the issue, but we are supposedly talking about definitive misunderstandings, not potential controversies.

You didn't respond to my last point, regarding the "Representing probabilities of hypotheses" section. — Preceding unsigned comment added by 23.242.207.48 (talk) 18:30, 10 July 2016 (UTC)[reply]

If someone is speaking of the truth or falsity of the null hypothesis in terms of probabilities, then they are adopting a Bayesian viewpoint. A Bayesian approach is incompatible with frequentist hypothesis testing. Consider the card example. You draw a card, and I try to predict whether it's the ace of spades or not. Let's assume that the deck has been well-shuffled. Then I can model your draw as being a uniform random variable on the deck. From a frequentist perspective, if we shuffle and draw from the deck repeatedly, we will observe each card with equal probability; from a Bayesian perspective, my belief that the deck was well-shuffled leads me to adopt the prior that puts equal probability on each of the cards.
So far there is no hypothesis to test. Let's say that there are two possibilities: One, the deck is a standard 52-card deck. Two, the deck is a trick deck in which every card is the ace of spades. Only one of these possibilities can hold, of course. Let's say that you draw a card. I observe the card and attempt to determine whether the deck is a standard deck or a trick deck. In a frequentist approach, I would choose a null hypothesis, say that the deck is standard, and set a significance level α, say 0.05. Under the null hypothesis, the probability that you drew the ace of spades is 1/52 ≈ 0.02. So if I observe the ace of spades, then I will reject the null hypothesis. Now let's suppose that we repeat the experiment twice. On the first draw, you draw the ace of spades. On the second draw, you draw the seven of diamonds. Under the null hypothesis, the probability of observing at least one ace of spades is 1 minus the probability of observing no aces of spades, that is, it's . Therefore I reject the null hypothesis and conclude that the deck is a trick deck. This is a ridiculous conclusion, but the logic is impeccable. Notice that none of my computations involved the alternative hypothesis. Notice also that I didn't attempt to assign probabilities to the deck being standard or a trick deck. This is, depending upon the situation and your viewpoint, either a feature or a bug of frequentist hypothesis testing. I think we would both agree that in this example, it's a bug.
In a Bayesian approach, I select a prior P on the two possibilities. Perhaps I believed that you decided to use a standard deck or a trick deck based on a fair coin flip, so I assign a prior probability of 0.5 to each possibility. After observing each draw, I update my prior. If the first draw is an ace of spades, I update my prior to P(standard deck|ace of spades) = P(ace of spades|standard deck)P(standard deck)/P(ace of spades) = (1/52)(1/2)/(1/52 ⋅ 1/2 + 1 ⋅ 1/2) = 1/53 and P(trick deck|ace of spades) = P(ace of spades|trick deck)P(trick deck)/P(ace of spades) = (1)(1/2)/(1/52 ⋅ 1/2 + 1 ⋅ 1/2) = 52/53. If the second draw is the seven of diamonds, I update my prior again to P(standard deck|ace of spades, seven of diamonds) = P(seven of diamonds|standard deck, ace of spades)P(standard deck|ace of spades) / P(seven of diamonds|ace of spades) = (51/52)(1/53)/((1/53) ⋅ (51/52) + (52/53) ⋅ 0) = 1 and P(trick deck|ace of spades, seven of diamonds) = P(seven of diamonds|trick deck, ace of spades)P(trick deck|ace of spades) / P(seven of diamonds|ace of spades) = (0)(52/53)/((1/53) ⋅ (51/52) + (52/53) ⋅ 0) = 0. Usually, of course, one doesn't end up with absolute certainty, so it's more common in Bayesian statistics to report the Bayes factor, the ratio of posterior odds to prior odds. If there were still some chance that it was a trick deck (perhaps 51 of the cards were aces of spades while the remaining card was the seven of diamonds), I could make further draws. Notice that in the Bayesian framework, we can talk about the probability of the null hypothesis being true.
So when you said earlier, "In fact, the two probabilities are, at least in some sense, "connected" by Bayes rule: P(A|B)=P(B|A)P(A)/P(B)", well, that's well-defined in a Bayesian framework. But p-values are a frequentist concept, and there, P(A) and P(A|B) aren't well-defined concepts. This invalidates your first point. In response to your third point: Suppose one adopts a quasi-Bayesian framework and claims that P(A|B) is well-defined; many people do this without even realizing it. Then it becomes possible to assert the prosecutor's fallacy, which is false even if one believes that P(A|B) is well-defined. So this is a distinct problem from the first point.
As regards the second point, I don't understand the point you're trying to make. It seems to me that you're willfully misunderstanding plain English. See definition 3 here.
The fourth point is not POV; it is simply a consequence of the assumptions of frequentist hypothesis testing. One can say a posteriori that, if we observed p=.0000001, then we could have taken a much smaller value of α and still seen a significant result. But choosing the level of significance after observing the data is a statistical fallacy.
As to your final point, the antecedent of "the hypothesis" is the "null hypothesis". The point the section is making is that p-values are a property of data, not of a hypothesis. I don't think that point is made elsewhere. Ozob (talk) 00:14, 11 July 2016 (UTC)[reply]

Your entire response to the first point is a red herring. The American Statistical Association's official statement on p-values (which is cited in this article; http://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108) notes that p-values can be used for "providing evidence against the null hypothesis"--directly contradicting the claim that p-values are "not connected" to the probability of the null hypothesis. If you insist that someone has to be called "Bayesian" to make that connection, fine--it is a connection nonetheless (and it is the connection that p-values' usefulness depends on). Furthermore, none of your response substantively speaks to the issue at hand: whether the statement "it is not connected to either" should be included in the article. Even if we accept your view that P(A|B) is meaningless, the disputed statement in the article does not communicate that premise. The article does not say, "The p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false. Those probabilities are not conceptually valid in the frequentist framework." Instead, the article says, "The p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false. It is not connected to either." Thus, even if we accept your premise, the statement is not helpful and should be removed. In fact, saying P(A|B) is "not connected" to P(B|A) might be taken to imply that the two probabilities orthogonally coexist--which would directly contradict your view. Given that there is no apparent reason for you to be attached to the disputed sentence even if all your premises are granted, I hope you will not object that I have removed it.

Regarding the second point, you defined "fluke" as "unlucky." I responded that "the probability that the finding was unlucky" (1) is an unclear concept and (2) does not obviously relate to any passages in the cited sources (neither "fluke" nor "unlucky" appear therein). Hence, with regard to your ad hominem, I do understand English--that doesn't make all combinations of English words intelligible or sensible. I repeat my suggestion that if there is an important point to be made, better vocabulary should be used to make it. Perhaps the business about "flukes" comes from the ASA's statement that "P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone" (http://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108). Note that the statement combines the fallacy regarding the null being true and the fallacy regarding the data being produced by random chance alone into a single point. Why not use a similar approach, and similar language, in the wiki article? I hope you will not object that I have made such an adjustment.

Regarding the third point (regarding "prosecutor's fallacy"), you don't have a response adequately demonstrating that (1) the proposed misunderstanding is consistent with how prosecutor's fallacy is defined (note that the article equates prosecutor's fallacy with thinking the p-value is the probability of false rejection), (2) the proposed misunderstanding is non-redundant (i.e. prosecutors's fallacy should be distinct from the first misunderstanding), and (3) the proposed misunderstanding is listed in the cited sources (note that "prosecutor's fallacy" is not contained therein). In fact, your description of prosecutor's fallacy is EXACTLY misunderstanding #1--whether you're a "quasi-Bayesian" or a "Bayesian," the fallacy is exactly the same: P(A|B)=P(B|A). What framework is used to derive or refute that fallacy doesn't change the fallacy itself.

Regarding the fourth issue, if the point is that the alpha level must be designated a priori rather than as convenient for the obtained p-value, then we are certainly in agreement. I have not removed the item. But if this point is indeed commonly misunderstood, how about providing a citation?

Regarding the final issue, I see the point you are making. I hope you will not object that I have slightly adjusted the language to more closely match the ASA's statement that the p-value is "a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself." — Preceding unsigned comment added by 23.242.207.48 (talk) 14:36, 11 July 2016 (UTC) 23.242.207.48 (talk) 14:45, 11 July 2016 (UTC)[reply]

I think we may be closing in on common ground. I believe that we mostly agree on the underlying concepts and ideas and are now trying to agree on phrasing.
You object to the phrase "not connected". I think that's a fair objection, because I agree that p-values can provide evidence for or against the null hypothesis. (Indeed, we reject or fail to reject the null hypothesis precisely on the basis of this evidence.) This comes with the caveat that this evidence is not of a probabilistic nature; it still improper, in a frequentist setting, to discuss the probability of a hypothesis being true or false. But I think it's fine to delete the "not connected" clause.
I would prefer not to merge the first two bullets, so I've split them apart. I'm still mystified as to why you dislike "fluke", but I'm happy with the wording you chose.
I believe I have more than replied to your arguments about the prosecutor's fallacy (contrary to your edit summary), but let me expand further. I believe that point #1 is about the non-existence, in frequentist statistics, of P(null) and P(alternative). Indeed, the article says as much when it says frequentist statistics "does not and cannot attach probabilities to hypotheses". Whether P(null|data) even exists, let alone its meaning, is not addressed. Consequently it is impossible for point #1 to express the prosecutor's fallacy. The ASA's statement, however, expresses a version of this when it says, "P-values do not measure the probability that the studied hypothesis is true". The P value is of course P(data|null), and one way to measure the probability that the studied hypothesis is true would be P(null|data) (assuming this is well-defined). Asserting that p-values measure the probability that the studied hypothesis is true is therefore the prosecutor's fallacy.
The citation you asked for is pretty well covered by reference 2, which says, "Thus, in the Neyman-Pearson approach we decide on a decision rule for interpreting the results of our experiment in advance, and the result of our analysis is simply the rejection or acceptance of the null hypothesis. ... we make no attempt to interpret the P value to assess the strength of evidence against the null hypothesis in an individual study."
Finally, I'd like to say that I respect you. I don't think that's always shown through in this discussion, but I do think you know what you're talking about, and I think Wikipedia is better for your efforts. Thank you! Ozob (talk) 03:21, 12 July 2016 (UTC)[reply]

I am pleased that we are finding common ground. You describe prosecutor's fallacy as "asserting thatp-values measure the probability that the studied hypothesis is true" (direct quote). In the article, misconception #1 is described as thinking the p-value is "the probability that the null hypothesis is true" (direct quote). That makes misconception #1 essentially a word-for-word match for your definition of prosecutor's fallacy. It's hard to see how one could justify saying those are two different misconceptions when they are identically defined. It seems that you are making a distinction between two versions of an objection to the fallacy rather than between two different fallacies; perhaps the reference to prosecutor's fallacy should be moved to misconception #1.

Note also that your definition of prosecutor's fallacy doesn't match the way it's described in the bold text of misconception #3. Indeed, "the probability of falsely rejecting the null hypothesis" (article's words) is certainly not the same thing as "the probability that the null hypothesis is true" (your words). Thus, there is another reason the reference to prosecutor's fallacy does not seem to belong where it appears. — Preceding unsigned comment added by 23.242.207.48 (talk) 10:22, 12 July 2016 (UTC)[reply]

Ah, this is a good point. I think what I want to do is distinguish the misconception "p = P(null)" from the misconception "p = P(null|data)". I think what you quoted above (which is actually from the ASA, not me) could be construed either way. I propose moving the bullet point about the prosecutor's to be immediately after the first bullet, and changing the wording to: "The p-value is not the conditional probability that the null hypothesis is true given the data." What would you think of that? Ozob (talk) 00:49, 13 July 2016 (UTC)[reply]

I can't say I'm convinced. I'm also wary that expanding on the ASA's descriptions would violate wikipedia standards on original research, unless there are other reputable sources that explicitly identify two distinct common misunderstandings as you propose.

I am also find misunderstanding #4 a bit peculiar: "The p-value is not the probability that replicating the experiment would yield the same conclusion." Are there really people who think a very low p-value means the results aren't likely to be replicable? It's hard to imagine someone saying, "p =.0001, so we almost certainly won't get significance if we repeat the experiment." I doubt many people think super-low p-values indicate less reliable conclusions. I also couldn't find this misunderstanding listed in any of the cited sources. Which paper and passage did it come from? 23.242.207.48 (talk) 00:07, 14 July 2016 (UTC)[reply]

I agree that there is a possible problem with original research here. And, now that I look at it, misunderstanding #4 looks backwards: I suspect it's intended to say, "The p-value is not the probability that replicating the experiment would yield a different conclusion." However, I'm not the one who originally wrote this material, so I can't say for sure. I think for a more detailed response, we should consult User:Sunrise, who originally wrote this list. I presume he would know where he got it from. Ozob (talk) 01:33, 14 July 2016 (UTC)[reply]
Thanks for the ping! I'm not the writer either though - I only transferred it from the p-value article, where it had already existed for some time. I'm glad to see that editors have been checking it over. Sunrise (talk) 20:09, 14 July 2016 (UTC)[reply]

That rewriting still seems weird to me. So, many people think that a very high p-value (e.g. p=.8) means they will probably get significance if they repeat the experiment? I've never heard that. I'm removing misunderstanding #4 pending a sourced explanation. 23.242.207.48 (talk) 11:08, 14 July 2016 (UTC)[reply]

Ah, good point. I don't see what the intended meaning must have been, so I support your removal. Ozob (talk) 12:23, 14 July 2016 (UTC)[reply]

What should we do with the list?[edit]

Based on the discussion so far, it seems like the quality of the list of misunderstandings is doubtful. I feel like we need to clean up the list: Each misunderstanding should come with an inline citation, and the language should be carefully checked to ensure that it is correct and reflects what is in the source. Would anyone like to volunteer? Or propose a different solution? Ozob (talk) 23:16, 14 July 2016 (UTC)[reply]

I don't think anyone should be opposed to better citation practices. :-) I definitely don't think it would be wasted effort, and there's also a better selection of sources available now. I'd also note that with the current section heading, misunderstandings are being divided into "common" and "uncommon" by implication, which itself needs to be supported in the sources. A structure chosen with that in mind, maybe focusing on one or two main sources like the ASA statement, would be an improvement.
Rewriting as prose is probably a good option - I think having a section in list format doesn't fit with the rest of the article, and leads to a lot of overlap. Some of the information could be moved to the "Representing probabilities" section, for example. Maybe part of it could also be repurposed for a general introduction to the article, although that might fit better in the lead if it isn't too long. Sunrise (talk) 06:40, 15 July 2016 (UTC)[reply]

This is what I have so far:

Proposal

The following list addresses several common misconceptions regarding the interpretation of p-values:

  1. The p-value is not the probability that the null hypothesis is true, or the probability that the alternative hypothesis is false.[1] A p-value can indicate the degree of compatibility between a dataset and a particular hypothetical explanation (such as a null hypothesis). However, it is not a statement about the null or alternative hypotheses.[1] In fact, frequentist statistics does not and cannot attach probabilities to hypotheses; a p-value can be very close to zero when the posterior probability of the null is very close to unity (Lindley's paradox).[citation needed] Similarly, the p-value is not the probability of falsely rejecting the null hypothesis.[citation needed]
  2. The p-value is not the probability that the observed effects were produced by random chance alone.[1] The p-value is computed under the assumption that a certain model, usually the null hypothesis, is true. This means that it is a statement about the relation of the data to a proposed hypothetical explanation, not a statement about the explanation itself. For example, other hypotheses could explain the observed data just as well or better than the one being analyzed.[1]
  3. The division of results into significant and non-significant is arbitrary.[2] The significance level is decided by the person conducting the experiment (with the value 0.05 widely used by the scientific community) before the data are viewed, and it is compared against the calculated p-value after the test has been performed.[citation needed] (However, when reporting results, giving the precise p-value is more useful than simply saying that the results were or were not significant at a given level.[2])
  4. The p-value does not indicate the size or importance of the observed effect.[1] That is, a small p-value can still be observed for an effect size which is not meaningful or not important. However, the larger the effect, the smaller the sample size that will be required to get a significant p-value (see effect size).[citation needed]
  5. In the absence of other evidence, the information provided by a p-value is limited. A p-value near 0.05 is usually weak evidence.[1][2]
  6. p-values do not account for the effects of confounding and bias.[2]

References:

Improvements would be appreciated. Have I interpreted everything correctly? Did I miss anything? Can we find citations for the unsourced parts? (at least a couple of them should be easy) There's also a comment in Sterne that directly addresses prevalence of a misconception, specifically that the most common one is (quote)"that the P value is the probability that the null hypothesis is true, so that a significant result means that the null hypothesis is very unlikely to be true," but I wasn't sure about how to best include that. Perhaps that (or other parts of the section) could be useful for the main p-value article. Sunrise (talk) 08:11, 17 July 2016 (UTC)[reply]

I've replaced the list in the article with the one above. Ozob (talk) 23:34, 18 July 2016 (UTC)[reply]
"In the absence of other evidence, the information provided by a p-value is limited. A p-value near 0.05 is usually weak evidence.[1][2]"" What? In the absence of other evidence, the information provided by the p value is indeed limited! That's not a misconception! Unless a much, much higher threshold of significance is chosen (e.g. 0.001). Likewise for "The division of results into significant and non-significant is arbitrary." Since the significance threshold is chosen by the experiment, this is quite arbitrary indeed! Headbomb {talk / contribs / physics / books} 02:03, 19 July 2016 (UTC)[reply]
The bolded statements in the list are intended to be true, not misconceptions. And, even in the presence of a very small threshold like 10−6, in the absence of other evidence the information provided by the p-value is still very limited. I might be able to definitively reject the null hypothesis while not feeling confident in the alternative hypothesis I chose to test. Ozob (talk) 03:02, 19 July 2016 (UTC)[reply]
Then the header should be updated, because it says these are common misconceptions. I'd much prefer rephrasing in terms of what the misconceptions are, however. Headbomb {talk / contribs / physics / books} 03:46, 19 July 2016 (UTC)[reply]
I feel that the article should state truths and explain why they're true, rather than state falsehoods and explain why they're false. I've edited the header. If you can think of a better header I would welcome it. Ozob (talk) 12:46, 19 July 2016 (UTC)[reply]
I've undone a couple of the changes made by the IP, with brief reasoning given in my edit summaries. Could we come to agreement here before making changes? The header seems to be one of the key points of disagreement. Sunrise (talk) 00:54, 21 July 2016 (UTC)[reply]

The "False Discovery Rate" section should be removed[edit]

Reasons:

1. The FDR section appears to refer to a misinterpretation of alpha levels--not a misinterpretation of p-values (note that p0 in the formula is the alpha level, not the p-value). Thus, the section is irrelevant to the article.

2. The statement that FDR increases when the number of tests increases is false. In fact, the FDR can either increase, decrease, or stay the same when the number of tests increases.

3. The given definition of the FDR appears to be incorrect ("the odds of incorrectly rejecting the null hypothesis"). The FDR is conventionally defined is the expected proportion of rejections that are incorrect (and defined as 0 when there are no rejections). — Preceding unsigned comment added by 2601:644:100:74B7:494A:EDB1:8541:281E (talk) 06:59, 7 July 2016 (UTC)[reply]

The FDR is related to misunderstandings of p-values, so the section should remain. For the rest, please provide references. Headbomb {talk / contribs / physics / books} 19:49, 8 July 2016 (UTC)[reply]
-Just saying it's related doesn't make it so. For the definition of the FDR, see the wiki article on the false discovery rate or see the original paper defining it (Benjamini & Hochberg, 1995). Using the correct definition, it's obvious that statements such as "FDR increases when the number of tests increases" are absurd--the FDR is an expected PROPORTION of tests, not a NUMBER of tests. Hence, if more tests are added, the FDR will only go up if the proportion of nulls that are true is higher among the added tests than in the original set of tests. On the other hand, if the proportion of nulls that are true is lower among the added tests than in the original set, then the FDR will go down. And if the proportion of true nulls remains constant as the number of tests increases (e.g., if we assume all nulls are true or we assume some ratio of nulls are true regardless of the number of hypotheses), then the FDR will remain constant. Again, see the Benjamini & Hochberg (1995) paper, which is the definitive reference for the FDR. 23.242.207.48 (talk) 21:47, 8 July 2016 (UTC)[reply]
It clearly is related. The FDR depends on the p_0 value (which you call alpha) and is the significant threshold which p must exceed according to certain statistical tests. I could quote from doi:10.1098/rsos.140216, but really the whole thing is about the relation between the FDR and the p-value. Headbomb {talk / contribs / physics / books} 00:42, 9 July 2016 (UTC)[reply]

As I noted in my previous comment, the section does not even correctly define the FDR itself--let alone the "relation between the FDR and the p-value." 23.242.207.48 (talk) 01:08, 9 July 2016 (UTC)[reply]

Who said the FDR gave a number of tests? It's clearly written "In general, for n independent hypotheses to test with the criteria p < p0, the probability of obtaining a false positive is given as..." (emphasis mine). All variables are defined and explained, as is the relation between the FDR and the p-value significance threshold chosen.1 Headbomb {talk / contribs / physics / books} 02:09, 9 July 2016 (UTC)[reply]

- That is simply incorrect. The FDR is not, as you have claimed, the probability of obtaining a false positive. The FDR is the expected proportion of significant tests (i.e., "positives") that are false positives. You are perhaps confusing the FDR with the familywise Type I error rate--just as the FDR section of this article does. Look at the formula given in the section--it is the formula for the familywise error rate, NOT FOR THE FDR! I again encourage you to actually read the wiki article on the FDR or, better yet, read the original Benjamini & Hochberg article that introduced the quantity in the first place. 23.242.207.48 (talk) 03:42, 9 July 2016 (UTC)[reply]

I agree. The false discovery rate article is quite clear, and the present article was wrong. I've deleted the offending section. Ozob (talk) 14:23, 9 July 2016 (UTC)[reply]
The passage is supported by WP:RS that is a dedicated paper on the connection between the FDR and misunderstandings of p-values, it needs to stay. Unless you have a source that says there is no connection betwee then the FDR and misunderstandings of p-values. If the objection is that the formula here is the FWER and not the FDR as commonly understood, then using the better term is a much better approach then deleting the section outright. Headbomb {talk / contribs / physics / books} 15:09, 9 July 2016 (UTC)[reply]
My objection is that what was written in the article is wrong. That's all. I haven't looked in the references, but I imagine they're saying something other than what used to be in the article. If you can write something correct based on those references, then perhaps it would fit well in the article. Ozob (talk) 18:18, 9 July 2016 (UTC)[reply]

Should the paragraph about the jellybean comic strip be removed? (in the multiple comparisons section)[edit]

As clever as the comic strip may be, it doesn't seem very encyclopedic to spend a paragraph summarizing it in this article. Similarly, it wouldn't make sense to dedicate a paragraph to summarizing the film Jaws in an article about great white sharks (though the film might be briefly mentioned in such an article).

The paragraph is also somewhat confusingly written (e.g. what does "to p > .05" mean?, what does "threshold that the results are due to statistical effects" mean?, and shouldn't "criteria of p > 0.05" be "criteria of p < 0.05?").

Another concern is that the punchline "Only 5% chance of coincidence!" is potentially confusing, because "5% chance of coincidence" is not an accurate framing of p < .05 even when there is only a single comparison.

If the jellybean example is informative enough to merit inclusion, I suggest either rewriting the summary more clearly and concisely (and without verbatim transcriptions such as "5% chance of coincidence"), or simply removing the summary and linking to the comic strip in the further reading section. 23.242.207.48 (talk) 17:51, 12 July 2016 (UTC)[reply]

It's a very well supported example, extremely useful to illustrate p-hacking / the multiple comparison issue, and used by several experts sources, including the people at Minitab, and in Statistics Done Wrong. That the example originated in a comic strip is inconsequential. Headbomb {talk / contribs / physics / books} 19:55, 12 July 2016 (UTC)[reply]
Sorry about the removal. That was accidental.
But, I'm not fond of the comic strip. It's meant to make people laugh, not to give people a deep understanding of the underlying statistical issues. For instance, here is a way in which the comic strip is just wrong: Assuming that the null hypothesis is true and that the p-values under the null hypothesis are being computed correctly, then the expected number of false positives at a significance level of α = 0.05 is one. The probability of having a false positive is . So I can't think of an interpretation of the phrase "5% chance of coincidence" that makes sense and is correct. Perhaps it's meant ironically (since it appears in the newspaper), but if that's true, then I think that point is lost on most readers. Ozob (talk) 23:45, 12 July 2016 (UTC)[reply]
That is exactly the point of the comic. The media is claiming this is an astonishing discovery, while this falls completely within the expectations for the null hypothesis (one false positive if your criteria for significance is p<= 0.05 and test 20 different kinds of jellybeans). Headbomb {talk / contribs / physics / books} 23:56, 12 July 2016 (UTC)[reply]
You seem to be saying that the point of the comic is that some people confuse probability with expected value. If that's true, then the comic has nothing to do with p-values, so it's irrelevant to the current article. Ozob (talk) 01:58, 13 July 2016 (UTC)[reply]
The point is that people don't understand that p-values cannot be used this way. If that is not a misunderstanding of p-values, nothing qualifies as a misunderstanding. Headbomb {talk / contribs / physics / books} 02:18, 13 July 2016 (UTC)[reply]
It's not just p-values that can't be used in this way. Nothing can be used in this way (without committing an error). So this misunderstanding seems to be more about the nature of probability than about p-values. Accordingly I've removed the comic from the article. Ozob (talk) 03:13, 13 July 2016 (UTC)[reply]
This is a pretty clear cut case of a misunderstanding of p-values, and features in at least three different reliable publications on the explicit topic of p-values and their misunderstandings. I've restored the material given you offer no sensible objection to it rather than your personal dislike. If it's good enough for these sources, it's good enough for wikipedia. I'll go further, and add that without exemples, this article is downright useless to anyone but statisticians. Headbomb {talk / contribs / physics / books} 03:38, 13 July 2016 (UTC)[reply]
I love xkcd and I think the particular comic under discussion is great. I note other people have used this comic as an illustration. (I've used xkcd to illustrate issues in research methods in my own work as a scientist, although not this particular comic.) However, the presentation of this comic in this article seems wrong to me. As an encyclopaedia, we should explain the issues around p-values in clear terms. Explaining someone else's comic (in both senses) explanation overly complicates the situation. I agree with others that we should drop it. Bondegezou (talk) 09:45, 13 July 2016 (UTC)[reply]

I've cleaned up the example so it references the comic without doing a frame-by-frame summary and without the confusing language. Perhaps this is a reasonable compromise? I'm still of the mind that the reference to the comic should probably be removed altogether, but it should at the very least be grammatically and scientifically correct in the meantime. 23.242.207.48 (talk) 10:19, 13 July 2016 (UTC)[reply]

With this new text I'm willing to let the comic stay. Ozob (talk) 12:13, 13 July 2016 (UTC)[reply]
I can live with this, yes. I've added the general formula however, and tweaked some of the wording. Hopefully this is acceptable? Headbomb {talk / contribs / physics / books} 12:26, 13 July 2016 (UTC)[reply]
That is better, but I still think the whole paragraph can go. We have a main article tag to Multiple comparisons problem and see alsos to p-hacking and Type I error. We don't need much text here when we have those articles elsewhere. Bondegezou (talk) 08:32, 14 July 2016 (UTC)[reply]

I'm inclined to agree with Bondegezou that detailed repetition of information available in other articles is unnecessary. By the same token, this whole article is arguably unnecessary and would be better as a short section in the p-value article than as a sprawling article all to itself, without much unique material (but it seems that issue has been previously discussed and consensus is to keep it). — Preceding unsigned comment added by 23.242.207.48 (talk) 11:00, 14 July 2016 (UTC)[reply]

I think it would be fine to condense those paragraphs down to a single sentence, "The webcomic xkcd satirized misunderstandings of p-values by portraying scientists investigating the claim that eating different colors of jellybeans causes acne." Everything else in those paragraphs replicates material that should be elsewhere. Ozob (talk) 12:26, 14 July 2016 (UTC)[reply]
And where should this material be, exactly? Reducing the text in one article to a bare minimum because something is covered in another article is counterproductive and sends readers all over the place. Concretes examples of misuse (with accompanying numbers) are sorely needed, and this is one of the better examples you can have, as it is both engaging and often used by reliable sources to illustrate possibly one of the most common and dangerous misuse of p-values. All those "scientists find a link between <item you love/hate> and <reduced/increased> risk of cancer" articles in the press? Often times claiming one item causes cancer one week, then the next week saying it reduces cancer? Half the time, that's pretty much exactly what the comic is about (with the other half being small N studies). Headbomb {talk / contribs / physics / books} 13:37, 14 July 2016 (UTC)[reply]
The multiple comparisons problem article. This is simply not the right place for a case study. Ozob (talk) 23:17, 14 July 2016 (UTC)[reply]
I removed the section as per the weight of argument here, but an IP editor has just re-added. Bondegezou (talk) 12:24, 21 July 2016 (UTC)[reply]

xkcd comic was a good example![edit]

Please keep it! This makes the article more understandable than just a bunch of math when we can just how ridiculous these situations are when you ignore the implications! I urge editors to keep this example and add more to other sections, because right now it seems to be in danger of becoming like all the other math pages: useless unless you already know the topic or are a mathematician. You talk about null or alternative hypotheses, but never give any example! Who exactly do you can understand this? You think someone who sees a health claim in a nutrition blog that checks a paper with a conclusion that prune juice cures cancer p < 0.05 knows that the null hypothesis means prune juice doesn't cure cancer? Or that an alternative hypothesis is that strawberries cure cancer? EXPLAIN THINGS IN WAYS PEOPLE WHO DON'T HAVE A PHD IN MATH CAN UNDERSTAND!

I am an educator at École Léandre LeGresley in Grande-Anse, NB, Canada and I agree to release my contributions under CC-BY-SA and GDFL. — Preceding unsigned comment added by 2607:FEA8:CC60:1FA:9863:1984:B360:4013 (talk) 12:30, 21 July 2016 (UTC)[reply]

"p-values do not account for the effects of confounding and bias" (in the list of misunderstandings)[edit]

It's not clear what the statement "p-values do not account for the effects of confounding and bias" is supposed to mean. For example, what kind of "bias" is being referenced? Publication bias? Poor randomization? The experimenter's confirmation bias? Even the cited source (an opinion piece in a non-statistical journal) doesn't make this clear, which is probably why the statement in this article's misunderstandings list is the only one not accompanied by an explanation. Furthermore, the cited source doesn't even explicitly suggest that there's a common misunderstanding about the issue. So are people really under the impression that p-values account for "confounding and bias?" Those are general problems in research, not some failing of p-values in particular. I'm removing the statement pending an explanation and a better source. 23.242.207.48 (talk) 02:07, 23 July 2016 (UTC)[reply]

As I've said pretty much since its article's creation, I think it's problematic. There is an important point that the p-value and equally a confidence interval or Bayesian equivalent only account for uncertainty due to sampling error and the calculations presume that the study was appropriately carried out. However, I agree that that does not simply fit into a list of "misunderstandings". Bondegezou (talk) 07:49, 23 July 2016 (UTC)[reply]
Just for the record, I agree with that point. I see it as important information to include in the article, so I restored it in the same place pending discussion, but I'd prefer it to be described elsewhere in the article as well. Sunrise (talk) 01:44, 31 July 2016 (UTC)[reply]

Semi-protected edit request on 4 September 2018[edit]

References #4 and #5 are identical. Please edit the citation to reference #5 that follows the sentence "The p-value fallacy is a common misinterpretation of the p-value whereby a binary classification of hypotheses as true or false is made, based on whether or not the corresponding p-values are statistically significant." to refer to reference #4 instead. Amoriarty21 (talk) 23:11, 4 September 2018 (UTC)[reply]

 Done, thank you. Gulumeemee (talk) 04:11, 5 September 2018 (UTC)[reply]

P-value Fallacy Section Should Be Removed[edit]

This issue was raised years ago, and it appears that the conclusion in this talk page was that "p value fallacy" is not a standard, consistently defined term. Apparently, a single user has been fighting for its conclusion in this article, but it seems to me that is not enough. Certainly giving "p value fallacy" an entire section in the article amounts to undue weight for a term that is hardly ever actually used in science or statistics--making the term's inclusion here misleading regarding what terminology is commonly used. Moreover, as was pointed out in an earlier discussion on this talk page, in the rare cases when the term "p value fallacy" is actually used, it isn't used consistently. Thus, including a section on the "p value fallacy" is not only unnecessary for understanding the topic of the article, but is also potentially confusing.164.67.15.175 (talk) 21:12, 24 September 2018 (UTC)[reply]

The top hit on GScholar - https://scholar.google.com/scholar?hl=en&as_sdt=0%2C39&q=p-value+fallacy - has over 1k citations, which I think makes the definition used there, at least, worth inclusion. — Charles Stewart (talk) 08:45, 25 September 2018 (UTC)[reply]

The "top hit on Google scholar" that you're referring to (which is actually an opinion piece) defines the p-value fallacy as "the mistaken idea that a single number can capture both the long-run outcomes of an experiment and the evidential meaning of a single result." That is NOT the definition given in this wiki article: "The p-value fallacy is a common misinterpretation of the p-value whereby a binary classification of hypotheses as true or false is made." Thus, the "top hit on Google scholar" actually illustrates the point that "p value fallacy" is an inconsistently defined and potentially confusing term. Furthermore, the "p value fallacy" (as defined in the "top hit on Google scholar") isn't even demonstrably a fallacy, though the authors may consider it so. Thus, including it in this wiki article amounts to POV, which is inappropriate. This is supposed to be an article about objective MISUNDERSTANDINGS, not about controversial opinions.23.242.198.189 (talk) 01:57, 26 September 2018 (UTC)[reply]

I agree that we should not be saying that there is a particular inference that is the p-value fallacy, but the fact that this term has some currency justifies a section by that name. What should go in that section is another matter. — Charles Stewart (talk) 07:16, 27 September 2018 (UTC)[reply]

That seems rather backward to me. It doesn't make sense to include a section just because we like the name of the section, without consideration for whether the content of the section is actually relevant to the topic of the article. Note also that the fact that a term or phrase has "some currency" is not enough to make that term merit a section in the article. People have come up with all sorts of terms, many of which have "some currency." That doesn't mean they all belong in an article on misunderstanding p-values.164.67.15.175 (talk) 00:04, 29 September 2018 (UTC)[reply]

Just checking in, I see that still no one has provided any counterargument in favor of keeping the content of the "p value fallacy" section. Please remove it.23.242.198.189 (talk) 01:23, 12 October 2018 (UTC)[reply]

The section belongs here. See refs in the section. Headbomb {t · c · p · b} 01:35, 12 October 2018 (UTC)[reply]

"See the refs" is not a legitimate argument--especially given that the refs were obviously already "seen" because they were addressed in this discussion.23.242.198.189 (talk) 07:22, 16 October 2018 (UTC)[reply]

Let's settle this once and for all, now that the inappropriately applied "semi-protected status" has been lifted. We can go through the section sentence-by-sentence and see that it is not valid.

Sentence 1: The p-value fallacy is a common misinterpretation of the p-value whereby a binary classification of hypotheses as true or false is made, based on whether or not the corresponding p-values are statistically significant.

The cited source for that sentence defining the p-value fallacy is A PAPER THAT DOES NOT EVEN CONTAIN THE TERM "P-VALUE FALLACY." So right of the bat, we can see there is something very wrong here.

Sentence 2: The term 'p-value fallacy' was coined in 1999 by Steven N. Goodman.

The "p-value fallacy" defined by Goodman in the cited article is NOT what is described in the preceding sentence (the "binary classification of hypotheses as true or false"). Instead, Goodman defines "p-value fallacy" as "the mistaken idea that a single number can capture both the long-run outcomes of an experiment andthe evidential meaning of a single result." In other words, Goodman is making a Bayesian critique of p-values. In fact, Goodman's paper is an OPINION PIECE that criticizes the use of "frequentist statistics" altogether! Goodman's opinion that using p-values in conventional frequentist null hypothesis testing is based on "fallacy" is just that--an opinion. It would be relevant in an article on controversies or debates about p-values, but this wiki article is supposed to be about MISUSES of p-values, SO including POV HERE directly contradictS wiki policy.

Sentence 3: This fallacy is contrary to the intent of the statisticians who originally supported the use of p-values in research.

This is more POV that cites the same Goodman article. Curiously, this sentence also cites a Sterne and Smith article (another opinion piece), which DOES NOT EVEN CONTAIN THE TERM "P-VALUE FALLACY."

Sentence 4: As described by Sterne and Smith, "An arbitrary division of results, into 'significant' or 'non-significant' according to the P value, was not the intention of the founders of statistical inference."

That may or may not be true. It doesn't actually matter, because again, that Sterne and Smith opinion piece DOES NOT EVEN CONTAIN THE TERM "P-VALUE FALLACY," and what Sterne and Smith are describing here does not appear to even be equivalent what Goodman defined as the p-value fallacy.

Sentence 5: In contrast, common interpretations of p-values discourage the ability to distinguish statistical results from scientific conclusions, and discourage the consideration of background knowledge such as previous experimental results.

This is POV again, that again cites the opinion piece by Goodman.

Sentence 6: It has been argued that the correct use of p-values is to guide behavior, not to classify results, that is, to inform a researcher's choice of which hypothesis to accept, not to provide an inference about which hypothesis is true.

This is POV yet again, that yet again cites the opinion piece by Goodman. At least here, the wording includes the phrase "It has been argued that..." to acknowledge the POV. It should be noted that in a ddition to citing the Goodman piece, the sentence also cites another article (one by Dixon). Dixon's article, in contrast to Goodman's, does in fact define the p-value fallacy similarly to how it is defined in Sentence 1. However, the fact is that the term SIMPLY HAS NOT CAUGHT ON. A Google scholar search shows that even the handful of articles that have cited some aspect or another of the Dixon paper have rarely (if ever) used the term "p-value fallacy." The same goes for articles that have cited the Goodman paper. In fact, if you search Google scholar for articles containing the phrase "p-value fallacy," in nearly every hit the phrase only appears in the reference section of the article (as part of a citation of the Goodman paper).

In summary, the "p-value fallacy" is: (a) not a term that is in common enough use to merit mention, (b) is a term that, even when it is used, is not used consistently, as this very wiki article illustrates, and (c) when used as the person who originally "coined" the term intended, is not even really a definitive fallacy and thus does not belong in this wiki article because it constitutes partisan Bayesian POV. It should also be noted that the problems with "p-value fallacy" section have been mentioned numerous times before in the past, going back years (search this talk page to see). It's time to put this silliness to bed once and for all. The section is unnecessary (because the term is fairly obscure), inappropriate (because it contains POV), and confusing (because it can't even agree with itself about the definition of the term it's talking about).

A final note: The main advocate for keeping the section has been the editor Headbomb, who showed similar resistance to removing the COMPLETELY INCORRECT section on the false discovery rate a while back (as shown in this talk page). When challenged to present an argument for keeping the "p-value fallacy" section (scroll up a few paragraphs), Headbomb said simply the following: "The section belongs here. See refs in the section." I hope that I have sufficiently demonstrated here that, after "seeing the refs," it is clearer than ever that the section does NOT belong here. — Preceding unsigned comment added by 23.242.198.189 (talk) 04:47, 8 September 2019 (UTC)[reply]

No link back to the p-value article[edit]

Shouldn’t there be at least one? — Preceding unsigned comment added by 194.5.225.252 (talk) 16:01, 2 December 2019 (UTC)[reply]

There is a link at the start of the second sentence. Mgnbar (talk) 16:21, 2 December 2019 (UTC)[reply]
As Mgnbar noted, there IS in fact a link to the p-value article. Moreover, even if there weren't, that's the type of minor, noncontroversial edit that should simply be performed, without opening a new section of the talk page. 130.182.24.154 (talk) 18:51, 4 December 2019 (UTC)[reply]

Opposite sides of 0.05[edit]

@23.242.198.189: You reverted my addition

  1. "Studies with p-values on opposite sides of 0.05 are not in conflict." "Studies statistically conflict only when the difference between their results is unlikely to have occurred by chance, corresponding to when their confidence intervals show little or no overlap".[1]

with the comment "Revered good faith edit. It isn't clear what "in conflict" means. This seems like a subjective thing, not an objective misconception." In this case "in conflict" means to disagree about the underlying reality or to contradict each other. As far as I understand it, it is not subjective at all, but maybe we can find a wording that is better. Do you (or anyone else) have an idea to more clearly express the misconception?Nuretok (talk) 14:24, 3 March 2021 (UTC)[reply]

The author of the linked article proposes that studies "statistically conflict" if the confidence intervals have "little to no overlap." But "statistically conflict" is not a standard term, and it isn't clear conceptually what it means. Moreover, the author's criteria for "statistical conflict" are subjective. For example, why "little to no overlap" and not simply "no overlap?" One could argue that if there is any overlap at all between two independent confidence intervals for the same parameter, then the two intervals are compatible. Or one could argue that every pair of independent confidence intervals are in conflict, even if they heavily overlap, because they do not have exactly the same upper and lower bounds. Or one could argue that two confidence intervals conflict only if they disagree about the direction of the effect. Or one could argue that the very idea of conflict between two confidence intervals is nonsensical, because if you conducted two independent studies to estimate the same parameter, then the correct thing to do would be to compute a single confidence interval using the data pooled from the two studies. In any case, I think the misconception the author is actually getting at is already listed in the wiki article: the misconception that "there is generally a scientific reason to consider results on opposite sides of [.05] as qualitatively different." I see no reason to add a separate misconception to the list that says basically the same thing while introducing the questionable and potentially confusing concept of "statistical conflict." 23.242.198.189 (talk) 10:16, 12 March 2021 (UTC)[reply]
Thank you for your explanation. I agree that it basically is a rewording of the sentence which is already on the list.Nuretok (talk) 19:28, 12 March 2021 (UTC)[reply]

References

  1. ^ Goodman, Steven (2008-07-01). "A Dirty Dozen: Twelve P-Value Misconceptions". Seminars in Hematology. Interpretation of Quantitative Research. 45 (3): 135–140. doi:10.1053/j.seminhematol.2008.04.003. ISSN 0037-1963.

The redirect P-hunting has been listed at redirects for discussion to determine whether its use and function meets the redirect guidelines. Readers of this page are welcome to comment on this redirect at Wikipedia:Redirects for discussion/Log/2024 April 21 § P-hunting until a consensus is reached. Utopes (talk / cont) 17:35, 21 April 2024 (UTC)[reply]