Good afternoon everyone on behalf of the National Institute of Statistical Sciences, I would like to welcome viewers from all around the world to our event. My name is Daniel Jeske and I'm coming to you from the Department of Statistics at the University of California in Riverside. And I'm delighted to be the moderator for what promises to be a very interesting statistics debate. During the first 90 minutes of our debate, we will witness a lively discussion around many of the issues surrounding the use of p values. Our debate will be a refreshing alternative from political debates. Our participants will demonstrate respect for facts and for each other, and they will actually answer questions. And after the debate is over, no commentator will spin a narrative that fits a personal agenda. Three years ago, the ASA released a statement on p values that addressed their context and purpose. Since that time, many statisticians have been thinking about and writing about alternatives to the traditional p value. There have been many webinars, podcasts, conference sessions, and special issues in journals that have supported the dialogue on these issues. Our event today is a little bit different. Three distinguished thinkers on p values will debate different sides of the issues, and an event that I think will help us further understand where we stand. Joining the debate today is Deborah G. Mayo, a professor emerita in the Department of Philosophy at Virginia Tech, her book, Error and Growth of Experimental Knowledge, won the 1998 Lakatos Prize in philosophy of science, and her most recent book is Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. She's a research associate at the London School of Economics and the Center for the Philosophy of Natural and Social Science. David Trafimow is Distinguished Achievement Professor at New Mexico State University, and is a fellow of the Association for Psychological Science. He is also the editor in chief of Basic and Applied Psychology, where he instituted the p value ban in 2015. He has also published recent articles expanding his alternative to significance testing, termed the a priori procedure. And Jim Berger is the Arts and Sciences Distinguished Professor Emeritus of Statistics at Duke University. He's long been interested in Bayesian statistics, frequentist statistics, and their relationship to p values and testing. The format of our debate today is as follows. For each of seven questions that I will pose, we will use multiple cycles through our debate participants to get their answers and their comments on what they heard from other participants. Debate on each question will last 12 minutes. If you see the yellow light on above my shoulder come on, that's our signal to the debate participants, that the time allocated to their current speaking slot is running out. Viewers can ask questions throughout the debate using the q&a feature in the zoom window. And we will present those questions to the participants during the last half hour of our event. We will take a five minute break prior to the q&a time during which time responses to a viewer poll will be collected using the polling feature in zoom. And we'll discuss those polling results when we come back from the break. So now on with the debate, and I'm going to start the debate with our first question to Deborah.
[QUESTION 1] Given the issues surrounding the misuses and abuse of p values, do you think they should continue to be used or not? Why or why not?
Thank you so much. And thank you for inviting me, I'm very pleased to be here. Yes, I say we should continue to use p values and statistical significance tests. Uses of p values are really just a piece in a rich set of tools intended to assess and control the probabilities of misleading interpretations of data, i.e., error probabilities. They're the first line of defense against being fooled by randomness as Yoav Benjamini puts it. If even larger, or more extreme effects than you observed are frequently brought about by chance variability alone, i.e., p value not small, clearly you don't have evidence of incompatibility with the mere chance hypothesis. It's very straightforward reasoning. Even those who criticize p values you'll notice will employ them, at least if they care to check their assumptions of their models. And this includes well known Bayesian such as George Box, Andrew Gelman, and Jim Berger. Critics of p values often allege it's too easy to obtain small p values. But notice the whole replication crisis is about how difficult it is to get small p values with preregistered hypotheses. This shows the problem isn't p values, but those selection effects and data dredging. However, the same data drenched hypothesis can occur in other methods, likelihood ratios, Bayes factors, Bayesian updating, except that now we lose the direct grounds to criticize inferences for flouting error statistical control. The introduction of prior probabilities, which may also be data dependent, offers further researcher flexibility. Those who reject p values are saying we should reject the method because it can be used badly. And that's a bad argument. We should reject misuses of p values. But there's a danger of blindly substituting alternative tools that throw out the error control baby with the bad statistics bathwater.
Thank you, Deborah, Jim, would you like to comment on Deborah's remarks and offer your own?
Okay, yes. Well, I certainly agree with much of what Deborah said, after all, a p value is simply a statistic. And it's an interesting statistic that does have many legitimate uses, when properly calibrated. And Deborah mentioned one such case is model checking where Bayesians freely use some version of p values for model checking. You know, on the other hand, that one interprets this question, should they continue to be used in the same way that they're used today? Then my, my answer would be somewhat different. I think p values are commonly misinterpreted today, especially when when they're used to test a sharp null hypothesis. For instance, of a p value of .05, is commonly interpreted as by many is indicating the evidence is 20 to one in favor of the alternative hypothesis. And that just that just isn't true. You can show for instance, that if I'm testing with a normal mean of zero versus nonzero, the odds of the alternative hypothesis to the null hypothesis can at most be seven to one. And that's just a probabilistic fact, doesn't involve priors or anything. It just is, is a is an answer covering all probability. And so that 20 to one cannot be if it's, if it's, if a p value of .05 is interpreted as 20 to one, it's just, it's just being interpreted wrongly, and the wrong conclusions are being reached.
I'm reminded of an interesting paper that was published some time ago now, which was reporting on a survey that was designed to determine whether clinical practitioners understood what a p value was. The results of the survey were published and were not surprising. Most clinical practitioners interpreted the p value as something like a p value of .05 as something like 20 to one odds against the null hypothesis, which again, is incorrect. The fascinating aspect of the paper is that the authors also got it wrong. Deborah pointed out that the p value is the probability under the null hypothesis of the data or something more extreme. The author's stated that the correct answer was the p value is the probability of the data under the null hypothesis, they forgot the more extreme. So, I love this article, because the scientists who set out to show that their colleagues did not understand the meaning of p values themselves did not understand the meaning of p values.
David?
Okay. Yeah, Um, like Deborah and Jim, I'm delighted to be here. Thanks for the invitation. Um and I partly agree with what both Deborah and Jim said, um, it's certainly true that people misuse p values. So, I agree with that. However, I think p values are more problematic than the other speakers have mentioned. And here's here's the problem for me. We keep talking about p values relative to hypotheses, but that's not really true. P values are relative to hypotheses plus additional assumptions. So, if we call, if we use the term model to describe the null hypothesis, plus additional assumptions, then p values are based on models, not on hypotheses, or only partly on hypotheses. Now, here's the thing. What are these other assumptions? An example would be random selection from the population, an assumption that is not true in any one of the thousands of papers I've read in psychology. And there are other assumptions, a lack of systematic error, linearity, and then we can go on and on, people have even published taxonomies of the assumptions because there are so many of them. See, it's tantamount to impossible that the model is correct, which means that the model is wrong. And so, what you're in essence doing then, is you're using the p value to index evidence against a model that is already known to be wrong. And even the part about indexing evidence is questionable, but I'll go with it for the moment. But the point is the model was wrong. And so, there's no point in indexing evidence against it. So given that, I don't really see that there's any use for them. There's, p values don't tell you how close the model is to being right. P values don't tell you how valuable the model is. P values pretty much don't tell you anything that researchers might want to know, unless you misuse them. Anytime you draw a conclusion from a p value, you are guilty of misuse. So, I think the misuse problem is much more subtle than is perhaps obvious at firsthand. So, that's really all I have to say at the moment.
Thank you. Jim, would you like to follow up?
Yes, so, so, I certainly agree that that assumptions are often made that are wrong. I won't say that that's always the case. I mean, I know many scientific disciplines where I think they do a pretty good job, and work with high energy physicists, and they do a pretty good job of checking their assumptions. Excellent job. And they use p values. It's something to watch out for. But any statistical analysis, you know, can can run into this problem. If the assumptions are wrong, it's, it's going to be wrong.
Deborah...
Okay. Well, Jim thinks that we should evaluate the p value by looking at the Bayes factor when he does, and he finds that they're exaggerating, but we really shouldn't expect agreement on numbers from methods that are evaluating different things. This is like supposing that if we switch from a height to a weight standard, that if we use six feet with the height, we should now require six stone, to use an example from Stephen Senn. On David, I think he's wrong about the worrying assumptions with using the p value since they have the least assumptions of any other method, which is why people and why even Bayesians will say we need to apply them when we need to test our assumptions. And it's something that we can do, especially with randomized controlled trials, to get the assumptions to work. The idea that we have to misinterpret p values to have them be relevant, only rests on supposing that we need something other than what the p value provides.
David, would you like to give some final thoughts on this question?
Sure. As it is, as far as Jim's point, and Deborah's point that we can do things to make the assumptions less wrong. The problem is the model is wrong or it isn't wrong. Now if the model is close, that doesn't justify the p value because the p value doesn't give the closeness of the model. And that's the, that's the problem. We're not we're not using, for example, a sample mean, to estimate a population mean, in which case, yeah, you wouldn't expect the sample mean to be exactly right. If it's close, it's still useful. The problem is that p values don't tell you p values aren't being used to estimate anything. So, if you're not estimating anything, then you're stuck with either correct or incorrect, and the answer is always incorrect that, you know, this is especially true in psychology, but I suspect it might even be true in physics. I'm not the physicist that Jim is. So, I can't say that for sure.
Jim, would you like to offer Final Thoughts?
Let me comment on Deborah's comment about Bayes factors are just a different scale of measurement. My my point was that it seems like people invariably think of p values as something like odds or probability of the null hypothesis, if that's the way they're thinking, because that's the way their minds reason. I believe we should provide them with odds. And so, I try to convert p values into odds or Bayes factors, because I think that's much more readily understandable by people.
Deborah, you have the final word on this question.
I do think that we need a proper philosophy of statistics to interpret p values. But I think also that what's missing in the reject p values movement is a major reason for calling in statistics in science is to give us tools to inquire whether an observed phenomena can be a real effect, or just noise in the data and the P values have intrinsic properties for this task, if used properly, other methods don't, and to reject them is to jeopardize this important role. As Fisher emphasizes, we need randomized control trials precisely to ensure the validity of statistical significance tests, to reject them because they don't give us posterior probabilities is illicit. In fact, I think that those claims that we want such posteriors need to show for any way we can actually get them, why.
[QUESTION 2] Thank you very much. I'd like to move to the second question. And I'm going to start this round with Jim. Jim, should practitioners avoid the use of thresholds in interpreting data? And if so, does this preclude testing?
Um, so I think the answer to this question has to do with who is doing the statistical analysis. If you were to ask this question of statisticians, they almost invariably will say, when I do an analysis, I don't want to use a strict threshold, I want to take into account all the complexities of the problem. Each problem is unique, one size does not fit all. However, if you asked this of non-statisticians who routinely have to deal with a type of difficult statistical problem, the response will usually be I welcome the threshold because it spares me from having to attempt to do a detailed statistical analysis, I need to focus on my science not on doing the statistics. Often a particular scientific community will do a serious job of considering problems and setting a threshold. A method, a method, I mentioned that high energy physicists who long ago settled on a five sigma threshold, which corresponds to a p value of three times 10 to the minus seven, their primary concern is they simply cannot stand getting false positives, they hate it. And they found that if they use the five sigma level threshold, they would avoid false positives. And so that's what they've been doing for a long time. Another, just a second example, is the genome wide association studies [GWAS] community where they try to find out there's genes associated with various diseases. They have a severe multiple testing problem and long time, and they have the threshold set, it's something like five times 10 to the minus eight for a p value, before they'll claim that that a disease is associated with a given gene. And this cutoff is chosen based on very principled methods considering type one and type two errors, the importance of each. So, I have no problem at all, when a community goes out goes about and looks at its kind of problem and says, for our problem and careful consideration suggests this threshold.
David, your response and comments.
Um, yeah. The problem for me is that, since I don't believe that p values should be used in the first place, I certainly don't believe in p value thresholds. And I think thresholds add an additional layer of problems to what we already have. And I'll explain that. In order to meet a threshold, there's, of course, the sample size, but there's also the sample effect size. And so, the sample, the sample effect size, is largely a matter of just which one you pick, there's a distribution of them. And if you get lucky in your particular experiment and get a larger effect size, then you'll get then you'll then you might meet whatever the threshold is for the p value because the larger the effect size, the smaller the p value. And so, what you end up with is a literature full of inflated effect sizes. In psychology, there was the replication project, where they replicated about 100 papers from top journals. And what happened? Well, they found that the average effect size in the replication cohort was less than half that in the original cohort. So, you have massive effect size inflation. And I think that's causes incalculable harm to the field because you can't trust the published effect sizes. I'll make a more general comment, which is that since since the model is wrong, in the sense of not being exactly correct, whenever you reject it, you haven't learned anything. And in the case where you fail to reject it, you've made a mistake. So, the worst, so the best possible cases you haven't learned anything, the worst possible cases you're is you're wrong. So, the expected utility is negative. And when you count that regression problem that I mentioned before that causes inflated effect sizes, I would say that the expected utility is quite negative. And I'll stop there.
Deborah?
There's a lot of confusion about thresholds. What people oppose are dichotomous accept reject routines, and we should move away from them, as well as unthinking uses of thresholds with 95% confidence levels, or Bayes factors or others. But we shouldn't confuse fixing a threshold to habitually use with prespecifying a threshold, beyond which there's evidence of inconsistency with a test hypothesis. And I will often call the test hypothesis a null in this debate just for abbreviation. Some think that banishing thresholds would actually diminish p-hacking and data dredging, but it's actually the opposite. In a world without thresholds, it would be harder to criticize those who failed to meet a small p value because they engaged in data dredging and multiple testing, and at most have given us a nominally small p value. Yet, that's the upshot of declaring predesignated p value thresholds shouldn't be used at all in interpreting data. If an account can't say about any outcome in advance that they won't count as evidence for a claim, then there's just no test of that claim. Giving up on tests means foregoing statistical falsification. What's the point of insisting on replications if at no stage, can you say the effect has failed to replicate? Now, you may favor a philosophy of statistics that favors something other than statistical falsification and say we should never do tests, but it won't do to declare by fiat that science should reject the testing view. So, my answer is no and yes: don't abandon thresholds to do so would be to abandon test. I agree with Jim that at times p values are made very small, and insisting on small thresholds, say in physics is often to accommodate the searching and data dredging. But they can also be the route of adjusting the p value instead of just blindly trying for a small p value. Now, David thinks that if a model is wrong, there's really no usefulness in testing it. And that's just false. We know our models aren't exactly correct. That's why they're called models. In fact, if they were true, we couldn't use them to learn about the world. We can find out true things with approximate models. I would agree that in some cases, as he says, the state of statistical inference is too uncertain. That might be okay, in some cases where we have poor error control, convenience samples, and questionable measurements, and in psychology, some errors might be like this. But that's not an argument against using proper tests in domains that work very hard for error control.
David, would you like to follow up?
Yeah, um, I disagree on two counts. First, according to Deborah, p values give inconsistencies of the data with hypotheses. But I don't think that's true. At best they give inconsistencies of data with models. So, there's an extra inferential step that is being conveniently glossed over. The second issue, I don't agree that you have to prespecify a cutoff value ahead of time. People have been falsifying hypotheses long before inferential statistics was invented. And furthermore, I can give you an everyday example, which is if I think I'm going to have steak for dinner, and fish appears on my plate, then I'm wrong. I don't need to prespecify anything.
Thank you, David,...Jim?
Jim, you might be on muted sound.
Sorry. David mentioned the problem of publication bias, which can lead to inflated effect sizes, and it's very real if thresholds are used. But that that's the publication bias question. I'm not sure, it's a side product of the threshold issue, but I'm not sure it's the cause of the problem. Going back to the GWAS situation, they that the community decided that, then when they declare a disease gene association, they want it to be 10 times more likely for that to be a true association than for it to be a false association. And that, and that, they said is what we need to do moving forward with a science. They will then take the truest associations and do replication studies and further study about it. So, they're just using it as a screening basis to find things which need further study. And there I think a threshold is just perfectly appropriate.
Deborah, would you like to give some final thoughts on this question?
Yes, a common of fallacy in discussing thresholds is to suppose that because we have a continuum that we can't distinguish points at the extremes, sometimes called the fallacy of the beard. But we can distinguish results readily produced by random variability from cases where there's evidence of statistical incompatibilities with the chance hypothesis, and we use thresholds throughout science, whether it's diabetes or something else, unproblematically. When p values are banned altogether, as David recommends in his journal, his eager researchers don't claim 'I'm simply describing', but they invariably go on to claim evidence for a substantive psychological theory, but on results that would be blocked, if they required a reasonably small p value threshold. And of course, in cases of deductive falsification, you don't appeal to statistics, but we're talking about the uses of statistics in science.
David?
Yeah. I think we might be at loggerheads because, again, the issue is the model, not the hypothesis. So, you can't, I just don't think that you can draw a conclusion about the hypothesis when you factor in that you're really talking about the model. Now, as far as the fallacy of the beard, I'm not claiming that's not what I'm claiming. So, so, I don't see that. My point is that by having a threshold, you set this regression thing in that causes inflated effect sizes. I mentioned the Science Foundation study, um, and we can't trust those effect sizes because of that inflation. And that's a really big harm. And again, I still don't see why you need to have p values to make a judgement. You can look at the effect size. And you can look at the sample size to see how much you can trust the effect size. I think confounding the two into a p value is counterproductive. It's better to keep the sample effect size separate from the sample size, so you can make independent judgments.
Jim the final word on this topic is yours.
Um, My, I think my response to David on this, is that exactly what the next question is about. So, I suggest we move on to the next question.
[QUESTION 3] Okay. All right. Very good. So moving on to the next question. I'm going to start now with David. Is there a role for sharp null hypotheses? Or should we be thinking about interval nulls?
I think you can probably guess my answer. From what I've said before, I would favor no nulls whatsoever, with one exception, which I'll get into in a minute. Since I don't believe in p values, I don't believe in p value thresholds. I don't believe in what we might call a null hypothesis for statistical convenience. So, I'd rather get rid of all of them. The exception, the exception is there might be a good substantive reason for the null hypothesis. A theory might actively predict no effect. Or might it's easy to imagine situations, for example, in medicine, if you're interested in showing that a, that that a side effect doesn't occur, you might have, again, an applied reason for a null effect. So, I'm not against null effects, if there's a substantive reason, whether theoretical or applied, but I'm very much against null hypotheses, or if they are based, if the reason for performing them is strictly so you can obtain a p value.
Deborah, your thoughts please?
Okay, well, David speaks about nulls coming from substantive theories. And that can be the case, for example, the equivalence principle in physics. But the point is, in order to evaluate whether there's evidence for a discrepancy from the null, you need procedures that have high capability of being able to detect a discrepancy, if one exists. I know he wants to reject all such tests. But the bottom line is, is that when people look at his journal, such as Ron Fricker, did and Dan Lakens, they find that the readers are, well, first of all kept in the dark as to whether they had used p values and whether the p values are small enough. And the reason is, he says that we should remove, the authors should remove all signs of p values, even if they were used in order to infer the results. And so, I think this is an issue and he's not really getting away from them. They're not. Because the authors aren't saying: 'I'm just describing'. They do make inferences to theories. But that said, I'd agree with those who regard testing a point null hypothesis as problematic. And notice, though, that the arguments purporting to show p values exaggerate evidence are always based on the point null, plus a spiked or lump of prior to that point null. And we know that by giving a spiked prior to the null, the nil null as it's often called, that it's easy to find the nil more likely than the alternative. This is the Jeffrey's Lindley paradox. The p value can differ from the posterior probability on the null. But the posterior can also equal the p value, it can range back from p to 1-p. In other words, Bayesians differ among themselves. Because with diffuse priors, the p value can equal the posterior on the null hypothesis. My own work reformulates results of statistical significance tests. So, we move away from the point null, instead, we'll make inferences about discrepancies from the null that are well or poorly tested. A small p value indicates discrepancy from a null value, because with high probability, 1-p, the test would have produced a larger p value in a world adequately described by the null and all we need, is that adequate or approximate description--not precision. Since the null hypothesis would very probably have survived, if it's correct, if it's adequate, then, when it doesn't survive, it indicates inconsistency with it: that's statistical falsification.
Thank you. ...Jim?
So, I certainly agree with with David that when I'm having a sharp null, it's, it's because of a substantive reason. It's not just a, it's not a null you stick in there, so you can get a p value and publish a paper. But I think the cases of such sharp nulls are, they're rather extensive. I mean, in psychology, for instance, there's the famous or infamous infamous Bem, precognition null, that's that's a substantive null that there is no precognition. And, I know in psychology, I think there's a quite a big debate about how precise, whether sharp nulls are prevalent or not. And I've watched that debate, but I haven't participated. I've given some examples, other examples today, where the nulls were obviously sharp and the high energy physicists. They recently discovered the Higgs boson that was a sharp null as whether the Higgs boson exists or not. The GWAS studies, genome wide association studies they would have been talking about, those are millions of sharp nulls. Namely, is there an association between this gene and this disease? We're currently facing a whole bunch of sharp nulls; we have thousands of COVID-19 vaccines being tested. Vaccines are a prime area where you want to consider a sharp null, because there are many things simply will not trigger any kind of immune response. There's a sharp null there. This issue is, is very important because Bayesian answers and what I think are optimal frequentist answers can differ enormously, whether the null is sharp or not. Deborah referred to this, although I phrased it differently, it's true that if I, say have a one sided testing problem, then the P values and the Bayesian answers are more or less the same. If I have a sharp null hypothesis, then the p value and the Bayesian answers are very different. But that's just because there are two different types of problems. And the Bayesian analysis has to reflect the type of problem. So, and then, and then, between the two is interval nulls, part of the question asked about interval nulls. And so, interval nulls can fall, either they can be either like a sharp null, or they can be like a one sided hypothesis testing problem, just depending on a lot of things, but the conditions are known, under which you can treat an interval null, like a sharp null. But the key, and this is to me the most fundamental distinction and what we're talking about, because because the answers vary so much. And my, my my last thought on that, is that if you do have a sharp null, you can't ascertain the evidence against it simply by looking at the effect size and the sample size. You have to do a, another kind of computation, which indicates the evidence for the sharp null.
Deborah, would you like to follow-up?
Yes. Jim mentions cases where we use sharp nulls in science, and it really is unproblematic, in say a Neyman Pearson statistical significance test. But it's very different when you look at the Bayes factor, and Jim's whole worry is that there's disagreement in that sharp null case, between the Bayesian and the frequentist significance test, but there's disagreement also among Bayesians. Now, Jim himself has said that p values are reasonably good measures of evidence when there's no prior concentration of belief about the null hypothesis. This is something noticed by Casella and Roger Berger. And so, if we reject that high concentration of belief in the null, which I think we should, then we could conclude with Jim that p values are reasonable measures of evidence, even though I don't recommend they be used as posteriors.
Yeah, I think we have to move on. I'm sorry. David?
Okay. Yeah, I just, I think that I agree with Jim's examples about substantive reasons for null hypotheses, and I don't think we disagree about that. Now, Deborah, again made the point that you need procedures for testing discrepancies from the null hypothesis, but I will repeat that you're not doing, p values, don't give you that. P values are about discrepancies from the model. And you already know the model is wrong. And so, I think I'm going to just stick with that. Now about the complaint about the journal I edit, two things, first, that the article that was cited did not provide a quantitative analysis. It was a, it had a bunch of cherry picked examples. And the second thing is that I'd like to point out that if you look at the impact factor and the rejection rate, both have gone up dramatically since banning p values, not down. So therefore, I, the argument that you have to have p values and cut-offs in order to do science is again, comes, it's just not true.
Thank you, David. Jim, do you want to give some final thoughts on this question?
Well, I think I've done want to emphasize what I, what I finished with last time was just saying that. So, I mean, we're not in this question. I think we're not just talking about p values we're talking about how do you assess the evidence for, or against, a sharp hypothesis such as this new COVID-19 vaccine is it going to be effective or not? And I believe that you have to use some kind of Bayesian odds analysis or frequentist odds analysis to to assess that evidence. And so, it's important to recognize when we have a substantive, sharp null, and it's important to use a method that can deal with it. Deborah mentioned that that yes, sometimes key values do agree with Bayes factors and odds. And I agree, sometimes they do, but very often they don't. And so, I just want to differentiate between the two cases.
Deborah, your final thoughts, please?
Yeah, I just want to mention that the whole move to redefine statistical significance, to use .005 or something smaller, all rest upon this lump high prior probability given to the sharp null, as well as evaluating p values using Bayes factors. And this is problematic that it's actually giving a biased assessment in favor of the null. What happens is the redefiners are prepared to say there's no evidence against the null hypothesis or even evidence for it, even though that point null is entirely excluded from the corresponding 95% confidence interval. And this would often erroneously fail to uncover discrepancies. So, there's, there's disagreement, not just with the significance tester, but with the confidence interval person. Whether to use a lower p value threshold is one thing. The only problem is arguing that we should do so based on a measure that's assessing something completely different, like a Bayes factor.
The final word on this question goes to you, David.
Okay, yeah. Um, yeah. I think I've already said what I needed what I need to say, you know, we're still, the null hypothesis that's, that's statistical rather than substantive, doesn't make sense to me. I still think that you have this problem of inflated effect sizes that's caused by having a cutoff point. And now as far as having some way of comparing probabilities or not, I don't object to probabilities. But sometimes I think we make this more complicated than we have to. If one perspective, predicts one effect size and another one predicts another effect size, you can look at, at the effect size that was obtained and see which one is closer to. So, I'm not rejecting complex statistics. But I think sometimes we use them when there are simpler ways to go.
[QUESTION 4] Thank you very much. Our next question moves us into the realm of pedagogy. And I'm going to start with Deborah: should we be teaching hypothesis testing and more, or should we be focusing on point estimation and interval estimation?
I would say absolutely, we should be teaching hypothesis testing. The best way to understand confidence interval estimation and to fix its shortcomings is to understand their duality with tests. The same person who developed confidence intervals and developed tests in the 1930s is Jerzy Neyman. The intervals he developed were inversions of tests. A 95% confidence interval contains the parameter values that are not statistically significantly different from the data at the 5% level. While I agree that p values should be accompanied by confidence intervals, my own preferred reconstruction of tests blends tests and intervals by reporting the discrepancies that are well or poorly indicated, the discrepancies being from some reference point. But I would give different levels--not just one level like 95%. This improves on current confidence intervals. For example, what's the justification standardly given for inferring a particular confidence interval estimate? It's that it came from a method that has high probability of covering the true parameter value. This is what I call a performance justification, not an inferential one. The testing perspective on confidence intervals that I favor gives an inferential justification. So, for example, I would justify inferring that there's evidence the parameter exceeds the confidence interval lower bound by saying: if the parameter was smaller than the lower bound of the confidence interval, then with high probability, we would have observed a smaller value of the test statistic then we did. Amazingly, the last president of the American Statistical Association, Karen Kafadar, had to appoint a new task force on statistical significance tests in order to affirm that statistical hypothesis testing is indeed part of good statistical practice.
Jim, would you like to comment?
Yes. So, the duality of tests and confidence intervals that Deborah mentioned is is indeed part of the classical frequentist perspective. But there is, there is not a duality on the Bayesian side, like that. And this is what I mentioned earlier when we have sharp null hypotheses. So, it's not at all uncommon for a Bayesian to produce a report for a sharp null hypothesis that would say something like: the probability that the null, the posterior probability that the null hypothesis is true is .4 and the confidence interval for the effect size, if the null hypothesis is not true, is the interval 1-3. That's entirely possible that the, that the Bayesian analysis gets substantial support to the to the sharp null, whereas the effect size seems to be somewhere else. And there's nothing wrong with that. That's just probability theory. That's the way it works out. Just as a, as a, as an example of this, there was a, I got involved a while ago with a reanalysis, of a large HIV vaccine trial that was conducted in Thailand. The 95% confidence interval for the difference between the infection rates of the treatment group and the control group was quite far from zero. So, if all you did is look at the confidence set for the treatment effect size, you would say: 'Oh, yeah, it's quite different from zero'. But on the other hand, when you do the odds kind of analysis that I wanted to believe in, it, the odds analysis, said: 'Well, okay, there is evidence for the treatment being effective, but but the odds are only four to one in favor of the treatment being effective.' Now, four to one is, you know, that means this is an interesting treatment, but it's hardly compelling. And what happened was, it was later found that this was just an aberration, it was just noise, the vaccine did not work. And this is why it's so important when you have sharp nulls to do the right kind of odds analysis to avoid over, being over eager to reject them.
Thank you, Jim, David, your thoughts?
Umm, sure. So, for me estimation, is is a much better use of inferential statistics, then hypothesis testing, mainly because I don't think you can test hypotheses using inferential statistical methods. Whereas I think you can use them for estimation, and I'll explain. Let me use the device of Laplace's demon who knows everything. So, Laplace's demon tells you that your sample statistics have nothing to do with their corresponding population parameters. Would we care about our sample statistics, in that case? We wouldn't. Right? Because we take it as an assumption, or maybe a hope that our sample statistics do have something to do with corresponding population parameters. Right. So, the point I want to make is that yes, we should do estimation, that's very important. And this is one case where model wrongness isn't fatal for me because again, if the model is, is wrong, but it's reasonably decent, then our estimate will be reasonably decent, and we're in good shape. The model wrongness is fatal only when you're making a dichotomous decision.
Okay. Jim, do you want to offer some follow up?
Um, no, not at the moment.
Okay. Then I'll go to Deborah.
Okay, so Jim, agrees that in the case of the sharp null, that p values can conflict with the confidence interval and thinks that it's good that we'll use the Bayes factor to check on an interval. But when we have a confidence interval that entirely excludes the null value, and he's saying there's actually evidence for the null, that's problematic, and has a high probability of erroneously failing to find effects. Um, yes, David says it's okay to estimate, but really, this estimation is tightly connected to tests. And I'd really like to know how he endorses something that he calls an a priori approach that allows him to give a degree of, I guess, posterior probability and degree of confidence to a chosen discrepancy with a true parameter value without using any of the questionable assumptions that he worries about the test and interval.
David, would you like to respond?
The difference? The difference is that model wrongness is fatal for dichotomous decisions, but it's not fatal for estimation. And so, in the a priori procedure, sure, we're using models. And in fact, one of the things I advocate is using multiple models and see if they give you different answers or how different they are. And if the model is wrong, but but reasonably good, then you'll end up with a good estimate.
Okay, thank you. Jim, your final thoughts on this question?
Um, well, I saw, I guess in response to David's last last comment. I'll just just repeat that I would not know in the vaccine situation, how simply looking at effect sizes, how I would ascertain whether the sharp null of the vaccine being ineffective is is support, is still possibly correct. I have to go ahead, and, and do a test, a Bayesian test. I, of course, look at the effect size, too, but I just need to do both before I can feel that I understand the problem.
The final word on this question is for you, Deborah.
Okay, we don't dichotomously infer, we infer discrepancies that are well or poorly indicated at a certain level, and we would give several confidence levels. Understanding the duality between tests and confidence intervals really is the key for getting around David's worries about invalidity and so I think it makes no sense, also, for the new statisticians, to reject tests. It's also important to see that using this testing interpretation of confidence intervals scotches criticisms of examples where it can happen that, let's say, a 95% confidence interval contains all possible values. As David Cox remarks, which is why he says we have to have tests plus intervals, an inference like that, you could say is trivially true, but it's scarcely vacuous. The fact that all parameter values are consistent with the data, at this level, is an informative statement about the limitations of the data, in this case.
[QUESTION 5] Thank you very much for that. Moving forward to the next question, which we'll start with Jim: What are your reasons for or against the use of Bayes factors?
But before addressing that, I'm gonna break protocol a bit because I can't resist addressing the last comment of Deborah about about it, because I was, I once had a, was having a little debate with David Cox about that precise example. And I said, it's, you know, it's disturbing to have a confidence interval that's the whole real line. And he responded, well no and then he said, just what David said, that's just reflects the fact that you have, you really haven't learned anything about the parameter. And I said, no, no, no, no, that's not what's concerning me. What concerns me is that I hate to say that I'm 95% confident in the whole real line. You know, pretty, I'm 100% confident, not 95. But I can't say 100% in the frequentist school. Anyway, back back back to this other question. Are you for or against the use of Bayes factors? And certainly, I'm a Bayesian, I'm a big fan of Bayes factors. Why? Just so many reasons. One, they're simple to understand these odds when I say the odds of the null, of the alternative to the null are 10 to 1 that's simple. Everybody can easily understand that. They don't involve prior probabilities of hypotheses, yeah, they're, they do, they can involve priors but they're mainly data driven. Prior probabilities of hypotheses can be problematical and so it's nice that they don't. You know, again, in that HIV vaccine trial, the p value there was .02, the Bayes factor was four to one, four to one odds. So, so, I feel that there's an important distinction and an important role for Bayes factors. The Bayes factors do have a conditional frequentist interpretation. I don't think I have; I don't think I have time to go through that. So, I'll skip that. Bayes, Bayes factors are central in Bayesian analysis. For instance, a very powerful method for prediction is called Bayesian model averaging, where you are considering all the possible models for a problem, you average them, weight them according to their posterior probabilities and this is chosen, this is proven, to be an immensely powerful tool in statistics and machine learning for prediction. And Bayes factors are a central component of that tool, they're not the only part of it, but they're central. The main disadvantage of Bayes factors is that they can be difficult to compute, and they, and they nominally depend on a prior. But there's things you can do about that. You can you can find bounds that hold over all priors. In fact, my favorite bounds, which drives Deborah crazy, is minus ep(logp); e is just 2.732, p is the p value, logp is the natural logarithm. So minus ep(logp) is if you take a p value, and you compute this minus ep(logp), then the Bayes factor in favor of the null hypothesis can never be smaller than that. No, sorry, it's always larger than that. So, so if the, if, if say, p is .01, when I plug that into minus ep(logp), I get, I get one eighth. So that means that the the Bayes factor in favor of the alternative could be as high as eight to one but no higher. So that's a very, very useful thing to say. And it's a bound that applies over all priors.
Thank you for that Jim. Yeah. David, would you like to comment on this question: Bayes factors?
Sure. In one sense, I like Bayes factors, because if they come out with an opposing implication from the p value, that that's good to know, and maybe would make people a little bit less confident in their p values. However, I also think there is an important problem with Bayes factors. And that is that, and this is going to sound familiar to all of you, Bayes factors are, are based on models, not hypotheses. And so, to say that the data are, I don't know, eight times more likely under one wrong model then under another wrong model is less informative than it seems, if you think about it in terms of hypotheses. So as a result, I'm not a big fan of Bayes factors. I don't think Bayes factors actually tell you which hypothesis you should favor and by how much, which is how they're often interpreted. So again, no, I don't favor those either.
Deborah?
Okay. Oh, Jim, of course, is a leading advocate of Bayes factors and also of the non-subjective interpretation of Bayesian prior probabilities to be used. Bayes is something that requires the priors and, as Jim has convincingly argued, eliciting subjective priors is too difficult, that the experts' prior beliefs almost never even overlap, and scientists are reluctant for subjective beliefs to overshadow the data. Now these default priors, reference, or non-subjective priors, are supposed to prevent prior beliefs from influencing the posteriors--they're data dominant in some sense. But there's a variety of incompatible ways to go about the job, maximum entropy, invariance, and several others. And Jim tends to dismiss it as, well, unproblematic and easy to understand, but really, it's very, very difficult to understand. As David Cox points out, the default priors are simply, we are told, formal devices to obtain default posteriors. So, what do they mean? The priors are not supposed to be considered expressions of uncertainty, ignorance, or degree of belief. They may not even be probabilities, being improper and so, it's not at all unproblematic how to interpret them. Also, the prior probabilities are supposed to let us bring, we always hear, they're letting us bring background information in, but this pulls in the opposite direction from the goal of the default prior. And on David, this time, I do agree with him that Bayes factors aren't tests. So, they, by just giving us a comparative appraisal, they don't tell us if a hypothesis is warranted or not.
David, would you like to follow up?
Um, no, I think I've said what I need to say.
Okay. Jim, any follow up?
Oh, actually, let me, yes, let me follow up on Deborah's last comment by saying that yes, most of the time, I operate as as sort of what's called an objective Bayesian. But on the other hand, often, one needs to bring in subjective knowledge. In this vaccine trial I talked about in Thailand, many scientists were highly skeptical of of the of the of the treatment, because it was a combination of two treatments that had both been shown to be utterly ineffective. So, a lot of scientists had a high prior probability that the, that the vaccine wasn't going to work. And and incorporating that into the analysis for them was important. In the GWAS studies, they as, they act, they directly assess the prior probability, the prior probability that any gene is associated with a specific disease. And they assess that to be roughly one in a million based on biological and genetic considerations. So that's a powerful piece of subjective, but it's fine--what's subjective? This is, it's just scientific knowledge. It's their best judgment as scientists that the prior probability is one in a million. And that's central to their analysis to do a good job of detecting these associations.
It's closing time on Bayes factors. Final comments, Deborah?
Yes, well aside from these points, we should remember the use of Bayes factors to criticize p values, as we saw before, which is, which is problematic because it rests on this high spiked prior on the point null. And I think it's worth noting that even default Bayesian statisticians, like Jose Bernardo, holds that the difference we've been seeing between the p value and the Bayes factor, the Jeffrey's Lindley paradox, he thinks is actually an indictment of Bayes factor because it finds evidence in favor of a null hypothesis, even when the alternative is much more likely. Other Bayesians object to the default priors because they can lead to improper posteriors and this leads statisticians like Dennis Lindley, back to subjective degrees of belief, priors. I've heard Bayesians admit that Bayesian testing is a work in progress and my feeling is we shouldn't kill a well worked out theory of testing for one that is admitted to being a 'work in progress'.
David, do you have some follow up?
Yeah, I think I do. Um, for me, the best I think, um Deborah is obviously a frequentist and Jim is a Bayesian and so there are important philosophical issues there. I'm going to pass on those philosophical issues, because I haven't decided myself, whether I'm, whether I like frequentist philosophy better or Bayesian philosophy better. I'll have to get back to you when I figured it out. But what I will say is, whichever philosophy one is going to use, I'd prefer that it be used for estimation, not for testing.
Well, Jim, you have the final thought on Bayes factors.
I'm sort of, in reacting to David's last comment, I'm both a Bayesian and a frequentist. I am not a Bayesian to the exclusion of being a frequentist. And in fact, I only, I'm only pretty much happy with an answer to a problem when I perceive it as an answer from both the Bayesian and the frequentist side. Now, I'm a different kind of frequentist than Deborah is. And so, when I talk about my Bayesian odds, I can reproduce those as conditional frequentist odds using a frequentist argument. So, so I'm both
[QUESTION 6] Okay, that was great. Thank you. So, moving into question number six. And I'll start with David on this one. With so much examination of, if and why the usual type one error, .05 is appropriate. Should there be similar questions about the usual nominal type two error rate of .2?
Yeah. Again, I'm going to go back to the point that these are based on models not on hypotheses. So, the .05 type one error rate. I don't think you can commit a type one error at the model level because the model is wrong. Now, as far as the type two error rate, I would say that whenever you fail to reject the model, we've made a mistake. So, so you should always reject the model, whether you do a test or not because the model is wrong. And again, the model, what you conclude about the model doesn't transfer to the hypothesis because the hypothesis is embedded in a larger model with additional assumptions. So for me, you know, .05 for type one or .2, for type two, they don't really matter very much because the whole idea is wrong.
Deborah, your comments, please.
And well, my answer is no, there shouldn't be a similar examination of type two error bounds if there have been for alpha. Rigid bounds for either error should be avoided. Neyman-Pearson themselves urge that specifications be used always with discretion and understanding. You know, it occurs to me that if an examination is warranted, it should be done by the new ASA Task Force on Significance Tests and Replicability. Its members aren't out to argue for rejecting statistical significance tests, but rather to show that tests are part of proper statistical practice. Power, which of course is a complement of the type two error and probability, I often say, is one of the most abused notions. But notice it's only defined in terms of the threshold. Critics of statistical significance tests, I'm afraid to say, often fallaciously take a just statistically significant difference at a level alpha as a better indication of a discrepancy from a null, if the tests' power to detect that discrepancy is high, rather than low. This boils down to saying it's a better indication for a discrepancy of at least 10, then at least 1 whatever the parameter is, and I call this 'mountains out of molehills' fallacy. It results from trying to use power and alpha as ingredients for a Bayes factor, and from viewing non-Bayesian methods through a Bayesian lens. We set high power to detect population effects of interest. But when we find a statistically significant difference, it doesn't warrant saying we have evidence for those effects. Of course, the significance tester isn't inferring points, but inequalities such as discrepancies and such and such. Now, David, doesn't understand how in science we can learn from approximate models. And I say that we can only learn from approximate models and we learn true things from models that are deliberately false. And my other question to him is how he has this a priori way of assigning 95% probability, for example, to the sample mean, that it will be within k standard deviations of the population mean, and he thinks that this confidence can be assigned without any of the invalidity threats that he finds in confidence intervals and power, and so on. But what he's actually doing is exactly equivalent to either using confidence intervals or power analysis. And if we had time to talk about it, I could show him that.
Well, let's first go to Jim for his comments and questions.
So I certainly with my frequentist hat think that it's important to consider power. Let me just remind everybody of the extreme example where suppose the power of the test equals the type one error, then then reject, rejection of the null hypothesis means nothing, I mean, the rejection region could have equally happened under the null or the alternative and so you've learned nothing. So that's that's a very extreme example, but it just shows the trying to infer something from type one error alone, not considering power can be grossly misleading. I really liked the genome genome wide association article, which Deborah sort of just referred to, doesn't like, and let me go through that explanation. They had the idea of, they said, how do we combine power and type one error in our analysis, they said, let's look at the ratio of the power and the type one error. Because that is very related to the pre-experimental odds that a rejection is a true positive versus a false positive. You can view power over type one error as kind of indicating, based on the data, how likely is it to get a true positive to a false positive. And then multiplying that times prior odds would give the overall odds but just seeing what the experiment suggests about the chance of true to false positive is interesting all by itself. Now, Deborah points out that that's problematical. And you can get some contradictory results there. And let me actually give you one, and then, this is why instead of using power over alpha, I prefer to use the Bayes factor. In this genome wide association paper in 2007, they claimed 21 gene disease associations. They said, so they not only set up this nice machinery, but they claim 21 gene disease associations. And they set this up so that the prior odds of a true to a false positive were supposed to be 10 to 1. So, they claimed most of these 21 gene disease associations are probably correct. In actual fact, 20 of them were replicated later in the scientific literature, only one failed to replicate. Remember, they set this up so that the prior odds of a true to a false positive were 10 to 1, it turned out when you computed the Bayes, the Bayesian odds of true to a false positive, they were 10 to 1 or bigger for 20 of those scenarios. One of them was a scenario, which is the kind of situation that Deborah was talking about. The Bayes factor or sorry, the Bayesian odds went the other way. The Bayesian odds said, for this particular gene disease association, conditional on the actual data observed, it's one to 10 evidence in favor of no association. So, it completely reversed it from pre-experimentally, it was supposed to be 10 to 1, but you can get beta, and this happened, where it was the other way around 1 to 10. So, so whereas I did like their idea of using power over alpha, I agree with Deborah, that it's not the ultimate answer. But I think the Bayes factor is.
Deborah, some follow-up?
So, um, Jim does put forward this pre-data rejection ratio, this power over alpha, and he even says that it quantifies the strength of evidence about the alternative hypothesis relative to the null that would be conveyed, conveyed by a result that was statistically significant. And my questioning of this is that a result that's just statistically significant at the alpha level would be poor evidence of a discrepancy against which the test has high power. And it's true, he says he, he sort of agrees with that. But we haven't finished our discussion. Nevertheless, presenting the Bayes factor as if it's just a post data follow up of this pre data rejection ratio, I think is misleading. But I'll agree that in the case of screening, when it's an 'on-off', and we can look at it as this binary that we might want to look at that ratio, but that's very different from evaluating hypotheses.
David, some follow up.
Yeah, um, so I think there might be a misunderstanding here. I'm not, I'm not arguing that models that are wrong, but that are close, are not useful. I'm just claiming they are not useful for hypothesis testing. I think they're fine for estimation. Now about the issue of power analysis. One of the interesting things that I found is that if you use the a priori procedure, you'll find out that often, a power analysis will suggest that you only need a certain participant count. And then when you put that into an a priori analysis, what you find is that your estimate of the population parameter is really terrible. Which I think is yet another reason why that whole way of doing significance testing thinking isn't a good idea.
Thank you, David. So final thoughts, Jim on type one error to a two?
Um, I don't know. I think I said everything I wanted to say.
Okay, Deborah?
Okay. Something that hasn't been mentioned, and is important, is this: that a legitimate criticism of p values is that they don't give population effect sizes, but Neyman developed power analysis just for this purpose, in addition to comparing tests predata. Yet critics of tests typically keep to these Fisherian tests that don't have an explicit alternative or power. And Neyman was very keen to avoid misinterpreting non statistically significant results as evidence for a null hypothesis, which he thought Fisher was occasionally guilty of, and he used power analysis, post data, like Jacob Cohen does much later, to set an upper bound for the discrepancy from the null value. The reasoning is this. That the test has high power to detect the population discrepancy, but doesn't do so, it's evidence that the discrepancy is absent, qualified, as always, to the level. And my preference is to use what I call attained power, but it's still the same reasoning. These days, you'll hear people talking about post hoc power as a bad thing and as sinister, but they're actually referring to something that's not proper power analysis. They're referring to something whereby you use the observed effect as the parameter value in the power computation.
The final thought on this is pretty good.
Yeah, I think I've pretty much said what I needed to say. So, I think I'll just pass my time.
[QUESTION 7] Okay. Very good. Thank you all for that. We've come to our final question. And I will return to Deborah to start this round off. What are the problems that lead to the reproducibility crisis? And what are the most important things we should do to address it?
Irreplication is due to many factors from data generation and modeling to problems of measurement and linking statistics to substantive science. And here I'm just going to focus on p values. The key problem is that, in many fields, latitude in collecting and interpreting data makes it too easy to dredge up impressive looking findings even when they're spurious. But the fact that it becomes difficult to replicate effects when features of the test are tied down, shows the problem isn't p values, but exploiting this researcher flexibility, and multiple testing. And the same flexibility can occur when the p-hacked hypothesis enters methods being promoted as alternatives to statistical significance tests, likelihood ratios, Bayes factors, Bayes updating, but the direct grounds to criticize inferences as flouting error statistical control is lost, at least not without adding non-standard stipulations. That's because these other methods condition on the actual outcome. As a result, they don't consider outcomes other than the one observed. This is embodied in something called the likelihood principle. Admittedly, error control, some people think, is only of concern to ensure low error rates in the long run, this performance justification. But I argue that instead, what really bothers us about the p-hacker, and the data dredger is that they've done a poor job in the case at hand. Their method very probably would have found some such effect or other, even if due to noise alone. Probability here notice is to assess how well tested claims are, which is very different from how comparatively believable they are. Claims can be true, but terribly tested. And I think there's room for both types of assessments in different contexts in statistics, but they are very different. How plausible is very different from how well tested. It seems to me that to really address replication problems, statistical reforms have to be developed together with the philosophy of statistics that properly underwrites them.
Thank you, Jim, your comments, please.
So, first I mean, there, there are many issues with what I would call the current scientific enterprise that leads to the reproducibility problems such as the publication bias issue that David essentially referred to earlier, the over over estimation of effect sizes. It is I mean, from the viewpoint of a statistician, we can warn about such issues, but, but it's up to the scientists to really do something about them. So, let me turn to the statistical side where I, where I think the main problems related to reproducibility are p-hacking, which Deborah has mentioned extensively. Misinterpretation of p values, which I've already talked about, multiple testing, which Deborah mentioned. And simply doing bad statistics. And which is the most important cause, I kind of think depends on the discipline. In social sciences, for instance, I did, it seems to me that p-hacking and misinterpretation of p values are probably the biggest issue. I loved the article in 2011, Psychological Science by Simons, Nelson and Simonsohn, they showed that there is significant evidence that listening to the song "When I'm 64", by The Beatles can reduce can reduce our listeners actual physical age by 1.5 years. This remarkable finding was obtained through a brilliant amount of p-hacking, where you can kind of go through every step of the p-hacking argument, and you can't see anything that was egregiously wrong. And yet, they ended up with an absurd conclusion, of course, they were trying to. And in many of the hard sciences, like GWAS and other things like that, I think controlling for multiplicity is the big problem, because they, they do routinely millions of tests simultaneously, or they deal with millions of types of multiplicities. In the physics world, they're called cuts that they have to deal with millions of them. And dealing with multiplicity control is crucial. You know, pretty soon we're going to be seeing some COVID vaccines, passing their trials, with only modest efficacy. With the thousands of different development efforts going on now, if one of these trials says, aha, the vaccine works it passed the phase three clinical trial, and there's, it has a modest effect, I'm not going to believe it. I think we're facing a multiplicity crisis, in terms of vaccine testing.
Sounds worrisome! David?
Um, yeah. Let me start by agreeing with Deborah and Jim about about the various ways of cheating, what we might call them researcher sins, or whatever your favorite word is. I think these are an important problem with reproducibility. However, I think there would still be an important reproducibility problem, even if everybody were perfectly honest. And let me explain. So, in a previous answer, I talked about inflated effect sizes, right? Again, because to pass the threshold, you need to get lucky with your, with your sample effect size. Well, so let's let's suppose someone does that. And then let's suppose that a replication study is performed, such as with that Open Science Foundation project I had mentioned earlier, well, what's going to happen? If, if the reason you got the first, got the original study published is because you were lucky, then that's probably not going to reproduce, right? So, you have a problem, even if people are perfectly honest. And that, that's a specific example of what I think is a more general philosophical problem, which is that we tended to find reproducibility with respect to getting a significant p value in the first study and in the replication study. I think a better way to do it would be to define a replication as meaning that you get similar effects in both studies, or even that you get effects in both studies that are close to the population effect, or something like that. Our insistence on significance testing thinking has lured us into a really silly way of thinking about replication, and that's why we're having the or at least a large part of why we're having irreproducibility crisis.
Thanks, David, yes. Jim, what are your thoughts on what you heard David say?
I certainly agree with with David that there are problems even when things are being done honestly. I'm reminded of the fact that in phase three clinical trials for for new drugs the success rate, for the last 20 years, the success rates have been plummeting. Now, here's a case where when you get to phase two and phase three trials, it's mandatory to control for multiple testing. So, they do. But the trouble is in the early experimental stage of drug development, you know, they're just kind of, they think of themselves as just exploring things and just trying to find interesting candidates for study. But by not appropriately taking care of the multiple testing or multiplicity issues in those early developmental phases, things get through to these final stages, and enough gets through that's just doesn't work, that it causes a problem.
Deborah, some follow-up?
Yes, we shouldn't try to fight this failed replication based especially on explorations, and this is how science is done. To do away with tools that are the ones that reveal these problems is huge mistake. I'm very interested to hear Jim emphasizing multiple testing here, because I think it's at odds with this idea of using Bayes factors, because what he says is that Bayes factors can be used in the absence of a sampling plan, which suggests you don't need to know about the selection effects that alter the sampling distribution, and error probabilities, including things like multiple testing. But other things too outcome switching, optional stopping all of these require you to look at what would have occurred under other outcomes. And those are, I think, the main cause of failed replications. So, I'm interested to hear that he's even moving to that, or that's also something he agrees with. And of course, he wrote the famous book, The Likelihood Principle. On David, what people find odd, Fricker et al, with Dan Lakens, finds at looking at his journal, they're very worried that banning p values makes replication problems worse, because for starters, the journal violates the American Psychological Association's Task Force insistence on reporting multiple testing and data dredging. And it also in this sense, violates the 2016 ASA statement on p values in its principle four. Okay.
David, would you like to offer some final thoughts?
Um, sure. Again, I'll, if the fear is that getting rid of significance testing causes more bad things to pass, I'll repeat that the rejection rate at BASP [Basic and Applied Social Psychology] because gone up, not down, which contradicts that. And I will also point out that the impact factor has gone up dramatically. So, the complaints about the journal I don't think really work. And as far as whether the BASP policy violates what X said or what Y said, I don't really care. That's argument to authority, which I don't really think is a valid form of argument. So no, I'm going to stick with what I said. I still think that our insistence on doing significance testing has caused us to have a poor philosophy of what reproducibility means.
Thank you, David. Jim, your final thoughts?
So, multiple testing is, certainly, Bayesians do it all the time. It's interesting is that the the adjustment for multiple testing and Bayesian, in the Bayesian world occurs through prior probabilities of hypotheses and models. So, it's different than Bayes factors. It's the other component of Bayesian analysis. Like I mentioned in the GWAS example that they assess the prior probability of the of a particular hypothesis, null hypothesis and alternative of a particular alternative hypothesis being true is one in a million. And they didn't have to specify that. They could have left it as unknown to be estimated by the data. But multiple testing is dealt with the, is dealt with by the other side of Bayesian analysis other than the Bayes factor side. And then and then one final thought is we also, you know, should look for low hanging fruit in the in the area of reproducibility. You know, we've mentioned multiplicity, multiplicity control trying to take care of that. Another one Deborah, in passing, mentioned optional stopping. You know, we should, we should insist that optional stopping is dealt with appropriately. And in many scenarios I know of, people just keep sampling until they get a p value less than .05 without admitting they did that, and things like that should just be eliminated.
So, your final thought on this question, Deborah?
Yes, I just want to say that they need to be transparent about how many hypotheses were tested. And things like cherry picking is not an appeal to authority, there's good reason that these task forces have put these things forward as requirements. On what Jim said, I'm interested to hear him also endorsing the concern with optional stopping since he is always my famous source of being so influential in telling us that we can ignore the stopping rule. This, in fact, he may have coined the Stopping Rule principle term, or maybe somebody else did, in his book, and what he admits is that, since the Bayesian is bound to ignore the stopping rule, she can arrive at a, let's say, 95% credible interval that can never contain the true value. So, I'm not sure if this reflects a change in his position, which is fine. And, David, stop thinking that estimation is somehow so different from testing. To me, there, any time you have to have a warrant, and you have to show that your method has had a chance of showing the claim false, if it is false, then that's doing a test. That's being probative, whether it's an estimate, or an inference about a discrepancy with the parameter value.
Thank you very much for that. So, we've come to the end of our round of questioning. And I want to say I admire you all for your commitment to our discipline and your particular passion, to this topic of p values and all the issues surrounding them. I think all of our viewers think that this was just a fantastic exchange. And the good news is we're not finished, we still have a little bit of time left. And we're heading into though a five minute break, during which time, I'm going to hand over the microphone to Jim, who's going to lead us through a viewer poll. So, for five minutes, the viewers can respond to that poll and Jim will bring us back from the break and talk about the results of those. And then we'll have a little bit of time for some q&a. So please stay tuned. And Jim...
I do. Thank you, Dan, and Jim, Deborah and David, for this very civil and thoughtful and informative discussion or debate, as we call it. Just to give the viewers a chance to weigh in, we have three very simple questions that we'll provide as a poll. And I'll give the viewers about 30 seconds to answer and then we'll show the results. So, you should be able to see on your screen the first question and you can just click your answer. And we have over 150 responses. Now let's wait a little longer. Give you 15 more seconds. Okay, it's slowing down now. But people are still clicking when we get to 300 responses, I'll quit. Okay, let me end that poll and share the results. So, you see, sorry, Jim [Berger], traditional p values are still pretty popular.
So, let's go to the next, next poll. I'll save these if we want to talk about them later. I should also say there are 50 quest more than 50 or 60 questions in the, in the q&a. So, we hope to get to a bunch of those. All right, the next question is about bright line thresholds. And let's see what your views are on that. They're voting on the bright line thresholds. I mean, maybe something irrelevant, David, but anyway, ending on your point of view. So, we're up to, more than half of the viewers have responded. Give it five more seconds. Okay. Okay, the results from that are mostly at the top.
So, let me go to the third poll, which is the more difficult question, having to do with results blind review. Okay, five more seconds. We have over 290 responses already. I'll wait, just another few seconds. Okay, and the results: A toss up, but strongly agree and agree is the predominant. So, let's see, is everyone back, Dan? I think I will stop sharing and you have access to the questions, more than 60. I'm not sure how you're going to decide. But...
I'm not either. I was trying to go through them. And I've pulled out a few. And I thought I would just open them up to all of the participants in the debate here and maybe kind of keying off that last question about results blind publication. The, one of the comments, one of the questions was: Please comment on publication bias. And maybe I could just add to that: What do you think is holding up the use of results blind publication procedures? Locascio contributed a paper to that in the 2019 special issue of the American Statistician, and I just thought that was a great idea that would solve a lot of the problems that go along with the misuse and abuse of p values and so forth. But what's holding it up? What's holding up the journals from doing that? Anyone like to jump in on that?
David? Yes.
Yeah, actually. Locascio published an earlier version of that argument in Basic and Applied Social Psychology. So, I'm certainly in favor of, of moving in that direction; however, I feel for the sake of full disclosure that there are arguments against it as well. I think some people, some of the arguments, I've heard are that if you have results blind publishing, then what you're in essence doing is not factoring in the obtained effect size in your decision. And there might be some situations where an effect, where one sized effect is valuable whereas some other sized effect is not valuable. So that's kind of an argument to the contrary. Note, despite the argument I just made, I still favor moving in that direction. But I just wanted to give you an opposing point of view as well.
It seemed, especially with the chance to use online publication formats, we don't have to worry so much about space, right? We could publish more, more papers, even if they weren't the valuable kind of effects. Another question in the, from the viewers was, if experts can't agree on what we should do on analyzing data, what should we be teaching our students?
Anybody jump in?
Yeah, anybody.
I mean, I think that the fact that there's disagreement is a very strong reason that we shouldn't be putting forward, you know, one view and the ASA, I would hope, would see itself as a kind of forum for discussing the different approaches used by its members. And I think the one most important thing in teaching would be to recognize one's own biases in putting forward one favored view versus another, and probably would require a team teaching effort where somebody supporting the other philosophy would, would be giving their account of the value of certain view.
This is a, in principle, I agree with, with what Deborah just said, but unfortunately, in the statistics community, now there's there's this huge problem that statistics programs are having to start to incorporate all sorts of elements of data science that are into their programs that were that were not there before. And this means that teaching programs are becoming more and more, more full of these other things. And so, there's less and less room to teach differing, the different philosophies of statistics. So, this is, I think, a problem that every statistics program is currently wrestling with. And it's, it's a real dilemma.
We'll come to you Deborah, David?
Yeah. I'd like to also advocate kind of for a, what might be a strange answer to that question, which is, let's suppose that we got to the point where one particular perspective was deemed wrong. By a, let's even say everybody agreed on it. I don't see that happening in the near future. But just let's pretend. I'd still advocate teaching the wrong method, in the sense that in order to understand the progress of science, students need to understand where we went wrong.
Did you want to comment, Deborah?
I was only going to say to Jim, that it's true that it's hard to find time for all of these things. But given how much time we're spending in the past five years, agonizing over lots of misinterpretations, which are now getting even more confused than ever, I think it's worthwhile, but also, that these data science aspects that are now filling up the time are themselves open to these foundational issues, which mix with philosophy and also with ethics. And so possibly, one could justify it that way.
Kind of related to that previous question is a thought that some might have about the the Big Data era, you know, the machine learning, algorithmic, the focus on, on prediction, and so forth. And maybe validation using cross validation methods for, you know, this is not so much the details of the model that matter, but does it give good, you know, predictions, good, useful answers? Does that sort of trend maybe diminish the importance of formal statistical inference tools, like p values for example? Do you see a shift towards things that are kind of like empirically validated and the proof is in the pudding sort of? What do you think about that way of thinking?
I mean, my view is that is that the those are, those machine learning tools are primarily addressing a different set of problems than the traditional statistics problem. I mean, so, so I worry about this next vaccine that's going to be declared to work for COVID-19. And I worry about assessing whether it really works or not. That's not a machine learning exercise at all, that's a traditional statistics exercise. So, I think there's plenty of traditional statistics problems that are of enormous importance in the world. There's just this other huge class of predictive problems that are also becoming important.
There was a question in from the viewers, to what extent is appropriate use of p value dependent on the content area? You know, does it matter the type of discipline that people are working in, might sideline the use of p values or kind of underlying their importance, or is that not relevant to this discussion?
I mean, I definitely think that the context is important. If you are doing clinical trials, and you have some control of error probabilities, then it's appropriate to use it. And fields that are unable to do so, if they have convenience samples, if their measurements are questionable, if they can only use students and who are pretty familiar with the problem already, then maybe David is right, that they shouldn't be using them, but then they should be pursuing other methods to warrant their inferences, not sort of claiming that they've got statistics. And I also think it's very problematic for these other fields to try to ban it from the rest of science that is striving and capable of getting the error control.
Yeah. Again, the question is, are we getting, are we talking about error control at the level of hypotheses? Or are we talking about it at the level of models? And I would claim that the error control at the level of hypotheses is exactly what we're not getting. So, I just don't believe the claim that we're getting error control. Now error control at the level of models, I'll buy that one. But at the level of models, there is no point in it, because we already know that the model is wrong.
Let me just, it's interesting to think of what happens in, I like the question, because it says it's addressing what happens in different disciplines. I already referred to the, to the high energy physicists as, is one discipline, and they and they primarily use p values. They're very comfortable using p values, they think they know how to interpret them and deal with them, right? And so there hasn't been too much motion there. On the other hand, cosmologists have entirely switched over to using Bayes factors. Basically, they don't, they don't use p values at all, anymore. So that's why it's interesting that different disciplines do make these decisions.
Can I just say that in error control of inferences, that, and any account that says that because models aren't true, that the threat, we have no threat of error because it's always correctly rejected, doesn't understand how this kind of statistical inference works.
So, so speaking of the concern about models being true or not, or useful or not, you know, and I guess this is one of your points, David, but so I'll ask you maybe. Don't you think that there's some insurance with some of the nonparametric methods, some of the efforts to do like randomization tests and so forth, that sort of reduced the dependence on a model being correct, and yet, still, we can derive p values and, and such from those types of analyses?
Yeah, I think the question that I would ask it, kind of in return to that question is, what are you trying to estimate with the p value?
Just that the two distributions for control and treatment group are the same. That's what we want to know. F equals G.
Right? Um, the problem is, I just don't think p values give you that for the reasons that I've stated. So no, I think I'm just going to have to stand, stand on that.
So, you don't think like Wilcoxon Rank Sum test for that sort of situation, that the p value for that wouldn't be informative.
Well, here's the thing. I'm certainly in favor of the notion of trying to have fewer wrong assumptions. But I did, and so that's good. And in an estimation context, I could see where that could be quite helpful. But in a testing situation, you're still wrong. And so, you know, again, it's just what I said earlier.
Okay. Well, you know, one way that we could maybe start to think about ending our, our time here with this debate is, what might be a good question to sort of sum up some of the dialogue in the back and forth, which is, what aspects of this dialogue in recent years even, not just today, about p values and significance testing, have been the most constructive, and which have been the least constructive? Because at the outset, we did mention so much attention has been put into this in the various outlets. Are we moving the needle on, on making an impact with this kind of dialogue? And, or what are the cautions we should undertake when we start to try to gain consensus, or talk about differences?
I'd be glad to jump in. The discussion has constructively brought forward some welcome reforms like preregistration, testing by replication, and calls against cookbook uses of statistics, and has really given us the reason to show why error control and controlled trials are so important. But some of the reforms are radical, and even obstruct practices known to promote replication--the least constructive are the strawperson arguments used to reject statistical significance tests. To me, looking as an outsider--I'm a philosopher of science--what used to be well known fallacious definitions of p values have in some circles appeared as their actual or purported definitions. And I think we're in serious danger of losing a critical standpoint that is vital to science. A large part of the blame, it's agreed, for lack of replication can be traced to biases encouraged by the reward structure, and it seems to me that the mindset that we have makes for a highly susceptible group, and when those with professional power use questionable arguments, I think it only reinforces any existing tendencies that practitioners have to use questionable methods in their own work.
Jim, David, what, what's been constructive and what maybe hasn't been so constructive about these kinds of conversations?
My own, my own view is that it the whole discussion has been very constructive. And the part I like, that I think is most constructive is people are really trying to understand what things mean now, whereas, whereas before, say a p value would just be used automatically, without any thought as to what it was. Now, people are actually trying to understand what it means. And the same thing with Bayesian methods. People are trying to understand what they mean. And that, that attempt to understand is, I view that as as a wonderful improvement and much better than it was, say 5, 10 years ago.
So, I guess it's my turn. I'll make it short. I agree with what Jim said. I think that having the different sides and letting the arguments come forth is a good thing.
Well, I think you've all, each of you have done a great job representing your position on this topic. And I believe it's going to help us think through where we are and where we might want to go. How it might change our our practice. And I want to thank you so much for that. A lot of fun. And I appreciate your participation. And thanks to all the viewers out there. And I want to wish all of you a good afternoon. Thank you so much.