![]() | This ![]() It is of interest to the following WikiProjects: | |||||||||||||||||||||||||||||||||||||||||||||||||||||
|
![]() |
Daily pageviews of this article
A graph should have been displayed here but
graphs are temporarily disabled. Until they are enabled again, visit the interactive graph at
pageviews.wmcloud.org |
![]() | Replication crisis was nominated as a Social sciences and society good article, but it did not meet the good article criteria at the time (March 28, 2022). There are suggestions on the review page for improving the article. If you can improve it, please do; it may then be renominated. |
In August 2015 the open science collaboration (based in the
Center for Open Science) published a paper in
Science (journal)
[1] (the paper appears to be open access), in which they report the outcomes of 100 replications of different experiments from top Cognitive and Social Psychology journals. Depending on how they assessed replicabilityie I(e.g. ndependent p values or aggregate data (meta-analytic) or subjective) they report replicability of social psychology studies between 23% (JPSP P values) and 58% (PsychSci - Metaanalytic) and between 48% p value, JEP and 92% metaanalytic PsychSci for cognitive studies. The paper is (to my judgement) be very carefully constructed and very thorough. It is not easy to interpret these percentages by the way as there is hardly any data from other fields about replication success rates. The only indications come from cell biology (see the science paper) where they are talking about percentages as low as 11% to 25% (probably based on p value alone). If this is indicative for all sciences (but I would not hazard to do so) it appears that psychology is neither much worse, nor much better than most. But that would be my own
original interpretation and hence not useful for Wikipedia.
I think we should construct a brief section on the outcomes of this programme / paper for this article. I will think about it - but it may take some time (busy) and should be done with due attention to nuance, anyone else is welcome to start it. Arnoutf ( talk) 14:27, 30 August 2015 (UTC)
I'm working on updating this page and working edits can be found in my sandbox. Pucla ( talk) 17:41, 13 November 2015 (UTC)-- pucla
We define QRPs as "while not intentionally fraudulent, involve capitalizing on the gray area of acceptable scientific practices or exploiting flexibility in data collection, analysis, and reporting" and then we list a bunch of gray areas...and "falsifying data".
I would argue that many researchers don't/didn't even realize that, e.g., selective stopping bought in significant bias, and I'm assuming it hasn't been traditionally prohibited. Falsifying data seems like a different category altogether. Unlike the others, it's not a gray area. This is noted in the discussion section of the cited article [2] (currently ref 5 in the main article) from which we draw our conclusion: " Although falsifying data (Item 10 in our study) is never justified, the same cannot be said for all of the items on our survey"
It's also worth noting that the statement that "A survey of over 2,000 psychologists indicated that nearly all respondents admitted to using at least one QPR" does not appear to be in the study cited. The highest any individual category is 66.5% ("In a paper, failing to report all of a study’s dependent measures"). It certainly seems likely, but I don't see any statement of that.
We can safely say "a majority" while leaving off the "falsifying data" category, as both the "In a paper, failing to report all of a study’s dependent measures" and "Deciding whether to collect more data after looking to see whether the results were significant" categories exceed 50%. (Table/fig 1 in the article).
As such, I'm going to be bold and remove the falsifying data category and change the claim to "a majority". The above is documentation of why and provides a starting pont for refutation if someone disagrees with my assessment.
General Wesc ( talk) 16:10, 13 February 2016 (UTC)
I've removed the "quotes" subsection from the page, as it doesn't really fit within the article as-is: see below for the text I removed. -- Markshale ( talk) 15:15, 6 March 2016 (UTC)
Begin removed text:
End of removed text.
We've got a statement about the prevalence of the problem, followed by a bulleted list of disciplines each followed by a percentage, and then another percentage in brackets, all appear to have been either rounded to the the nearest 10% or to have been drawn from a conveniently sized sample. How am I supposed to interpret this information? Am I being told that 90% of chemistry papers are not reproducible or that 90% of them are? Is 60% some sort of confidence interval? The whole this is just nonsensical. -- 81.151.18.242 ( talk) 08:21, 1 August 2016 (UTC)
Please explain how the inability to reproduce the same result in a study comparing subjects wearing body cameras to subjects not wearing body cameras doesn't relate to the research replication crisis? The article cited even explains why they may have been unable to get a result that supports when they did the same experiments (police officers wearing body cameras compared to officers in the same department not wearing cameras). Did your read the articles? Natureium ( talk) 18:04, 21 October 2017 (UTC)
These figures are not suggestive of a reproducibility crisis and appear to have been misunderstood so should not be highlighted in this page.
The usual significance cut offs (p value) eg 0.001 or 0.05 mean that it is completely normal not to be able to reproduce an experiment, which is what the figures refer to.
The p value is usually set at 0.05 which means that a researcher would only have needed to repeat 20 studies in their career to expect a irreproducible result.
The percentage of scientists who fail to reproduce an experiment without knowing how many experiments they have tried to reproduce has no meaning. They may just indicate tendency to re-run studies in different fields.
Since the figures are meaningless without context they give a misleading impression of the failure to reproduce eg that 87% of chemistry experiments are irreproducible.
The source article is fine, and elaborates on the above figures to highlight other issues, for instance how often these failures to reproduce are published. If I had time I would add the entirety of the findings to this page, but the data currently chosen should be removed as they are not indicative of the page topic.
Here's an additional reference from behavioral neuroscience: Kafkafi, N; Agassi, J; Chesler, EJ; Crabbe, JC; Crusio, WE; Eilam, D; Gerlai, R; Golani, I; Gomez-Marin, A; Heller, R; Iraqi, F; Jaljuli, I; Karp, NA; Morgan, H; Nicholson, G; Pfaff, DW; Richter, SH; Stark, PB; Stiedl, O; Stodden, V; Tarantino, LM; Tucci, V; Valdar, W; Williams, RW; Würbel, H; Benjamini, Y (April 2018). "Reproducibility and replicability of rodent phenotyping in preclinical studies". Neuroscience and biobehavioral reviews. 87: 218–232. doi: 10.1016/j.neubiorev.2018.01.003. PMID 29357292. -- Randykitty ( talk) 15:38, 23 January 2019 (UTC)
The article misses a section about reproducibility crisis in machine learning. -- JakobVoss ( talk) 17:24, 13 September 2019 (UTC)
The paragraph, "Glenn Begley and John Ioannidis proposed these causes for the increase in the chase for significance:
Generation of new data/publications at an unprecedented rate. Majority of these discoveries will not stand the test of time. Failure to adhere to good scientific practice and the desperation to publish or perish. Multiple varied stakeholders.
They conclude that no party is solely responsible, and no single solution will suffice."
has these issues:
It lacks a precise reference. I have found it: Reproducibility in Science. Improving the Standard for Basic and Preclinical Research C. Glenn Begley, John P.A. Ioannidis, Circulation Research. But I have only found the third point in the abstract.
The second point is no cause, but an effect. It remains unclear why the fourth point should be a cause.
Therefore, I suggest to drop this paragraph and I will do so unless somebody asks for keeping it. Werner A. Stahel en ( talk) 13:04, 14 October 2020 (UTC)
This section is difficult to understand and may need further clarification. It is not clear experimental studies on more complex, nonlinear systems should be less reproducible than studies on simple, linear systems. Appropriately large samples sizes and correct statistical analysis should yield reproducible findings when these actually exist. Biologic systems are typically complex and nonlinear yet these are successfully studied. The two references cited (107 and 108) discuss the complexities and instabilities in psychological research but do not convincingly make that case that such research should not be reproducible. — Preceding unsigned comment added by TailHook ( talk • contribs) 06:56, 16 December 2020 (UTC)
The article currently states that the "replication crisis most severely affects the social sciences and medicine.". To make this claim, it is not sufficient to show that replicability is low in the mentioned fields. Instead, a source would have to state that this problem is more prevalent in social sciences and medicine than in others. While two sources are provided, they both do not support this claim in my opinion. I would therefore argue that this claim should be removed unless more substantial sources can be given.
Jochen Harung ( talk) 18:24, 25 April 2021 (UTC)
In disciplines such as medicine, psychology, genetics and biology, researchers have been confronted with results that are not as robust as they originally seemed. [1]
Some have blamed the reliance on p-values for the replication crises now afflicting many scientific fields. In psychology, in medicine, and in some fields of economics, large and systematic searches are discovering that many findings in the literature are spurious. [2]
References
...that I recently added. Please compare the hat note to that of Reproducibility. Thanks CapnZapp ( talk) 11:27, 26 October 2021 (UTC)
GA toolbox |
---|
Reviewing |
Reviewer: Tommyren ( talk · contribs) 00:19, 31 December 2021 (UTC)
I am excited to review this article!
Before evaluating the article based the good article standards, I'll just be listing a few thoughts that struck me as I read though the article. Some, or perhaps all of them, may go beyond the good article standards.
1. "The replication crisis most severely affects the social and medical sciences." This statement is not supported by the inline citation that follows it, as the source only says that replication is a problem in social and medical sciences, but does not necessarily say that it is most severe in social and medical sciences. A similar issue has been discussed in the section "Sources for most impacted fields." However, from citation #10, the Fanelli article, it seems that we just might argue that the medical sciences are most severely affected. From citation #79, the Duvendack et al. article, it seems that we are argue that the economic sciences are more severely affected than others.
2. Why does Ioannidis's (2005) paper deserve to be screenshotted but not other people's papers? I'm not saying that we should delete the image. In fact, to me the paper's title is quite effective at piquing my interest for the whole wikipedia article. But I just want to make this comment here in case others come up with a better image. — Preceding unsigned comment added by Tommyren ( talk • contribs) 14:34, 31 January 2022 (UTC)
I've done about a third of the background section now. The plan is to have an explanation of replication/reproducibility and its importance (done), an explanation of what the replicability crisis is and how it fits into the scientific process (not done), and potentially some explanation of significance and effect size testing (not done, still figuring out how necessary it would be). -- Xurizuri ( talk) 10:27, 1 February 2022 (UTC)
1. Last paragraph needs citations. — Preceding unsigned comment added by Tommyren ( talk • contribs) 03:04, 18 February 2022 (UTC)
2. The difference between systematic and conceptual replication could be made a bit clearer. — Preceding unsigned comment added by Tommyren ( talk • contribs) 00:32, 6 March 2022 (UTC)
1. This section contains a lot of information on causes of the replication crisis, including QRPs and the disciplinary social dilemma. Information of a similar vein is given later in the "Causes" section. I wonder if it would be a good idea to take information about causes of the crises from this section and move it to the "Causes" section. I think this may make the article feel less repetitive. Similarly, this section also contains a lot of information on potential remedies of the replication crisis, such as the discussion on whether inviting the original author in replication efforts and result-blind peer review. This should go into the "Remedies" section. Also, perhaps the "methodological terrorism" controversy can go under the "Consequences" section.
2. "Several factors have combined to put psychology at the center of the controversy." This sentence is misleading to me because it seems to say that the replication crisis is most severe in the field of psychology. However, if you go into the source for this sentence, it actually says largely the opposite--other fields could have replication crises just as severe. So far, it seems to me that psychology is at the center of the controversy largely because it has received the most scholarly and media attention, not necessarily because the crisis is particularly bad in this field.
3. "Social priming." Maybe we should explain what social priming is. Also, I am not 100% sure that The Chronicle of Higher Education is a reliable source.
4. I don't think we should mention the "Psycho-babble" report by the Independent. "Psycho-babble" is such a colloquial word and can mean different things to different people. Is the Independent invalidating all aspects of non-replicable research? Would that be fair?
5. "Early analysis of result-blind peer review, which is less affected by publication bias, has estimated that 61 percent of result-blind studies have led to null results, in contrast to an estimated 5 to 20 percent in earlier research." This sentence appears twice in the article. The same information is also presented under the Remedies section. I feel that we can just keep the latter.
6. "First open empirical study." I saw nothing in the source suggesting that this is the first of such studies.
7. "Replications appear particularly difficult when research trials are pre-registered and conducted by research groups not highly invested in the theory under questioning." I do not see how this sentence fits in the paragraph. The first sentence of the paragraph seems to indicate that replication is an issue in psychology partly because some of the theories tested may not be tenable, whereas pre-registration and researcher investment seem more related to the issues of QRPs.
8. "p-hacking." It may be helpful to give a brief definition of what p-hacking is.
9. What exactly is BWAS? Tommyren ( talk) 14:27, 15 May 2022 (UTC)
1. The part on commonalities of unreplicable papers could go under the "Causes" section.
2. "A survey on cancer researchers found that half of them had been unable to reproduce a published result." For reasons described above, this is to be expected and does not necessarily show that a replication crisis exists.
3. "Flaws." What flaws is the word referring to?
4. What is the purpose of the block quote? — Preceding unsigned comment added by Tommyren ( talk • contribs) 19:51, 31 January 2022 (UTC)
5. Does cancer research merit its own tiny little section?
1. The part of globalization to me seems to be a "Cause" of the crises and does not belong in the "Scope" Section.
1. The part about the fragility of econometrics to me seems to be a "Cause" of the crises and does not belong in the "Scope" Section.
1. As explained by others in this talk page (see comments by Lavateraguy, also see section on "Failure to reproduce figures in 'Outline' section misleading"), the fact that many scholars have encountered unreplicable studies is in fact expected and not necessarily problematic. I personally do not see it as necessary to include the first sentence of this section.
2. "Only a minority." Do we have a concrete percentage for this?
3. "The authors also put forward possible explanations for this state of affairs." Would it be possible to elaborate on what these explanations are, especially in terms of why unreplicable studies are cited more?
1. "Generation of new data/publications at an unprecedented rate." According to the original source, it does not seem to be a "trigger" of a crisis but seems more like something that makes things worse.
2. "A success and a failure." I understand it to mean "a successful and a failed attempt at finding evidence in support of the alternative hypothesis." Is this correct? Maybe we can clarify this.
1. What is scientometrics?
2. I would suggest changing the title of this section to just "Historical Roots," and we should move arguments by Mirowski, "a group of STS scholars," and Smith to before this section together with other recently publishes sources because these works were published rather recently.
3. "Attention." What exactly is this word referring to?
4. Should the five numbered points go under "Historical and sociological roots?"
1. It would be nice if there can be proper citation for Ravetz's book.
2. I have a general feeling that this section can be more clearly structured and written. While the opening paragraphs connects nicely to more theoretical discussions in the last section, they do prevent readers from getting straight away what the publish or perish culture really is.
1. "They consist of applying different methods of data screening, outlier rejection, subgroup selection, data transformations, models, concomittant variables, and alternative estimation and testing methods, and finally reporting the variety that produces the most significant result." This sentence is becoming really confusing. It also seems a little repetitive considering that the following sentence comes in the next paragraph: "Examples of QRPs include selective reporting or partial publication of data (reporting only some of the study conditions or collected dependent measures in a publication), optional stopping (choosing when to stop data collection, often based on statistical significance of tests), post-hoc storytelling (framing exploratory analyses as confirmatory analyses), and manipulation of outliers (either removing outliers or leaving outliers in a dataset to cause a statistical test to be significant)"
2. "However, most scholars acknowledge that fraud is, perhaps, the lesser contribution to replication crises." There is a "who" tag that needs addressing.
3. "Serious." I know the original author also used "serious," but a reader might wonder in what sense is fraud "serious." Is it the most morally reprehensible? Does it lead to the most wrong study results?
4. "Positive and negative controls" Would it be clearer to just say control here? How important is it for readers to know the difference between positive and negative controls? If it is important, should we explain what the two terms mean?
5. What is confirmation bias?
6."Some examples of QRPs..." This sentence may suffer from overciting. Also, the jargons would preferrably have in-text explanations.
7. "A 2012 survey of over 2,000 psychologists..." Given the critique on survey methods, I am leaning towards not including this source in this article at all. Tommyren ( talk) 03:19, 7 June 2022 (UTC)
1. "According to a 2018 survey of 200 meta-analyses, 'psychological research is, on average, afflicted with low statistical power'."Are we using British or American English in this article? Sometimes I see periods/commas within quotation marks, and sometimes I see them outside quotation marks.
1. I suspect the methodological terrorism incident is not very representative of the entire scholarly community. Can we add more information to this section?
1. "Amgen Oncology's cancer researchers were only able to replicate 11 percent of 53 innovative studies they selected to pursue over a 10-year period; a 2011 analysis by researchers with pharmaceutical company Bayer found that the company's in-house findings agreed with the original results only a quarter of the time, at the most. The analysis also revealed that, when Bayer scientists were able to reproduce a result in a direct replication experiment, it tended to translate well into clinical applications; meaning that reproducibility is a useful marker of clinical potential." Maybe this information should go under the "In Medicine" section?
1. I wonder if the second paragraph is needed. There seems to be no sources for it, and most information seems to occur elsewhere. What are the CONSORT and EQUATOR guidelines anyways.
1. From what is currently in the article it's a little hard to tell the difference between result-blind peer review and pre-registration
1. What do the Bayesion methods refer to? Bayes is mentioned three times in this section. Are they referring to the same methods?
2. What logical problems is "The Problem with p-values" referring to?
1. "Unless software used in research is open source, reproducing results with different software and hardware configurations is impossible." This doesn't sound right.
2. Do we need the CERN example?
1. "The null hypothesis (the hypothesis that the results are not reflecting a true pattern) is rejected when the probability of the null hypothesis being true is less than 5%" This is not true. The null hypothesis is either 100% true or 100% false.
Tommyren ( talk) 00:19, 31 December 2021 (UTC)
Tommyren, Marisauna, where does this nomination stand? As far as I can tell, Maurisauna has never edited the article at all, much less to address any of the issues raised in this nomination. Xurizuri has made a few edits to address one or more issues above, and Tommyren has made various edits over the past four weeks since the article was nominated. If there isn't anyone who is able to address the issues raised last month in a timely manner, then the nomination should be closed. Thank you. BlueMoonset ( talk) 01:35, 29 January 2022 (UTC)
I'm noticing that a fair amount of the article is based on blogs and opinion pieces. Those aren't inherently not-RS, because they're written generally by experts and the statements are being attributed (now). But I am concerned about the amount of the article that is based on them, especially when the article is also under-using reviews and meta-analyses published in reliable journals (Begley & Ioannidis, Shrout & Rodgers, and Stanley, Carter & Doucouliagos spring to mind). I also have assorted other concerns, some of which I am planning to address, including:
And then also the ones you've mentioned. I could address all of these given a few days, but this is starting to give me the vibes of doing a uni assignment at the last minute. And I'm not currently a uni student for a reason. Some of this may be above the needs of a good article, I'm really not sure, but that's why I'm not reviewing :) I'm going to add comments under yours above explaining what my plans are for it. That way, you can make a proper decision about how much work there is left, and this can be a complete record of recommendations. -- Xurizuri ( talk) 14:15, 2 February 2022 (UTC)
Tommyren, since you established nearly two months ago that the nomination will not pass without significant additional work, it's time to fail the nomination. You are certainly welcome to continue working on the article and posting here as to issues that will need to be addressed prior to any subsequent nomination, but the end result is clear. Thank you for your detailed work. BlueMoonset ( talk) 17:35, 27 March 2022 (UTC)
In the section: Historical and philosophical roots; sentence in the last paragraph: "This theory holds that each "system" such as economy, science, religion or media on communicates using its own code: true/false for science, profit/loss for the economy, news/no-news for the media, and so on.". (new->news?)-- S.POROY ( talk) 12:53, 30 January 2022 (UTC)
I've noticed that some people are adding content to the Prevalence section that is not about quantitative measures of replicability and QRPs, likely because other sections are not subdivided by field. – LaundryPizza03 ( d c̄) 02:17, 24 March 2022 (UTC)
In the field of metascience, a growing number of recent publications are concerned with how theoretical as opposed to methodological or statistical shortcomings might be the cause of low replication rates, at least in psychological science. Some of these even talk about a "Theory-crisis" in psychology. I was wondering if it would make sense to create a separate subsection under "Causes" to report on the considerations that have been made in this area of study concerning the replication crisis. Examples of these publications are:
Fiedler, K. (2017). What constitutes strong psychological science? The (neglected) role of diagnosticity and a priori theorizing. Perspectives on Psychological Science, 12(1), 46-61. https://doi.org/10.1177/1745691616654458
Oberauer, K., & Lewandowsky, S. (2019). Addressing the theory crisis in psychology. Psychonomic Bulletin & Review, 26(5), 1596-1618. https://doi.org/10.3758/s13423-019-01645-2
Oude Maatman, F. (preprint). Psychology's Theory Crisis, and Why Formal Modelling Cannot Solve It. PsyArxiv. https://doi.org/10.31234/osf.io/puqvs
Szollosi, A., & Donkin, C. (2021). Arrested theory development: The misguided distinction between exploratory and confirmatory research. Perspectives on Psychological Science, 16(4), 717-724. https://doi.org/10.1177/1745691620966796
ProgressiveProblemshift ( talk) 16:43, 15 May 2023 (UTC)
In the section "Background", the explanation of how NHST works ends by saying "Although p-value testing is the most commonly used method, it is not the only method.". This sentence is missing a reference, but on top of that I would argue it is not very clear. It raises the question: "The only method for what?". Given the content of that paragraph one could say it most likely means "not the only method to test significance", but since the page is on replication, I'd say that it would make more sense if it referred to methods to establish whether findings were successfully replicated in general. In such a case, I have a good reference in Nosek et al. (2022) where a small section at the beginning is dedicated to defining when we can say that original findings were replicated (i.e. "How do we decide whether the same occured again?", p. 722), where the authors describe different methodologies and criteria by which replications are defined as successful. I'd love to do it myself, but I would need a confirmation that this edit makes sense!
Nosek, B. A., Hardwicke, T. E., Moshontz, H., Allard, A., Corker, K. S., Dreber, A., Fidler, F., Hilgard, J., Struhl, M. K., Nuijten, M. B., Rohrer, J. M., Romero, F., Scheel, A. M., Scherer, L. D., Schönbrodt, F. D., & Vazire, S. (2022). Replicability, robustness, and reproducibility in psychological science. Annual Review of Psychology, 73, 719-748. https://doi.org/10.1146/annurev-psych-020821-114157
ProgressiveProblemshift (
talk)
13:04, 22 May 2023 (UTC)
In the subsection "Result-blind peer review", it is reported that "more than 140 psychology journals have adopted result-blind peer review". I believe this statement is wrong. If one check the website article that's cited, the author says that 140 journals at the time were using registered reports (which implies result-blind peer-review). She says "journals" without mentioning specific disciplines. If one checks the source she cites, it's the COS web page on registered reports, so I assume that to come up with the 140 number she probably checked the COS's page on TOP scores for journals. By consulting the page, presently only 46 journals adopt some form of registered reports. The statemente should be changes or just deleted since, I believe, it's misleading and incorrect. Here one can see the stats of psychology journals when it comes to adopting registered reports: https://topfactor.org/journals?factor=Registered+Reports+%26+Publication+Bias&disciplines=Psychology&page=3 ProgressiveProblemshift ( talk) 14:14, 14 July 2023 (UTC)
The paragraph containing the sentence 'It is a mathematical impossibility.' is nonsense. It assumes that a study has a fixed power, namely 30%, and then tries to use this to make a deduction. But a study only has a power level relative to an effect size. If the real effect size is much larger than the one used to calculate the power level, "regardless of which hypothesis is true, the study will reject the null hypothesis with probability 30%" is just false.
This paragraph is also not backed up by the given citation at all.
Not a regular wiki editor but I am killing this para! 51.148.179.136 ( talk) 08:00, 31 July 2024 (UTC)