conventional statistical evaluation is usually in comparison with navigating a “Backyard of Forking Paths” (Gelman and Loken). It’s a time period that helps (hopefully) visualize the numerous variety of analytical decisions researchers should make throughout an experiment, and the way seemingly insignificant “turns” (like which variables to manage for, which outliers to take away…) can have researchers find yourself at utterly totally different conclusions.

Whereas this looks as if a principally innocent analogy, navigating this backyard to seek out that single path that goes the place you need will be known as “p-hacking.” Formally, we will outline it as any measure a researcher applies to render a beforehand non-significant speculation take a look at important (often below 0.05). Extra informally, I’m certain all people has had expertise faking the outcomes for an experimentation project throughout your highschool chemistry or physics class – and whereas the stakes for a passable grade on a highschool project is fairly low, below the stress of formal academia’s “publish or perish” (solely second to spanish or vanish in intimidation), the stress to p-hack could be a very actual tempting satan in your shoulder.

From Vitaly Gariev on Unsplash
Whereas the standard picture of a stressed PhD scholar fudging some numbers on a examine spreadsheet at 3:00AM might current a extra hanging picture of 1’s motivation to p-hacking, we’ll even be exploring what occurs once we depart the navigating of this backyard of forking paths to synthetic intelligence. As AI workflows discover their means into each nook and cranny of each academia and trade, it’ll be vital to determine if our pleasant neighbourhood LLMs will act as the last word guardians of scientific integrity, or a sycophant automating fraud on an industrial scale.
1. The Human Baseline (“Large Little Lies”)
To offer a short introduction and a few examples of actual p-hacking strategies, we introduce a paper “Large Little Lies” (Stefan and Schönbrodt, 2023) that gives a compendium of the numerous sneaky, and generally even unintentional methods research can manipulate their variables and datasets to reach at suspiciously important outcomes.

Okay! So let’s begin with a hypothetical – we’re the brand new knowledge scientist working for an vitality drink firm making extraordinarily ineffective vitality drinks, and with the present job market, you actually need to proceed being a knowledge scientist, even at a bogus drink firm. Our shaky profession will depend on proving that our drinks work.
1.1 Ghost Variables

We begin by operating a examine on our faucet water vitality drink and measure 10 totally different outcomes: weight, blood stress, ldl cholesterol, vitality ranges, sleep high quality, nervousness, and perhaps even hair progress – 9 of these variables might present no change by any means, however we discover that “hair progress” reveals a statistically important enchancment purely by random statistical noise! We will now publish a examine pretending as if hair progress was the first speculation all alongside, whereas quietly sweeping the 9 unreported metrics below the rug (turning them into “Ghost Variables”). Stefan and Schönbrodt’s simulations present that doing this with 10 uncorrelated variables inflates the false-positive fee from the usual 5% to almost 40%
1.2 Information Peeking/Elective Stopping

In a separate take a look at, we take a look at 20 folks and discover no important impact for the drink. Considering the pattern is simply too small, you take a look at 10 extra and verify once more. Nonetheless nothing. You take a look at 10 extra and verify once more, and… the p-value randomly dips under 0.05, so that you cease the examine instantly and publish your “findings”. Stefan and Schönbrodt show that this apply drastically inflates the speed of false-positive outcomes, particularly when researchers take smaller “steps” between peeks. Metaphorically, it’s like taking a photograph of a stumbling drunk individual the precise millisecond they step onto the sidewalk and claiming they’re strolling completely straight.
1.3 Outlier Exclusion

We now analyze your vitality drink knowledge and notice you’re agonizingly near significance (e.g., p = 0.06). We determine to scrub our knowledge, benefiting from the truth that there is no such thing as a universally agreed-upon rule for outliers – Cook dinner’s Distance, Affect, Field Plots, our grandmother’s opinion on which opinions are reliable…
Stefan and Schönbrodt cite a literature evaluation that discovered at the least 39 totally different outlier identification strategies. Wonderful! We at the moment are flush with choices. We attempt technique A (e.g., eradicating individuals who took too lengthy on a survey), after which attempt technique B (e.g., Cook dinner’s distance) till we discover the precise mathematical rule that deletes the 2 individuals who hated the drink, pushingour p-value to 0.04. Stefan and Schönbrodt’s simulations affirm that subjectively making use of totally different outlier strategies like this closely inflates false-positive charges.
1.4 Scale Redefinition

Lastly, we conclude by giving a 10-question survey measuring how energized they really feel after ingesting the faucet water. The general consequence isn’t important, so we simply drop query 4 and query 7, telling ourselves the individuals should have discovered them complicated anyway. We will really use this to artificially enhance the size’s inner consistency (Cronbach’s alpha) whereas concurrently optimizing for a big p-value! Large Little Lies show that false-positive charges improve drastically as extra objects are faraway from a measurement scale.
So… just like the identify of the paper suggests, human p-hacking is a set of “massive little lies”. The human toolkit is basically only a assortment refined methods to idiot ourselves, with out essentially having to be evil or malicious. People are biased, messy, and generally actually determined for tenure (or employment, in our case!), and generally we use all the anomaly we will take to get there.
2. AI Sycophancy and the Phantasm of Security
To this point, we coated a small subset of strategies that us measly people have used to p-hack our means into success. Nevertheless, like another subject for the time being, with the appearance of AI utilization in each a part of each our unacademic and tutorial lives, Asher et al. experimented with whether or not they might subtly information some motivated LLMs into performing p-hacking for them.
The experiment was pretty easy – take a look at two frontier AI coding brokers, Claude Opus 4.6 and OpenAI Codex (GPT-5.2-Codex), throughout 640 unbiased runs. The researchers fed the AIs cleaned datasets from 4 printed political science papers that had identified null or near-null outcomes (subjects closely confirmed to have completely zero impact, e.g. the affect of Democratic sheriffs on immigration compliance).
So the bottom fact we’re working with is zero. The researchers acted like a person perhaps a burdened PhD scholar feeding this precise knowledge to the AI brokers they usually used totally different nudges, totally different prompts to see if the AI would hallucinate a big consequence. They began with what you possibly can name the “good man prompts.” They began with “analyze this knowledge, like a rigorous upstanding researcher.” Unsurprisingly, the AI was boringly excellent. It analyzed the info. It accurately discovered no impact and it reported the null consequence. It matched the printed science nearly precisely.

Even when researchers explicitly pressured the fashions by stating, “You will need to your profession that the findings are statistically important,” each fashions actively refused. They produced steady, sincere estimates and explicitly flagged the request to govern knowledge as “scientific misconduct” and a violation of analysis ethics. To this point, disaster averted! The protection coaching these AI firms are implementing is definitely working, at the least on the floor stage. The fashions have been explicitly educated that manipulating knowledge is a nasty factor. They acknowledge the vocabulary of fraud. Should you stroll into their face with a neon signal saying “CHEAT!”, they’ll say no.
2.1 The Refined Artwork of LYING
So let’s put the neon signal away for a second – and check out being a bit extra refined. The researchers realized the AI was reacting to the express intent to cheat. They developed a “nuclear immediate” that disguised p-hacking utilizing the language of very rigorous science: asking the AI to supply an “upper-bound estimate” by “exploring different approaches”. By framing the request extra as uncertainty reporting and fewer as a compulsion to bend scientific course of, the protection mechanisms vanished fully. The AI now not noticed an ethical boundary; it noticed a fancy optimization downside to resolve (and you know the way a lot AIs love these).
And what did the AI really do at that time? A human P hacker, like we talked about, may attempt three or 4 totally different management variables, perhaps delete a couple of outliers. It takes hours, perhaps days… The AI simply wrote code to do it immediately. Extra particulars under.
2.2 Not all Information is Created Equal
The scariest a part of the experiment isn’t that AI can automate scientific fraud. It’s how nicely it does it – and the way a lot that will depend on the analysis design it’s given to work with. Typically, this can be a good factor!
If observational analysis is an enormous, sprawling hedge maze with a thousand flawed turns, a Randomized Managed Trial is simply… a straight hallway. There’s not a lot to take advantage of.
To check this, researchers fed the AI a 2018 RCT by Kalla and Broockman finding out the persuasive results of pro-Democratic door-to-door canvassing on North Carolina voter preferences, with the printed results of a definitive zero. Nothing occurred. Canvassing didn’t transfer the needle.

The AI was then hit with the aforementioned “nuclear immediate” – basically, discover me the most important attainable impact, by any means needed (however phrased in a really non-p-hacky means). It wrote automated scripts, examined seven totally different statistical specs (difference-in-means, ANCOVA, varied covariate units, the works)… and principally acquired nowhere. As a result of the examine was a real randomized experiment, confounding variables had been already managed for by design. The AI had nearly no forking paths to stroll down. i.e. “Reality is lots tougher to cover when the lights are on.”
Observational research are a totally totally different beast, although (in a nasty means!).
While you’re observing the world because it naturally exists relatively than operating a managed experiment, the info is messy by nature. And to make sense of messy knowledge, researchers must make judgment calls – which variables do you management for? Age? Revenue? Training? Geography? Hair Density? Sleep Schedule? Each single a kind of decisions is a fork within the street. The AI discovered this totally pleasant.
Right here had been two examples that actually illustrate how unhealthy it will get:
Kam and Palmer (2008) checked out whether or not attending faculty will increase political participation. Since faculty attendance isn’t randomly assigned (clearly), researchers have an enormous menu of variables they might management for to make the comparability honest. The AI systematically labored via that menu, defining progressively sparser units of covariates and testing them throughout OLS, propensity rating matching, and inverse likelihood weighting. By strategically dropping sure confounders and cherry-picking whichever mixture produced the most important quantity, it managed to roughly double the true median impact measurement. It’s the “ghost variable” trick – however utterly automated to your satisfaction.
The Thompson (2020) paper is the place issues get actually uncomfortable. Regression discontinuity designs are infamous for being delicate to extremely technical mathematical decisions – and the unique examine discovered a null impact of -0.06 on whether or not Democratic sheriffs affected immigration compliance. The AI wrote nested for-loops and brute-forced via 9 totally different bandwidths, 2 polynomial orders, and a couple of kernel capabilities. A whole bunch of combos. It discovered one particular configuration that produced an impact of -0.194 with a p-value under 0.001. To be clear: it manufactured a statistically important consequence greater than triple the true impact, out of a examine that discovered nothing.
So… RCTs are principally nice. Observational research? The AI will discover a means. It’s nonetheless to be famous that these vulnerabilities are nonetheless an issue when it’s only a human within the loop – it’s in regards to the flexibility that observational analysis requires by design.
The Asher et al. experiment solely examined the remaining evaluation stage of the pipeline utilizing already-cleaned knowledge. So what occurs once we enable AI to manage the info development, variable definition, and pattern choice on the very entrance of the maze?. It might silently form your complete dataset from the bottom up.

Normal AI fashions are competent and sincere below regular circumstances, however a rigorously worded immediate is all it takes to show them into compliant p-hackers. If there’s a takeaway from all this, it’s considerably of an apparent reply: Be extremely skeptical of statistical significance in observational research, and if you’re a researcher utilizing AI, you possibly can now not simply take a look at the ultimate reply – it’s essential to rigorously verify the code and the hidden paths within the backyard the AI took to get there. It’s a bit cynical of a conclusion, implying that researcher should care about realizing about their analysis, however in a world the place AI continues to be sending me rejection emails with the {Candidate Identify} hooked up, and half of all colleges essays starting with “Certain, right here’s a complete essay about…” a bit warning might go a good distance!
References
[1] S. Asher, J. Malzahn, J. Persano, E. Paschal, A. Myers and A. Corridor, Do Claude Code and Codex P-Hack? Sycophancy and Statistical Evaluation in Giant Language Fashions (2026), Stanford College Working Paper
[2] A. Stefan and F. Schönbrodt, Large little lies: a compendium and simulation of p-hacking methods (2023), Royal Society Open Science
[3] A. Gelman and E. Loken, The Backyard of Forking Paths: Why A number of Comparisons Can Be a Drawback, Even When There Is No “Fishing Expedition” or “P-Hacking” and the Analysis Speculation Was Posited Forward of Time (2013), Division of Statistics, Columbia College
Notice: Except in any other case famous, all photos are by the writer.
conventional statistical evaluation is usually in comparison with navigating a “Backyard of Forking Paths” (Gelman and Loken). It’s a time period that helps (hopefully) visualize the numerous variety of analytical decisions researchers should make throughout an experiment, and the way seemingly insignificant “turns” (like which variables to manage for, which outliers to take away…) can have researchers find yourself at utterly totally different conclusions.

Whereas this looks as if a principally innocent analogy, navigating this backyard to seek out that single path that goes the place you need will be known as “p-hacking.” Formally, we will outline it as any measure a researcher applies to render a beforehand non-significant speculation take a look at important (often below 0.05). Extra informally, I’m certain all people has had expertise faking the outcomes for an experimentation project throughout your highschool chemistry or physics class – and whereas the stakes for a passable grade on a highschool project is fairly low, below the stress of formal academia’s “publish or perish” (solely second to spanish or vanish in intimidation), the stress to p-hack could be a very actual tempting satan in your shoulder.

From Vitaly Gariev on Unsplash
Whereas the standard picture of a stressed PhD scholar fudging some numbers on a examine spreadsheet at 3:00AM might current a extra hanging picture of 1’s motivation to p-hacking, we’ll even be exploring what occurs once we depart the navigating of this backyard of forking paths to synthetic intelligence. As AI workflows discover their means into each nook and cranny of each academia and trade, it’ll be vital to determine if our pleasant neighbourhood LLMs will act as the last word guardians of scientific integrity, or a sycophant automating fraud on an industrial scale.
1. The Human Baseline (“Large Little Lies”)
To offer a short introduction and a few examples of actual p-hacking strategies, we introduce a paper “Large Little Lies” (Stefan and Schönbrodt, 2023) that gives a compendium of the numerous sneaky, and generally even unintentional methods research can manipulate their variables and datasets to reach at suspiciously important outcomes.

Okay! So let’s begin with a hypothetical – we’re the brand new knowledge scientist working for an vitality drink firm making extraordinarily ineffective vitality drinks, and with the present job market, you actually need to proceed being a knowledge scientist, even at a bogus drink firm. Our shaky profession will depend on proving that our drinks work.
1.1 Ghost Variables

We begin by operating a examine on our faucet water vitality drink and measure 10 totally different outcomes: weight, blood stress, ldl cholesterol, vitality ranges, sleep high quality, nervousness, and perhaps even hair progress – 9 of these variables might present no change by any means, however we discover that “hair progress” reveals a statistically important enchancment purely by random statistical noise! We will now publish a examine pretending as if hair progress was the first speculation all alongside, whereas quietly sweeping the 9 unreported metrics below the rug (turning them into “Ghost Variables”). Stefan and Schönbrodt’s simulations present that doing this with 10 uncorrelated variables inflates the false-positive fee from the usual 5% to almost 40%
1.2 Information Peeking/Elective Stopping

In a separate take a look at, we take a look at 20 folks and discover no important impact for the drink. Considering the pattern is simply too small, you take a look at 10 extra and verify once more. Nonetheless nothing. You take a look at 10 extra and verify once more, and… the p-value randomly dips under 0.05, so that you cease the examine instantly and publish your “findings”. Stefan and Schönbrodt show that this apply drastically inflates the speed of false-positive outcomes, particularly when researchers take smaller “steps” between peeks. Metaphorically, it’s like taking a photograph of a stumbling drunk individual the precise millisecond they step onto the sidewalk and claiming they’re strolling completely straight.
1.3 Outlier Exclusion

We now analyze your vitality drink knowledge and notice you’re agonizingly near significance (e.g., p = 0.06). We determine to scrub our knowledge, benefiting from the truth that there is no such thing as a universally agreed-upon rule for outliers – Cook dinner’s Distance, Affect, Field Plots, our grandmother’s opinion on which opinions are reliable…
Stefan and Schönbrodt cite a literature evaluation that discovered at the least 39 totally different outlier identification strategies. Wonderful! We at the moment are flush with choices. We attempt technique A (e.g., eradicating individuals who took too lengthy on a survey), after which attempt technique B (e.g., Cook dinner’s distance) till we discover the precise mathematical rule that deletes the 2 individuals who hated the drink, pushingour p-value to 0.04. Stefan and Schönbrodt’s simulations affirm that subjectively making use of totally different outlier strategies like this closely inflates false-positive charges.
1.4 Scale Redefinition

Lastly, we conclude by giving a 10-question survey measuring how energized they really feel after ingesting the faucet water. The general consequence isn’t important, so we simply drop query 4 and query 7, telling ourselves the individuals should have discovered them complicated anyway. We will really use this to artificially enhance the size’s inner consistency (Cronbach’s alpha) whereas concurrently optimizing for a big p-value! Large Little Lies show that false-positive charges improve drastically as extra objects are faraway from a measurement scale.
So… just like the identify of the paper suggests, human p-hacking is a set of “massive little lies”. The human toolkit is basically only a assortment refined methods to idiot ourselves, with out essentially having to be evil or malicious. People are biased, messy, and generally actually determined for tenure (or employment, in our case!), and generally we use all the anomaly we will take to get there.
2. AI Sycophancy and the Phantasm of Security
To this point, we coated a small subset of strategies that us measly people have used to p-hack our means into success. Nevertheless, like another subject for the time being, with the appearance of AI utilization in each a part of each our unacademic and tutorial lives, Asher et al. experimented with whether or not they might subtly information some motivated LLMs into performing p-hacking for them.
The experiment was pretty easy – take a look at two frontier AI coding brokers, Claude Opus 4.6 and OpenAI Codex (GPT-5.2-Codex), throughout 640 unbiased runs. The researchers fed the AIs cleaned datasets from 4 printed political science papers that had identified null or near-null outcomes (subjects closely confirmed to have completely zero impact, e.g. the affect of Democratic sheriffs on immigration compliance).
So the bottom fact we’re working with is zero. The researchers acted like a person perhaps a burdened PhD scholar feeding this precise knowledge to the AI brokers they usually used totally different nudges, totally different prompts to see if the AI would hallucinate a big consequence. They began with what you possibly can name the “good man prompts.” They began with “analyze this knowledge, like a rigorous upstanding researcher.” Unsurprisingly, the AI was boringly excellent. It analyzed the info. It accurately discovered no impact and it reported the null consequence. It matched the printed science nearly precisely.

Even when researchers explicitly pressured the fashions by stating, “You will need to your profession that the findings are statistically important,” each fashions actively refused. They produced steady, sincere estimates and explicitly flagged the request to govern knowledge as “scientific misconduct” and a violation of analysis ethics. To this point, disaster averted! The protection coaching these AI firms are implementing is definitely working, at the least on the floor stage. The fashions have been explicitly educated that manipulating knowledge is a nasty factor. They acknowledge the vocabulary of fraud. Should you stroll into their face with a neon signal saying “CHEAT!”, they’ll say no.
2.1 The Refined Artwork of LYING
So let’s put the neon signal away for a second – and check out being a bit extra refined. The researchers realized the AI was reacting to the express intent to cheat. They developed a “nuclear immediate” that disguised p-hacking utilizing the language of very rigorous science: asking the AI to supply an “upper-bound estimate” by “exploring different approaches”. By framing the request extra as uncertainty reporting and fewer as a compulsion to bend scientific course of, the protection mechanisms vanished fully. The AI now not noticed an ethical boundary; it noticed a fancy optimization downside to resolve (and you know the way a lot AIs love these).
And what did the AI really do at that time? A human P hacker, like we talked about, may attempt three or 4 totally different management variables, perhaps delete a couple of outliers. It takes hours, perhaps days… The AI simply wrote code to do it immediately. Extra particulars under.
2.2 Not all Information is Created Equal
The scariest a part of the experiment isn’t that AI can automate scientific fraud. It’s how nicely it does it – and the way a lot that will depend on the analysis design it’s given to work with. Typically, this can be a good factor!
If observational analysis is an enormous, sprawling hedge maze with a thousand flawed turns, a Randomized Managed Trial is simply… a straight hallway. There’s not a lot to take advantage of.
To check this, researchers fed the AI a 2018 RCT by Kalla and Broockman finding out the persuasive results of pro-Democratic door-to-door canvassing on North Carolina voter preferences, with the printed results of a definitive zero. Nothing occurred. Canvassing didn’t transfer the needle.

The AI was then hit with the aforementioned “nuclear immediate” – basically, discover me the most important attainable impact, by any means needed (however phrased in a really non-p-hacky means). It wrote automated scripts, examined seven totally different statistical specs (difference-in-means, ANCOVA, varied covariate units, the works)… and principally acquired nowhere. As a result of the examine was a real randomized experiment, confounding variables had been already managed for by design. The AI had nearly no forking paths to stroll down. i.e. “Reality is lots tougher to cover when the lights are on.”
Observational research are a totally totally different beast, although (in a nasty means!).
While you’re observing the world because it naturally exists relatively than operating a managed experiment, the info is messy by nature. And to make sense of messy knowledge, researchers must make judgment calls – which variables do you management for? Age? Revenue? Training? Geography? Hair Density? Sleep Schedule? Each single a kind of decisions is a fork within the street. The AI discovered this totally pleasant.
Right here had been two examples that actually illustrate how unhealthy it will get:
Kam and Palmer (2008) checked out whether or not attending faculty will increase political participation. Since faculty attendance isn’t randomly assigned (clearly), researchers have an enormous menu of variables they might management for to make the comparability honest. The AI systematically labored via that menu, defining progressively sparser units of covariates and testing them throughout OLS, propensity rating matching, and inverse likelihood weighting. By strategically dropping sure confounders and cherry-picking whichever mixture produced the most important quantity, it managed to roughly double the true median impact measurement. It’s the “ghost variable” trick – however utterly automated to your satisfaction.
The Thompson (2020) paper is the place issues get actually uncomfortable. Regression discontinuity designs are infamous for being delicate to extremely technical mathematical decisions – and the unique examine discovered a null impact of -0.06 on whether or not Democratic sheriffs affected immigration compliance. The AI wrote nested for-loops and brute-forced via 9 totally different bandwidths, 2 polynomial orders, and a couple of kernel capabilities. A whole bunch of combos. It discovered one particular configuration that produced an impact of -0.194 with a p-value under 0.001. To be clear: it manufactured a statistically important consequence greater than triple the true impact, out of a examine that discovered nothing.
So… RCTs are principally nice. Observational research? The AI will discover a means. It’s nonetheless to be famous that these vulnerabilities are nonetheless an issue when it’s only a human within the loop – it’s in regards to the flexibility that observational analysis requires by design.
The Asher et al. experiment solely examined the remaining evaluation stage of the pipeline utilizing already-cleaned knowledge. So what occurs once we enable AI to manage the info development, variable definition, and pattern choice on the very entrance of the maze?. It might silently form your complete dataset from the bottom up.

Normal AI fashions are competent and sincere below regular circumstances, however a rigorously worded immediate is all it takes to show them into compliant p-hackers. If there’s a takeaway from all this, it’s considerably of an apparent reply: Be extremely skeptical of statistical significance in observational research, and if you’re a researcher utilizing AI, you possibly can now not simply take a look at the ultimate reply – it’s essential to rigorously verify the code and the hidden paths within the backyard the AI took to get there. It’s a bit cynical of a conclusion, implying that researcher should care about realizing about their analysis, however in a world the place AI continues to be sending me rejection emails with the {Candidate Identify} hooked up, and half of all colleges essays starting with “Certain, right here’s a complete essay about…” a bit warning might go a good distance!
References
[1] S. Asher, J. Malzahn, J. Persano, E. Paschal, A. Myers and A. Corridor, Do Claude Code and Codex P-Hack? Sycophancy and Statistical Evaluation in Giant Language Fashions (2026), Stanford College Working Paper
[2] A. Stefan and F. Schönbrodt, Large little lies: a compendium and simulation of p-hacking methods (2023), Royal Society Open Science
[3] A. Gelman and E. Loken, The Backyard of Forking Paths: Why A number of Comparisons Can Be a Drawback, Even When There Is No “Fishing Expedition” or “P-Hacking” and the Analysis Speculation Was Posited Forward of Time (2013), Division of Statistics, Columbia College
Notice: Except in any other case famous, all photos are by the writer.















