is now not whether or not AI can write code, however whether or not we will belief the code it writes?
Over the previous few years, ChatGPT and different giant language fashions have turn out to be more and more widespread within the each day workflow of scholars, analysts, researchers, and information scientists. Many people have already used AI instruments to generate a Python operate, debug an error message, automate a repetitive activity, or rapidly translate code from one language to a different.
However there’s a main distinction between asking ChatGPT to jot down a small helper operate and asking it to implement a posh econometric methodology.
Can ChatGPT appropriately code a Distinction-in-Variations mannequin? Can it implement Inverse Likelihood Remedy Weighting? Can it reproduce a Regression Discontinuity evaluation? Can it do that not solely in Python, but additionally in R and Stata?
That’s the reason the article “Can AI write your code? A case examine of ChatGPT’s statistical coding capabilities for quantitative analysis” by Winberg et al. instantly caught my consideration. The paper was revealed on-line on January 22, 2026, in Well being Economics Evaluate. The authors consider ChatGPT-4.0 Professional’s capability to generate code for causal inference duties in Python, R, and Stata, utilizing benchmark options from Causal Inference: The Mixtape by Scott Cunningham.
Most articles I had beforehand learn on this matter targeted on comparatively easy programming duties: small automations, descriptive statistics, information cleansing, primary information evaluation, or code era in languages reminiscent of Python, R, and SAS. This examine goes additional. It asks whether or not ChatGPT can help quantitative analysis in additional demanding settings, the place the code is not only technical but additionally methodological.
The authors concentrate on three extensively used causal inference strategies:
- Distinction-in-Variations, additionally referred to as Diff-in-Diff;
- Inverse Likelihood Remedy Weighting, or IPTW;
- Regression Discontinuity, or RD.
On this article, I’ll stroll via the examine in a structured method. First, we’ll current what makes this examine totally different for quantitative researchers. Second, we’ll evaluation the methodology utilized by the authors. Third, we’ll have a look at how ChatGPT’s efficiency was evaluated. Lastly, we’ll focus on how the Rise of LLMs Has Modified in My Personal Manner of Working
What Makes This Examine Totally different?
Many earlier research have evaluated ChatGPT’s coding capability utilizing subjective evaluation. In different phrases, researchers seemed on the generated code and judged whether or not it appeared right.
That method is helpful, but it surely has a limitation: it relies upon closely on the evaluator’s judgment.
Winberg et al. take a extra structured method. They evaluate ChatGPT-generated code towards standardized reference code and benchmark outputs from Causal Inference: The Mixtape. This enables them to guage the code not solely primarily based on look, but additionally primarily based on whether or not it reproduces anticipated outcomes.
One other vital contribution is that the examine contains Stata.
This issues as a result of many empirical researchers, particularly in economics, public coverage, and well being economics, nonetheless use Stata extensively. Nevertheless, discussions about AI coding assistants usually focus primarily on Python and R. By together with Stata, the authors consider ChatGPT in a language that’s extremely related for utilized econometric analysis however much less continuously analyzed in AI coding research.
The Methodology Used within the Examine
The authors consider ChatGPT-4.0 Professional, the paid model of ChatGPT out there on the time of the examine. Their objective is to measure how nicely it performs when requested to code causal inference analyses in Python, R, and Stata.
They use publicly out there information and downside units from Causal Inference: The Mixtape. This textbook is extensively identified in utilized econometrics and supplies examples with code in R, Stata, and Python. In response to the examine, the reference environments had been R 3.6.0, Stata 18, and Python 3.13.
The authors concentrate on three causal inference strategies:
- Distinction-in-Variations;
- Inverse Likelihood Remedy Weighting;
- Regression Discontinuity.
These strategies had been chosen as a result of they’re generally utilized in empirical analysis and require greater than easy syntax era. They require correct information preparation, mannequin specification, and interpretation of outputs.
The examine follows a three-step course of.
Prompting ChatGPT With Econometric Downside Units
Step one is to present ChatGPT downside units and ask it to generate code for the related econometric analyses.
For instance, one of many downside units focuses on Distinction-in-Variations. The context is the legalization of abortion in 5 U.S. states earlier than the nationwide legalization following Roe v. Wade in 1973. The duty is to estimate whether or not early abortion legalization affected gonorrhea incidence amongst adolescent females aged 15–19.
As an alternative of utilizing solely a easy post-treatment indicator, the immediate asks ChatGPT to make use of year-by-treatment interactions to seize dynamic therapy results over time.
The sort of immediate is extra complicated than asking for a primary regression. It requires the mannequin to know the coverage context, establish the therapy indicator, construction the interplay phrases, and generate applicable code.
The authors outline related downside units for IPTW and RD.
Asking for Full Coding Workflows
Within the second step, the authors present extra complete prompts. These prompts ask ChatGPT to breed fuller coding duties from The Mixtape, together with information administration, econometric evaluation, and determine era.
That is vital as a result of actual analysis workflows are hardly ever restricted to 1 mannequin command. A researcher often has to import information, clear variables, create indicators, estimate fashions, generate tables, produce plots, and evaluate outcomes.
By testing full workflows, the authors consider whether or not ChatGPT can deal with the sensible complexity of utilized quantitative work.
Working the Code and Evaluating Outputs
Within the third step, the generated code is executed within the corresponding programming setting: Python, R, or Stata.
The authors then evaluate the outputs produced by ChatGPT-generated code with the benchmark outputs from The Mixtape.
How the Prompts Have been Generated
Some of the attention-grabbing elements of the examine is the way in which the prompts had been designed.
The authors recruited 4 researchers with superior experience in econometric strategies. Two held PhDs, and two had been PhD candidates. Three researchers had been assigned to work with one language every: Python, R, or Stata. The fourth researcher replicated the total course of throughout all three languages to validate the outcomes and assess consistency.
This design is helpful as a result of it displays how researchers may use ChatGPT in follow. Every researcher interacts with the mannequin, generates code, runs it, observes errors, and offers suggestions.
Nevertheless, this additionally creates a danger. If every researcher writes prompts independently, the outcomes could mirror variations in prompting type slightly than variations in ChatGPT’s coding capability.
To cut back this bias, the authors standardized the prompts. They collaboratively developed prompts that had been clear, structured, and basic sufficient to use throughout duties. The objective was to supply ChatGPT with sufficient data to resolve the issue with out overfitting the immediate to 1 particular activity.
The standard of the output relies upon closely on the standard of the immediate. If the immediate is obscure, the mannequin could produce generic or incorrect code. If the immediate is just too particular, it could carry out nicely on one activity however fail to generalize.
An excellent immediate ought to present context, specify the anticipated methodology, outline the related variables, describe the specified output, and make clear any assumptions.
The 5 Efficiency Indicators
The authors consider ChatGPT’s efficiency utilizing 5 essential outcomes: accuracy, effectivity, error output, enhancing, and consistency.
Accuracy is measured by evaluating the outcomes generated by the ChatGPT-written code with the benchmark outputs from The Mixtape.
The analysis is binary: if the end result matches the benchmark, it’s thought of correct. If it doesn’t, it’s thought of inaccurate.
Effectivity is measured by evaluating the variety of instructions used within the ChatGPT-generated code with the variety of instructions in the usual reference code.
This isn’t an ideal measure of effectivity, but it surely provides a helpful approximation.
The authors doc whether or not the ChatGPT-generated code produces execution errors.
This is without doubt one of the most sensible indicators. When code fails to run, the consumer should debug it. If the consumer doesn’t perceive the strategy or the programming language, this will turn out to be a significant downside.
Modifying refers to circumstances the place the code doesn’t produce an execution error however nonetheless requires clarification, extra context, or guide adjustment to acquire the right output.
That is significantly vital as a result of not all errors are seen. A code block can run with out crashing however nonetheless produce an incorrect mannequin, a incorrect variable transformation, or a deceptive determine.
Consistency is assessed via replication. A fourth researcher repeats the duties utilizing the identical prompts throughout Python, R, and Stata, with a brand new ChatGPT account and no prior dialog historical past.
The objective is to find out whether or not ChatGPT produces related logic and construction when totally different customers submit the identical prompts.
This issues as a result of reproducibility is central to analysis. If the identical immediate produces very totally different code throughout classes, researchers have to doc and validate outputs rigorously.
What Did the Examine Discover?
The general conclusion is balanced. Here’s a desk that summarizes the outcomes.

Based mostly on the examine, ChatGPT carried out higher in Python and R than in Stata. The authors state that ChatGPT generated correct code and ends in R and Python for many duties, whereas Stata was much less dependable.
This end result just isn’t totally stunning.
Python and R are extensively utilized in information science, statistics, and machine studying. Additionally they have giant on-line communities, intensive documentation, and lots of publicly out there code examples. Since giant language fashions be taught from large-scale textual content and code information, it’s affordable to anticipate them to carry out higher in languages with extra plentiful public examples.
That mentioned, this interpretation must be handled rigorously. The examine just isn’t a large-scale benchmark throughout 1000’s of duties. It’s a case examine primarily based on chosen econometric downside units. Due to this fact, we must always not conclude that ChatGPT is universally higher at Python or R than Stata in all contexts.
A extra cautious conclusion is that this:
For the causal inference duties examined on this examine, ChatGPT appeared extra dependable in Python and R than in Stata.
What the Rise of LLMs Has Modified in My Personal Manner of Working
What makes this examine significantly attention-grabbing to me is that it doesn’t deal with solely a theoretical query. It straight connects with what I observe in my very own work, each at residence and in knowledgeable setting. We used ChatGPT Professional 4.0 up to now, and in the present day we use ChatGPT Professional 5.5. On this part, I wish to clarify how the adoption of those fashions has modified the way in which I work.
Up to now, after I needed to conduct a quantitative examine or develop a statistical methodology, a big a part of the work was spent on literature evaluation. I needed to establish the precise scientific papers, perceive the strategies used, evaluate totally different approaches, after which determine methods to apply them to our personal information.
At present, with ChatGPT, this exploratory section is far quicker. It doesn’t exchange the vital studying of scientific papers, but it surely helps construction the preliminary analysis, establish key ideas extra rapidly, and formulate methodological questions extra clearly.
The change has been much more seen within the office, particularly in the way in which we use programming languages.
Beforehand, we primarily used SAS for information extraction, preparation, and processing. SAS stays a really environment friendly instrument for dealing with giant volumes of knowledge in knowledgeable setting. Nevertheless, for statistical modeling, we regularly relied on R, which was extra handy for estimation, visualization, and methodological experimentation.
With the rise of LLMs, we progressively determined to maneuver a major a part of our work to Python. This determination was not solely pushed by the truth that Python is straightforward and extensively used. It additionally got here from a really sensible statement: in our expertise, instruments like ChatGPT usually present higher solutions in Python, with fewer errors and extra reusable examples.
We didn’t conduct a scientific examine as structured because the one by Winberg et al., however we reached this conclusion via the suggestions of the modelers in our group and as a part of a long-term strategic alternative. In follow, AI has influenced not solely the way in which we write code but additionally the infrastructure we use. We moved from an setting centered on SAS Studio and RStudio to a workflow extra oriented towards VS Code, as a result of it integrates extra simply with instruments reminiscent of ChatGPT, Claude, and GitHub Copilot.
This shift could look technical, however it’s really fairly deep. AI not solely improves productiveness. It additionally influences the languages we select, the instruments we use, and the way in which we manage our workflows.
One other concrete instance is the gathering of exterior information. In our work, we typically want publicly out there datasets: INSEE information, local weather information, IPCC information, NGFS eventualities for local weather stress testing, or different datasets utilized in ESG danger modeling.
Up to now, the sort of activity may take a number of days, typically even a number of weeks. We needed to discover the precise supply, perceive the construction of the information, obtain the information, clear it, reformat it, and make it usable for our fashions. At present, with LLMs, this course of may be considerably accelerated.
Not too long ago, for instance, I wished to retrieve NAF codes from the INSEE web site, along with their labels, in a format that may very well be used straight. Up to now, this activity would most likely have taken me a number of hours. With just a few well-structured prompts, I rapidly obtained a script that retrieved the information, cleaned the codes, eliminated the dots, and produced an Excel file prepared to make use of. This isn’t solely a time acquire. It additionally reduces the friction between an thought and its execution.
In my opinion, this is without doubt one of the most vital contributions of LLMs for statisticians and quantitative analysts. They’re very helpful for information processing, statistical modeling, mathematical programming, reporting, and formatting outcomes.
They’ve additionally turn out to be priceless for producing deliverables: structuring paperwork, bettering explanations, formatting tables, describing figures, and deciphering outcomes. Earlier variations of ChatGPT nonetheless made many errors in these duties, particularly in technical reasoning and references. Latest fashions are a lot better, though they nonetheless require cautious validation.
In my work, I see them extra as very quick analysis assistants than as autonomous consultants. They’ll do in just a few hours what we would beforehand have assigned to a analysis assistant for a number of days: discover a way, suggest code, generate a primary model of a chart, rewrite an interpretation, or automate a part of a report.
However this pace comes with one situation: human supervision and validation stay important.
The danger of hallucination just isn’t theoretical. A current instance made this very clear: in line with the Monetary Occasions, EY Canada withdrew a examine used to advertise its cybersecurity companies after it was discovered to include fabricated information, misattributed citations, and even a reference to a McKinsey report that didn’t exist.
That is precisely why I discover the examine by Winberg et al. attention-grabbing. It doesn’t merely ask whether or not ChatGPT can write code. It factors to a extra vital query: beneath what situations can we belief AI-generated code?
For me, the reply is obvious. We are able to use LLMs to work quicker, however to not take away the duty of the researcher. The researcher nonetheless must examine the assumptions, validate the information, take a look at the code, evaluate the outcomes with benchmarks, and ensure the interpretation is right.
In different phrases, AI is deeply altering the way in which we work, but it surely doesn’t take away the necessity for experience. Actually, it makes experience much more vital. The extra highly effective the instrument turns into, the extra crucial it’s to know when to belief it and when to not.
Lastly, the adoption of AI instruments will proceed to remodel the way in which we work. Some processes will turn out to be extra environment friendly, others will disappear, and extra subtle workflows will emerge. To stay aggressive, we have to continue learning, preserve working, and be able to combine these instruments into our skilled lives.
On the identical time, AI may even change the way in which data is produced and shared. As a result of these instruments enhance productiveness, an article that when required a month of labor can now typically be accomplished in per week. It is a good factor in some ways: it lowers the barrier to writing, helps extra folks share concepts, and accelerates the circulation of data.
Nevertheless it additionally creates a brand new problem. If everybody can produce extra content material quicker, the web will turn out to be much more crowded. The attain of every article is probably not the identical as earlier than. Some writers could really feel discouraged, particularly if their work receives much less visibility regardless of the hassle behind it.
In my opinion, this can create a brand new type of inequality between those that know methods to use AI successfully and those that don’t, but additionally between those that write solely to provide content material and those that write as a result of they really care concerning the topic.
In the long term, I consider the individuals who stay might be those that are genuinely passionate, those that wish to be taught, suppose deeply, and share data with others. AI could make writing quicker, but it surely won’t exchange curiosity, self-discipline, and the will to contribute one thing significant.
References
Winberg, D., Tsai, E., Tang, T., Xuan, D., Marchi, N., & Shi, L. (2026). Can AI write your code? A case examine of chatgpt’s statistical coding capabilities for quantitative analysis. Well being Economics Evaluate.
















