Welcome to my sequence on Causal AI, the place we’ll discover the combination of causal reasoning into machine studying fashions. Anticipate to discover a lot of sensible functions throughout totally different enterprise contexts.
Within the final article we lined safeguarding demand forecasting with causal graphs. At present, we flip our consideration to powering experiments utilizing CUPED and double machine studying.
If you happen to missed the final article on safeguarding demand forecasting, test it out right here:
On this article, we consider whether or not CUPED and double machine studying can improve the effectiveness of your experiments. We are going to use a case research to discover the next areas:
- The constructing blocks of experimentation: Speculation testing, energy evaluation, bootstrapping.
- What’s CUPED and the way can it assist energy experiments?
- What are the conceptual similarities between CUPED and double machine studying?
- When ought to we use double machine studying moderately than CUPED?
The total pocket book will be discovered right here:
Background
You’ve lately joined the experimentation staff at a number one on-line retailer identified for its huge product catalog and dynamic person base. The info science staff has deployed a complicated recommender system designed to reinforce person expertise and drive gross sales. This method integrates in real-time with the retailer’s platform and includes vital infrastructure and engineering prices.
The finance staff is raring to know the system’s monetary affect, particularly how a lot further income it generates in comparison with a baseline situation with out suggestions. To judge the recommender system’s effectiveness, you intend to conduct a randomized managed experiment.
Information-generating course of: Pre-experiment
We begin by creating some pre-experiment information. The info-generating course of we use has the next traits:
- 3 noticed covariates associated to the recency (x_recency), frequency (x_frequency) and worth (x_value) of earlier gross sales.
- 1 unobserved covariate, the customers month-to-month earnings (u_income).
- A posh relationship between covariates is used to estimate our goal metric, gross sales worth:
The python code under is used to create the pre-experiment information:
np.random.seed(123)n = 10000 # Set variety of observations
p = 4 # Set variety of pre-experiment covariates
# Create pre-experiment covariates
X = np.random.uniform(measurement=n * p).reshape((n, -1))
# Nuisance parameters
b = (
1.5 * X[:, 0] +
2.5 * X[:, 1] +
X[:, 2] ** 3 +
X[:, 3] ** 2 +
X[:, 1] * X[:, 2]
)
# Create some noise
noise = np.random.regular(measurement=n)
# Calculate end result
y = np.most(b + noise, 0)
# Scale variables for interpretation
df_pre = pd.DataFrame({"noise": noise * 1000,
"u_income": X[:, 0] * 1000,
"x_recency": X[:, 1] * 1000,
"x_frequency": X[:, 2] * 1000,
"x_value": X[:, 3] * 1000,
"y_value": y * 1000
})
# Visualise goal metric
sns.histplot(df_pre['y_value'], bins=30, kde=False)
plt.xlabel('Gross sales Worth')
plt.ylabel('Frequency')
plt.title('Gross sales Worth')
plt.present()
Earlier than we get onto CUPED, I assumed it could be worthwhile overlaying some foundational information on experimentation.
Speculation testing
Speculation testing helps decide if noticed variations in an experiment are statistically vital or simply random noise. In our experiment, we divide customers into two teams:
- Management Group: Receives no suggestions.
- Therapy Group: Receives personalised suggestions from the system.
We outline our hypotheses as follows:
- Null Speculation (H₀): The recommender system doesn’t have an effect on income. Any noticed variations are on account of probability.
- Various Speculation (Hₐ): The recommender system will increase income. Customers receiving suggestions generate considerably extra income in comparison with those that don’t.
To evaluate the hypotheses you may be evaluating the imply income within the management and therapy group. Nonetheless, there are some things to concentrate on:
- Sort I error (False constructive): If the experiment concludes that the recommender system considerably will increase income when in actuality, it has no impact.
- Sort II error (Beta, False detrimental): If the experiment finds no vital enhance in income from the recommender system when in actuality, it does result in a significant enhance
- Significance Degree (Alpha): If you happen to set the importance stage to 0.05, you might be accepting a 5% probability of incorrectly concluding that the recommender system improves income when it doesn’t (false constructive).
- Energy (1 — Beta): Reaching an influence of 0.80 means you’ve an 80% probability of detecting a big enhance in income as a result of recommender system if it really has an impact. The next energy reduces the chance of false negatives.
As you begin to consider designing the experiment, you set some preliminary objectives:
- You wish to reliably detect the impact — Ensuring you stability the dangers of detecting a non-existent impact vs the chance of not detecting an actual impact.
- As rapidly as attainable — Finance are in your case!
- Maintaining the pattern measurement as price environment friendly as attainable — The enterprise case from the info science staff suggests the system goes to drive a big enhance in income in order that they don’t need the management group being too huge.
However how are you going to meet these objectives? Let’s delve into energy evaluation subsequent!
Energy evaluation
Once we discuss powering experiments, we’re normally referring to the method of figuring out the minimal pattern measurement wanted to detect an impact of a sure measurement with a given confidence. There are 3 elements to energy evaluation:
- Impact measurement — The distinction between the imply worth of H₀ and Hₐ. We usually must make wise assumptions round this based mostly on understanding what issues to the enterprise/business we’re working inside.
- Significance stage — The likelihood of incorrectly concluding there may be an impact when there isn’t, usually set at 0.05.
- Energy — The likelihood of accurately detecting an impact when there may be one, usually set at 0.80.
I discovered the instinct behind these fairly laborious to know at first, however visualising it may possibly actually assist. So lets give it a strive! The important thing areas are the place H₀ and Hₐ crossover — See should you it helps you tie collectively the elements mentioned above…
A bigger pattern measurement results in a smaller customary error. With a smaller customary error, the sampling distributions of H₀ and Hₐ turn into narrower and fewer overlapping. This decreased overlap makes it simpler to detect a distinction, resulting in greater energy.
The perform under exhibits how we are able to use the statsmodels python package deal to hold out an influence evaluation:
from typing import Union
import pandas as pd
import numpy as np
import statsmodels.stats.energy as smpdef power_analysis(metric: Union[np.ndarray, pd.Series], exp_perc_change: float, alpha: float = 0.05, energy: float = 0.80) -> int:
'''
Carry out an influence evaluation to find out the minimal pattern measurement required for a given metric.
Args:
metric (np.ndarray or pd.Collection): Array or Collection containing the metric values for the management group.
exp_perc_change (float): The anticipated proportion change within the metric for the take a look at group.
alpha (float, non-obligatory): The importance stage for the take a look at. Defaults to 0.05.
energy (float, non-obligatory): The specified energy of the take a look at. Defaults to 0.80.
Returns:
int: The minimal pattern measurement required for every group to detect the anticipated proportion change with the required energy and significance stage.
Raises:
ValueError: If `metric` shouldn't be a NumPy array or pandas Collection.
'''
# Validate enter varieties
if not isinstance(metric, (np.ndarray, pd.Collection)):
increase ValueError("metric must be a NumPy array or pandas Collection.")
# Calculate statistics
control_mean = metric.imply()
control_std = np.std(metric, ddof=1) # Use ddof=1 for pattern customary deviation
test_mean = control_mean * (1 + exp_perc_change)
test_std = control_std # Assume the take a look at group has the identical customary deviation because the management group
# Calculate (Cohen's D) impact measurement
mean_diff = control_mean - test_mean
pooled_std = np.sqrt((control_std**2 + test_std**2) / 2)
effect_size = abs(mean_diff / pooled_std) # Cohen's d must be constructive
# Run energy evaluation
power_analysis = smp.TTestIndPower()
sample_size = spherical(power_analysis.solve_power(effect_size=effect_size, alpha=alpha, energy=energy))
print(f"Management imply: {spherical(control_mean, 3)}")
print(f"Management std: {spherical(control_std, 3)}")
print(f"Min pattern measurement: {sample_size}")
return sample_size
So let’s try it out with our pre-experiment information!
exp_perc_change = 0.05 # Set the anticipated proportion change within the chosen metric attributable to the therapymin_sample_size = power_analysis(df_pre["y_value"], exp_perc_change
We will see that given the distribution of our goal metric, we would wish a pattern measurement of 1,645 to detect a rise of 5%.
Information-generating course of: Experimental information
Quite than rush into organising the experiment, you resolve to take the pre-experiment information and simulate the experiment.
The next perform randomly selects customers to be handled and applies a therapy impact. On the finish of the perform we report the imply distinction earlier than and after the therapy was utilized in addition to the true ATE (common therapy impact):
def exp_data_generator(t_perc_change, t_samples):# Create copy of pre-experiment information prepared to control into experiment information
df_exp = df_pre.reset_index(drop=True)
# Calculate the preliminary therapy impact
treatment_effect = spherical((df_exp["y_value"] * (t_perc_change)).imply(), 2)
# Create therapy column
treated_indices = np.random.selection(df_exp.index, measurement=t_samples, exchange=False)
df_exp["treatment"] = 0
df_exp.loc[treated_indices, "treatment"] = 1
# therapy impact
df_exp["treatment_effect"] = 0
df_exp.loc[df_exp["treatment"] == 1, "treatment_effect"] = treatment_effect
# Apply therapy impact
df_exp["y_value_exp"] = df_exp["y_value"]
df_exp.loc[df_exp["treatment"] == 1, "y_value_exp"] = df_exp["y_value"] + df_exp["treatment_effect"]
# Calculate imply diff earlier than therapy
mean_t0_pre = df_exp[df_exp["treatment"] == 0]["y_value"].imply()
mean_t1_pre = df_exp[df_exp["treatment"] == 1]["y_value"].imply()
mean_diff_pre = spherical(mean_t1_pre - mean_t0_pre)
# Calculate imply diff after therapy
mean_t0_post = df_exp[df_exp["treatment"] == 0]["y_value_exp"].imply()
mean_t1_post = df_exp[df_exp["treatment"] == 1]["y_value_exp"].imply()
mean_diff_post = spherical(mean_t1_post - mean_t0_post)
# Calculate ate
treatment_effect = spherical(df_exp[df_exp["treatment"]==1]["treatment_effect"].imply())
print(f"Diff-in-means earlier than therapy: {mean_diff_pre}")
print(f"Diff-in-means after therapy: {mean_diff_post}")
print(f"ATE: {treatment_effect}")
return df_exp
We will feed by means of the minimal pattern measurement we beforehand calculated:
np.random.seed(123)
df_exp_1 = exp_data_generator(exp_perc_change, min_sample_size)
Let’s begin by inspecting the info we created for handled customers that will help you perceive what the perform is doing:
Subsequent let’s check out the outcomes which the perform prints:
Attention-grabbing, we see that after we choose customers to be handled, however earlier than we deal with them, there may be already a distinction in means. This distinction is because of probability. Which means that once we have a look at the distinction after customers are handled we don’t accurately estimate the ATE (common therapy impact). We are going to come again thus far once we cowl CUPED.
Subsequent let’s discover a extra subtle approach of constructing an inference than simply taking the distinction in means…
Bootstrapping
Bootstrapping is a strong statistical method that includes resampling information with alternative. These resampled datasets, known as bootstrap samples, assist us estimate the variability of a statistic (just like the imply or median) from our unique information. That is notably engaging on the subject of experimentation because it permits us to calculate confidence intervals. Let’s stroll by means of it step-by-step utilizing a easy instance…
You will have run an experiment with a management and therapy group every made up of 1k customers.
- Create bootstrap samples — Randomly choose (with alternative) 1k customers from the management after which therapy group. This offers us 1 bootstrap pattern for management and one for therapy.
- Repeat this course of n occasions (e.g. 10k occasions).
- For every pair of bootstrap samples calculate the imply distinction between management and therapy.
- We now have a distribution (made up of the imply distinction between 10k bootstrap samples) which we are able to use to calculate confidence intervals.
Making use of it to our case research
Let’s use our case research for instance the way it works. Beneath we use the sciPy stats python package deal to assist calculate bootstrap confidence intervals:
from typing import Union
import pandas as pd
import numpy as np
from scipy import statsdef mean_diff(group_a: Union[np.ndarray, pd.Series], group_b: Union[np.ndarray, pd.Series]) -> float:
'''
Calculate the distinction in means between two teams.
Args:
group_a (Union[np.ndarray, pd.Series]): The primary group of information factors.
group_b (Union[np.ndarray, pd.Series]): The second group of information factors.
Returns:
float: The distinction between the imply of group_a and the imply of group_b.
'''
return np.imply(group_a) - np.imply(group_b)
def bootstrapping(df: pd.DataFrame, adjusted_metric: str, n_resamples: int = 10000) -> np.ndarray:
'''
Carry out bootstrap resampling on the adjusted metric of two teams within the dataframe to estimate the imply distinction and confidence intervals.
Args:
df (pd.DataFrame): The dataframe containing the info. Should embrace a 'therapy' column indicating group membership.
adjusted_metric (str): The identify of the column within the dataframe representing the metric to be resampled.
n_resamples (int, non-obligatory): The variety of bootstrap resamples to carry out. Defaults to 1000.
Returns:
np.ndarray: The array of bootstrap resampled imply variations.
'''
# Separate the info into two teams based mostly on the 'therapy' column
group_a = df[df["treatment"] == 1][adjusted_metric]
group_b = df[df["treatment"] == 0][adjusted_metric]
# Carry out bootstrap resampling
res = stats.bootstrap((group_a, group_b), statistic=mean_diff, n_resamples=n_resamples, methodology='percentile')
ci = res.confidence_interval
# Extract the bootstrap distribution and confidence intervals
bootstrap_means = res.bootstrap_distribution
bootstrap_ci_lb = spherical(ci.low,)
bootstrap_ci_ub = spherical(ci.excessive)
bootstrap_mean = spherical(np.imply(bootstrap_means))
print(f"Bootstrap confidence interval decrease sure: {bootstrap_ci_lb}")
print(f"Bootstrap confidence interval higher sure: {bootstrap_ci_ub}")
print(f"Bootstrap imply diff: {bootstrap_mean}")
return bootstrap_means
Once we run it for our case research information we are able to see that we now have some confidence intervals:
bootstrap_og_1 = bootstrapping(df_exp_1, "y_value_exp")
Our floor fact ATE is 143 (the precise therapy impact from our experiment information generator perform), which falls inside our confidence intervals. Nonetheless, it’s value noting that the imply distinction hasn’t modified (it’s nonetheless 93 as earlier than once we merely calculated the imply distinction of management and therapy), and the pre-treatment distinction remains to be there.
So what if we wished to give you narrower confidence intervals? And is there any approach we are able to take care of the pre-treatment variations? This leads us properly into CUPED…
Background
CUPED (managed experiments utilizing pre-experiment information) is a strong method for enhancing the accuracy of experiments developed by researchers at Microsoft. The unique paper is an insightful learn for anybody enthusiastic about experimentation:
https://ai.stanford.edu/~ronnyk/2009controlledExperimentsOnTheWebSurvey.pdf
The core concept of CUPED is to make use of information collected earlier than your experiment begins to cut back the variance in your goal metric. By doing so, you may make your experiment extra delicate, which has two main advantages:
- You may detect smaller results with the identical pattern measurement.
- You may detect the identical impact with a smaller pattern measurement.
Consider it like eradicating the “background noise” so you may see the “sign” extra clearly.
Variance, customary deviation, customary error
Whenever you examine CUPED you could hear individuals discuss it decreasing the variance, customary deviation or customary error. If you’re something like me, you may end up forgetting how these are associated, so earlier than we go any additional let’s recap on this!
- Variance: Variance measures the common squared deviation of every information level from the imply, reflecting the general unfold or dispersion inside a dataset.
- Normal Deviation: Normal deviation is the sq. root of variance, representing the common distance of every information level from the imply, and offering a extra interpretable measure of unfold.
- Normal Error: Normal error quantifies the precision of the pattern imply as an estimate of the inhabitants imply, calculated as the usual deviation divided by the sq. root of the pattern measurement.
How does CUPED work?
To know how CUPED works, let’s break it down…
Pre-experiment covariate — Within the lightest implementation of CUPED, the pre-experiment covariate can be the goal metric measured in a time interval earlier than the experiment. So in case your goal metric was gross sales worth, your covariate could possibly be every clients gross sales worth 4 weeks previous to the experiment.
It’s vital that your covariate is correlated together with your goal metric and that it’s unaffected by the therapy. For this reason we’d usually use pre-treatment information from the management group.
Regression adjustment — Linear regression is used to mannequin the connection between the covariate (measured earlier than the experiment) and the goal metric (measured throughout the experiment interval). We will then calculate the CUPED adjusted goal metric by eradicating the affect of the covariate:
It’s value noting that taking away the imply of the covariate is completed to centre the end result variable across the imply to make it interpretable when in comparison with the unique goal metric.
Variance discount — After the regression adjustment the variance in our goal metric has diminished. Decrease variance implies that the variations between the management and therapy group are simpler to detect, thus growing the statistical energy of the experiment.
Making use of it to our case research
Let’s use our case research for instance the way it works. Beneath we code CUPED up in a perform:
from typing import Union
import pandas as pd
import numpy as np
import statsmodels.api as smdef cuped(df: pd.DataFrame, pre_covariates: Union[str, list], target_metric: str) -> pd.Collection:
'''
Implements the CUPED (Managed Experiments Utilizing Pre-Experiment Information) method to regulate the goal metric
by eradicating predictable variation utilizing pre-experiment covariates. This reduces the variance of the metric and
will increase the statistical energy of the experiment.
Args:
df (pd.DataFrame): The enter DataFrame containing each the pre-experiment covariates and the goal metric.
pre_covariates (Union[str, list]): The column identify(s) within the DataFrame equivalent to the pre-experiment covariates used for the adjustment.
target_metric (str): The column identify within the DataFrame representing the metric to be adjusted.
Returns:
pd.Collection: A pandas Collection containing the CUPED-adjusted goal metric.
'''
# Match management mannequin utilizing pre-experiment covariates
control_group = df[df['treatment'] == 0]
X_control = control_group[pre_covariates]
X_control = sm.add_constant(X_control)
y_control = control_group[target_metric]
model_control = sm.OLS(y_control, X_control).match()
# Compute residuals and regulate goal metric
X_all = df[pre_covariates]
X_all = sm.add_constant(X_all)
residuals = df[target_metric].to_numpy().flatten() - model_control.predict(X_all)
adjustment_term = model_control.params['const'] + sum(model_control.params[covariate] * df[pre_covariates].imply()[covariate] for covariate in pre_covariates)
adjusted_target = residuals + adjustment_term
return adjusted_target
Once we apply it to our case research information and evaluate the adjusted goal metric to the unique goal metric, we see that the variance has diminished:
# Apply CUPED
pre_covariates = ["x_recency", "x_frequency", "x_value"]
target_metric = ["y_value_exp"]
df_exp_1["adjusted_target"] = cuped(df_exp_1, pre_covariates, target_metric)# Plot outcomes
plt.determine(figsize=(10, 6))
sns.kdeplot(information=df_exp_1[df_exp_1['treatment'] == 0], x="adjusted_target", hue="therapy", fill=True, palette="Set1", label="Adjusted Worth")
sns.kdeplot(information=df_exp_1[df_exp_1['treatment'] == 0], x="y_value_exp", hue="therapy", fill=True, palette="Set2", label="Authentic Worth")
plt.title(f"Distribution of Worth by Authentic vs CUPED")
plt.xlabel("Worth")
plt.ylabel("Density")
plt.legend(title="Distribution")
Does it scale back the usual error?
Now we’ve got utilized CUPED and diminished the variance, lets run our bootstrapping perform to see what affect it has:
bootstrap_cuped_1 = bootstrapping(df_exp_1, "adjusted_target")
If you happen to evaluate this to our earlier consequence utilizing the unique goal metric you see that the boldness intervals are narrower:
bootstrap_1 = pd.DataFrame({
'unique': bootstrap_og_1,
'cuped': bootstrap_cuped_1
})# Plot the KDE plots
plt.determine(figsize=(10, 6))
sns.kdeplot(bootstrap_1['original'], fill=True, label='Authentic', colour='blue')
sns.kdeplot(bootstrap_1['cuped'], fill=True, label='CUPED', colour='orange')
# Add imply traces
plt.axvline(bootstrap_1['original'].imply(), colour='blue', linestyle='--', linewidth=1)
plt.axvline(bootstrap_1['cuped'].imply(), colour='orange', linestyle='--', linewidth=1)
plt.axvline(spherical(df_exp_1[df_exp_1["treatment"]==1]["treatment_effect"].imply(), 3), colour='inexperienced', linestyle='--', linewidth=1, label='Therapy impact')
# Customise the plot
plt.title('Distribution of Worth by Authentic vs CUPED')
plt.xlabel('Worth')
plt.ylabel('Density')
plt.legend()
# Present the plot
plt.present()
The bootstrap distinction in means additionally strikes nearer to the bottom fact therapy impact. It is because CUPED can be very efficient at coping with pre-existing variations between the management and therapy group.
Does it scale back the minimal pattern measurement?
The subsequent query is does it scale back the minimal pattern measurement we want. Nicely lets discover out!
treatment_effect_1 = spherical(df_exp_1[df_exp_1["treatment"]==1]["treatment_effect"].imply(), 2)
cuped_sample_size = power_analysis(df_exp_1[df_exp_1['treatment'] == 0]['adjusted_target'], treatment_effect_1 / df_exp_1[df_exp_1['treatment'] == 0]['adjusted_target'].imply())
The minimal pattern measurement wanted has diminished from 1,645 to 901. Each Finance and the Information Science staff are going to be happy as we are able to run the experiment for a shorter time interval with a smaller management pattern!
Background
Once I first examine CUPED, I considered double machine studying and the similarities. If you happen to aren’t aware of double machine studying, take a look at my article from earlier within the sequence:
Take note of the primary stage end result mannequin in double machine studying:
- Consequence mannequin (de-noising): Machine studying mannequin used to estimate the end result utilizing simply the management options. The end result mannequin residuals are then calculated.
That is conceptually similar to what we’re doing with CUPED!
How does it evaluate to CUPED?
Let’s feed by means of our case research information and see if we get the same consequence:
# Practice DML mannequin
dml = LinearDML(discrete_treatment=False)
dml.match(df_exp_1[target_metric].to_numpy().ravel(), T=df_exp_1['treatment'].to_numpy().ravel(), X=df_exp_1[pre_covariates], W=None)
ate_dml = spherical(dml.ate(df_exp_1[pre_covariates]))
ate_dml_lb = spherical(dml.ate_interval(df_exp_1[pre_covariates])[0])
ate_dml_ub = spherical(dml.ate_interval(df_exp_1[pre_covariates])[1])print(f'DML confidence interval decrease sure: {ate_dml_lb}')
print(f'DML confidence interval higher sure: {ate_dml_ub}')
print(f'DML ate: {ate_dml}')
We get an nearly similar consequence!
Once we plot the residuals we are able to see that the variance is diminished like in CUPED (though we don’t add the imply to scale for interpretation):
# Match mannequin end result mannequin utilizing pre-experiment covariates
X_all = df_exp_1[pre_covariates]
X_all = sm.add_constant(X)
y_all = df_exp_1[target_metric]
outcome_model = sm.OLS(y_all, X_all).match()# Compute residuals and regulate goal metric
df_exp_1['outcome_residuals'] = df_exp_1[target_metric].to_numpy().flatten() - outcome_model.predict(X_all)
# Plot outcomes
plt.determine(figsize=(10, 6))
sns.kdeplot(information=df_exp_1[df_exp_1['treatment'] == 0], x="outcome_residuals", hue="therapy", fill=True, palette="Set1", label="Adjusted Goal")
sns.kdeplot(information=df_exp_1[df_exp_1['treatment'] == 0], x="y_value_exp", hue="therapy", fill=True, palette="Set2", label="Authentic Worth")
plt.title(f"Distribution of Worth by Authentic vs DML")
plt.xlabel("Worth")
plt.ylabel("Density")
plt.legend(title="Distribution")
plt.present()
“So what?” I hear you ask!
Firstly, I believe it’s an fascinating commentary for anybody utilizing double machine studying — The primary stage end result mannequin assist scale back the variance and due to this fact we should always get comparable advantages to CUPED.
Secondly, it raises the query when is every methodology applicable? Let’s shut issues off by overlaying off this query…
There are a number of explanation why it might make sense to have a tendency in direction of CUPED:
- It’s simpler to know.
- It’s easier to implement.
- It’s one mannequin moderately than three, that means you’ve much less challenges with overfitting.
Nonetheless, there are a few exceptions the place double machine studying outperforms CUPED:
- Biased therapy task — When the therapy task is biased, for instance if you find yourself utilizing observational information, double machine studying can take care of this. My article from earlier within the sequence builds on this:
- Heterogenous therapy results — Whenever you wish to perceive results at a person stage, for instance discovering out who it’s value sending reductions to, double machine studying may help with this. There’s a good case research which illustrates this in my earlier article on optimising therapy methods:
At present we did a whistle cease tour of experimentation, overlaying speculation testing, energy evaluation and bootstrapping. We then explored how CUPED can scale back the usual error and enhance the ability of our experiments. Lastly, we touched on it’s similarities to double machine studying and mentioned when every methodology must be used. There are just a few further key factors that are value mentioning in phrases CUPED:
- We don’t have to make use of linear regression — If we’ve got a number of covariates, perhaps some with non-linear relationships, we might use a machine studying method like boosting.
- If we do go down the route of utilizing a machine studying method, we want to verify to not overfit the info.
- Some cautious thought ought to go into when to run CUPED — Are you going to run it earlier than you begin your experiment after which run an influence evaluation to find out your diminished pattern measurement? Or are you simply going to run it after your experiment to cut back the usual error?