To get probably the most out of this tutorial, you need to have already got a strong understanding of how linear regression works and the assumptions behind it. You must also bear in mind that, in apply, multicollinearity is addressed utilizing the Variance Inflation Issue (VIF). As well as, you must perceive what prediction threat means, and be accustomed to the fundamentals of Python in addition to its core features.
On the finish of this text, you will see the code for the stepwise choice process used right here. The implementation follows two key ideas: orthogonality and Don’t Repeat Your self (DRY), making certain clear, modular, and reusable code.
Lowering the variety of variables in a regression mannequin will not be solely a technical train; it’s a strategic alternative that have to be guided by the targets of the evaluation. In a earlier work, we demonstrated how easy instruments, equivalent to correlation evaluation and the Variance Inflation Issue (VIF), can already shrink a dataset with lots of of predictors into a much more compact mannequin. But, even after this preliminary discount, fashions typically nonetheless comprise too many variables to work successfully. A smaller mannequin with fewer predictors gives a number of benefits: it might yield higher predictions than a bigger mannequin, it’s extra parsimonious, therefore simpler to interpret, and it typically generalizes higher. As extra variables are added, the mannequin’s bias decreases however its variance will increase. That is the essence of the bias–variance trade-off: too few variables result in excessive bias (underfitting), whereas too many result in excessive variance (overfitting). Good predictive efficiency requires a steadiness between the 2.
This raises a elementary query for anybody working with regression fashions: how can we resolve which variables need to be included within the mannequin? In different phrases, how can we scale back the dimensionality of our information with out dropping important data?
The problem relies on the aim of the evaluation. Ought to the mannequin present exact estimates of the coefficients? Ought to it determine which predictors are vital? Or ought to it maximize predictive accuracy? Every of those targets calls for various approaches to mannequin choice, and ignoring this distinction can result in deceptive conclusions.
On this article, we deal with the problem of mannequin choice in regression. We start by outlining the final framework of linear regression (readers already accustomed to it might skip this part). We then evaluate the principle scoring standards used to guage competing fashions, adopted by a dialogue of the procedures that permit us to discover subsets of the doable mannequin area. Lastly, we illustrate these strategies with a Python utility utilizing the Communities and Crime dataset.
1. Framework of linear regression.
On this part, we offer a short overview of the linear regression mannequin. We start by describing the dataset, together with the variety of observations and the variety of covariates. We then introduce the mannequin itself and description the assumptions made concerning the information.
We assume that we now have a dataset with n observations and p covariates. The response variable is denoted by Y is steady and the covariates are denoted by X1, … , Xp. We assume that the connection between the response variable and the covariates is linear, that’s:

for i = 1, …, 𝑛, the place β0 is the intercept, βj is the coefficient of the 𝑗-th covariate, and εᵢ is the error time period. We assume that the error time period is unbiased and identically distributed (i.i.d.) with imply zero and variance σ².
With the regression framework in place, the following step is to face the central problem of mannequin choice: how can we evaluate completely different subsets of variables?
2. Scoring Standards for Evaluating Competing Fashions
In mannequin choice, the primary problem is to assign a rating to every mannequin, the place a mannequin is outlined by a specific subset of covariates. This part explains how fashions might be scored.
Allow us to first talk about the issue of scoring fashions. Let S ⊂ {1, …, p} and let 𝓧S = {Xⱼ : j ∈ S} denote a subset of the covariates. Let βS denote the coefficients of the corresponding set of covariates and let β̂S denote the least squares estimate of βS. Additionally, let XS denote the X matrix for this subset of covariates and outline r̂S(x), to be the estimated regression operate. The anticipated values from mannequin S are denoted by Ŷᵢ(S) = r̂S(Xᵢ). The 𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐢𝐨𝐧 𝐫𝐢𝐬𝐤 is outlined to be

the place Yi* is the long run statement of Yi on the covariate Xi.
The aim of mannequin choice is to seek out the subset S that minimizes the prediction threat R(S).
With information, we can’t immediately compute the prediction threat R(S). On this state of affairs, we typically use its estimate primarily based on the obtainable information. The estimation of the prediction threat is used as our scoring criterion.
The naive estimate of the prediction threat that we will use is: the coaching error, which is outlined as:

the place Yi is the noticed worth of the response variable for the i-th statement.
Nevertheless, the coaching error could be very biased as an estimate of the prediction threat. It’s all the time smaller than the prediction threat . In reality,

What explains this biaïs is that the info is used twice: as soon as to suit the mannequin and as soon as to compute the coaching error. After we match a fancy mannequin, with many parameters, the covariance 𝐶𝑜𝑣(Ŷᵢ(S), 𝑌ᵢ) can be giant and the bias of the coaching error will worsen. That is why we have to use a extra dependable estimate of the prediction threat.
2.1 Mallow’s Cp statistic
The Mallow’s Cp statistic is a well-liked technique for mannequin choice. It’s outlined as:

Right here, |𝑆| is the variety of phrases in 𝑆, and σ̂² is the estimate of σ², the variance of the error time period obtained from the total mannequin with all variables (𝑘). This worth represents the coaching error plus a bias correction. The primary time period measures how nicely the mannequin suits the info, whereas the second time period measures the mannequin’s complexity. The extra complicated the mannequin, the bigger the second time period can be, and consequently, the bigger the Mallow’s 𝐶ₚ statistic can be.
Mallow’s 𝐶ₚ statistic represents a trade-off between mannequin match and complexity. Discovering an excellent mannequin subsequently requires balancing these two features. The aim is to determine the mannequin that minimizes the Mallow’s 𝐶ₚ statistic.
Past Mallows’ CP, mannequin choice standards may also be derived from likelihood-based estimation with penalty phrases. This leads us to the following household of strategies.”
2.2 Probability and penalization
The strategy beneath to estimate the prediction threat relies on the utmost chance estimation of the parameters.
Within the speculation that the error time period is generally distributed, the chance operate is given by:

In case you compute the utmost chance estimate of the parameters β and σ², for the mannequin 𝑆, that has |𝑆| variables, you’re going to get respectively:
β̂(𝑆)mv = β̂(𝑆)mco and σ̂(𝑆)²mv= (1/𝑛) ∑ᵢ₌₁ⁿ (𝑌ᵢ − 𝑌̂ᵢ(𝑆))².
The log-likelihood of the mannequin for the mannequin $S$ which has $|S|$ variables is then given by:

Selecting the mannequin that maximizes the log-likelihood is equal to selecting the mannequin which have the smallest residual sum of squares (RSS), that’s:

With the intention to reduce a criterion, we use the unfavourable log-likelihood. The criterion is usually outlined as:
−2𝓁(𝑆) + 2|𝑆|·𝑓(𝑛)
the place 𝑓(𝑛) is a penalty operate that relies on the pattern dimension 𝑛.
This formulation permits us to outline the AIC and BIC standards, given as follows:
2.2.1 Akaike Data Criterion (AIC)
AIC(𝑆) = −2𝓁ₛ + 2|𝑆|
the place 𝓁ₛ is the log-likelihood of mannequin 𝑆 evaluated on the most chance estimates of its parameters. Right here, 𝑓(𝑛) = 2.
This criterion might be considered as a mix of goodness of match and mannequin complexity.
When evaluating two fashions, the one with the decrease AIC worth is most well-liked.
2.2.2 Bayesian Data Criterion (BIC)
The Bayesian Data Criterion (BIC) is one other technique for mannequin choice. It’s just like the AIC, and BIC is outlined as:
BIC(𝑆) = −2𝓁ₛ + |𝑆|·log(𝑛)
the place 𝓁ₛ is the log-likelihood of mannequin 𝑆 evaluated on the most chance estimates of its parameters.
It’s referred to as the Bayesian Data Criterion as a result of it may be derived from a Bayesian perspective. Let 𝑆 = {𝑆₁, …, 𝑆ₘ} denote a set of fashions. If we assign every mannequin 𝑆ᵢ a previous chance π(𝑆ᵢ) = 1/𝑚, then the posterior chance of mannequin 𝑆ᵢ given the info is proportional to its chance. This results in the next expression:

Thus, selecting the mannequin that minimizes the BIC is equal to selecting the mannequin with the very best posterior chance given the info.
BIC additionally has an interpretation by way of minimal description size: it balances mannequin match towards complexity. As a result of its penalty time period is 𝑓(𝑛) = ½·log(𝑛), the BIC applies a stronger penalty than the AIC when 𝑛 > 7. In consequence, BIC sometimes selects extra parsimonious fashions than AIC, particularly because the pattern dimension grows.
As with AIC, when evaluating two fashions, the popular mannequin is the one with the decrease BIC.
2.3 Go away-One-Out Cross-Validation (LOOCV) and k-Fold Cross-Validation
One other extensively used technique for mannequin choice is leave-one-out cross-validation (LOOCV). On this strategy, the danger estimator is outlined as:

the place Ŷ₋ᵢ(𝑆) is the prediction for 𝑌ᵢ utilizing mannequin 𝑆 fitted on all observations besides the i-th one, and 𝑌ᵢ is the precise response for the i-th statement.
It may be proven that:

the place hᵢᵢ(𝑆) is the i-th diagonal factor of the hat matrix: HS = XS (XSᵀ XS)⁻¹ XSᵀ for the mannequin S.
This formulation reveals that it’s pointless to refit the mannequin repeatedly by leaving out one statement at a time. As a substitute, LOOCV might be computed immediately utilizing the fitted values and the hat matrix.
A pure extension of LOOCV is k-fold cross-validation, the place the info is partitioned into ok folds, and the mannequin is educated on ok − 1 folds and validated on the remaining fold. This course of is repeated throughout all folds, and the outcomes are averaged to estimate prediction error.
Ok-Fold Cross-Validation
On this strategy, the info is split into 𝑘 teams, or folds (generally 𝑘 = 5 or 𝑘 = 10). One fold is not noted, and the mannequin is fitted on the remaining 𝑘 − 1 folds. The fitted mannequin is then used to foretell the responses within the omitted fold. The chance for that fold is estimated as:

the place the sum is taken over all observations within the omitted fold. This process is repeated for every of the 𝑘 folds, and the general threat estimate is obtained by averaging the 𝑘 particular person threat values.
This technique is especially appropriate when the first aim of regression is prediction. On this setting, various efficiency measures [such as the Mean Absolute Error (MAE) or the Root Mean Squared Error (RMSE)] may also be used to guage predictive accuracy.
2.4 Different standards
Within the literature, along with the standards mentioned above, a number of different measures are generally used for mannequin choice. One extensively used choice is the adjusted coefficient of dedication, outlined as:

One other strategy is to make use of nested mannequin checks, such because the F-test. The F-test compares two nested fashions: a smaller mannequin 𝑆₁, whose covariates type a subset of these in a bigger mannequin 𝑆₂. The null speculation states that the extra variables in 𝑆₂ don’t considerably enhance the match relative to 𝑆₁.
General, the strategies offered above primarily deal with the 2 central targets of linear regression: parameter estimation and variable choice.
Having outlined a number of methods to attain a mannequin, the remaining query is tips on how to search the set of candidates to seek out the mannequin with the perfect rating.
3. Choice Process
As soon as fashions might be scored, the following step is to go looking both the whole area of doable fashions or a specific subset to determine the one with the perfect rating. With 𝑘 covariates, there are 2ok−1 doable fashions [a number that quickly becomes impractical for large 𝑘 (for instance, more than one million models when 𝑘 = 20]. In such circumstances, exhaustive search is computationally infeasible, and heuristic strategies are most well-liked. Broadly, mannequin choice methods fall into two classes: exhaustive search and stepwise search.
3.1 Exhaustive Search
This strategy evaluates each doable mannequin and selects the one with the perfect rating. It’s possible solely when 𝑘 is small, because the computational burden turns into prohibitive with a lot of covariates.
3.2 Stepwise Search
Stepwise strategies goal to determine a native optimum—a mannequin that performs higher than its fast neighbors. These strategies are typically advisable solely when exhaustive search will not be possible (e.g., when each 𝑛 and 𝑝 are giant).
3.2.1 Ahead Stepwise Choice
- Select a scoring criterion (e.g., AIC, BIC, Mallows’ 𝐶ₚ).
- Begin with an empty mannequin.
- At every step, add the variable that gives the best enchancment within the criterion.
- Proceed till no variable improves the rating or all variables are included within the mannequin.
3.2.2 Backward Stepwise Choice
- Select a scoring criterion (e.g., AIC, BIC, Mallows’ 𝐶ₚ).
- Begin with the total mannequin containing all variables.
- At every step, take away the variable whose elimination yields the best enchancment within the criterion.
- Proceed till no additional enchancment is feasible or solely the important variables stay.
3.2.3 Stepwise Choice (Combined Methodology)
- Select a scoring criterion (e.g., AIC, BIC, Mallows’ 𝐶ₚ).
- Begin with an empty mannequin and add variables one by one, as in ahead choice, till no variable additional improves the rating.
- Then proceed as in backward choice, eradicating variables one by one if doing so improves the criterion.
- Cease when no extra enchancment might be achieved or when all variables are included.
The subsequent part is tips on how to apply it on actual information.
4. Software
In apply, earlier than making use of mannequin choice strategies, it’s important to make sure that the covariates aren’t extremely correlated. The next process might be utilized:
- Preliminary filtering: Take away covariates which might be clearly irrelevant to the response variable (primarily based on knowledgeable judgment, remedy of lacking values, and so on.).
- Correlation with the response variable: Outline a threshold for the correlation between every covariate and the response variable (e.g., 0.6). Covariates beneath this threshold could also be excluded. (Right here, we won’t apply this filter to retain a enough variety of covariates for choice.)
- Correlation amongst covariates: Outline a threshold for pairwise correlations between covariates (e.g., 0.7). Compute the correlation matrix; if two covariates exceed the brink, preserve the one with the strongest correlation with the response variable or the one with higher interpretability from a site perspective.
- Variance Inflation Issue (VIF): Compute the VIF for all remaining covariates. If a covariate’s VIF exceeds 5 or 10, it’s thought-about extremely collinear with others and must be eliminated.
- Mannequin choice: Apply the chosen mannequin choice strategies. On this case, we’ll use Mallows’ 𝐶ₚ because the scoring criterion and backward stepwise choice because the variable choice technique.
Lastly, we’ll implement a stepwise choice process that may incorporate all the standards mentioned above (AIC, BIC, Mallows’ 𝐶ₚ, and so on.) underneath both ahead or backward methods. This unified strategy will permit us to match fashions and choose the one which greatest balances goodness of match and complexity.
For instance the process, allow us to now current the dataset that can be used for the evaluation.
4.1 Presentation of the Dataset
We use the Communities and Crime dataset from the UCI Machine Studying Repository, which comprises socio-economic and demographic details about U.S. communities. The dataset consists of greater than 100 variables. The response variable is the variety of violent crimes per inhabitants (violentCrimesPerPop
). Our aim is to use the mannequin choice strategies mentioned above to determine the covariates most strongly related to this response.
4.2 Dealing with Lacking Values
For this evaluation, we take away all rows containing lacking values.
An alternate technique can be to:
- Drop variables with a excessive proportion of missingness (e.g., >10%), and
- Assess whether or not the remaining lacking values are Lacking At Random (MAR), Lacking Utterly at Random (MCAR), or Lacking Not at Random (MNAR), making use of an applicable imputation technique if vital.
Right here, nonetheless, we undertake the less complicated strategy of discarding all incomplete rows. After this step, the dataset comprises no lacking values and consists of 103 variables in whole: the response variable violentCrimesPerPop
plus the covariates.
4.3 Choice of Related Variables Utilizing Skilled Judgment
We then apply knowledgeable judgment to evaluate the relevance of every variable and decide whether or not its correlation with the response is significant. This requires session between the statistician and area specialists to grasp the context and significance of every covariate.
For this dataset, we take away:
communityname
(a categorical variable with many ranges), andfold
(a technical variable used just for cross-validation).
After this filtering step, we retain 101 variables: the response violentCrimesPerPop
and 100 covariates.
4.4 Lowering Covariates Utilizing a Correlation Threshold
To additional scale back dimensionality, we compute the correlation matrix of the covariates and the response. When a number of covariates are extremely correlated with one another (correlation > 0.6), we retain solely the one with the strongest correlation to the response. This process reduces redundancy whereas mitigating multicollinearity.
After making use of this filtering and computing the Variance Inflation Issue (VIF), we retain a remaining set of 19 covariates with VIF values beneath 5.

These preprocessing steps are defined in higher element in my article: Characteristic Choice. Now, allow us to apply our choice process to determine probably the most related variables
4.5 Mannequin Choice with Stepwise Choice
With 19 variables, the full variety of doable fashions is 219-1 = 524,287, which might be computationally infeasible for a lot of techniques. To cut back the search area, we use a stepwise choice process. We implement a operate, stepwise_selection
, that identifies probably the most related variables primarily based on a selected choice criterion and technique (ahead, backward, or combined). On this instance, we use Mallows’ 𝐶ₚ as the choice criterion and apply each ahead and backward stepwise choice strategies.
4.5.1 Backward Stepwise Choice Utilizing Mallows’ 𝐶ₚ
Making use of backward choice with Mallows’ 𝐶ₚ, we proceed as follows:
- Step 1: Take away
pctWFarmSelf
. Its exclusion reduces the criterion to 𝐶ₚ = 41.74, decrease than the total mannequin. - Step 2: Take away
PctWOFullPlumb
. This additional decreases 𝐶ₚ to 41.69895. - Step 3: Take away
indianPerCap
. The criterion is decreased once more to 𝐶ₚ = 41.66073.
In whole, three variables are eliminated, yielding the ultimate mannequin.
4.5.2 Ahead Stepwise Choice Utilizing Mallows’ 𝐶ₚ
Ahead stepwise choice is usually advisable when the variety of variables is giant, since it’s much less computationally demanding than backward choice. Ranging from an empty mannequin, variables are added sequentially, one by one, in accordance with the criterion enchancment.
On this instance, ahead choice identifies the identical set of variables as backward choice. The Determine 1 beneath illustrates the sequence of variables added to the mannequin, together with their corresponding 𝐶ₚ values. The method begins with PctKids2Par
, adopted by PctWorkMom
, LandArea
, and continues till the ultimate mannequin is reached, attaining a criterion worth of 𝐶ₚ = 41.66.

Warning! This doesn’t but deal with the query of which variables are causes of the unbiased variable.
Conclusion
On this article, we addressed the query of mannequin choice. The core precept of the process is to assign a rating to every mannequin as a way to measure its high quality, after which to go looking via the set of doable fashions to determine the one with the perfect rating. This rating is outlined by balancing each the standard of match and the complexity of the mannequin.
Among the many obtainable procedures, we offered the stepwise ahead and backward strategies, which we carried out in Python. We utilized them utilizing completely different analysis standards: AIC, BIC, and Mallows’ CP.
These strategies, nonetheless, have a limitation: they discover a subset of all doable fashions. In consequence, the chosen fashions might typically characterize oversimplifications of actuality. Nonetheless, they continue to be very helpful when the variety of variables is giant and exhaustive approaches turn out to be computationally too costly.
Lastly, when coping with regression for predictive functions, it’s important to separate the dataset into two elements: coaching and check units. Variable choice have to be carried out solely on the coaching set; and by no means on the check set as a way to guarantee an sincere analysis of the mannequin’s predictive efficiency.
Picture Credit
All photos and visualizations on this article had been created by the writer utilizing Python (pandas, matplotlib, seaborn, and plotly) and excel, except in any other case acknowledged.
References
Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer Science & Enterprise Media.
Redmond, M. (2002). Communities and Crime [Dataset]. UCI Machine Studying Repository. https://doi.org/10.24432/C53W3X.
Cornillon, P. A., Hengartner, N., Matzner-Løber, E., & Rouvière, L. (2023). Régression avec R: 3ème édition. In Régression avec R. EDP sciences.
Knowledge & Licensing
The dataset used on this article is licensed underneath the Inventive Commons Attribution 4.0 Worldwide (CC BY 4.0) license.
This license permits anybody to share and adapt the dataset for any goal, together with industrial use, supplied that correct attribution is given to the supply.
For extra particulars, see the official license textual content: CC BY 4.0.
Disclaimer
Any remaining errors or inaccuracies are the writer’s duty. Suggestions and corrections are welcome.
Codes
import numpy as np
import statsmodels.api as sm
def compute_score(y, X, vars_to_test, metric, full_model_mse=None):
X_train = sm.add_constant(X[vars_to_test])
mannequin = sm.OLS(y, X_train).match()
n = len(y)
p = len(vars_to_test) + 1 # +1 pour la constante
if metric == 'AIC':
return mannequin.aic
elif metric == 'BIC':
return mannequin.bic
elif metric == 'Cp':
if full_model_mse is None:
elevate ValueError("full_model_mse doit être fourni pour calculer Cp Mallows.")
rss = sum(mannequin.resid ** 2)
return rss + 2 * p * full_model_mse
elif metric == 'R2_adj':
return -model.rsquared_adj # négatif pour maximiser
else:
elevate ValueError("Métrique inconnue. Utilisez 'AIC', 'BIC', 'Cp' ou 'R2_adj'.")
def get_best_candidate(y, X, chosen, candidates, metric, technique, full_model_mse=None):
scores_with_candidates = []
for candidate in candidates:
vars_to_test = chosen + [candidate] if technique == 'ahead' else [var for var in selected if var != candidate]
rating = compute_score(y, X, vars_to_test, metric, full_model_mse)
scores_with_candidates.append((rating, candidate, vars_to_test))
scores_with_candidates.type()
print("Suppressions testées:", [(v, round(s, 2)) for s, v, _ in scores_with_candidates])
return scores_with_candidates[0] if scores_with_candidates else (None, None, None)
def stepwise_selection(df, goal, technique='ahead', metric='AIC', verbose=True):
if df.isnull().values.any():
elevate ValueError("Des valeurs manquantes sont présentes dans le DataFrame.")
X = df.drop(columns=[target])
y = df[target]
variables = record(X.columns)
chosen = [] if technique == 'ahead' else variables.copy()
remaining = variables.copy() if technique == 'ahead' else []
# Calcul préalable du MSE du modèle complet pour Cp Mallows
if metric == 'Cp':
X_full = sm.add_constant(X)
full_model = sm.OLS(y, X_full).match()
full_model_mse = sum(full_model.resid ** 2) / (len(y) - len(variables) - 1)
else:
full_model_mse = None
current_score = np.inf
historical past = []
step = 0
whereas True:
step += 1
candidates = remaining if technique == 'ahead' else chosen
best_score, best_candidate, vars_to_test = get_best_candidate(y, X, chosen, candidates, metric, technique, full_model_mse)
if best_candidate is None:
if verbose:
print("Aucun candidat disponible.")
break
if verbose:
motion = "ajouter" if technique == 'ahead' else "retirer"
print(f"nÉtape {step}: Meilleure variable à {motion} : {best_candidate} (rating={spherical(best_score,5)})")
enchancment = best_score < current_score - 1e-6
if enchancment:
if technique == 'ahead':
chosen.append(best_candidate)
remaining.take away(best_candidate)
else:
chosen.take away(best_candidate)
current_score = best_score
historical past.append({
'step': step,
'chosen': chosen.copy(),
'rating': current_score,
'modified': best_candidate
})
else:
if verbose:
print("Aucune amélioration supplémentaire du rating.")
break
X_final = sm.add_constant(X[selected])
best_model = sm.OLS(y, X_final).match()
if verbose:
print("nVariables sélectionnées :", chosen)
final_score = best_model.aic if metric == 'AIC' else best_model.bic
if metric == 'Cp':
final_score = compute_score(y, X, chosen, metric, full_model_mse)
elif metric == 'R2_adj':
final_score = -compute_score(y, X, chosen, metric)
print(f"Rating remaining ({metric}): {spherical(final_score,5)}")
return chosen, best_model, historical past
import matplotlib.pyplot as plt
def plot_stepwise_crosses(historical past, all_vars, metric="AIC", title=None):
"""
Affiche le graphique stepwise kind heatmap à croix :
- X : variables explicatives modifiées à au moins une étape (ordre d'apparition)
- Y : rating (AIC/BIC) à chaque étape (de l'historique)
- Croix noire : variable modifiée à chaque étape
- Vide ailleurs
- Courbe du rating
"""
n_steps = len(historical past)
scores = [h['score'] for h in historical past]
# Extraire la liste ordonnée des variables effectivement modifiées
modified_vars = []
for h in historical past:
var = h['modified']
if var not in modified_vars and var will not be None:
modified_vars.append(var)
n_mod_vars = len(modified_vars)
# Building des positions X pour les croix (selon modified_vars)
mod_pos = [modified_vars.index(h['modified']) if h['modified'] in modified_vars else None for h in historical past]
fig, ax = plt.subplots(figsize=(min(1.3 * n_mod_vars, 8), 6))
# Placer la croix noire à chaque étape
for i, x in enumerate(mod_pos):
if x will not be None:
ax.scatter(x, scores[i], colour='black', marker='x', s=100, zorder=3)
# Tracer la courbe du rating
ax.plot(vary(n_steps), scores, colour='grey', alpha=0.7, linewidth=2, zorder=1)
# Axe X : labels verticaux, police réduite (uniquement variables modifiées)
ax.set_xticks(vary(n_mod_vars))
ax.set_xticklabels(modified_vars, rotation=90, fontsize=10)
ax.set_xlabel("Variables modifiées")
ax.set_ylabel(metric)
ax.set_title(title or f"Stepwise ({metric}) – Variables modifiées à chaque étape")
ax.grid(True, axis='y', alpha=0.2)
plt.tight_layout()
plt.present()