quick look at Stepwise Regression: explanation, code fix and opinion

4 min readAug 31, 2021

Okay, so during this Data Science curriculum we came across Stepwise Regression. A super easy and straight forward feature selection method for modeling liner regressions. While I do not feel quite as fluid with ML to go over the process of lets say a naïve Bayes classification and publishing it to the world; this is simple enough.

The example in our course does not seem to have the backward model functioning and it seemed to be code similar to the method used on StackExchange. I will be going over a quick explanation of the method, the code just my opinion on the use of the strategy.

Stepwise regression

So for the python users take a look at the code below, it was the example of the code on stack exchange. Glance through and we will go over the use. The link is below also.

Does scikit-learn have a forward selection/stepwise regression algorithm?

begingroup$ Scikit-learn indeed does not support stepwise regression. That's because what is commonly known as…

datascience.stackexchange.com

def stepwise_selection(X, y, 
                           initial_list=[], 
                           threshold_in=0.01, 
                           threshold_out = 0.05, 
                           verbose=True):
      
        included = list(initial_list)
        while True:
            changed=False
            # forward step
            excluded = list(set(X.columns)-set(included))
            new_pval = pd.Series(index=excluded)
            for new_column in excluded:
                model = sm.OLS(y,        sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
                new_pval[new_column] = model.pvalues[new_column]
            best_pval = new_pval.min()
            if best_pval < threshold_in:
                best_feature = new_pval.idxmin()
                included.append(best_feature)
                changed=True
                if verbose:
                    print('Add  {:30} with p-value {:.6}'.format(best_feature, best_pval))

            # backward step
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
            # use all coefs except intercept
            pvalues = model.pvalues.iloc[1:]
            worst_pval = pvalues.max() # null if pvalues is empty
            if worst_pval > threshold_out:
                changed=True
                worst_feature = pvalues.idxmax()
                included.remove(worst_feature)
                if verbose:
                    print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
            if not changed:
                break
        return included

    result = stepwise_selection(X, y)

    print('resulting features:')
    print(result)

So the methodology is that you take your inputs(predictors, target variable), a threshold for the forward step and a threshold for the backward step. Next you you instantiate a list starting with just the first independent variable and the target variable and run the linear model to on just the first feature. The idea is to add a feature to the model and every time check for the threshold of the specific statistic in the results and as long as the model statistic meets that variable; add it to the list of valuable features. by the end of the first step you should have a list of variables that improve your model.

In this model the statistic used was the p-value score. The rationale being that as each of the p-values for the independent variables reach a p-value score below 0.05, the more likely the variables rejects the null hypothesis. So in this example and p-value with a score below 0.05 we can say that this feature rejects the null hypothesis and is a significant feature to the model.

Every time the model runs it adds one feature to the list and drops off any variable that was previously added that may have increased its p-value with the addition of another variable.

What is pretty cool is that you could technically adjust the threshold statistic, with other model validators like R² or the f-value and simply test for an improvement of the values in each step.

Once you have this base list of variables from the forward step, now you will run the model through the backward step. Which takes each value and removes each one as if it was not a part of the model and checks to see if there is an increase or decrease in the same statistic and removes them if they are above that threshold.

But this code does not work

So I soon found out while implementing this code from my lecture, that the backward step of the code did not work. Basically if I ran the code and the backward step had adjusted the list one time the next time it cycled back through the list it state the list was empty. Basically you can not adjust a list in a for loop while it is still being accessed by the for loop. fixed with adjusting this line of code so it is accessing it directly.

included.remove(worst_feature)
to
included.remove(included[worst_feature])

I just did not like that I could not figure it out right away while I was working on my project.

Opinion

We learned this as a tertiary way to determine if the values you have are significant. I personally saw a good use for this method after the first model that I ran off of my base model. Not to say that the variables that it picked from this were the most important but more to understand the data and see what common sense variables I would choose and think are important, and relate them to the results of this list of p-values chosen from my first model. for example for my project on WHO Life Expectancy Data many of the variables that were removed had some type of multicollinearity with another variable so it it was very easy to understand why some were removed.

Over all I would not put a heavy significance on this method there are other ways to determine best fit but definitely useful, I think a way to improve the model is to do a combination of the terms because I did think that depending on the order of the entered values if I have a variable at the end of a list that has multicollinearity with a feature next to it and is removed we would not know if that is the same for all of the other variables.