Multiple Regression Project:
Forecasting Sales for Proposed New
Sites of Pam and Susan’s Stores
Pam and Susan’s is a discount department store that currently has 250 stores, most of which are located throughout the southern United States. As the company has grown, it has become increasingly more important to identify profitable locations. Using census and existing store data, a multiple regression equation will be used to forecast potential sales, and therefore which proposed new site location will be more profitable.
The data set has 37 independent variables. This includes 7 categorical variables for competitive type and 30 numerical categories. There are 250 stores, meaning the sample size is 250. As the sales are given in $1,000’s of dollars it is best to remember that a unit change in x will correspond to that coefficient of x multiplied by 1,000.
III. Results and Discussion
Building a multiple regression model requires a step-by-step approach. Failure to follow such methodology could ultimately lead to incorrect and inaccurate forecasting for the dependent variable of interest. Below I will outline the process and findings used to obtain a multiple regression equation to forecast potential sales at newly proposed site of Pam and Susan’s discount department stores.
The initial step in building a multiple regression model is to look for outliers and non-linear relationships between your dependent (predicated sales) and independent variables. In order for multiple regression to be an accurate forecasting tool, each x-variable should have a slightly linear relationship with the y-variable. Below in Table (i) is a list of the 10 quantitative x-variables that have the highest correlation with sales. These 10 variables will be used to obtain the final multiple regression equation. Additionally, the problem of multicollinearity can result if the correlation coefficients are almost perfectly correlated with each other. As none of the correlation coefficients are significantly higher than 0.9, it is safe to assume no multicollinearity problems exist in our model.
Next, you should convert all categorical variables into dummy variables. The purpose of this is to allow for non-numerical information to be used in building the multiple regression equation. In Pam and Susan’s case competitive type (comtype) was a categorical value. A plot of sales versus comtype, below in Graph (i), shows that comtype 3 thru 6 are likely statistically insignificant variables as they are all in similar ranges of sales; therefore, only comtype 1, 2, and 7 will be used for this model.
After determining the significant independent variables for use in the multiple regression analysis, you can use one of two different procedures to obtain your equation: stepwise or backwards. For this particular model the backwards procedure was used. To begin, the 10 quantitative x-variables having the highest correlations with sales and comtypes 1, 2, and 7, were placed into a blank worksheet along with sales at the 250 current locations. Utilizing the regression add-in under Data Analysis for Excel, each scenario was tested for significance. Significance was defined as having a t-value less than (-2) and more than +2. Ultimately, the variable whose t-value was closest to zero was eliminated and a new regression equation was computed. This process was repeated until only statistically significant variables remained. Removal of insignificant variables must be done one at a time due to how the coefficients and standard error change each time a variable is removed. Furthermore, removing insignificant variables reinforces the idea of parsimony, or explaining as much as possible with the fewest number of variables.
Subsequently, you must check the technical assumptions you have made. This requires two separate...