Assignment 3 – CELL2CELL Case

1. What are the Business Objective(s) and Data Mining Objective(s) for the case? Business Objectives

To develop a proactive retention program (including incentive plans) to reduce the customer churn Data Mining Objectives

1. To predict churn accurately

2. To identify key factors that drive customer churn

2. Based on initial data understanding (Using multiplot/ statexplore node), what are some initial obvious results? Data Understanding

1. Using StatExplore to find out the relative importance of variables From the statexplore, we found that EQPDAYS is the most important variable as it has the highest Worth(Sum) and ChiSquare Sum.

2. Choosing which variables to reject

“csa” variable is already rejected in the data-set as it is a string (text) variable, “churndep” is rejected because it has missing values for validation set. Others variables must first be analyzed in the multiplot and statistical model before rejecting them. As a general rule a good input variable is one which supports the data model as well as makes business sense

3. Using MultiPlot to observe the distribution of various variables against the Target

From these plots and the table above we can see that the variables EQPDAYS, MONTHS, MOU, CHANGEM, DROPBLK, DROPVCE, RECCHRGE and REVENUE are able to explain “Churn” in a better way than the other variables. 4. Using StatExplore to search for variables having missing values

As we can see, there are a few missing values for the variables AGE1, AGE2, CHANGEM, CHANGER, DIRECTAS, MOU, OVERAGE, RECCHRGE, REVENUE and ROAM.

3. Run at least 6 models on SAS - Decision Trees (binary and three way tree), Logistic Regression, Logistic Regression with Transform Variables, Neural Networks, Neural Networks after selection of variables/ transform variables).

Initial Data Preparation

1. Partitioning the data

The data needs to be partitioned into training set and validation set for enabling this we have a “Calibrate” variable which is binary and has same value for 41,000 data points and opposite value for the remaining data points, this enables the partitioning of data. We have to set Calibrate to “segment” in the data set

The Partitioning result

2. Imputing the data-set

The data-set provided for this case has fewer missing variables and inconsistencies but still it can be imputed to get better results

Running different Predictive models

After the data partition step we can apply different predictive models like decision trees, logistic regression and neural network to come up with churn prediction, logistic regression and neural networks may require transformation of variable for good modeling.

a. Using the decision trees – With 2 branches

b. Using the decision trees – With 3 branches

c. Using the Logistic Regression

RETCALL appears to be the most significant variable as per the logistic regression

d. Using the Neural Networks

Transformations Used

Along with imputation of the variable here we performed transformation. Various transformation methods were employed to check which one offers the most lift, some transformation which we tried: * Only Optimal binning and normal of EQPDAYS, MONTHS, RETCALL and SETPRC

* Optimum binning; normal of EQPDAYS, MONTHS, RETCALL and SETPRC. Rejected DROPBLK – variables for dropped and blocked calls are already present, INCMISS and MARRYUN are unknown.

e. Using Logistic regression with transformation

(i) Using only optimal binning

(ii) Using Optimal binning and dropping some variables

f. Using Neural Network with transformation

(i) Using only optimal binning

(ii) Using Optimal binning and dropping some variables

4. Interpret and explain the output of the various models. What should be the assessment criteria for the models? What are the transformations of variables...