Online Purchase Intent
Online Purchase Intent
In this blog we will explore a classification problem of judging whether an online buyer will end up buying an item based on his online activity. The dataset can be found here. There are 17 features of which 10 are continuous and 7 categorical. There is one target variable with binary classification of True or False depending on whether any revenue was generated.
Exploring the Dataset
Let us begin by exploring the data set. We have the following features and their description.
# Administrative Administrative Value
# Administrative_Duration Duration in Administrative Page
# Informational Informational Value
#Informational_Duration Duration in Informational Page
#Product_related Product Related Value
#ProductRelated_Duration Duration in Product Related Page
#BounceRates Bounce Rates of a web page
#ExitRates Exit rate of a web page
#PageValues Page values of each web page
#SpecialDay Special days like valentine etc
#Month Month of the year
#OperatingSystem Operating system used
#Browser Browser used
#Region Region of the user
#TrafficType Traffic Type
#VisitorType Types of Visitor
#Weekend Weekend or not
For the purposes of visualization and for Naive Bayesian Analysis I consider features with < 30 unique entries as categorical variables and the remaining as continuous variables. I also log transform the categorical variables to make them less skewed. Below I have plotted the distributions of the continuous variables. As can be seen there can be significant differences in the distribution of variables.
We can also visualize the probability distributions of the target variable as categorical variables by plotting the relative fractions in a bar plot. We have done this below
We also need to keep in mind that the target variable has class imbalance i.e ~ 85% of the values are False and only 15% are true. So even if we predict all false entries we will get an accuracy score of ~ 85%. Any model we build will need to improve on this accuracy score. We will begin by doing a naive Bayesian analysis.
Below we have plotted the ROC curve. The test and train AUC are extremely close to each other indicating that we are not overfitting. The accuracy we get with Logistic Regression is 0.897
Probability distributions for continuous feature. The continuous variables have been log transformed |
Target variable as a function of the categorical variables |
Naive Bayesian analysis
Naive bayesian analysis assumes that all the variables are independent of each other. In Naive bayesian analysis we can write
In order to do a Naive Bayesian analysis we need to find out the Likelihoods i.e. P(xi|True,False). We treat continuous and categorical variables separately. To find the Likelihood of the continuous variables we bin the data for each target value and count the entries per bin which gives us our Likelihood probabilities. For categorical variables we bin it according to the number of bins in any given category. The priors P(True) and P(False) are about 0.15 and 0.85 respectively since that's the probability of the target variables in the underlying dataset. We split the data into train and test. We then find whether a given test data print has a larger probability to be True or False to classify it.
We first train the naive Bayes classifier on the continuous features. Below we have plotted the precision, recall, accuracy, and F1 score for each individual feature. As we can see the Page Values is the best predictor whereas the others don't seem to affect the predictions individually. Now we combine all the continuous features into a single Bayesian classifier. As can be seen the predictability of the model in fact goes down. This is most likely due to the fact that our iid assumption in not correct. Training on categorical variables we find that none of the categorical variables are predictable by themselves, so we ignore them. Since Page Values is a better predictor than the combined we use Naive Bayesian on Page Values for our baseline accuracy score. Any model we use should do better than that.
One would assume that if we do PCA on our continuous features and rerun Naive Bayesian Analysis we would get better results because our features are now iid. However that does not turn out to be the case. In fact the predictions are no better predicting all False.
We first train the naive Bayes classifier on the continuous features. Below we have plotted the precision, recall, accuracy, and F1 score for each individual feature. As we can see the Page Values is the best predictor whereas the others don't seem to affect the predictions individually. Now we combine all the continuous features into a single Bayesian classifier. As can be seen the predictability of the model in fact goes down. This is most likely due to the fact that our iid assumption in not correct. Training on categorical variables we find that none of the categorical variables are predictable by themselves, so we ignore them. Since Page Values is a better predictor than the combined we use Naive Bayesian on Page Values for our baseline accuracy score. Any model we use should do better than that.
Scores for Naive Bayes with continuous features. As we can see the Page Values classifies better than combined features |
|
One would assume that if we do PCA on our continuous features and rerun Naive Bayesian Analysis we would get better results because our features are now iid. However that does not turn out to be the case. In fact the predictions are no better predicting all False.
Logistic Regression
One of the simplest classifier models is Logistic Regression. For Logistic Regression we consider the last 7 features as categoorical. One hot encoding them with get_dummies we get a total of 73 features. We also add another boolean feature which is true when Administrative_Duration = -1.
We run a GridSearchCV on Logistic Regression with the following parameters
['fit_intercept': [True,False], 'solver':['liblinear'], 'C': np.logspace(0,4,5), 'penalty': ['l2'],
'class_weight': [{0:1,1:1},{0:1.2,1:1},{0:3,1:1},{0:6,1:1}]]
The best parameters are
Optimal Parameters: {'C': 1000.0, 'class_weight': {0: 3, 1: 1}, 'fit_intercept': False, 'penalty': 'l2', 'solver': 'liblinear'}
We run a GridSearchCV on Logistic Regression with the following parameters
['fit_intercept': [True,False], 'solver':['liblinear'], 'C': np.logspace(0,4,5), 'penalty': ['l2'],
'class_weight': [{0:1,1:1},{0:1.2,1:1},{0:3,1:1},{0:6,1:1}]]
The best parameters are
Optimal Parameters: {'C': 1000.0, 'class_weight': {0: 3, 1: 1}, 'fit_intercept': False, 'penalty': 'l2', 'solver': 'liblinear'}
Below we have plotted the ROC curve. The test and train AUC are extremely close to each other indicating that we are not overfitting. The accuracy we get with Logistic Regression is 0.897
Random Forests
The second classifier we use is Random Forests. A Random Forest is an ensemble method for a Decision Tree . We also use a GridSearchCV for our Random Forest Classifier. The parameters that we used for GridSearchCV are
{'n_estimators': [10, 30, 100,300], 'criterion': ['gini', 'entropy'], 'max_depth': [None, 2, 6, 10],
'min_samples_split': [5, 10,15], 'min_samples_leaf': [1,3, 6]}
The optimal parameters of our search are
Optimal Parameters: {'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 15, 'n_estimators': 300}
with an accuracy of 91.4% .
We have plotted below the top 10 important features in our random forest classification. As expected the PageValues is the most important feature.
Stacking and Conclusion
We tried 3 different classification methods and our Random Forest Algorithm achieved a maximum accuracy of 91.4% . We can not stack all the three predictors together and see if we get a better result. The idea is that each of the predictor fits better to different features of the data. We stack our predictors by giving one vote to each predictor. As can be seen from the summary of the final results below the stacked predictor does not do better than our best predictor, the Random Forest.
Comments
Post a Comment