Bike Sharing Demand Prediction

 Regression to Predict Bike Sharing Demand

Introduction

In this blogpost we will use hourly data on bike sharing program to predict the subsequent demand. the dataset can be found here on Kaggle.  The data has been collected by a bike share progarm in the city of Washington D.C. The dataset contains training data till the 20th of every month for 2 years (2011-2012). Our task is to predict the usage over the remaining days in the month for every hour. Our analysis can be found here.

The dataframe looks as follows

There are no null values and no rows need to be removed or values imputed.

Feature Engineering and Data visualization

The only feature engineering we perform is to transform the the datetime stamp to [time of day (hour), day of week, month, year ]. The transformed dataframe looks as follows


Note that in the training data there are three target variables, registered users, casual users and the sum of the two (count) total users. For the final submission to Kaggle we only have to predict the total number of users however for training purposes. We take our continuous features to be  [temp, atemp, humidity, windspeed]. The discrete features are [season, holiday, workingday, weather, Month, dayofwk, hour, year ]. For the discrete variables we perform OneHotEncoding resulting in a total of 60 features

Let us now visualize the data. First we will plot the counts of users as a function of discrete variables



We also plot the correlation matrix of the continuous variables with the total count

Next we plot the distribution of casual, registered and count. It's possible that the very low user numbers in a given hour at one end of the distribution are anomalous. However we will keep them for our final predictions.

Models

There is no good reason why linear regression would work. For example, both too low and too high temperature would be bad for bike sharing numbers. There is also no reason why other continuous variables are related linearly. Thus we choose Random Forest and XGB as our two regression models.

The metric used to evaluate the model is Root Mean Square Log Error (RMSLE), i.e. the root mean square error of log(1 + prediction). This makes predicting both high and low counts equally important.

Since we are using RMSLE our target will always be the log of the registered, casual or total counts, depending on the target we are dealing with.

We will begin by considering the total count as our target

Random Forest Regression

We use sklearn RandomForestRegressor as our regression model. We use GridSearchCV to scan over a grid of parameters to find the best value. We do a (80,20) split on our training data to split between training and "test" data. We use 5-fold cross-validation to avoid overfitting. The following are our GridSearch parameters

'n_estimators': [100],
'criterion': ['mse'],
'max_depth': [None, 2, 3, 5,10],
'min_samples_split': [None,5,10],
'min_samples_leaf': [None,3,5],

The training is done under a minute. We get
test RMSLE = 0.34
submission RMSLE = 0.42

There is a big discrepancy between predicting on the test data and blind (output not known to us) submission data. It dosen't seem that our model is overfitting since training and test error are similar. The more likely case is that since the "blind" submission data is for the latter dates for every month (20th and beyond), the behavior of users at the end of the month might be different than from the beginning of the month. However we leave addressing this for future work. Note that the best submission RMSLE is 0.33, so we are still quite far away from having a great model.

XGBoost Regression

Next we use a XGBoost regression model on our data. Other things remain the same, the GridSearchCV parameters are given by

objective= 'reg:squarederror'
'max_depth': range (2, 10, 2),
'n_estimators': range(20, 120, 20),
'learning_rate': [0.001,0.01, 0.1]

The training is again done in about a minute. We get

test RMSLE = 0.32
submission RMSLE = 0.402

Basic "Stacked" Model

We take our best fit model from both Random Forest and XGB and linealry combine them to get our final prediction. The linear fit suggests y_test = 0.33* y_RF + 0.67*y_xgb. We get 

submission RMSLE = 0.399

which is a slight improvement over our XGB only model

XGB separate fit

Since XGB performs reasonably well on our data, we now use it to train two separate models. Since we have casual and registered user counts, it makes sense that two separate models fit each of those categories since each category of a user might have different performances.

All the hyperparameter tuning remains the same. For our registered user model we get

test RMSLE = 0.31

For our casual user model we get

test RMSLE = 0.51

As we can see the casual data is not fit well at all and has a high variance. Combining the results we get

test RMSLE = 0.31
submission RMSLE = 0.400

which is only slightly better than our total XGBoost model.

We also try to use the registered prediction as a feature for our casual usage prediction. Though casual usage test RMSLE improves slightly it has extremely minimal impact on the final total RMSLE since the number of casual users is much less than registered users.

Conclusion and Future Work

RF and XGB are powerful regressors since they dont require linearity and minimal feature engineering to get reasonable results. While our best performance is far from best performance on the Kaggle Leaderboard, we can attempt the following in the future
1) Train more regressors to better do  a stacked analysis
2) Minimize error by accounting for the fact that the submission data is in the later third of the month. Time Series analysis over a month might help predicting the trend for the end of month.

Comments

Popular posts from this blog

ETF Predictions

MNIST Digit Recognition

Zillow Data Time Series Analysis