Posts

Bike Sharing Demand Prediction

Image
 Regression to Predict Bike Sharing Demand Introduction In this blogpost we will use hourly data on bike sharing program to predict the subsequent demand. the dataset can be found here on Kaggle.  The data has been collected by a bike share progarm in the city of Washington D.C. The dataset contains training data till the 20th of every month for 2 years (2011-2012). Our task is to predict the usage over the remaining days in the month for every hour. Our analysis can be found here. The dataframe looks as follows There are no null values and no rows need to be removed or values imputed. Feature Engineering and Data visualization The only feature engineering we perform is to transform the the datetime stamp to [time of day (hour), day of week, month, year ]. The transformed dataframe looks as follows Note that in the training data there are three target variables, registered users, casual users and the sum of the two (count) total users. For the final submission to Kaggle we on...

MNIST Digit Recognition

Image
MNIST Digit Recognition without Neural Networks  Introduction The MNIST dataset is a set of 70,000 handwritten digits. Of these 42,000 are used for training while 28,000 are used for testing our trained models. The dataset can be found here on Kaggle. The MNIST is an excellent standard benchmark dataset and can be used an a playground to to test and learn more about standard classification algorithms. Using Neural Networks it's reasonably easy to achieve an accuracy greater than 98% as is evidenced by the submitted notebooks on Kaggle. In this blog we will try to see what's the best accuracy we can get without using deep neural networks. We will use Random Forest, Gradient Boost, SVM, stacking and combinations of the above to perform our analysis. We will try different tricks and feature engineering, not all of which will give us better accuracy compared to out of box classification methods . However since the goal of this blog is to explore and learn we will report  The anal...

ETF Predictions

Image
Introduction An exchange traded fund ( ETF ) is a fund that owns multiple underlying securities. A popular example is the SPY ETF which tracks the S&P 500 index  which itself tracks the performance of the 500 largest companies in the United States. In the long term the stock market always goes up and investing in ETF's such as SPY is a lucrative passive investment that leverages this fact. Diversification is an important tool for an investor to minimize risks of owning individual stocks and ETFs accomplish this by owning multiple underlying securities. If fact in the 30 year period, ETF's such as SPY and VTSAX  outperform 99% of portfolio managers  . Thus predicting how ETF's will change over a given time horizon will improve our returns, possibly far more than the average 6%-10% annual return generated by the popular ETFs. However that's easier said than done. The efficient market hypothesis ( EMH ) says that it is impossible to consistently generate profits ( Fam...

Online Purchase Intent

Image
Online Purchase Intent In this blog we will explore a classification problem of judging whether an online buyer will end up buying an item based on his online activity. The dataset can be found here . There are 17 features of which 10 are continuous and 7 categorical. There is one target variable with binary classification of True or False depending on whether any revenue was generated. Exploring the Dataset Let us begin by exploring the data set. We have the following features and their description. # Administrative  Administrative Value # Administrative_Duration  Duration in Administrative Page # Informational  Informational Value #Informational_Duration  Duration in Informational Page #Product_related  Product Related Value #ProductRelated_Duration  Duration in Product Related Page #BounceRates  Bounce Rates of a web page #ExitRates  Exit rate of a web page #PageValues  Page values of each web page #SpecialDay...

Zillow Data Time Series Analysis

Image
Zillow Data Time Series Analysis In this bog post we will deal with the time series analysis of of the Zillow data set which details the average house prices in every zipcode between April 1996 to April 2018. Our analysis can be found in this github repo . Of course not all zipcodes have data starting from  April 1996. There are overall 14723 Zipcodes which is a lot of data.  The question that we address in this project is the following. Imagine a real estate company has contracted us to analyze the data. The question that they want answered is What are the 5 best zipcodes to invest in? Of course the important question in what does one mean by best . That is of course why we are being paid the big bucks. We will have several definitions on best and analyze the data accordingly. Let us begin by describing the data set and gain some intuition from it. The dataset looks like the following, when stored in the 'wide' format with one column for the average price for ea...