Data Science projects

Posts

MNIST Digit Recognition

March 23, 2021

MNIST Digit Recognition without Neural Networks Introduction The MNIST dataset is a set of 70,000 handwritten digits. Of these 42,000 are used for training while 28,000 are used for testing our trained models. The dataset can be found here on Kaggle. The MNIST is an excellent standard benchmark dataset and can be used an a playground to to test and learn more about standard classification algorithms. Using Neural Networks it's reasonably easy to achieve an accuracy greater than 98% as is evidenced by the submitted notebooks on Kaggle. In this blog we will try to see what's the best accuracy we can get without using deep neural networks. We will use Random Forest, Gradient Boost, SVM, stacking and combinations of the above to perform our analysis. We will try different tricks and feature engineering, not all of which will give us better accuracy compared to out of box classification methods . However since the goal of this blog is to explore and learn we will report The anal...

ETF Predictions

February 11, 2021

Introduction An exchange traded fund ( ETF ) is a fund that owns multiple underlying securities. A popular example is the SPY ETF which tracks the S&P 500 index which itself tracks the performance of the 500 largest companies in the United States. In the long term the stock market always goes up and investing in ETF's such as SPY is a lucrative passive investment that leverages this fact. Diversification is an important tool for an investor to minimize risks of owning individual stocks and ETFs accomplish this by owning multiple underlying securities. If fact in the 30 year period, ETF's such as SPY and VTSAX outperform 99% of portfolio managers . Thus predicting how ETF's will change over a given time horizon will improve our returns, possibly far more than the average 6%-10% annual return generated by the popular ETFs. However that's easier said than done. The efficient market hypothesis ( EMH ) says that it is impossible to consistently generate profits ( Fam...

Online Purchase Intent

February 17, 2020

Online Purchase Intent In this blog we will explore a classification problem of judging whether an online buyer will end up buying an item based on his online activity. The dataset can be found here . There are 17 features of which 10 are continuous and 7 categorical. There is one target variable with binary classification of True or False depending on whether any revenue was generated. Exploring the Dataset Let us begin by exploring the data set. We have the following features and their description. # Administrative Administrative Value # Administrative_Duration Duration in Administrative Page # Informational Informational Value #Informational_Duration Duration in Informational Page #Product_related Product Related Value #ProductRelated_Duration Duration in Product Related Page #BounceRates Bounce Rates of a web page #ExitRates Exit rate of a web page #PageValues Page values of each web page #SpecialDay...

Zillow Data Time Series Analysis

January 30, 2020

Zillow Data Time Series Analysis In this bog post we will deal with the time series analysis of of the Zillow data set which details the average house prices in every zipcode between April 1996 to April 2018. Our analysis can be found in this github repo . Of course not all zipcodes have data starting from April 1996. There are overall 14723 Zipcodes which is a lot of data. The question that we address in this project is the following. Imagine a real estate company has contracted us to analyze the data. The question that they want answered is What are the 5 best zipcodes to invest in? Of course the important question in what does one mean by best . That is of course why we are being paid the big bucks. We will have several definitions on best and analyze the data accordingly. Let us begin by describing the data set and gain some intuition from it. The dataset looks like the following, when stored in the 'wide' format with one column for the average price for ea...

Search This Blog