Northwind Dataset - Statistical Analysis
Northwind Dataset - Statistical Analysis
Data description
The data comes form a fictional "Northwind trading" company that seems to sell food products and beverages. The data is stored in the form of a SQL database in multiple tables. The schema of the SQL database is shown below. We will perform statistical tests to investigate various questions of business interest. The code for the analysis can be found here.
Questions investigated
- Does discount amount have a statistically significant effect on the quantity of a product in an order? If so, at what level(s) of discount?
- Do sales get better year on year? Does the month when the products were ordered matter?
- Is there a standout employee in terms of the distribution from which quantities sold are drawn ? Are there employees who are shipping their orders significantly late?
- Are there regions that are growing or decreasing in overall sales ? Are there product categories that are growing or decreasing in overall sales
1) Does discount amount have a statistically significant effect on the quantity of a product in an order? If so, at what level(s) of discount?
We can query our database OrderDetail Table to get access to the quantities of a product ordered and the discount offered. We want to investigate whether the discount offered has a significant effect on the quantity ordered. We will go through a systematic analysis of the effect of discount prices on sales, but first we will begin by stating the precise hypothesis that we are trying to investigate.
Hypothesis - The null hypothesis H0 states that the quantities ordered are drawn from come from the same distribution regardless of the discount level offered. We will choose our alpha = 0.05. The alternative hypothesis states that the distributions of no discount vs discounted products come from different distributions with different mean.
This is a two-tailed test. Typically if we want to test whether two Gaussian distributions are the same, we use the Welch's t-test. This is a better version of the Student's t-test, in that the distributions being compared don't have to have the same variance. However the underlying assumption of Welch's t-test is that the true underlying distribution is Gaussian. So we need to confirm that before implementing the Welch's t-test. We can divide our dataset into two categories, i.e. no discount and discount and test their normality visually by a QQ plot.
QQ plot of quantities ordered without discount (left) and with discount (right). Clearly they are not normally distributed. |
Visually we can already see the difference between the two distributions at the tails |
Conducting a two sample KS test we get a p-value of 5*10-7
This is clear indication that the two samples come from different distributions since the p-value massively rejects our null hypothesis.
Our next order of business is to figure out which if any levels of discount matter. Our hypothesis remains the same except we will apply the hypothesis for each discount level. There discount levels [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.1, 0.15, 0.20, 0.25]. However only [0.05, 0.1, 0.15, 0.2, 0.25] have enough data points (>3) to make a meaningful decision and we will use these. Below we plot the distributions for each discount level including no discount.
We can do a statistical test on all of the discount levels and compare it to the no discount level. Along with the Welch's t-test which might be unreliable because the underlying distributions are not normal, we also perform a Monte Carlo test and a KS test. For the Monte Carlo test we take the distributions we are to compare, merge them, then split the merged distributions randomly into sizes of the original dataframe. We subtract the means of the randomly generated distributions. Repeating this process a lot of times generates a distribution of mean differences that is normally distributed. We then count the percentage of mean differences that were greater than the mean difference of our original data sets which is our p value. The power of the Monte Carlo approach is that it is a non-parametric test and no assumptions are made about the underlying distribution. The results are summarized in the table below. As we can see the Welch's t-test results don't exactly match with the more reliable Monte Carlo results. KS test is even more conservative in it's p-values than the others
As we can see all the discount levels except 0.1 reject the Null Hypothesis after taking into account the multiple comparisons problem. So discount levels of 5% also matter and the power (related to the Type II errors) is also above 0.8 . Its probably true that discount level of 0.1 also results in increased sales. We can investigate how many data points we need to investigate whether we can reject the Null Hypothesis with 0.8 power. Assuming that the 0.1 discount data is drawn from the same distribution with the same mean as 0.05 data, we get that we would need 462 data points to conclusively rule out the Null Hypothesis.
It is also interesting to note that there is no statistical difference in the distributions of quantities for 0.05, 0.15,0.2 and 0.25 levels i.e. statistically they seem to be drawn from the same distribution. Hence it might not be worthwhile to give discounts above 5%. This is however not a robust conclusion since high discount levels might be needed to keep up the sales.
We can investigate further whether individual products also exhibit the trend that higher discount levels correspond to higher prices. We can perform Welch's t-test and KS test for different discount levels for individual products. However to begin with we will split each product into two categories, one with discount and the other with no discount. Comparing the distributions of these two categories for each product we get that 3 products display a difference between discount vs no discount data using Welch's test and 2 products display a difference doing a KS test. Taking into account the multiple comparison problem we expect for 77 products, (77*0.05 ~ 4) products to have a difference assuming Null hypothesis is true. Thus among individual products there is no evidence that the discount level matters.
Plotted below are the p-values using a KS test (green) and Welch test (blue) for all products and all relevant discount levels. As can be seen very few points have a p-value lower than 0.05 (red line). In fact if we take into account the multiple comparison problem there is no evidence that discount values matter for individual products.
- Considering all products discounted items are drawn from a different distribution compared to undiscounted items at the 95% confidence level
- Considering individual discount levels, discounted products at 0.05, 0.15, 0.2, and 0.25 levels are drawn from a a different distribution compared to the undiscounted quantities and have higher means. There is no concrete distinction in the distributions between 0.05, 0.15, 0.2 and 0.25 levels. 0.1 discount level needs to have more data points for us to potentially reject the null hypothesis
- At the individual product level there is no evidence that discount levels matter, However that is most likely due to low statistics per product and per discount level.
2) Do sales get better year on year? Does the month when the products were ordered matter?
Hypothesis - Throughout the analysis we need to specify what our hypothesis is. We will set our tolerance value $\alpha=0.05$. We will be performing the two tailed tests. Our Null hypothesis H0 is that the the quantities ordered at various years (or months) have the same mean and variance as that of other years (months). Our alternative hypothesis is that the means and variance of the distributions at different years (months) don't have the same mean and variance. Thus this is a two tailed hypothesis test. Since the variance of the two distributions can be different and the sample sizes are not the same we need to do a Welch's t-test. We will also perform a Kolmogorov-Smirnov test.
Let's begin by investigating whether the distribution of quantities sold change from year to year. This will happen if the products in our ensemble are on an average growing or waning in popularity. However if the products are already established, we don't expect a big change in their demand. Let's begin by conducting our analysis regardless of product per year. We have data from July 2012 to May 2014 so we can divide the data into 3 years [2012, 2013, 2014]. Plotting a box-whisker plot for quantities ordered in the year we can see that there is no noticeable difference. In fact performing the Welch t-test we can see that there is no statistical difference between the distributions and the p-values are much larger than 0.05
Boxplot of distribution of quantities over years |
p-values for Welch t-test comparing years |
p-values for products that reject the Null hypothesis. |
Letf: Heatmap of Welch's t-test that reject the Null hypothesis. Right: Heatmap of KS test that reject the Null hypothesis. |
3) Is there a standout employee in terms of the distribution from which quantities sold are drawn ? Are there employees who are shipping their orders significantly late?
Hypothesis - Throughout the analysis we need to specify what our hypothesis is. We will set our tolerance value $\alpha=0.05$. We will be performing the two tailed tests. Our Null hypothesis is that the the quantities sold by various employees have the same mean and variance. Our alternative hypothesis is that the means and variance of all employess dont have the same mean and variance. Thus this is a two tailed hypothesis test. Since the variance of the two distributions can be different and the sample sizes are not the same we need to do a Welch's t-test. We will also perform a Kolmogorov-Smirnov test.
Conducting an ANOVA analysis on employees we get that the F-value is ~ 0.11. This is not significant enough to consider the quantities sold to be a function of the employee. This is borne out if we do a comparison for all employees doing a Welch's t-test and a KS test. Plotted below are the heaptmaps of both the test, showing the comparison that reject the Null hypothesis. Although the Welch's test suggests that Employee 6 is performing wore than others, the KS test is inconclusive. Hence we conclude that the distribution from which quantity sold is drawn do not depend on the employee
It has also been noticed that a fraction of the orders have been delivered late. We want to see if there are particular employees that are outliers in this respect, i.e. they tend do deliver their products late with a higher probability than others. Since there is a fixed probability associated with delivering a shipment late, the number of late deliveries will depend on the total shipments handled by each employee and will follow a binomial distribution. Thus we can set up our hypothesis as follows
Calculating the probabilities we find that Employee 5 has the best on time delivery record. Keeping Employee 5 as our standard and comparing other employees with Employee 5 in a z-test we find that Employees [4,2,7,9] need to pick up their pace. See code in Github repo.
4)Are there regions that are growing or decreasing in overall sales ? Are there product categories that are growing or decreasing in overall sales
Here we will investigate whether any regions and [product categories exhibit an increase in sales from 2012 to 2014. Note that this is different than Q2 where we investigated the whether the distribution which determined the product quantities changed over time. In this question we will investigate whether sales were higher in regions and product categories between 2012 to 2014. For this we will define an "effective probability" for a quantity to be sold per minute in 2012 vs 2014 and then perform a z-test to see if the two binomial distributions.
Hypothesis - We will be performing the two tailed tests. Our Null hypothesis is that the the probability of a product being sold in a given region (or from a given category) does not depend on the year (2012 vs 2014). Our alternative hypothesis is that the the probability of product is different between 2012 and 2014. Thus this is a two tailed hypothesis test. We will be performing the z-test to compare the probabilities of every employee. For this 2-tailed z-test alpha = 0.05 corresponds to z > 1.96
We reach the same conclusion for product categories in that each category shows growth between 2012 and 2014. The results have been summarized in the table below
Conclusions
Q1 Does discount amount have a statistically significant effect on the quantity of a product in an order? If so, at what level(s) of discount?
- Considering all products discounted items are drawn from a different distribution compared to undiscounted items at the 95% confidence level
- Considering individual discount levels, discounted products at 0.05, 0.15, 0.2, and 0.25 levels are drawn from a a different distribution compared to the undiscounted quantities and have higher means. There is no concrete distinction in the distributions between 0.05, 0.15, 0.2 and 0.25 levels. 0.1 discount level needs to have more data points for us to potentially reject the null hypothesis
- At the individual product level there is no evidence that discount levels matter, However that is most likely due to low statistics per product and per discount level.
Q2 Do sales get drawn from a different distribution every year? Does the month when the products were ordered matter?
- The sales do not seem to depend on year and the quantities seem to be drawn from the same underlying distribution. Individual products also don't suggest a preferred year
- For overall sales the month of May shows a definite decrease in quantities purchased
Q3 Is there a standout employee in terms of the distribution from which quantities sold are drawn ? Are there employees who are shipping their orders significantly late?
- There is insufficient evidence that the quantities sold between employees come from different distributions
- Employees 2,4,7,9 statistically have a bad record of shipping on time compared to Employee 5
Q4 Are there regions that are growing or decreasing in overall sales ? Are there products that are growing or decreasing in overall sales
- All regions show a growth comparing year 2012 and 2014
- All products show a growth comparing year 2012 and 2014
Comments
Post a Comment