Northwind Dataset - Statistical Analysis

Data description

The data comes form a fictional "Northwind trading" company that seems to sell food products and beverages. The data is stored in the form of a SQL database in multiple tables. The schema of the SQL database is shown below. We will perform statistical tests to investigate various questions of business interest. The code for the analysis can be found here.

Questions investigated

Does discount amount have a statistically significant effect on the quantity of a product in an order? If so, at what level(s) of discount?
Do sales get better year on year? Does the month when the products were ordered matter?
Is there a standout employee in terms of the distribution from which quantities sold are drawn ? Are there employees who are shipping their orders significantly late?
Are there regions that are growing or decreasing in overall sales ? Are there product categories that are growing or decreasing in overall sales

1) Does discount amount have a statistically significant effect on the quantity of a product in an order? If so, at what level(s) of discount?

We can query our database OrderDetail Table to get access to the quantities of a product ordered and the discount offered. We want to investigate whether the discount offered has a significant effect on the quantity ordered. We will go through a systematic analysis of the effect of discount prices on sales, but first we will begin by stating the precise hypothesis that we are trying to investigate.

Hypothesis - The null hypothesis H0 states that the quantities ordered are drawn from come from the same distribution regardless of the discount level offered. We will choose our alpha = 0.05. The alternative hypothesis states that the distributions of no discount vs discounted products come from different distributions with different mean.

This is a two-tailed test. Typically if we want to test whether two Gaussian distributions are the same, we use the Welch's t-test. This is a better version of the Student's t-test, in that the distributions being compared don't have to have the same variance. However the underlying assumption of Welch's t-test is that the true underlying distribution is Gaussian. So we need to confirm that before implementing the Welch's t-test. We can divide our dataset into two categories, i.e. no discount and discount and test their normality visually by a QQ plot.

QQ plot of quantities ordered without discount (left) and with discount (right). Clearly they are not normally distributed.

Even though our quantities are not normally distributed, we can still use the Welch's t-test since we have a lot of data points and in some cases the distributions can be "normal enough" to use the Welch's t-test safely. However just to be on the safe side in our analysis we will use a Monte Carlo test and we will also use a two sample Kolmogorov-Smirnov (KS) test which does not care if the distributions are Gaussian. However empirically there is only a minimal difference between the Welch's t-test on the actual data and the log-transformed data. The actual binned data along with their Kernel Density Estimates are plotted below along with their means. Visually we can already tell the difference between the two distributions especially in the tails.

Visually we can already see the difference between the two distributions at the tails

Conducting Welch's t-test on the data we get a p-value of 5*10^-10
Conducting a two sample KS test we get a p-value of 5*10^-7
This is clear indication that the two samples come from different distributions since the p-value massively rejects our null hypothesis.

Our next order of business is to figure out which if any levels of discount matter. Our hypothesis remains the same except we will apply the hypothesis for each discount level. There discount levels [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.1, 0.15, 0.20, 0.25]. However only [0.05, 0.1, 0.15, 0.2, 0.25] have enough data points (>3) to make a meaningful decision and we will use these. Below we plot the distributions for each discount level including no discount.

We can do a statistical test on all of the discount levels and compare it to the no discount level. Along with the Welch's t-test which might be unreliable because the underlying distributions are not normal, we also perform a Monte Carlo test and a KS test. For the Monte Carlo test we take the distributions we are to compare, merge them, then split the merged distributions randomly into sizes of the original dataframe. We subtract the means of the randomly generated distributions. Repeating this process a lot of times generates a distribution of mean differences that is normally distributed. We then count the percentage of mean differences that were greater than the mean difference of our original data sets which is our p value. The power of the Monte Carlo approach is that it is a non-parametric test and no assumptions are made about the underlying distribution. The results are summarized in the table below. As we can see the Welch's t-test results don't exactly match with the more reliable Monte Carlo results. KS test is even more conservative in it's p-values than the others

As we can see all the discount levels except 0.1 reject the Null Hypothesis after taking into account the multiple comparisons problem. So discount levels of 5% also matter and the power (related to the Type II errors) is also above 0.8 . Its probably true that discount level of 0.1 also results in increased sales. We can investigate how many data points we need to investigate whether we can reject the Null Hypothesis with 0.8 power. Assuming that the 0.1 discount data is drawn from the same distribution with the same mean as 0.05 data, we get that we would need 462 data points to conclusively rule out the Null Hypothesis.

It is also interesting to note that there is no statistical difference in the distributions of quantities for 0.05, 0.15,0.2 and 0.25 levels i.e. statistically they seem to be drawn from the same distribution. Hence it might not be worthwhile to give discounts above 5%. This is however not a robust conclusion since high discount levels might be needed to keep up the sales.

We can investigate further whether individual products also exhibit the trend that higher discount levels correspond to higher prices. We can perform Welch's t-test and KS test for different discount levels for individual products. However to begin with we will split each product into two categories, one with discount and the other with no discount. Comparing the distributions of these two categories for each product we get that 3 products display a difference between discount vs no discount data using Welch's test and 2 products display a difference doing a KS test. Taking into account the multiple comparison problem we expect for 77 products, (77*0.05 ~ 4) products to have a difference assuming Null hypothesis is true. Thus among individual products there is no evidence that the discount level matters.

Plotted below are the p-values using a KS test (green) and Welch test (blue) for all products and all relevant discount levels. As can be seen very few points have a p-value lower than 0.05 (red line). In fact if we take into account the multiple comparison problem there is no evidence that discount values matter for individual products.

p-values for different discount levels for individual products. Green dots are with KS test and blue are using Welch test. Red line denotes alpha=0.05 . As can be seen most of the points don't reject the Null hypothesis. Taking into account the fact that we are doing multiple comparisons per product there is no evidence of any discount level affecting the quantity of sales for any product

Conclusions for Q1:

Considering all products discounted items are drawn from a different distribution compared to undiscounted items at the 95% confidence level
Considering individual discount levels, discounted products at 0.05, 0.15, 0.2, and 0.25 levels are drawn from a a different distribution compared to the undiscounted quantities and have higher means. There is no concrete distinction in the distributions between 0.05, 0.15, 0.2 and 0.25 levels. 0.1 discount level needs to have more data points for us to potentially reject the null hypothesis
At the individual product level there is no evidence that discount levels matter, However that is most likely due to low statistics per product and per discount level.

2) Do sales get better year on year? Does the month when the products were ordered matter?

Hypothesis - Throughout the analysis we need to specify what our hypothesis is. We will set our tolerance value $\alpha=0.05$. We will be performing the two tailed tests. Our Null hypothesis H0 is that the the quantities ordered at various years (or months) have the same mean and variance as that of other years (months). Our alternative hypothesis is that the means and variance of the distributions at different years (months) don't have the same mean and variance. Thus this is a two tailed hypothesis test. Since the variance of the two distributions can be different and the sample sizes are not the same we need to do a Welch's t-test. We will also perform a Kolmogorov-Smirnov test.

Let's begin by investigating whether the distribution of quantities sold change from year to year. This will happen if the products in our ensemble are on an average growing or waning in popularity. However if the products are already established, we don't expect a big change in their demand. Let's begin by conducting our analysis regardless of product per year. We have data from July 2012 to May 2014 so we can divide the data into 3 years [2012, 2013, 2014]. Plotting a box-whisker plot for quantities ordered in the year we can see that there is no noticeable difference. In fact performing the Welch t-test we can see that there is no statistical difference between the distributions and the p-values are much larger than 0.05

Boxplot of distribution of quantities over years

p-values for Welch t-test comparing years

Doing the analysis using KS test for individual products we find that only 5 out 77 products reject the null hypothesis. Taking into account multiple comparison problem we expect 4 products to reject the Null hypothesis (Type I error) assuming Null hypothesis is true. So we cannot say with any certainty that the underlying distribution for quantities of any product changed over time. Below we have plotted the p-values for the products [21, 42, 56, 68, 77] that reject the Null Hypothesis comparing year 2013, 2014 to year 2012.

p-values for products that reject the Null hypothesis.

The next step is to see whether the month has any effect on the quantities ordered, i.e. whether sales in any given month are affected on the month. Since there are 12 different months we will have a huge multiple comparison problem if we compare each month with the rest. So we will conduct an ANOVA test to start with by treating the month as a categorical variable (see code.) We get that the ANOVA F-value for month is ~ 0.04 making Quantities ordered mildly dependent on the month. We can dig deeper by performing an explicit comparison between all months and plotting the heatmap of comparisons that reject the null hypothesis. Using the Welch's t-test and KS test we can see that May is an outlier month compared to others. Welch's t-test also indicates that August is also an outlier month.

Letf: Heatmap of Welch's t-test that reject the Null hypothesis. Right: Heatmap of KS test that reject the Null hypothesis.

This is borne out visually in the box-whisker plot where we can see a lower mean for the month of May as shown above So the month of May is a distinctly bad one for sales.

3) Is there a standout employee in terms of the distribution from which quantities sold are drawn ? Are there employees who are shipping their orders significantly late?

Hypothesis - Throughout the analysis we need to specify what our hypothesis is. We will set our tolerance value $\alpha=0.05$. We will be performing the two tailed tests. Our Null hypothesis is that the the quantities sold by various employees have the same mean and variance. Our alternative hypothesis is that the means and variance of all employess dont have the same mean and variance. Thus this is a two tailed hypothesis test. Since the variance of the two distributions can be different and the sample sizes are not the same we need to do a Welch's t-test. We will also perform a Kolmogorov-Smirnov test.

Conducting an ANOVA analysis on employees we get that the F-value is ~ 0.11. This is not significant enough to consider the quantities sold to be a function of the employee. This is borne out if we do a comparison for all employees doing a Welch's t-test and a KS test. Plotted below are the heaptmaps of both the test, showing the comparison that reject the Null hypothesis. Although the Welch's test suggests that Employee 6 is performing wore than others, the KS test is inconclusive. Hence we conclude that the distribution from which quantity sold is drawn do not depend on the employee

It has also been noticed that a fraction of the orders have been delivered late. We want to see if there are particular employees that are outliers in this respect, i.e. they tend do deliver their products late with a higher probability than others. Since there is a fixed probability associated with delivering a shipment late, the number of late deliveries will depend on the total shipments handled by each employee and will follow a binomial distribution. Thus we can set up our hypothesis as follows

Hypothesis - We will be performing the two tailed tests. Our Null hypothesis is that the the probability of late deliveries does not depend on the employee. Our alternative hypothesis is that the the probability of late delivery of all employees is not the same. Thus this is a two tailed hypothesis test. We will be performing the z-test to compare the probabilities of every employee. For this 2-tailed z-test alpha = 0.05 corresponds to z > 1.96

Calculating the probabilities we find that Employee 5 has the best on time delivery record. Keeping Employee 5 as our standard and comparing other employees with Employee 5 in a z-test we find that Employees [4,2,7,9] need to pick up their pace. See code in Github repo.

4)Are there regions that are growing or decreasing in overall sales ? Are there product categories that are growing or decreasing in overall sales

Here we will investigate whether any regions and [product categories exhibit an increase in sales from 2012 to 2014. Note that this is different than Q2 where we investigated the whether the distribution which determined the product quantities changed over time. In this question we will investigate whether sales were higher in regions and product categories between 2012 to 2014. For this we will define an "effective probability" for a quantity to be sold per minute in 2012 vs 2014 and then perform a z-test to see if the two binomial distributions.

Hypothesis - We will be performing the two tailed tests. Our Null hypothesis is that the the probability of a product being sold in a given region (or from a given category) does not depend on the year (2012 vs 2014). Our alternative hypothesis is that the the probability of product is different between 2012 and 2014. Thus this is a two tailed hypothesis test. We will be performing the z-test to compare the probabilities of every employee. For this 2-tailed z-test alpha = 0.05 corresponds to z > 1.96

In order to do this we first divide the data by region and then calculate the probability per minute any product had of being sold in 2012 vs 2014. The total quantities sold should then follow a binomial distribution. We do the two-tailed z-test on the 2012 vs 2014 data. All our regions have z > 1.96 with the probability being sold in 2014 higher than that of 2102. Thus all regions exhibit growth between 2012 and 2014. The effect might be a seasonal one since 2014 only captures data till May and 2012 only captures data after July but that is unlikely since all the products are staple foods.

We reach the same conclusion for product categories in that each category shows growth between 2012 and 2014. The results have been summarized in the table below

The tables show the probability of products sold per minute in 2012 vs 2014 by splitting them using Region or Category. Sice z > 1.96 for all regionss and categories we see induspitable growth in products being sold in 2012 vs 2014

Conclusions

Q1 Does discount amount have a statistically significant effect on the quantity of a product in an order? If so, at what level(s) of discount?

Considering all products discounted items are drawn from a different distribution compared to undiscounted items at the 95% confidence level
Considering individual discount levels, discounted products at 0.05, 0.15, 0.2, and 0.25 levels are drawn from a a different distribution compared to the undiscounted quantities and have higher means. There is no concrete distinction in the distributions between 0.05, 0.15, 0.2 and 0.25 levels. 0.1 discount level needs to have more data points for us to potentially reject the null hypothesis
At the individual product level there is no evidence that discount levels matter, However that is most likely due to low statistics per product and per discount level.

Q2 Do sales get drawn from a different distribution every year? Does the month when the products were ordered matter?

The sales do not seem to depend on year and the quantities seem to be drawn from the same underlying distribution. Individual products also don't suggest a preferred year
For overall sales the month of May shows a definite decrease in quantities purchased

Q3 Is there a standout employee in terms of the distribution from which quantities sold are drawn ? Are there employees who are shipping their orders significantly late?

There is insufficient evidence that the quantities sold between employees come from different distributions
Employees 2,4,7,9 statistically have a bad record of shipping on time compared to Employee 5

Q4 Are there regions that are growing or decreasing in overall sales ? Are there products that are growing or decreasing in overall sales

All regions show a growth comparing year 2012 and 2014
All products show a growth comparing year 2012 and 2014

Search This Blog

Data Science projects