Statistical Analysis and Visualization

Overview

Teaching: 40 min
Exercises: 0 min

Questions

How can I perform statistical analysis in Python?

How to visualize my data

Objectives

Learn to perform statistical analysis with Scipy.stats module

Learn to visualize data with matplotlib

Statistical Analysis with Scipy

Similar as the Dada Analytics Tools in Excel, python can perform many statistic analysis as well. scipy module has many useful statistic tools. (more than Excel).

t-test

Assume you are interested in selling either energy drink or cola, and you want to know if the two categories has significant difference in average bottle sales. We can perform a t-test assuming different variance.
Firstly we need to import the stats module.

from scipy import stats

Then we need to use aggregate function to find the total bottle sold for each soda in each category. Remember the groupby function? Try to do it by yourself.

ct = inv_soda.groupby(['Item_Description','Category'], as_index=False).agg({"Bottles_Sold":"sum"})

Then, you get the total bottles sold for the two categories.

ct_eng = ct[ct["Category"] == "Energy Drink"]
ct_cola = ct[ct["Category"] == "Cola"]

Finally, you perform the t-test.

stats.ttest_ind(ct_eng['Bottles_Sold'],ct_cola['Bottles_Sold'], equal_var=False)

Then you get

Ttest_indResult(statistic=1.9035249229837687, pvalue=0.057480903679786859)

Well, the p value is not low enough to reject the null.

Takeaways

Scipy has a lot of statistical tools. You can perform anything in the Excel Data Analytics Tool with Scipy, yet it is much more powerful. For example, you can do chi-squares, ANOVA, k-stat, etc. You can click the link to see all tools in Scipy.stats.
As you can observe above, the most difficult part is to preprocess the data. As soon as the data is prepaired, you can just feed it into the model, very easy.
There are many modules that can perform stats tasks in python. For example, sklearn, Statsmodles are also great. You can play with those if you have time.

Visualization

Basics

Many of you are experts in Excel graphing. You can do similar visualization with python. matplotlib is a good starter module for visualization in python. Firstly, let’s import the module.

%matplotlib inline
# In majority of cases, you will just use pyplot. So you can just import that.   
import matplotlib.pyplot as plt

Similar as Excel, you just need to feed in the values in x-axis and y-axis. Let’s start with a very basic example:

# create three points for the first line
x = [1,2,3]
y = [2,3,5]
# create three points for the second line (optional, I just want to show you that you can plot multiple lines in one graph)
x2 = [1.5,2.5,3.5]
y2 = [2.7,3.7,5.7]
# change size (set size before plotting)
plt.figure(figsize=(10,7))
# plot the graph  
plt.plot(x,y, label = "first line")
plt.plot(x2,y2, label = "second line")
# Set x, y limits  
plt.xlim(0, 5)
plt.ylim(0, 7)
# Add x, y label  
plt.xlabel("independent variable")
plt.ylabel("dependent variable")
# Add title
plt.title("example")
# Add legend  
plt.legend()
# show the graph
plt.show()

You will get something like this:
graph

Without a graphic interface like Excel, it is a little less intuitive. Carefully review the code, and try to understand what each line is doing. Note that pyplot.plot (you can click to read the documentation) has a lot of parameters. Basically, you can customize everything such as color, marker, alpha, etc.

Histogram

Let’s go back to our data.
Assume you want to see the price distribution of the soda that cost less than $5, you can draw a histogram with pyplot.hist.

# set size
plt.figure(figsize=(10,7))
# feed in the data, set range, plot the graph
plt.hist(soda['Bottle_Cost'], range=(0,5))
# set x, y label
plt.xlabel("price")
plt.ylabel("number of soda")
# Add title
plt.title("Price distribution")
plt.show()

hist

This is a very basic example. You can customize a lot more parameters such as colors, bins, etc.

Scatter plot

A scatter plot is a two-dimensional data visualization that uses dots to represent the values obtained for two different variables.

# set figure size  
fig, ax = plt.subplots(figsize=(10,7))
# Plot data
ax.scatter(soda['Bottle_Cost'], soda['Bottle_Retail_Price'])
plt.xlabel("Bottle_Cost")
plt.ylabel("Bottle_Retail_Price")
# Set title  
plt.title("Bottle_Retail_Price VS Bottle_Cost")
# Set x, y limit  
plt.xlim(1, 5)
plt.ylim(1, 10)
plt.legend()
plt.show()

Takeaways

You can do many other kinds of visualizations with matplotlib. For example, pie chart, bar chart, etc.
Anything you can draw in Excel, you can do it with matplotlib. Similar as statistical analysis, the most difficult part for visualization is preprocessing. After that, you can just feed your data into the pre-written graphing tool.
If you are going to big companies, you will probablly learn Tableau later. Tableau can easily create nice graphs but it is a little annoying in preprocessing. We can preprocess with python and feed the data to Tableau.
Last but not the least, there are many other cool visualization modules avaliable. If you have time, you can consider these modules: plotly, graph-tool. They can draw cool and interactive graphs.

Challenge

Assume you just opened a new convenience store. You want to see what categories of soda did other stores spend the most money on in their inventory.
What are the top three categories in terms of total cost? Visualize it with pie chart. Read the documentation yourself and try to figure out how to draw it. Hint: 1. Calculate total cost for each soda (arithmetic)

Calculate total cost for each category (aggregation)

Sort the aggregated value

draw the graph (what is the x? what is the label?)

set autopct parameter to ‘%.0f%%’
Solution
inv_soda["Total_Cost"] = inv_soda["Bottle_Cost"] * inv_soda["Bottles_Sold"]  
ct = inv_soda.groupby(['Category'])["Total_Cost"].agg(["sum"]).reset_index()
ct = temp.sort_values("sum", ascending = False)
plt.figure(figsize=(15,15))
plt.pie(ct["sum"], labels=ct["Category"], autopct='%.0f%%')
plt.show()

Key Points

Use scipy.stats.ttest_ind for t-test

There are many other statistical tools in scipy, you can read the documentation for more details

You can customize many parameters in your graph, you can read the documentation for more details

For both Scipy and matplotlib, the most difficult part is to preprocess your data. After that, you can just find the right function and feed your data into it.

previous episode

Python for Business

next episode

Statistical Analysis and Visualization

Overview

Statistical Analysis with Scipy

t-test

Takeaways

Visualization

Basics

Histogram

Scatter plot

Takeaways

Challenge

Solution

Key Points

previous episode

next episode