Assessing Liquor Store sales impact if 3.2 Licences are converted to full strength

This analysis looks at what bill SB16-197 and HB16-1439 would have on Liquor Store Owners. Currently, there are two different types of Liquor Licenses in the state of Colorado. A “Half License” meaning a 3.2% ABV limit and a “Full License” which refers to Full Strength Beer, Wine, and Spirits. This would allow grocery stores, Convince Stores and other such establishments to sell Wine and Spirits. Currently, if a customer wants to obtain these beverages they must go to a Liquor Store.

http://denver.cbslocal.com/2018/02/02/new-bill-would-prevent-employed-minors-from-selling-beer-at-stores/

Section 1 Header

In [1]:
#Standard Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#Configuring Notebook
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from IPython.core.display import HTML

#Machine Learning Imports
import sklearn
from sklearn import linear_model

#Linear Regression Imports
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.sandbox.regression.predstd import wls_prediction_std
In [2]:
df = pd.read_csv('Iowa_Liquor_Sales.csv')

Section 2 Clean Sales Data

In [3]:
# To Dollars Function
def to_dollars(x):
    while True:
        try:
            y = float(x.replace('$',''))
            return y
        
        except ValueError:
            return float(x)
        
        except AttributeError:
            return float(x)
In [4]:
# Lower Case Function
def low(x):
    while True:
        try:
            y = str(x.lower())
            return y
        except AttributeError:
            return 'nan'
In [5]:
#Applying Dollar Conversion Function
df['Sale (Dollars)'] = df['Sale (Dollars)'].apply(lambda x: to_dollars(x))
In [6]:
#Applying Lower Case Function
df['County'] = df['County'].apply(lambda x : low(x))
In [7]:
#Cleaning County Data, there are misspellings and misrepresentations throughout the data set
df['County'].replace('buena vist','buena vista',inplace=True)
df['County'].replace('cerro gord','cerro gordo',inplace=True)
df['County'].replace("o'brien",'obrien',inplace=True)
df['County'].replace('pottawatta','pottawattamie',inplace=True)
In [8]:
df_county = pd.DataFrame(df['County'].unique())

Section 3 Initial Inspection and Reduction of Iowa Sales Data

Section 3.1 Inspection

In [9]:
df.head(3)
Out[9]:
Invoice/Item Number Date Store Number Store Name Address City Zip Code Store Location County Number County ... Item Number Item Description Pack Bottle Volume (ml) State Bottle Cost State Bottle Retail Bottles Sold Sale (Dollars) Volume Sold (Liters) Volume Sold (Gallons)
0 S29198800001 11/20/2015 2191 Keokuk Spirits 1013 MAIN KEOKUK 52632 1013 MAIN\nKEOKUK 52632\n(40.39978, -91.387531) 56.0 lee ... 297 Templeton Rye w/Flask 6 750 $18.09 $27.14 6 162.84 4.5 1.19
1 S29195400002 11/21/2015 2205 Ding's Honk And Holler 900 E WASHINGTON CLARINDA 51632 900 E WASHINGTON\nCLARINDA 51632\n(40.739238, ... 73.0 page ... 297 Templeton Rye w/Flask 6 750 $18.09 $27.14 12 325.68 9.0 2.38
2 S29050300001 11/16/2015 3549 Quicker Liquor Store 1414 48TH ST FORT MADISON 52627 1414 48TH ST\nFORT MADISON 52627\n(40.624226, ... 56.0 lee ... 249 Disaronno Amaretto Cavalli Mignon 3-50ml Pack 20 150 $6.40 $9.60 2 19.20 0.3 0.08

3 rows × 24 columns

In [10]:
df['Sale (Dollars)'].describe()
Out[10]:
count    1.259107e+07
mean     1.165918e+02
std      3.843290e+02
min      0.000000e+00
25%      2.544000e+01
50%      6.714000e+01
75%      1.327200e+02
max      2.541000e+05
Name: Sale (Dollars), dtype: float64
In [11]:
df_plot = df[df['Sale (Dollars)'] < 250]
sns.boxplot(y= 'Sale (Dollars)', x='County', data=df_plot)
plt.xticks(rotation=90)
plt.figure(figsize=(45,30))
Out[11]:
<Figure size 3240x2160 with 0 Axes>
<Figure size 3240x2160 with 0 Axes>
In [12]:
df.corr()
Out[12]:
Store Number County Number Category Vendor Number Item Number Pack Bottle Volume (ml) Bottles Sold Sale (Dollars) Volume Sold (Liters) Volume Sold (Gallons)
Store Number 1.000000 0.005435 -0.013788 -0.003244 -0.023936 0.096067 -0.045227 -0.011310 -0.024924 -0.030714 -0.030737
County Number 0.005435 1.000000 -0.003929 0.000766 0.009563 0.018526 -0.018756 0.011851 0.012542 0.003912 0.003916
Category -0.013788 -0.003929 1.000000 0.090873 0.220785 -0.041507 -0.011167 0.001992 0.009174 -0.003244 -0.003243
Vendor Number -0.003244 0.000766 0.090873 1.000000 0.134647 -0.048423 0.010193 -0.002552 -0.009191 -0.004603 -0.004605
Item Number -0.023936 0.009563 0.220785 0.134647 1.000000 -0.013792 -0.016774 0.001644 0.009800 -0.005205 -0.005200
Pack 0.096067 0.018526 -0.041507 -0.048423 -0.013792 1.000000 -0.417754 0.080785 -0.043844 -0.060130 -0.060063
Bottle Volume (ml) -0.045227 -0.018756 -0.011167 0.010193 -0.016774 -0.417754 1.000000 -0.004470 0.061164 0.107614 0.107539
Bottles Sold -0.011310 0.011851 0.001992 -0.002552 0.001644 0.080785 -0.004470 1.000000 0.849588 0.910107 0.910147
Sale (Dollars) -0.024924 0.012542 0.009174 -0.009191 0.009800 -0.043844 0.061164 0.849588 1.000000 0.836248 0.836242
Volume Sold (Liters) -0.030714 0.003912 -0.003244 -0.004603 -0.005205 -0.060130 0.107614 0.910107 0.836248 1.000000 1.000000
Volume Sold (Gallons) -0.030737 0.003916 -0.003243 -0.004605 -0.005200 -0.060063 0.107539 0.910147 0.836242 1.000000 1.000000
In [13]:
sns.heatmap(df.corr(),cmap='RdBu_r')
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2de62cf090>
In [14]:
sns.distplot(df[df['Sale (Dollars)'] <250]['Sale (Dollars)'])
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2de2f32690>

Section 3.2 Reduction

Looking at this distribution, most of the sales are > 250 dollars

In [15]:
df['Sale (Dollars)'].describe()
Out[15]:
count    1.259107e+07
mean     1.165918e+02
std      3.843290e+02
min      0.000000e+00
25%      2.544000e+01
50%      6.714000e+01
75%      1.327200e+02
max      2.541000e+05
Name: Sale (Dollars), dtype: float64
In [16]:
df['Sale (Dollars)'].mean()+5*df['Sale (Dollars)'].std()
Out[16]:
2038.2366844032017
In [17]:
df[df['Sale (Dollars)'] < 250]['Sale (Dollars)'].describe()
Out[17]:
count    1.160744e+07
mean     7.496687e+01
std      5.971670e+01
min      0.000000e+00
25%      2.310000e+01
50%      6.296000e+01
75%      1.132800e+02
max      2.498400e+02
Name: Sale (Dollars), dtype: float64
In [18]:
# To assess the behavior of customers only sales less than 2500 will be assesed
df_250 = df[df['Sale (Dollars)'] < 250]
In [19]:
# Building Resulting Database
df_result = pd.DataFrame(df['County'].unique())
df_result.columns = ['County']

Section 4 Aggergate Data

Section 4.1 - Number of Sales

In [20]:
# Count the number of sales per county
def counter(x):
    y = df[df['County'] == x].count()
    return y
In [21]:
# Applying counter function
df_result['Count'] = df_result['County']
df_result['Count'] = df_result['Count'].apply(lambda x : counter(x))

Section 4.2 - Volume of Sales by County

In [22]:
# Total amount of sales per county
def sales(x):
    y = df[df['County'] == x]['Sale (Dollars)'].sum()
    return y
In [23]:
# Applying counter function
df_result['Sales'] = df_result['County']
df_result['Sales'] = df_result['Sales'].apply(lambda x : sales(x))

Section 4.3 - Avg. Sales

In [24]:
# Average sales per county
def avg(x):
    y = df[df['County'] == x]['Sale (Dollars)'].mean()
    return y
In [25]:
# Applying avg function
df_result['Avg'] = df_result['County']
df_result['Avg'] = df_result['Avg'].apply(lambda x : avg(x))

Section 4.4 - Number of Stores

In [190]:
# Average sales per county
def stores(x):
    y = df[df['County'] == x]['Store Number'].nunique()
    return y
In [191]:
# Applying stores function
df_result['Num_Stores'] = df_result['County']
df_result['Num_Stores'] = df_result['Num_Stores'].apply(lambda x : stores(x))

Section 4.5 - Distance to Next Store

In [28]:
# Future Project

Section 5 Importing Demographic Data

In [29]:
#Injesting aggrigated data
df_agg = pd.read_csv('augmented_data.csv')

Section 6 Joining Data

In [30]:
# Joining Data
df_result = df_result.sort_values('County')
df_result = df_result.reset_index(drop=True)
In [31]:
#'Population Estimates 2016'
df_result['Population'] = df_agg['Population Estimates 2016']
In [32]:
#'Female persons, percent'
df_result['% Female'] = df_agg['Female persons, percent']
In [33]:
#'White alone, percent(a)'
df_result['% White'] = df_agg['White alone, percent(a)']
In [34]:
#'High school graduate or higher, percent of persons age 25 years+, 2012-2016'
df_result['% HS Grad'] = df_agg['High school graduate or higher, percent of persons age 25 years+, 2012-2016']
In [35]:
#"Bachelor's degree or higher, percent of persons age 25 years+, 2012-2016"
df_result['% BS Degree'] = df_agg["Bachelor's degree or higher, percent of persons age 25 years+, 2012-2016"]
In [36]:
#'Median household income (in 2016 dollars), 2012-2016'
df_result['Household Income'] = df_agg['Median household income (in 2016 dollars), 2012-2016']
In [37]:
#'Per capita income in past 12 months (in 2016 dollars), 2012-2016'
df_result['Per Capita Income'] = df_agg['Per capita income in past 12 months (in 2016 dollars), 2012-2016']
In [38]:
#'Owner-occupied housing unit rate, 2012-2016'
df_result['Owner Occupied'] = df_agg['Owner-occupied housing unit rate, 2012-2016']
In [39]:
#'Land area in square miles, 2010'
df_result['Area'] = df_agg['Land area in square miles, 2010']
In [40]:
# Dropping all nan values
df_result = df_result.dropna(how='any')

Section 7 - Analysis of Resulting Data Frame

In [193]:
df_result.describe()
Out[193]:
Count Sales Avg Num_Stores Population % Female % White % HS Grad % BS Degree Household Income Per Capita Income Owner Occupied Area
count 9.500000e+01 9.500000e+01 95.000000 95.000000 95.000000 95.000000 95.000000 95.000000 95.000000 95.000000 95.000000 95.000000 95.000000
mean 1.294672e+05 1.517070e+07 95.611388 19.421053 32169.157895 0.500537 0.952589 0.911032 0.209211 51705.821053 27433.947368 0.745242 564.304000
std 2.793764e+05 3.828533e+07 27.936071 32.826941 58624.154287 0.009175 0.036248 0.031271 0.070654 6580.910611 2797.747465 0.050336 118.400133
min 1.243000e+03 1.399594e+05 40.038321 1.000000 3693.000000 0.451000 0.833000 0.782000 0.117000 38467.000000 20808.000000 0.529000 380.610000
25% 3.026700e+04 2.494410e+06 76.647548 7.000000 9861.000000 0.497500 0.946500 0.899500 0.166500 47371.000000 25926.500000 0.719500 479.020000
50% 5.301000e+04 4.507082e+06 92.826056 10.000000 15391.000000 0.502000 0.967000 0.917000 0.196000 49957.000000 27037.000000 0.749000 569.570000
75% 9.653300e+04 9.464419e+06 114.272948 17.000000 24999.000000 0.507000 0.976000 0.929500 0.219500 55676.000000 28708.500000 0.783000 587.055000
max 2.279555e+06 3.200346e+08 225.065501 259.000000 474045.000000 0.513000 0.983000 0.963000 0.524000 78918.000000 40468.000000 0.842000 972.720000
In [42]:
df_result.corr()
Out[42]:
Count Sales Avg Num_Stores Population % Female % White % HS Grad % BS Degree Household Income Per Capita Income Owner Occupied Area
Count 1.000000 0.996502 0.361562 0.996502 0.988957 0.156438 -0.589499 0.107972 0.500105 0.239792 0.268400 -0.403086 0.138251
Sales 0.996502 1.000000 0.387256 1.000000 0.990404 0.156693 -0.575419 0.107588 0.500578 0.251289 0.286602 -0.385607 0.129197
Avg 0.361562 0.387256 1.000000 0.387256 0.426179 0.045098 -0.384503 0.151327 0.536378 0.383492 0.426398 -0.267533 0.178714
Num_Stores 0.996502 1.000000 0.387256 1.000000 0.990404 0.156693 -0.575419 0.107588 0.500578 0.251289 0.286602 -0.385607 0.129197
Population 0.988957 0.990404 0.426179 0.990404 1.000000 0.149016 -0.608472 0.121018 0.558905 0.313319 0.311795 -0.401267 0.171215
% Female 0.156438 0.156693 0.045098 0.156693 0.149016 1.000000 0.224349 0.152401 -0.053542 0.142761 0.210939 0.140592 0.008652
% White -0.589499 -0.575419 -0.384503 -0.575419 -0.608472 0.224349 1.000000 0.144318 -0.565351 -0.043388 -0.004913 0.660540 -0.104697
% HS Grad 0.107972 0.107588 0.151327 0.107588 0.121018 0.152401 0.144318 1.000000 0.486877 0.381085 0.517877 0.024008 -0.004012
% BS Degree 0.500105 0.500578 0.536378 0.500578 0.558905 -0.053542 -0.565351 0.486877 1.000000 0.521689 0.532184 -0.512612 0.089030
Household Income 0.239792 0.251289 0.383492 0.251289 0.313319 0.142761 -0.043388 0.381085 0.521689 1.000000 0.787575 0.236848 0.199675
Per Capita Income 0.268400 0.286602 0.426398 0.286602 0.311795 0.210939 -0.004913 0.517877 0.532184 0.787575 1.000000 0.146323 0.093891
Owner Occupied -0.403086 -0.385607 -0.267533 -0.385607 -0.401267 0.140592 0.660540 0.024008 -0.512612 0.236848 0.146323 1.000000 -0.045865
Area 0.138251 0.129197 0.178714 0.129197 0.171215 0.008652 -0.104697 -0.004012 0.089030 0.199675 0.093891 -0.045865 1.000000
In [194]:
sns.heatmap(df_result.corr(),cmap='RdBu_r')
Out[194]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2d2cce6e90>
In [195]:
df_result[['Count','Sales','Num_Stores','Population']].corr()
Out[195]:
Count Sales Num_Stores Population
Count 1.000000 0.996502 0.989062 0.988957
Sales 0.996502 1.000000 0.982609 0.990404
Num_Stores 0.989062 0.982609 1.000000 0.981482
Population 0.988957 0.990404 0.981482 1.000000
In [196]:
sns.heatmap(df_result[['Count','Sales','Num_Stores','Population']].corr(),cmap='Blues',annot=True,linewidths=.5)
Out[196]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2d2cdf8910>
In [46]:
sns.pairplot(df_result)
Out[46]:
<seaborn.axisgrid.PairGrid at 0x7f2de2184290>
In [47]:
sns.lmplot(x='Population',y='Count', data=df_result)
plt.title('# of Transactions vs. Population')
Out[47]:
Text(0.5,1,'# of Transactions vs. Population')