Getting started

Once you've chosen your scenario, download the data from the Iowa website in csv format. Start by loading the data with pandas. You may need to parse the date columns appropriately.

Scenario 1: State tax board

You are a data scientist in residence at the Iowa State tax board. The Iowa State legislature is considering changes in the liquor tax rates and wants a report of current liquor sales by county and projections for the rest of the year to inform their decision.

Goal for Scenario #1: Your task is as follows:

  • Calculate the yearly liquor sales for each store using the provided data. You can add up the transactions for each year, and store sales in 2015 specifically will be used later as your target variable.
  • Use the data from 2015 to make a linear model using as many variables as you find useful to predict the yearly sales of all stores. You must use the sales from January to March as one of your variables.
  • Use your model for 2015 to estimate total sales in 2016, extrapolating from the sales so far for January to March of 2016.
  • Report your findings, including any projected increase or decrease in total sales (over the entire state) for the tax committee of the Iowa legislature.
  • Use cross-validation to check how your model predicts to held out data compared to the model metrics on the full dataset.
  • Fit your model(s) using one or both of the regularization tactics covered. Explain whether the regularized or the non-regularized model performed better and what the selected regression(s) are doing.

Class E liquor licenses: For grocery, liquor and convenience stores, etc. Allows for the sale of alcoholic liquor for off-premises consumption in original unopened containers. No sales by the drink. Sunday sales are included. Also allows wholesale sales to on-premises Class A, B, C and D liquor licensees but must have a TTB Federal Wholesale Basic Permit.

<font color = "red">Problem Statement: We are interested in projecting the yearly sales for all stores in the state of Iowa for the purposes of evaluating liquor tax rates.

Iowa is a particular case in that it is the 6th highest tax rate on liquor in the United States. In Iowa, liquor distribution is particularly regulated, with some prohibition era laws still in place. While beer and wine regulations have been relaxed, liquor remains a sticking point for these prohibition style laws.

</font> Sources:
http://www.desmoinesregister.com/story/news/politics/2015/06/27/booze-bureaucracy/29371105/
http://qctimes.com/news/local/liquor-sales-up-in-iowa/article_aa9155fc-9de3-11df-bdf8-001cc4c03286.html
http://www.thegazette.com/subject/news/government/iowa-posts-record-year-for-liquor-wine-sales-20151201

In [1]:
import pandas as pd
from datetime import datetime
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Load the data into a DataFrame
iowa = pd.read_csv("iowa_liquor_sales_proj_2.csv", parse_dates="Date")
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [2]:
len(iowa)
Out[2]:
2709552
In [3]:
iowa.head()
Out[3]:
Invoice/Item Number Date Store Number Store Name Address City Zip Code Store Location County Number County ... Item Number Item Description Pack Bottle Volume (ml) State Bottle Cost State Bottle Retail Bottles Sold Sale (Dollars) Volume Sold (Liters) Volume Sold (Gallons)
0 S29198800001 11/20/2015 2191 Keokuk Spirits 1013 MAIN KEOKUK 52632 1013 MAIN\nKEOKUK 52632\n(40.39978, -91.387531) 56 Lee ... 297 Templeton Rye w/Flask 6 750 $18.09 $27.14 6 $162.84 4.50 1.19
1 S29195400002 11/21/2015 2205 Ding's Honk And Holler 900 E WASHINGTON CLARINDA 51632 900 E WASHINGTON\nCLARINDA 51632\n(40.739238, ... 73 Page ... 297 Templeton Rye w/Flask 6 750 $18.09 $27.14 12 $325.68 9.00 2.38
2 S29050300001 11/16/2015 3549 Quicker Liquor Store 1414 48TH ST FORT MADISON 52627 1414 48TH ST\nFORT MADISON 52627\n(40.624226, ... 56 Lee ... 249 Disaronno Amaretto Cavalli Mignon 3-50ml Pack 20 150 $6.40 $9.60 2 $19.20 0.30 0.08
3 S28867700001 11/04/2015 2513 Hy-Vee Food Store #2 / Iowa City 812 S 1ST AVE IOWA CITY 52240 812 S 1ST AVE\nIOWA CITY 52240\n 52 Johnson ... 237 Knob Creek w/ Crystal Decanter 3 1750 $35.55 $53.34 3 $160.02 5.25 1.39
4 S29050800001 11/17/2015 3942 Twin Town Liquor 104 HIGHWAY 30 WEST TOLEDO 52342 104 HIGHWAY 30 WEST\nTOLEDO 52342\n(41.985887,... 86 Tama ... 249 Disaronno Amaretto Cavalli Mignon 3-50ml Pack 20 150 $6.40 $9.60 2 $19.20 0.30 0.08

5 rows × 24 columns

In [4]:
iowa.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2709552 entries, 0 to 2709551
Data columns (total 24 columns):
Invoice/Item Number      object
Date                     object
Store Number             int64
Store Name               object
Address                  object
City                     object
Zip Code                 object
Store Location           object
County Number            float64
County                   object
Category                 float64
Category Name            object
Vendor Number            int64
Vendor Name              object
Item Number              int64
Item Description         object
Pack                     int64
Bottle Volume (ml)       int64
State Bottle Cost        object
State Bottle Retail      object
Bottles Sold             int64
Sale (Dollars)           object
Volume Sold (Liters)     float64
Volume Sold (Gallons)    float64
dtypes: float64(4), int64(6), object(14)
memory usage: 516.8+ MB
In [5]:
columns_keep = ['Date', 'Store Number', 'City', 'Zip Code', 'County Number', 'County', 'Bottle Volume (ml)', 
                'State Bottle Cost', 'State Bottle Retail', 'Bottles Sold', 'Sale (Dollars)', 'Volume Sold (Liters)', 
                'Volume Sold (Gallons)']
iowa = iowa[columns_keep]
In [6]:
iowa["Date"] = pd.to_datetime(iowa["Date"], format="%m/%d/%Y") #capital y because year is 4 digits
In [7]:
#Remove dollar signs and convert to float
iowa["State Bottle Cost"] = iowa["State Bottle Cost"].str.replace('$', '').astype('float')
iowa["State Bottle Retail"] = iowa["State Bottle Retail"].str.replace('$', '').astype('float')
iowa["Sale (Dollars)"] = iowa["Sale (Dollars)"].str.replace('$', '').astype('float')
In [8]:
iowa.columns
Out[8]:
Index(['Date', 'Store Number', 'City', 'Zip Code', 'County Number', 'County',
       'Bottle Volume (ml)', 'State Bottle Cost', 'State Bottle Retail',
       'Bottles Sold', 'Sale (Dollars)', 'Volume Sold (Liters)',
       'Volume Sold (Gallons)'],
      dtype='object')
In [9]:
#We're going to need to check zipcode/city/county for reasonability
#source: https://www.zip-codes.com/state/ia.asp
zips = pd.read_csv('iowazips.csv')
In [10]:
zips["City"] = zips['City'].str.upper() # make all cities capitalized so it matches our iowa data
In [11]:
zips.head()
Out[11]:
Zip Code City County
0 50001 ACKWORTH Warren
1 50002 ADAIR Adair
2 50003 ADEL Dallas
3 50005 ALBION Marshall
4 50006 ALDEN Hardin
In [12]:
#convert all zips to strings, as they are categorical
zips["Zip Code"] = zips["Zip Code"].astype('str')
In [13]:
zips.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1060 entries, 0 to 1059
Data columns (total 3 columns):
Zip Code    1060 non-null object
City        1060 non-null object
County      1060 non-null object
dtypes: object(3)
memory usage: 33.1+ KB
In [14]:
#We merge our zipcode data with iowa in df a.
a = pd.merge(iowa,zips, on = "Zip Code",how="left")
In [15]:
a["do cities match"] = a["City_x"] == a["City_y"]
In [16]:
a["do cities match"].value_counts()
Out[16]:
True     2514567
False     194985
Name: do cities match, dtype: int64
In [17]:
# conducted a few spot checks, source y (iowazips.csv) is correct when it does not have a null value. 
# Will merge data accordingly and replace iowa cities with source from df (a) when available.
# This will help with visualizing data and analyzing outliers, which I suspect may be big college towns.
# We will test this suspicion later
a[["City_x", "City_y","Zip Code","do cities match"]].sort_values("do cities match", ascending = True)
Out[17]:
City_x City_y Zip Code do cities match
2709551 DES MOINES NaN 50317 False
2206013 LEMARS LE MARS 51031 False
852584 DUNLAP NaN 712-2 False
374336 DUNLAP NaN 712-2 False
1278321 PLEASANT HILL DES MOINES 50317 False
1483725 OTTUWMA OTTUMWA 52501 False
1765977 WATERLOO EVANSDALE 50707 False
374332 DUMONT ALLISON 50602 False
374331 PLEASANT HILL DES MOINES 50327 False
1891994 WINDSOR HEIGHTS DES MOINES 50311 False
1765982 Des Moines DES MOINES 50313 False
1278308 ARNOLD'S PARK ARNOLDS PARK 51331 False
374324 LEMARS LE MARS 51031 False
1278326 DEWITT DE WITT 52742 False
525110 ARNOLD'S PARK ARNOLDS PARK 51331 False
171446 PLEASANT HILL DES MOINES 50317 False
1122946 ST ANSGAR SAINT ANSGAR 50472 False
525114 MARION JEFFERSON 50129 False
374310 LEMARS LE MARS 51031 False
525121 LEMARS LE MARS 51031 False
2205990 MANCHESTER NaN 52087 False
1122960 CEDAR RAPIDS NaN 52303 False
1632654 WINDSOR HEIGHTS DES MOINES 50311 False
1122965 MANCHESTER NaN 52087 False
1483765 Des Moines DES MOINES 50313 False
1278278 NEWTON WATERLOO 50702 False
374296 NEWTON WATERLOO 50702 False
1122944 WINDSOR HEIGHTS URBANDALE 50322 False
525133 WINDSOR HEIGHTS DES MOINES 50311 False
1765974 CEDAR RAPIDS NaN 52303 False
... ... ... ... ...
922001 SPIRIT LAKE SPIRIT LAKE 51360 True
922002 WAVERLY WAVERLY 50677 True
922003 SHELDON SHELDON 51201 True
922004 CEDAR RAPIDS CEDAR RAPIDS 52405 True
922005 DENISON DENISON 51442 True
922006 WATERLOO WATERLOO 50701 True
922007 ORANGE CITY ORANGE CITY 51041 True
922000 CEDAR RAPIDS CEDAR RAPIDS 52403 True
921990 JOHNSTON JOHNSTON 50131 True
921989 CEDAR RAPIDS CEDAR RAPIDS 52411 True
921988 BONDURANT BONDURANT 50035 True
921971 CEDAR RAPIDS CEDAR RAPIDS 52402 True
921972 STORM LAKE STORM LAKE 50588 True
921973 CEDAR RAPIDS CEDAR RAPIDS 52404 True
921974 LE CLAIRE LE CLAIRE 52753 True
921975 CHARITON CHARITON 50049 True
921976 BOONE BOONE 50036 True
921977 SIOUX CITY SIOUX CITY 51103 True
921978 COUNCIL BLUFFS COUNCIL BLUFFS 51501 True
921979 ATLANTIC ATLANTIC 50022 True
921980 DECORAH DECORAH 52101 True
921981 MISSOURI VALLEY MISSOURI VALLEY 51555 True
921982 FLOYD FLOYD 50435 True
921983 MUSCATINE MUSCATINE 52761 True
921984 AMES AMES 50014 True
921985 WESLEY WESLEY 50483 True
921986 DES MOINES DES MOINES 50314 True
921987 KEOKUK KEOKUK 52632 True
922009 SAC CITY SAC CITY 50583 True
1354775 DES MOINES DES MOINES 50317 True

2709552 rows × 4 columns

In [18]:
a[a["do cities match"] == False]["City_x"].value_counts()
Out[18]:
WINDSOR HEIGHTS     27679
DES MOINES          12322
PLEASANT HILL        9705
LEMARS               9344
OTTUWMA              9253
CEDAR RAPIDS         8900
DAVENPORT            7594
CEDAR FALLS          7136
WATERLOO             4810
MANCHESTER           4184
SIOUX CITY           3486
LECLAIRE             3242
IOWA CITY            3127
AMES                 2980
COUNCIL BLUFFS       2616
ARNOLD'S PARK        2593
MARION               2489
NEWTON               2259
DEWITT               2015
WEST DES MOINES      2007
DUBUQUE              2003
DUNLAP               2000
CORNING              1998
ANKENY               1900
MT PLEASANT          1839
BURLINGTON           1830
ST ANSGAR            1825
MASON CITY           1622
CLEAR LAKE           1610
URBANDALE            1608
                    ...  
BELLEVUE               12
HAMBURG                12
COLO                   12
MALVERN                12
EXIRA                  11
STRAWBERRY POINT       11
LOGAN                  11
PALO                   11
MONTEZUMA              10
WINTHROP               10
MARTENSDALE            10
Carroll                 9
MECHANICSVILLE          9
DELMAR                  9
DOWS                    9
MAXWELL                 8
LAWLER                  8
EDDYVILLE               8
CORWITH                 7
MERRILL                 7
SHEFFIELD               7
VICTOR                  7
AFTON                   6
HUDSON                  6
ANTHON                  5
HARPERS FERRY           5
KEOKUK                  3
WELLMAN                 3
CARTER LAKE             3
Davenport               2
Name: City_x, dtype: int64
In [19]:
#trust that zip code is accurate, back out city and county names.
for i, row in zips.iterrows():
    iowa.loc[iowa["Zip Code"] == row["Zip Code"], "City"] = row["City"]
    iowa.loc[iowa["Zip Code"] == row["Zip Code"], "County"] = row["County"]
In [20]:
#for any zip codes not in the zips data, match on city and back out correct zip code.
for i, row in zips.iterrows():
    iowa.loc[iowa["City"] == row["City"], "Zip Code"] = row["Zip Code"]
    iowa.loc[iowa["City"] == row["City"], "County"] = row["County"]
In [21]:
# We have a few null values in "County Number", "Category" and "Category Name"
# I'm going to choose to prioritize the accuracy of county number. Will come back and ammend Category and Category name
# should they impact our model. 
In [22]:
iowa["countynull"] = iowa["County Number"].isnull()
In [23]:
iowa[iowa["countynull"]==True]
Out[23]:
Date Store Number City Zip Code County Number County Bottle Volume (ml) State Bottle Cost State Bottle Retail Bottles Sold Sale (Dollars) Volume Sold (Liters) Volume Sold (Gallons) countynull
469 2015-11-03 4247 BELMOND 50421 NaN Wright 1750 15.75 23.63 6 141.78 10.50 2.77 True
1036 2015-10-13 4247 BELMOND 50421 NaN Wright 750 11.49 17.24 1 17.24 0.75 0.20 True
1851 2015-05-12 4247 BELMOND 50421 NaN Wright 750 13.73 20.60 2 41.20 1.50 0.40 True
4023 2015-07-28 4247 BELMOND 50421 NaN Wright 750 6.20 9.30 1 9.30 0.75 0.20 True
5758 2015-11-03 4247 BELMOND 50421 NaN Wright 1750 12.59 18.89 6 113.34 10.50 2.77 True
7332 2015-07-28 4247 BELMOND 50421 NaN Wright 750 10.49 15.74 1 15.74 0.75 0.20 True
8049 2015-06-02 4247 BELMOND 50421 NaN Wright 1000 9.25 13.88 2 27.76 2.00 0.53 True
10232 2015-04-21 4247 BELMOND 50421 NaN Wright 1750 7.20 10.80 6 64.80 10.50 2.77 True
12253 2015-11-03 4247 BELMOND 50421 NaN Wright 750 4.26 6.39 4 25.56 3.00 0.79 True
14001 2015-03-03 4247 BELMOND 50421 NaN Wright 750 10.00 15.00 12 180.00 9.00 2.38 True
15973 2015-05-12 4247 BELMOND 50421 NaN Wright 1000 11.00 16.50 1 16.50 1.00 0.26 True
16139 2015-08-04 4247 BELMOND 50421 NaN Wright 1750 8.74 13.36 6 80.16 10.50 2.77 True
17853 2015-06-29 4247 BELMOND 50421 NaN Wright 1750 10.45 15.68 72 1128.96 126.00 33.29 True
21775 2015-05-12 4247 BELMOND 50421 NaN Wright 1750 12.59 18.89 6 113.34 10.50 2.77 True
21879 2015-08-25 4247 BELMOND 50421 NaN Wright 750 10.49 15.74 3 47.22 2.25 0.59 True
25008 2015-01-20 4247 BELMOND 50421 NaN Wright 750 8.98 13.47 12 161.64 9.00 2.38 True
25841 2015-01-06 4247 BELMOND 50421 NaN Wright 750 9.85 14.78 12 177.36 9.00 2.38 True
27265 2015-03-24 4247 BELMOND 50421 NaN Wright 750 8.25 12.38 12 148.56 9.00 2.38 True
30017 2015-06-02 4247 BELMOND 50421 NaN Wright 1750 24.99 37.49 6 224.94 10.50 2.77 True
31914 2015-02-17 4247 BELMOND 50421 NaN Wright 750 8.25 12.38 3 37.14 2.25 0.59 True
33084 2015-10-27 4247 BELMOND 50421 NaN Wright 750 6.90 10.35 3 31.05 2.25 0.59 True
33170 2015-01-13 4247 BELMOND 50421 NaN Wright 1750 14.66 21.99 1 21.99 1.75 0.46 True
35742 2015-05-12 4247 BELMOND 50421 NaN Wright 1750 29.67 44.51 6 267.06 10.50 2.77 True
37000 2015-05-26 4247 BELMOND 50421 NaN Wright 750 6.83 10.25 2 20.50 1.50 0.40 True
39680 2015-05-19 4247 BELMOND 50421 NaN Wright 1750 7.20 10.80 6 64.80 10.50 2.77 True
40449 2015-06-29 4247 BELMOND 50421 NaN Wright 750 6.83 10.25 1 10.25 0.75 0.20 True
41807 2015-01-27 4247 BELMOND 50421 NaN Wright 750 8.25 12.38 1 12.38 0.75 0.20 True
45215 2015-04-07 4247 BELMOND 50421 NaN Wright 1750 8.20 12.30 6 73.80 10.50 2.77 True
45656 2015-06-23 4247 BELMOND 50421 NaN Wright 1750 7.20 10.80 6 64.80 10.50 2.77 True
45990 2015-01-20 4247 BELMOND 50421 NaN Wright 1750 22.49 33.74 2 67.48 3.50 0.92 True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2706431 2016-03-31 5247 ROCKWELL 50469 NaN Cerro Gordo 750 8.67 13.01 2 26.02 1.50 0.40 True
2706432 2016-03-31 5247 ROCKWELL 50469 NaN Cerro Gordo 750 5.78 8.67 2 17.34 1.50 0.40 True
2706433 2016-03-31 5247 ROCKWELL 50469 NaN Cerro Gordo 750 12.50 18.75 1 18.75 0.75 0.20 True
2706434 2016-03-31 5247 ROCKWELL 50469 NaN Cerro Gordo 1750 18.00 27.00 2 54.00 3.50 0.92 True
2706435 2016-03-31 5247 ROCKWELL 50469 NaN Cerro Gordo 1000 12.59 18.89 2 37.78 2.00 0.53 True
2706436 2016-03-31 5247 ROCKWELL 50469 NaN Cerro Gordo 1000 18.49 27.74 1 27.74 1.00 0.26 True
2706437 2016-03-31 5247 ROCKWELL 50469 NaN Cerro Gordo 1000 7.87 11.81 2 23.62 2.00 0.53 True
2706438 2016-03-31 5247 ROCKWELL 50469 NaN Cerro Gordo 750 5.38 8.07 1 8.07 0.75 0.20 True
2706439 2016-03-31 5247 ROCKWELL 50469 NaN Cerro Gordo 750 6.75 10.13 3 30.39 2.25 0.59 True
2706440 2016-03-31 5247 ROCKWELL 50469 NaN Cerro Gordo 1750 6.92 10.38 2 20.76 3.50 0.92 True
2706441 2016-03-31 5247 ROCKWELL 50469 NaN Cerro Gordo 1000 3.97 5.96 2 11.92 2.00 0.53 True
2706442 2016-03-31 5247 ROCKWELL 50469 NaN Cerro Gordo 375 3.55 5.33 2 10.66 0.75 0.20 True
2706443 2016-03-31 5247 ROCKWELL 50469 NaN Cerro Gordo 1000 22.04 33.06 1 33.06 1.00 0.26 True
2706444 2016-03-31 5247 ROCKWELL 50469 NaN Cerro Gordo 1000 4.42 6.63 2 13.26 2.00 0.53 True
2706445 2016-03-31 5247 ROCKWELL 50469 NaN Cerro Gordo 750 8.32 12.48 1 12.48 0.75 0.20 True
2706446 2016-03-31 5247 ROCKWELL 50469 NaN Cerro Gordo 1000 12.97 19.46 2 38.92 2.00 0.53 True
2706447 2016-03-31 5247 ROCKWELL 50469 NaN Cerro Gordo 750 6.83 10.25 2 20.50 1.50 0.40 True
2706881 2016-03-31 5229 NORWALK 50211 NaN Warren 300 7.35 11.03 2 22.06 0.60 0.16 True
2706882 2016-03-31 5229 NORWALK 50211 NaN Warren 375 6.25 9.38 3 28.14 1.12 0.30 True
2706883 2016-03-31 5229 NORWALK 50211 NaN Warren 750 11.49 17.24 3 51.72 2.25 0.59 True
2706884 2016-03-31 5229 NORWALK 50211 NaN Warren 750 11.49 17.24 3 51.72 2.25 0.59 True
2706885 2016-03-31 5229 NORWALK 50211 NaN Warren 750 3.31 4.97 3 14.91 2.25 0.59 True
2706886 2016-03-31 5229 NORWALK 50211 NaN Warren 200 3.51 5.27 3 15.81 0.60 0.16 True
2706887 2016-03-31 5229 NORWALK 50211 NaN Warren 375 7.42 11.13 3 33.39 1.12 0.30 True
2706888 2016-03-31 5229 NORWALK 50211 NaN Warren 750 10.00 15.00 3 45.00 2.25 0.59 True
2706889 2016-03-31 5229 NORWALK 50211 NaN Warren 750 8.25 12.38 3 37.14 2.25 0.59 True
2706890 2016-03-31 5229 NORWALK 50211 NaN Warren 750 8.25 12.38 3 37.14 2.25 0.59 True
2706891 2016-03-31 5229 NORWALK 50211 NaN Warren 500 11.50 17.25 10 172.50 5.00 1.32 True
2709500 2016-03-31 5222 CEDAR RAPIDS 52499 NaN Linn 200 5.70 8.55 120 1026.00 24.00 6.34 True
2709501 2016-03-31 5222 CEDAR RAPIDS 52499 NaN Linn 375 10.66 15.99 60 959.40 22.50 5.94 True

10913 rows × 14 columns

In [24]:
len(iowa[iowa["countynull"]==True])
Out[24]:
10913
In [25]:
a = pd.DataFrame(iowa.groupby("County")["County Number"].value_counts())
In [26]:
a.columns= ["Counts"]
In [27]:
a.reset_index(inplace = True)
In [28]:
a.head(15)
Out[28]:
County County Number Counts
0 Adair 1 2795
1 Adams 2 2240
2 Adams 22 51
3 Allamakee 3 10716
4 Appanoose 4 10475
5 Audubon 5 2453
6 Benton 6 9609
7 Benton 57 194
8 Black Hawk 7 146278
9 Black Hawk 94 2246
10 Boone 8 17035
11 Bremer 9 22104
12 Bremer 7 4668
13 Buchanan 10 12560
14 Buena Vista 11 27516
In [29]:
# We have all counties filled in, sometimes mapped to more than one county number, and sometimes mapped to 
# a null county number. So we will make a dictionary w keys = counties and values = county numbers and fill in blank 
# county numbers in our iowa df from it. In cases where the county is mapped to more than one county number, this for 
# loop favors the county number with the most counts in it because we assume the majority of transactions are 
# categorized correctly. County number becomes relevant in a later data visualization.
a_dict = {}
for county in list(iowa["County"].unique()):
    x = a[a["County"]==county]
    if len(x)>1:
        maxi = max(list(x["Counts"]))
        y = list(x.loc[x["Counts"] == maxi, "County Number"])[0]
        a_dict[county] = y
    else:
        a_dict[county] = list(x["County Number"])[0]
In [30]:
countynum = pd.Series(a_dict, index = a_dict.keys())
In [31]:
countynum = pd.DataFrame(countynum, columns=["County Number"])
In [32]:
countynum["County Number"] = countynum["County Number"].astype(int)
In [33]:
countynum = countynum.reset_index()
In [34]:
countynum.columns=["County", "County Number"]
In [35]:
for i, row in countynum.iterrows():
    iowa.loc[iowa["County"] == row["County"], "County Number"] = row["County Number"]
In [36]:
iowa[iowa["County"]=="Benton"]
Out[36]:
Date Store Number City Zip Code County Number County Bottle Volume (ml) State Bottle Cost State Bottle Retail Bottles Sold Sale (Dollars) Volume Sold (Liters) Volume Sold (Gallons) countynull
186 2015-06-23 4228 VINTON 52349 6 Benton 750 13.86 20.79 12 249.48 9.00 2.38 False
612 2015-03-17 4025 BELLE PLAINE 52208 6 Benton 750 14.49 21.74 2 43.48 1.50 0.40 False
1631 2015-09-22 4228 VINTON 52349 6 Benton 1750 10.45 15.68 120 1881.60 210.00 55.48 False
1698 2015-10-13 4228 VINTON 52349 6 Benton 750 3.40 5.10 12 61.20 9.00 2.38 False
1777 2015-11-17 4025 BELLE PLAINE 52208 6 Benton 750 11.49 17.24 2 34.48 1.50 0.40 False
1794 2015-03-24 4228 VINTON 52349 6 Benton 1750 6.92 10.38 6 62.28 10.50 2.77 False
1845 2015-04-21 4551 URBANA 52345 6 Benton 375 3.50 5.25 6 31.50 2.25 0.59 False
1919 2015-03-03 4346 SHELLSBURG 52332 6 Benton 750 6.30 9.45 1 9.45 0.75 0.20 False
2195 2015-06-09 4070 BELLE PLAINE 52208 6 Benton 750 13.86 20.79 12 249.48 9.00 2.38 False
2321 2015-05-26 4228 VINTON 52349 6 Benton 1750 11.25 16.88 6 101.28 10.50 2.77 False
2520 2015-09-15 4070 BELLE PLAINE 52208 6 Benton 1000 18.49 27.74 4 110.96 4.00 1.06 False
3318 2015-09-01 4346 SHELLSBURG 52332 6 Benton 1750 8.50 12.75 1 12.75 1.75 0.46 False
3333 2015-02-03 4025 BELLE PLAINE 52208 6 Benton 1750 7.47 11.21 12 134.52 21.00 5.55 False
3589 2015-09-08 4228 VINTON 52349 6 Benton 1750 9.00 13.50 6 81.00 10.50 2.77 False
3820 2015-11-10 4228 VINTON 52349 6 Benton 1750 11.99 17.99 6 107.94 10.50 2.77 False
4099 2015-08-18 4070 BELLE PLAINE 52208 6 Benton 750 5.23 7.85 12 94.20 9.00 2.38 False
4119 2015-06-09 4025 BELLE PLAINE 52208 6 Benton 750 6.30 9.45 3 28.35 2.25 0.59 False
4168 2015-09-08 4346 SHELLSBURG 52332 6 Benton 1000 11.00 16.50 5 82.50 5.00 1.32 False
5807 2015-09-29 4228 VINTON 52349 6 Benton 1750 10.99 16.49 6 98.94 10.50 2.77 False
6040 2015-09-22 4228 VINTON 52349 6 Benton 1750 8.20 12.30 6 73.80 10.50 2.77 False
6369 2015-07-07 4346 SHELLSBURG 52332 6 Benton 1750 6.00 9.25 2 18.50 3.50 0.92 False
6869 2015-10-20 4070 BELLE PLAINE 52208 6 Benton 750 4.34 6.51 6 39.06 4.50 1.19 False
7329 2015-05-05 4346 SHELLSBURG 52332 6 Benton 1000 12.50 18.75 2 37.50 2.00 0.53 False
7330 2015-01-06 4070 BELLE PLAINE 52208 6 Benton 750 13.86 20.79 12 249.48 9.00 2.38 False
7543 2015-02-03 4346 SHELLSBURG 52332 6 Benton 1750 7.17 10.76 12 129.12 21.00 5.55 False
8045 2015-02-17 4346 SHELLSBURG 52332 6 Benton 750 15.00 22.50 3 67.50 2.25 0.59 False
8140 2015-09-15 4346 SHELLSBURG 52332 6 Benton 200 1.75 2.63 2 5.26 0.40 0.11 False
8390 2015-06-02 4346 SHELLSBURG 52332 6 Benton 1000 6.92 10.38 1 10.38 1.00 0.26 False
8523 2015-11-21 4070 BELLE PLAINE 52208 6 Benton 750 15.00 22.50 12 270.00 9.00 2.38 False
8539 2015-10-20 4228 VINTON 52349 6 Benton 750 6.88 10.82 12 129.84 9.00 2.38 False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2691601 2016-03-29 4346 SHELLSBURG 52332 6 Benton 1750 7.62 11.43 1 11.43 1.75 0.46 False
2691602 2016-03-29 4346 SHELLSBURG 52332 6 Benton 200 1.75 2.63 5 13.15 1.00 0.26 False
2691603 2016-03-29 4346 SHELLSBURG 52332 6 Benton 375 3.55 5.33 2 10.66 0.75 0.20 False
2691604 2016-03-29 4346 SHELLSBURG 52332 6 Benton 1750 7.84 11.76 1 11.76 1.75 0.46 False
2691605 2016-03-29 4346 SHELLSBURG 52332 6 Benton 1000 6.63 9.95 4 39.80 4.00 1.06 False
2691606 2016-03-29 4346 SHELLSBURG 52332 6 Benton 1000 18.38 27.57 5 137.85 5.00 1.32 False
2691607 2016-03-29 4346 SHELLSBURG 52332 6 Benton 750 27.00 40.50 1 40.50 0.75 0.20 False
2691608 2016-03-29 4346 SHELLSBURG 52332 6 Benton 1000 7.50 11.25 2 22.50 2.00 0.53 False
2691609 2016-03-29 4346 SHELLSBURG 52332 6 Benton 1000 9.50 14.25 1 14.25 1.00 0.26 False
2691610 2016-03-29 4346 SHELLSBURG 52332 6 Benton 1000 4.05 6.08 1 6.08 1.00 0.26 False
2691611 2016-03-29 4346 SHELLSBURG 52332 6 Benton 1000 11.50 17.25 1 17.25 1.00 0.26 False
2691612 2016-03-29 4346 SHELLSBURG 52332 6 Benton 1000 7.87 11.81 4 47.24 4.00 1.06 False
2691613 2016-03-29 4346 SHELLSBURG 52332 6 Benton 1000 7.87 11.81 1 11.81 1.00 0.26 False
2691614 2016-03-29 4346 SHELLSBURG 52332 6 Benton 1000 7.87 11.81 1 11.81 1.00 0.26 False
2691615 2016-03-29 4346 SHELLSBURG 52332 6 Benton 750 4.75 7.13 1 7.13 0.75 0.20 False
2691616 2016-03-29 4346 SHELLSBURG 52332 6 Benton 1000 11.00 16.50 2 33.00 2.00 0.53 False
2691617 2016-03-29 4346 SHELLSBURG 52332 6 Benton 1000 11.34 17.01 2 34.02 2.00 0.53 False
2691618 2016-03-29 4346 SHELLSBURG 52332 6 Benton 1000 12.97 19.46 1 19.46 1.00 0.26 False
2691619 2016-03-29 4346 SHELLSBURG 52332 6 Benton 1000 6.14 9.21 2 18.42 2.00 0.53 False
2691620 2016-03-29 4346 SHELLSBURG 52332 6 Benton 1000 18.75 28.13 2 56.26 2.00 0.53 False
2691621 2016-03-29 4346 SHELLSBURG 52332 6 Benton 750 9.45 14.18 2 28.36 1.50 0.40 False
2699523 2016-03-30 5190 WALFORD 52351 6 Benton 750 8.25 12.38 12 148.56 9.00 2.38 False
2699524 2016-03-30 5190 WALFORD 52351 6 Benton 750 4.00 6.00 12 72.00 9.00 2.38 False
2699525 2016-03-30 5190 WALFORD 52351 6 Benton 750 12.05 18.08 3 54.24 2.25 0.59 False
2699526 2016-03-30 5190 WALFORD 52351 6 Benton 750 15.07 22.61 12 271.32 9.00 2.38 False
2699527 2016-03-30 5190 WALFORD 52351 6 Benton 750 5.23 7.85 12 94.20 9.00 2.38 False
2699528 2016-03-30 5190 WALFORD 52351 6 Benton 1750 10.45 15.68 6 94.08 10.50 2.77 False
2699529 2016-03-30 5190 WALFORD 52351 6 Benton 1750 7.20 10.80 6 64.80 10.50 2.77 False
2699530 2016-03-30 5190 WALFORD 52351 6 Benton 750 6.49 9.74 3 29.22 2.25 0.59 False
2699531 2016-03-30 5190 WALFORD 52351 6 Benton 750 7.88 11.82 5 59.10 3.75 0.99 False

9803 rows × 14 columns

In [37]:
# great! no more duplicate county numbers! (see spotcheck on Benton, which used to have two county numbers) 
# No nulls in county numbers!
In [38]:
# our only null values are in Category and Category name, which we expect.
iowa.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2709552 entries, 0 to 2709551
Data columns (total 14 columns):
Date                     datetime64[ns]
Store Number             int64
City                     object
Zip Code                 object
County Number            float64
County                   object
Bottle Volume (ml)       int64
State Bottle Cost        float64
State Bottle Retail      float64
Bottles Sold             int64
Sale (Dollars)           float64
Volume Sold (Liters)     float64
Volume Sold (Gallons)    float64
countynull               bool
dtypes: bool(1), datetime64[ns](1), float64(6), int64(3), object(3)
memory usage: 292.0+ MB
In [39]:
import numpy as np
In [40]:
#Are numbers reasonable?
iowa[["Bottle Volume (ml)","State Bottle Cost", "State Bottle Retail", "Bottles Sold", "Sale (Dollars)", 
      "Volume Sold (Liters)","Volume Sold (Gallons)"]].describe()
Out[40]:
Bottle Volume (ml) State Bottle Cost State Bottle Retail Bottles Sold Sale (Dollars) Volume Sold (Liters) Volume Sold (Gallons)
count 2709552.000000 2709552.000000 2709552.000000 2709552.000000 2709552.000000 2709552.000000 2709552.000000
mean 924.211126 9.816209 14.740118 9.838390 129.119100 8.921504 2.357017
std 546.485776 14.906556 22.359532 23.498628 399.461537 28.244350 7.461356
min 0.000000 0.890000 1.340000 1.000000 1.340000 0.000000 0.000000
25% 750.000000 5.510000 8.270000 2.000000 30.480000 1.500000 0.400000
50% 750.000000 8.070000 12.300000 6.000000 70.560000 5.250000 1.390000
75% 1000.000000 11.960000 17.940000 12.000000 135.000000 10.500000 2.770000
max 225000.000000 6468.000000 9702.000000 3960.000000 106326.000000 3960.000000 1046.120000
In [41]:
for i in ["Bottle Volume (ml)","State Bottle Cost", "State Bottle Retail", "Bottles Sold", "Sale (Dollars)", 
          "Volume Sold (Liters)","Volume Sold (Gallons)"]:
    print(iowa[i].value_counts())
750       1227979
1750       541448
1000       367592
375        272113
500        121004
200         99426
600         32835
3000        16294
300         13719
100          8553
800          2866
1200         2832
2400         1004
400           335
850           293
603           158
4800          152
50            150
3600          146
6000          130
2250           91
250            71
950            68
150            60
1125           54
1500           51
900            50
2550           42
4500           27
502             3
180000          2
0               2
189000          1
225000          1
Name: Bottle Volume (ml), dtype: int64
8.25       61549
6.50       48647
9.00       43436
10.00      42490
10.49      40135
15.00      37499
7.17       29630
7.47       28936
7.00       27601
6.30       27472
5.00       27238
5.23       25954
11.49      25728
6.90       25473
7.49       23089
3.34       22910
7.62       22510
6.92       21978
8.98       21958
3.37       21275
12.50      21228
8.20       21029
4.75       20268
11.00      20130
18.49      20098
7.20       19797
5.50       19223
3.50       19151
18.00      18827
10.50      17184
           ...  
13.35          1
102.85         1
245.00         1
120.22         1
14.88          1
407.02         1
48.00          1
20.12          1
22.05          1
27.33          1
15.80          1
8.69           1
14.94          1
2.72           1
71.80          1
11.17          1
14.36          1
6468.00        1
5500.00        1
8.12           1
87.75          1
40.25          1
25.50          1
7.55           1
33.00          1
22.25          1
35.80          1
483.00         1
31.08          1
51.38          1
Name: State Bottle Cost, dtype: int64
12.38      61549
9.75       49084
13.50      43942
15.00      42567
15.74      38370
22.50      37583
10.76      29630
11.21      28936
10.50      27669
9.45       27472
7.50       27238
7.85       25954
17.24      25696
10.35      25473
11.24      23091
5.01       22910
11.43      22510
10.38      21978
13.47      21841
18.75      21484
5.06       21275
12.30      21029
7.13       20267
16.50      20130
27.74      20098
10.80      19797
8.25       19249
5.25       19151
27.00      18882
15.75      17219
           ...  
623.40         1
18.90          1
31.19          1
20.18          1
16.76          1
36.25          1
367.50         1
724.50         1
28.80          1
51.51          1
72.00          1
70.50          1
30.62          1
11.60          1
57.75          1
9.43           1
12.93          1
26.22          1
26.75          1
41.00          1
33.38          1
9.20           1
22.80          1
18.35          1
504.99         1
45.75          1
21.84          1
1011.24        1
77.07          1
15.23          1
Name: State Bottle Retail, dtype: int64
12      729679
6       523228
2       376696
1       313147
3       280903
4       149473
24      148590
48       37388
5        31965
36       21104
18       17129
10       13816
8        12311
60       10904
30        6249
72        5061
120       3850
7         3673
96        3319
84        2023
9         1601
144       1401
15        1138
42        1069
240       1066
180       1037
300        967
90         908
150        805
20         753
         ...  
618          1
615          1
157          1
678          1
92           1
53           1
522          1
159          1
55           1
494          1
59           1
195          1
61           1
64           1
750          1
67           1
69           1
744          1
91           1
74           1
75           1
77           1
726          1
525          1
81           1
178          1
85           1
176          1
684          1
1980         1
Name: Bottles Sold, dtype: int64
162.00     34673
148.56     25146
64.80      20728
94.20      19274
70.56      18967
90.00      18014
60.12      16813
117.00     16558
62.28      16239
73.80      16025
188.88     15928
60.72      15782
64.56      15497
180.00     15449
135.00     15189
126.00     14705
45.00      14481
30.00      13281
161.64     13111
270.00     12724
99.00      12525
72.00      12510
124.20     12375
81.00      12338
132.78     10597
58.50      10338
24.76      10062
31.50       9779
225.00      9745
19.50       9715
           ...  
220.43         1
230.93         1
86.13          1
232.68         1
68.30          1
317.28         1
180.18         1
674.91         1
72.87          1
671.58         1
106.70         1
170.18         1
632.16         1
626.16         1
3119.76        1
620.16         1
606.08         1
605.16         1
74.62          1
535.08         1
78.37          1
173.18         1
568.08         1
83.37          1
558.66         1
177.43         1
554.58         1
548.16         1
542.16         1
215.76         1
Name: Sale (Dollars), dtype: int64
9.00       533259
10.50      357856
1.50       247102
2.25       199464
12.00      173414
0.75       172934
3.00       135887
4.50       122337
1.00        60702
0.50        56816
21.00       52590
2.00        49652
3.50        40584
1.12        37781
18.00       36961
6.00        35611
0.60        30790
1.75        29865
24.00       28755
4.80        21159
1.20        21110
9.60        20249
0.38        19828
5.25        19043
4.00        18020
3.75        16446
31.50       14942
36.00       13845
0.80         9834
0.40         9719
            ...  
1837.50         1
204.75          1
1932.00         1
14.25           1
1980.00         1
124.25          1
4.75            1
1186.50         1
1176.00         1
1165.50         1
96.25           1
229.25          1
51.75           1
99.75           1
913.50          1
24.75           1
217.50          1
211.50          1
972.00          1
206.50          1
117.75          1
1081.50         1
121.50          1
1092.00         1
1102.50         1
1116.00         1
1128.00         1
186.00          1
185.50          1
333.60          1
Name: Volume Sold (Liters), dtype: int64
2.38      533259
2.77      357856
0.40      247102
0.59      199464
3.17      173414
0.20      172934
0.79      135887
1.19      122337
0.26       60702
0.13       56816
5.55       52590
0.53       49652
0.92       40584
0.30       37781
4.76       36961
1.59       35623
0.16       30790
0.46       29865
6.34       28755
1.27       21160
0.32       21123
2.54       20249
0.10       19828
1.39       19043
1.06       18020
0.99       16446
8.32       14942
9.51       13845
0.21        9834
0.11        9719
           ...  
716.43         1
807.18         1
5.81           1
0.77           1
174.62         1
104.02         1
1.64           1
13.47          1
624.11         1
6.97           1
162.47         1
73.97          1
19.22          1
96.69          1
180.69         1
49.14          1
18.39          1
313.44         1
6.14           1
832.14         1
307.89         1
212.39         1
14.58          1
128.39         1
326.52         1
443.81         1
468.77         1
871.77         1
60.56          1
485.42         1
Name: Volume Sold (Gallons), dtype: int64
In [42]:
# next we check that (1/100)*"Bottle Volume (ml)" * "Bottles Sold" == "Volume Sold (Liters)"
In [43]:
a = iowa.copy()
In [44]:
a.columns
Out[44]:
Index(['Date', 'Store Number', 'City', 'Zip Code', 'County Number', 'County',
       'Bottle Volume (ml)', 'State Bottle Cost', 'State Bottle Retail',
       'Bottles Sold', 'Sale (Dollars)', 'Volume Sold (Liters)',
       'Volume Sold (Gallons)', 'countynull'],
      dtype='object')
In [45]:
a["volume sold calc"] = np.round(((1/1000)*a["Bottle Volume (ml)"])*a["Bottles Sold"],2)
a["volume sold check"] = a["volume sold calc"] == a["Volume Sold (Liters)"]
a["volume sold check"].value_counts()
Out[45]:
True    2709552
Name: volume sold check, dtype: int64
In [46]:
a["gallon calc"] = np.round(a["volume sold calc"]*0.264172, 2)
a["gallon sold check"] = a["gallon calc"] - a["Volume Sold (Gallons)"]
a["gallon sold check"].value_counts()
Out[46]:
 0.00    2709203
-0.01        345
 0.01          4
Name: gallon sold check, dtype: int64
In [47]:
# looks good! volume sold (ml AND Gallons) are documented correctly!
In [48]:
#Now we'd like to check that all stores of a specific number in one city only.
In [49]:
a = pd.DataFrame(iowa.groupby(["Store Number"])["City"].value_counts())
In [50]:
a.columns = ["Transaction Counts"]
In [51]:
a.reset_index(inplace=True)
In [52]:
a["duplicate"] = a["Store Number"].duplicated()
In [53]:
a["duplicate"].value_counts()
Out[53]:
False    1403
True       32
Name: duplicate, dtype: int64
In [54]:
a[a['duplicate']==True]
Out[54]:
Store Number City Transaction Counts duplicate
52 2536 PLEASANT HILL 334 True
56 2543 OTTUWMA 394 True
105 2604 LEMARS 181 True
118 2620 WINDSOR HEIGHTS 503 True
146 2656 CORNING 51 True
227 3542 ST ANSGAR 37 True
255 3645 WINDSOR HEIGHTS 72 True
302 3742 LEMARS 36 True
310 3762 WINDSOR HEIGHTS 218 True
364 3877 DES MOINES 25 True
381 3909 MELCHER-DALLAS 67 True
392 3930 ST LUCAS 34 True
626 4351 CLEAR LAKE 14 True
694 4433 PLEASANT HILL 38 True
785 4533 ST ANSGAR 18 True
830 4584 DUMONT 46 True
884 4647 WATERLOO 45 True
1057 4839 Northwood 27 True
1059 4840 WATERLOO 50 True
1066 4847 LECLAIRE 124 True
1096 4882 ST CHARLES 26 True
1132 4925 URBANDALE 21 True
1134 4926 DES MOINES 55 True
1152 4947 CEDAR FALLS 227 True
1171 4971 PLEASANT HILL 47 True
1252 5058 Urbandale 18 True
1301 5106 DEWITT 85 True
1303 5107 WINDSOR HEIGHTS 72 True
1363 5169 Des Moines 83 True
1427 9002 LECLAIRE 14 True
1430 9013 Cumming 7 True
1434 9023 Carroll 1 True
In [55]:
#this seems minor given the scale of the data, so we're going to set it aside for now. but may be worth further 
#investigation
In [56]:
# Assumption 1: We've sufficiently ammended for erroneous zip and city info from an external source.
# Assumption 2: We were able to fill in missing or erroneous county number values by favoring other data records where
#      it was filled in. We assume that if there was a county with two county numbers in the data, that the county where 
#      the majority of the records were was the correct one.
# Risk 1: Misinformation in Store code - there is no source to back into these values
# Risk 2: Category and Category Name are left as-is, with the understanding that they will be amended should they 
#      be relevant to our model predicting liquor sales.
# Risk 3: There is no way to determine store profit or anything in excess of what is taxed.
In [57]:
#Let's eliminate extra columns and rename our favorite features.
In [58]:
iowa.head(2)
Out[58]:
Date Store Number City Zip Code County Number County Bottle Volume (ml) State Bottle Cost State Bottle Retail Bottles Sold Sale (Dollars) Volume Sold (Liters) Volume Sold (Gallons) countynull
0 2015-11-20 2191 KEOKUK 52632 56 Lee 750 18.09 27.14 6 162.84 4.5 1.19 False
1 2015-11-21 2205 CLARINDA 51632 73 Page 750 18.09 27.14 12 325.68 9.0 2.38 False
In [59]:
iowacols = ["date", "store", "city", "zip", "cnty-num", "cnty-name", 'vol-ml', 'state-cost', 'state-ret', 'qty-sold', 
                 'sale-price', 'vol-sold-l', 'vol-sold-g']
In [60]:
len(iowacols)
Out[60]:
13
In [61]:
len(iowa.columns)
Out[61]:
14
In [62]:
iowa = iowa[list(iowa.columns)[0:13]]
iowa.head(2)
Out[62]:
Date Store Number City Zip Code County Number County Bottle Volume (ml) State Bottle Cost State Bottle Retail Bottles Sold Sale (Dollars) Volume Sold (Liters) Volume Sold (Gallons)
0 2015-11-20 2191 KEOKUK 52632 56 Lee 750 18.09 27.14 6 162.84 4.5 1.19
1 2015-11-21 2205 CLARINDA 51632 73 Page 750 18.09 27.14 12 325.68 9.0 2.38
In [63]:
iowa.columns = iowacols
In [87]:
#Finally, we add Cost total, Sales Total, Sales Tax, Excise Tax, and Revenue
In [64]:
import math
In [66]:
iowa["cost"] = np.round(iowa["state-cost"]*iowa['qty-sold'],2)
iowa['sales-tax'] = np.round(.06*iowa["sale-price"],2)
iowa["excise"] = np.round(iowa["sale-price"]-iowa['cost'],2)
In [67]:
iowa.head()
Out[67]:
date store city zip cnty-num cnty-name vol-ml state-cost state-ret qty-sold sale-price vol-sold-l vol-sold-g cost sales-tax excise
0 2015-11-20 2191 KEOKUK 52632 56 Lee 750 18.09 27.14 6 162.84 4.50 1.19 108.54 9.77 54.30
1 2015-11-21 2205 CLARINDA 51632 73 Page 750 18.09 27.14 12 325.68 9.00 2.38 217.08 19.54 108.60
2 2015-11-16 3549 FORT MADISON 52627 56 Lee 150 6.40 9.60 2 19.20 0.30 0.08 12.80 1.15 6.40
3 2015-11-04 2513 IOWA CITY 52246 52 Johnson 1750 35.55 53.34 3 160.02 5.25 1.39 106.65 9.60 53.37
4 2015-11-17 3942 TOLEDO 52342 86 Tama 150 6.40 9.60 2 19.20 0.30 0.08 12.80 1.15 6.40
In [68]:
iowa.describe()
Out[68]:
store cnty-num vol-ml state-cost state-ret qty-sold sale-price vol-sold-l vol-sold-g cost sales-tax excise
count 2709552.000000 2709552.000000 2709552.000000 2709552.000000 2709552.000000 2709552.000000 2709552.000000 2709552.000000 2709552.000000 2709552.000000 2709552.000000 2709552.000000
mean 3589.070592 57.561262 924.211126 9.816209 14.740118 9.838390 129.119100 8.921504 2.357017 85.955665 7.747086 43.163435
std 949.108341 27.229228 546.485776 14.906556 22.359532 23.498628 399.461537 28.244350 7.461356 265.623098 23.967692 133.868440
min 2106.000000 1.000000 0.000000 0.890000 1.340000 1.000000 1.340000 0.000000 0.000000 0.890000 0.080000 0.000000
25% 2604.000000 32.000000 750.000000 5.510000 8.270000 2.000000 30.480000 1.500000 0.400000 20.300000 1.830000 10.200000
50% 3721.000000 63.000000 750.000000 8.070000 12.300000 6.000000 70.560000 5.250000 1.390000 46.560000 4.230000 23.520000
75% 4382.000000 77.000000 1000.000000 11.960000 17.940000 12.000000 135.000000 10.500000 2.770000 90.000000 8.100000 45.000000
max 9023.000000 99.000000 225000.000000 6468.000000 9702.000000 3960.000000 106326.000000 3960.000000 1046.120000 70884.000000 6379.560000 35442.000000
In [92]:
#wait, why do we have a zero dollar minimum excise?
In [69]:
iowa.loc[iowa["excise"] ==0,:]
Out[69]:
date store city zip cnty-num cnty-name vol-ml state-cost state-ret qty-sold sale-price vol-sold-l vol-sold-g cost sales-tax excise
15657 2015-07-30 2515 MASON CITY 50402 17 Cerro Gordo 600 8.4 8.4 2 16.8 1.2 0.32 16.8 1.01 0
16253 2015-05-13 2543 OTTUMWA 52501 90 Wapello 600 8.4 8.4 3 25.2 1.8 0.48 25.2 1.51 0
22540 2015-06-16 2327 CORNING 50841 2 Adams 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
25348 2015-05-18 2613 COUNCIL BLUFFS 51503 78 Pottawattamie 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
67203 2015-05-12 2518 RED OAK 51591 69 Montgomery 600 8.4 8.4 2 16.8 1.2 0.32 16.8 1.01 0
71157 2015-05-27 4509 AMES 50014 85 Story 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
83087 2015-06-15 2521 WEST DES MOINES 50398 77 Polk 600 8.4 8.4 10 84.0 6.0 1.59 84.0 5.04 0
124267 2015-05-22 2613 COUNCIL BLUFFS 51503 78 Pottawattamie 600 8.4 8.4 10 84.0 6.0 1.59 84.0 5.04 0
132073 2015-07-15 3650 HOLSTEIN 51025 47 Ida 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
137397 2015-06-30 4295 DES MOINES 50981 77 Polk 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
178194 2015-05-06 3650 HOLSTEIN 51025 47 Ida 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
199968 2015-06-04 5116 WATERLOO 50706 7 Black Hawk 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
210546 2015-07-06 5116 WATERLOO 50706 7 Black Hawk 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
219423 2015-05-27 2505 BOONE 50037 8 Boone 600 8.4 8.4 3 25.2 1.8 0.48 25.2 1.51 0
224467 2015-05-04 2653 WASHINGTON 52353 92 Washington 600 8.4 8.4 2 16.8 1.2 0.32 16.8 1.01 0
253040 2015-05-05 3514 FORT DODGE 50501 94 Webster 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
264124 2015-05-21 2515 MASON CITY 50402 17 Cerro Gordo 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
280133 2015-06-30 2623 SIOUX CITY 51111 97 Woodbury 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
372627 2015-06-08 2662 MUSCATINE 52761 70 Muscatine 600 8.4 8.4 2 16.8 1.2 0.32 16.8 1.01 0
379674 2015-06-15 2528 DES MOINES 50981 77 Polk 600 8.4 8.4 2 16.8 1.2 0.32 16.8 1.01 0
387539 2015-04-23 3678 DES MOINES 50981 77 Polk 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
393906 2015-05-26 2624 DUBUQUE 52099 31 Dubuque 600 8.4 8.4 10 84.0 6.0 1.59 84.0 5.04 0
408340 2015-06-17 2623 SIOUX CITY 51111 97 Woodbury 600 8.4 8.4 2 16.8 1.2 0.32 16.8 1.01 0
435385 2015-05-05 2665 WAUKEE 50263 25 Dallas 600 8.4 8.4 2 16.8 1.2 0.32 16.8 1.01 0
498252 2015-05-28 2626 DES MOINES 50981 77 Polk 600 8.4 8.4 10 84.0 6.0 1.59 84.0 5.04 0
503396 2015-05-06 3650 HOLSTEIN 51025 47 Ida 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
512747 2015-05-12 2327 CORNING 50841 2 Adams 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
515002 2015-06-09 4277 LE MARS 51031 75 Plymouth 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
560989 2015-07-08 2623 SIOUX CITY 51111 97 Woodbury 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
634813 2015-05-04 4307 DUNLAP 51529 43 Harrison 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1380986 2015-05-11 2662 MUSCATINE 52761 70 Muscatine 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
1385788 2015-05-27 4509 AMES 50014 85 Story 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
1406868 2015-05-05 2629 COUNCIL BLUFFS 51503 78 Pottawattamie 600 8.4 8.4 2 16.8 1.2 0.32 16.8 1.01 0
1412153 2015-05-19 3942 TOLEDO 52342 86 Tama 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
1540193 2015-05-27 2505 BOONE 50037 8 Boone 600 8.4 8.4 3 25.2 1.8 0.48 25.2 1.51 0
1565248 2015-07-27 2506 BURLINGTON 52601 29 Des Moines 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
1588435 2015-04-22 3842 BANCROFT 50517 55 Kossuth 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
1609173 2015-06-04 5116 WATERLOO 50706 7 Black Hawk 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
1615042 2015-06-10 2623 SIOUX CITY 51111 97 Woodbury 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
1620985 2015-06-01 2643 WATERLOO 50706 7 Black Hawk 600 8.4 8.4 2 16.8 1.2 0.32 16.8 1.01 0
1624137 2015-06-16 2228 WINTERSET 50273 61 Madison 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
1628697 2015-06-30 2623 SIOUX CITY 51111 97 Woodbury 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
1651168 2015-06-02 2666 ANKENY 50023 77 Polk 600 8.4 8.4 10 84.0 6.0 1.59 84.0 5.04 0
1657188 2015-07-08 2623 SIOUX CITY 51111 97 Woodbury 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
1669960 2015-06-02 2666 ANKENY 50023 77 Polk 600 8.4 8.4 10 84.0 6.0 1.59 84.0 5.04 0
1677299 2015-05-05 2518 RED OAK 51591 69 Montgomery 600 8.4 8.4 2 16.8 1.2 0.32 16.8 1.01 0
1680377 2015-05-14 2582 MASON CITY 50402 17 Cerro Gordo 600 8.4 8.4 2 16.8 1.2 0.32 16.8 1.01 0
1706248 2015-06-01 4504 INDIANOLA 50125 91 Warren 600 8.4 8.4 10 84.0 6.0 1.59 84.0 5.04 0
1706510 2015-06-01 2506 BURLINGTON 52601 29 Des Moines 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
1708279 2015-04-29 2614 DAVENPORT 52809 82 Scott 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
1738444 2015-05-11 2538 WATERLOO 50706 7 Black Hawk 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
1745839 2015-04-29 2614 DAVENPORT 52809 82 Scott 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
1758180 2015-07-20 4114 ATLANTIC 50022 15 Cass 600 8.4 8.4 3 25.2 1.8 0.48 25.2 1.51 0
1825455 2015-05-19 3942 TOLEDO 52342 86 Tama 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
1851707 2015-06-09 4277 LE MARS 51031 75 Plymouth 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0
1853646 2015-05-06 3932 MAPLETON 51034 67 Monona 600 8.4 8.4 2 16.8 1.2 0.32 16.8 1.01 0
1888279 2015-05-27 2603 BETTENDORF 52722 82 Scott 600 8.4 8.4 10 84.0 6.0 1.59 84.0 5.04 0
1900960 2015-07-29 2623 SIOUX CITY 51111 97 Woodbury 600 8.4 8.4 2 16.8 1.2 0.32 16.8 1.01 0
1914323 2015-07-20 4114 ATLANTIC 50022 15 Cass 600 8.4 8.4 3 25.2 1.8 0.48 25.2 1.51 0
1949719 2015-07-27 2506 BURLINGTON 52601 29 Des Moines 600 8.4 8.4 1 8.4 0.6 0.16 8.4 0.50 0

108 rows × 16 columns

In [70]:
#these don't look right. Let's see what the markup percentage should be
iowa["markup-pct"] = np.round(iowa["state-ret"]/iowa['state-cost'],2)
In [71]:
# It's 1.5! Our research is consistent with this. Good.
iowa["markup-pct"].describe()
Out[71]:
count    2709552.000000
mean           1.501768
std            0.011987
min            1.000000
25%            1.500000
50%            1.500000
75%            1.500000
max            3.500000
Name: markup-pct, dtype: float64
In [96]:
# markup percentage of 3.5! what a can of worms we have here. Let's first fix the 0 excise one. 
# We can assume these were documented incorrectly because it is actually illegal to sell these at no tax.
# Iowa collects a 50% markup of any liquor. We can assume that excise in addition to 50% of the markup is erroneous.
# source: http://statelaws.findlaw.com/iowa-law/iowa-consumer-tax-laws.html
In [72]:
iowa.loc[iowa["excise"] ==0,"state-ret"]=np.round(1.5*iowa["state-cost"],2)
In [73]:
#now to correct the markup percentage of 3.5 which is way high!
In [74]:
iowa.loc[iowa["markup-pct"] ==3.5,:]
Out[74]:
date store city zip cnty-num cnty-name vol-ml state-cost state-ret qty-sold sale-price vol-sold-l vol-sold-g cost sales-tax excise markup-pct
325986 2015-10-08 2633 DES MOINES 50981 77 Polk 750 2.2 7.7 1 7.7 0.75 0.20 2.2 0.46 5.5 3.5
489183 2015-10-22 2633 DES MOINES 50981 77 Polk 750 2.2 7.7 24 184.8 18.00 4.76 52.8 11.09 132.0 3.5
863878 2015-10-01 5102 MOUNT VERNON 52314 57 Linn 750 2.2 7.7 6 46.2 4.50 1.19 13.2 2.77 33.0 3.5
954408 2015-10-12 2633 DES MOINES 50981 77 Polk 750 2.2 7.7 1 7.7 0.75 0.20 2.2 0.46 5.5 3.5
1015894 2015-10-05 2633 DES MOINES 50981 77 Polk 750 2.2 7.7 3 23.1 2.25 0.59 6.6 1.39 16.5 3.5
1135118 2015-10-13 2578 CHARLES CITY 50616 34 Floyd 750 2.2 7.7 6 46.2 4.50 1.19 13.2 2.77 33.0 3.5
1149858 2015-10-01 2633 DES MOINES 50981 77 Polk 750 2.2 7.7 1 7.7 0.75 0.20 2.2 0.46 5.5 3.5
1171131 2015-10-21 5180 OELWEIN 50662 33 Fayette 750 2.2 7.7 2 15.4 1.50 0.40 4.4 0.92 11.0 3.5
1798342 2015-10-26 2633 DES MOINES 50981 77 Polk 750 2.2 7.7 30 231.0 22.50 5.94 66.0 13.86 165.0 3.5
In [75]:
iowa.loc[iowa["markup-pct"] != 1.5,"state-ret"]=np.round(1.5*iowa["state-cost"],2)
In [101]:
# Phew, would have completely lost those if we didn't check out state legislation and the calculated columns!
In [93]:
# write data to csv for continued use
iowa.to_csv('iowa_clean.csv', index=False)

Explore the data

Perform some exploratory statistical analysis and make some plots, such as histograms of transaction totals, bottles sold, etc.

In [1]:
import pandas as pd
from datetime import datetime
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
iowa = pd.read_csv('iowa_clean.csv')
In [76]:
# source = http://www.collegecalc.org/colleges/iowa/four-year/
# Chose only schools with >1000 f/t students and with a ranking, as most schools without a ranking consist of students who 
# are either close to home, living at home, or working in addition to classes. There is less of a drinking culture. 
# The largest university to be excluded on the grounds of no ranking is Kaplan University Davenport Campus with 
# 9231 students.
uni = pd.read_csv("iowa-uni.csv")
uni =uni[0:23]
In [77]:
schools = uni["School"]
In [81]:
from time import sleep

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

<font color = "red">You do not have to run this if you also collected file "iowa-uni-coords.csv" from Natalie!</font>

In [82]:
driver = webdriver.Firefox()
wait = WebDriverWait(driver, 30)
coords = []
driver.get('https://www.google.com/maps')
for school in schools:
    searchbox = wait.until(EC.presence_of_element_located((By.ID, 'searchboxinput')))
    searchbox.clear()
    searchbox.send_keys(school + ' iowa')
    driver.find_element_by_id('searchbox-searchbutton').click()
    sleep(6)
    url = driver.current_url
    tries = 0
    while '@' not in url:
        tries += 1
        sleep(1)
        url = driver.current_url
        if tries == 5: #try five times to give the url the chance to resolve
            break
    try:
        long_lat = url.split('@')[1].split(',')[:2]
        coords.append((school, long_lat[0], long_lat[1]))
    except:
        coords.append((school, None, None)) 
driver.quit()  
In [83]:
coords
Out[83]:
[('Buena Vista University', '42.6410826', '-95.2120009'),
 ('Central College', '41.400595', '-92.9218557'),
 ('Clarke University', '42.5092662', '-90.6935688'),
 ('Coe College', '41.9884283', '-91.6613174'),
 ('Cornell College', '41.9269546', '-91.4275389'),
 ('Dordt College', '43.0824829', '-96.1685495'),
 ('Drake University', '41.6029128', '-93.6574871'),
 ('Graceland University Lamoni', '40.616346', '-93.9278267'),
 ('Grand View University', '41.6207345', '-93.6024867'),
 ('Grinnell College', '41.7490571', '-92.7223189'),
 ('Iowa State University', '40.3680026', '-89.819982'),
 ('Loras College', '42.502738', '-90.6835877'),
 ('Luther College', '43.311063', '-91.808442'),
 ('Morningside College', '42.4733866', '-96.3607242'),
 ('Mount Mercy University', '42.0027013', '-91.6544266'),
 ('Northwestern College', '42.9979484', '-96.0604485'),
 ('Saint Ambrose University', '41.5404254', '-90.5829498'),
 ('Simpson College', '41.364988', '-93.5658147'),
 ('University of Dubuque', '42.4952453', '-90.6978244'),
 ('University of Iowa', '41.6613397', '-91.5536702'),
 ('University of Northern Iowa', '42.5115523', '-92.4602541'),
 ('Upper Iowa Uniersity', '42.8387392', '-91.799972'),
 ('Wartburg College', '42.7285775', '-92.4843476')]
In [84]:
coords = pd.DataFrame(coords)
In [85]:
coords.to_csv("iowa-uni-coords.csv", index = False)

<font color ="red">Start here if you have csv file "iowa-uni-coords" collected from NMO.</font>

In [3]:
coords=pd.read_csv("iowa-uni-coords.csv")
In [4]:
coords.columns =[["School", "Latitude", "Longitude"]]
In [5]:
#iowa state university is mapped wrong! how did that happen? I guess always double check your webscraper.
# 42.0266187,-93.6486541
In [6]:
coords.head(20)
Out[6]:
School Latitude Longitude
0 Buena Vista University 42.641083 -95.212001
1 Central College 41.400595 -92.921856
2 Clarke University 42.509266 -90.693569
3 Coe College 41.988428 -91.661317
4 Cornell College 41.926955 -91.427539
5 Dordt College 43.082483 -96.168549
6 Drake University 41.602913 -93.657487
7 Graceland University Lamoni 40.616346 -93.927827
8 Grand View University 41.620734 -93.602487
9 Grinnell College 41.749057 -92.722319
10 Iowa State University 40.368003 -89.819982
11 Loras College 42.502738 -90.683588
12 Luther College 43.311063 -91.808442
13 Morningside College 42.473387 -96.360724
14 Mount Mercy University 42.002701 -91.654427
15 Northwestern College 42.997948 -96.060448
16 Saint Ambrose University 41.540425 -90.582950
17 Simpson College 41.364988 -93.565815
18 University of Dubuque 42.495245 -90.697824
19 University of Iowa 41.661340 -91.553670
In [7]:
coords.loc[coords["School"]=="Iowa State University", "Latitude"] = 42.0266187
coords.loc[coords["School"]=="Iowa State University", "Longitude"] = -93.6486541
In [8]:
#let's import our visualization packages.
In [9]:
import seaborn as sns
import matplotlib.pyplot as plt
from bokeh.io import output_notebook, show
output_notebook()
Loading BokehJS ...
In [10]:
a = pd.DataFrame(iowa.groupby("cnty-num")["excise"].sum())
In [11]:
a.reset_index(inplace=True)
In [12]:
a.head()
Out[12]:
cnty-num excise
0 1.0 99248.06
1 2.0 41836.02
2 3.0 336293.58
3 4.0 338571.27
4 5.0 71761.10
In [13]:
ia_liq ={}
for i in range(0,len(a)):
    key = int(a["cnty-num"][i])
    value = float(a["excise"][i])
    ia_liq[key] = value
In [14]:
from bokeh.io import show
from bokeh.models import (
    ColumnDataSource,
    HoverTool,
    LogColorMapper
)
from bokeh.palettes import Viridis6 as palette
from bokeh.plotting import figure

from bokeh.sampledata.us_counties import data as counties
#from bokeh.sampledata.unemployment import data as unemployment

palette.reverse()

counties = {
    code: county for code, county in counties.items() if county["state"] == "ia"
}

county_xs = [county["lons"] for county in counties.values()]
county_ys = [county["lats"] for county in counties.values()]

county_names = [county['name'] for county in counties.values()]
county_rates = [ia_liq[county_id[1]] if county_id[1] in ia_liq else None for county_id in counties]
color_mapper = LogColorMapper(palette=palette)

source = ColumnDataSource(data=dict(
    x=county_xs,
    y=county_ys,
    name=county_names,
    rate=county_rates,
))

town_xs = list(coords["Longitude"])
town_ys = list(coords["Latitude"])
source2 = ColumnDataSource(data=dict(x=town_xs, y=town_ys))

TOOLS = "pan,wheel_zoom,reset,hover,save"

p = figure(
    title="Liquor Tax Revenue in Iowa", tools=TOOLS,
    x_axis_location=None, y_axis_location=None
)
p.grid.grid_line_color = None

p.patches('x', 'y', source=source,
          fill_color={'field': 'rate', 'transform': color_mapper},
          fill_alpha=0.7, line_color="white", line_width=0.5)

p.circle('x','y', source = source2)

hover = p.select_one(HoverTool)
hover.point_policy = "follow_mouse"
hover.tooltips = [
    ("Name", "@name"),
    ("Total Excise Tax Revenue", "$@rate{1.11}"),
    ("(Long, Lat)", "($x, $y)"),
]

show(p)
In [15]:
# not super impressed with the gradient of colors in bokeh. That's ok. I mostly wanted to see if my college 
# town theory was true. So below I pull in college town information and plot the lat/long on the map. It appears not
# to make that much a difference. Perhaps population is the better indicator; Iowan natives hold their own with any 
# major college towns. College Town indicator does not need to be a variable in our model.
In [16]:
# I'm super interested in the distribution of the excise tax. Here are histograms at varying x-scales.
# We see that among all transactions, there is a wide spread all the way out to ~125000!! wow who got married?
In [17]:
plt.subplot(221)
plt.hist(iowa["excise"])

plt.subplot(222)
plt.hist(iowa["excise"], range = [0,500])

plt.subplot(223)
plt.hist(iowa["excise"], range = [0,200])

plt.subplot(224)
plt.hist(iowa["excise"], range = [0,70])

plt.show()
In [18]:
# Ok let's see the same thing grouped by county. 
a=iowa.groupby(["cnty-name"])["excise"].sum()
In [19]:
a=pd.DataFrame(a)
In [20]:
plt.subplot(221)
plt.hist(a["excise"])
#plt.axis([40, 160, 0, 0.03])

plt.subplot(222)
plt.hist(a["excise"], range = [0,100000])

plt.subplot(223)
plt.hist(a["excise"], range = [0,50000])

plt.subplot(224)
plt.hist(a["excise"], range = [0,20000])

plt.show()
In [21]:
# interesting. skewed right, but when we zoom in on the bulk of the data it is less so
In [22]:
# Let's try to understand those outliers a little better. They skew so far!! We will do so in the "Refine Data" step.
In [23]:
import statsmodels.api as sm
from sklearn import linear_model
/Users/nmolivo/anaconda/lib/python3.5/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
In [24]:
c=iowa.groupby("date")["excise"].sum()
In [25]:
c =pd.DataFrame(c)
In [26]:
c.index = pd.to_datetime(c.index)
In [27]:
c.head()
Out[27]:
excise
date
2015-01-05 541543.17
2015-01-06 384096.17
2015-01-07 490866.86
2015-01-08 341018.87
2015-01-12 383348.67
In [28]:
c["mo-date"] = c.index.month
In [29]:
c["yr-date"] = c.index.year
In [30]:
c.head()
Out[30]:
excise mo-date yr-date
date
2015-01-05 541543.17 1 2015
2015-01-06 384096.17 1 2015
2015-01-07 490866.86 1 2015
2015-01-08 341018.87 1 2015
2015-01-12 383348.67 1 2015
In [31]:
c["shortdate"] = c.apply(lambda x: (str(int(x["mo-date"]))+"-"+str(int(x["yr-date"]))), axis=1)
In [32]:
c = pd.DataFrame(c.groupby(["shortdate"])["excise"].sum())
In [33]:
c.index = pd.to_datetime(c.index)
In [34]:
c.sort_values(["excise"])
Out[34]:
excise
shortdate
2015-01-01 6181739.35
2016-01-01 6346672.56
2015-02-01 7112305.30
2015-08-01 7386557.28
2015-07-01 7490484.40
2015-05-01 7515179.79
2015-04-01 7633332.48
2016-02-01 7635352.07
2015-03-01 7769270.87
2016-03-01 7906418.06
2015-11-01 7913577.60
2015-09-01 8043362.97
2015-10-01 8696331.15
2015-06-01 9061761.84
2015-12-01 10261225.59
In [35]:
plt.figure(figsize = (10,10))
plt.scatter(c.index, c.excise)
Out[35]:
<matplotlib.collections.PathCollection at 0x1163680f0>
In [36]:
# When grouped by month, this appears to reflect seasonality. The two lowest datapoints are both January, 
# there's spikes with growth up to a plateau range [~725000,~825000], in the spring. Spikes in June and December.
# Month is clearly influential, we will be using it as a dummy variable. Will probably exclude March, as it is pretty
# representative of the plateu liquor sales have.
In [37]:
# We're definitely going to want to look at population. source:
In [38]:
# Ok let's check out a scatter matrix for good measure.
In [ ]:
# this takes forever to run on the full data... holding off for now
from pandas.plotting import scatter_matrix
df = iowa[["vol-ml", 'state-cost', 'state-ret', 'qty-sold', 'sale-price', 'cost', 'excise']]
scatter_matrix(df, alpha=0.2, figsize=(15, 15), diagonal='kde')
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-39-75f4858bbfc0> in <module>()
      1 from pandas.plotting import scatter_matrix
      2 df = iowa[["vol-ml", 'state-cost', 'state-ret', 'qty-sold', 'sale-price', 'cost', 'excise']]
----> 3 scatter_matrix(df, alpha=0.2, figsize=(15, 15), diagonal='kde')

/Users/nmolivo/anaconda/lib/python3.5/site-packages/pandas/plotting/_misc.py in scatter_matrix(frame, alpha, figsize, ax, grid, diagonal, marker, density_kwds, hist_kwds, range_padding, **kwds)
    104 
    105                 ax.scatter(df[b][common], df[a][common],
--> 106                            marker=marker, alpha=alpha, **kwds)
    107 
    108                 ax.set_xlim(boundaries_list[j])

/Users/nmolivo/anaconda/lib/python3.5/site-packages/matplotlib/__init__.py in inner(ax, *args, **kwargs)
   1896                     warnings.warn(msg % (label_namer, func.__name__),
   1897                                   RuntimeWarning, stacklevel=2)
-> 1898             return func(ax, *args, **kwargs)
   1899         pre_doc = inner.__doc__
   1900         if pre_doc is None:

/Users/nmolivo/anaconda/lib/python3.5/site-packages/matplotlib/axes/_axes.py in scatter(self, x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, verts, edgecolors, **kwargs)
   4061                 self.set_ymargin(0.05)
   4062 
-> 4063         self.add_collection(collection)
   4064         self.autoscale_view()
   4065 

/Users/nmolivo/anaconda/lib/python3.5/site-packages/matplotlib/axes/_base.py in add_collection(self, collection, autolim)
   1760 
   1761         if autolim:
-> 1762             self.update_datalim(collection.get_datalim(self.transData))
   1763 
   1764         collection._remove_method = lambda h: self.collections.remove(h)

/Users/nmolivo/anaconda/lib/python3.5/site-packages/matplotlib/collections.py in get_datalim(self, transData)
    227             result = mpath.get_path_collection_extents(
    228                 transform.frozen(), paths, self.get_transforms(),
--> 229                 offsets, transOffset.frozen())
    230             result = result.inverse_transformed(transData)
    231         else:

/Users/nmolivo/anaconda/lib/python3.5/site-packages/matplotlib/path.py in get_path_collection_extents(master_transform, paths, transforms, offsets, offset_transform)
   1008     return Bbox.from_extents(*_path.get_path_collection_extents(
   1009         master_transform, paths, np.atleast_3d(transforms),
-> 1010         offsets, offset_transform))
   1011 
   1012 

/Users/nmolivo/anaconda/lib/python3.5/site-packages/matplotlib/path.py in simplify_threshold(self)
    252         self._update_values()
    253 
--> 254     @property
    255     def simplify_threshold(self):
    256         """

KeyboardInterrupt: 
In [39]:
pop = pd.read_csv("../county_pop_data.csv", sep = "\t")
In [40]:
pop = pop[["Geographic name", "7/1/2015", "7/1/2016"]]
In [41]:
pop.columns =[["cnty-name", "2015-pop", "2016-pop"]]
In [42]:
pop["cnty-name"]=[" ".join(x.split()[:-1]) for x in pop["cnty-name"]]
In [43]:
iowa = iowa.merge(pop, on= "cnty-name", how = "left")
In [44]:
import numpy as np
In [45]:
plt.scatter(np.log(iowa["2015-pop"]), iowa["excise"])
Out[45]:
<matplotlib.collections.PathCollection at 0x114df0eb8>
In [46]:
#Looks like a correlation! We're going to need to scale our data, for sure. 

Record your findings

Be sure to write out anything observations from your exploratory analysis.

Per the prompt, we are interested in projecting sales for the following year. In my opinion, the tax board would be interested in the excise. By excluding this, the prompt implies everyone in the room can do the excise tax calculation in their heads and therefore would like everything in terms of sales rather than excise tax.

Since cost is perfectly colinear with sales, we will not include it in the model.
Month is definitely worth including in the model. The prompt specifically asks for Sales from Jan to March per store, likely for this reason.
Population is definitely worth including in the model, though performs better after regularization.

One avenue the Tax Board my take to increase tax revenue is doing a by the gallon or liter tax rather than a 50% markup from the cost.</font>

Mine the data

Now you are ready to compute the variables you will use for your regression from the data. For example, you may want to compute total sales per store from Jan to March of 2015, mean price per bottle, etc. Refer to the readme for more ideas appropriate to your scenario.

Pandas is your friend for this task. Take a look at the operations here for ideas on how to make the best use of pandas and feel free to search for blog and Stack Overflow posts to help you group data by certain variables and compute sums, means, etc. You may find it useful to create a new data frame to house this summary data.

In [47]:
iowa.head()
Out[47]:
date store city zip cnty-num cnty-name vol-ml state-cost state-ret qty-sold sale-price vol-sold-l vol-sold-g cost sales-tax excise markup-pct 2015-pop 2016-pop
0 2015-11-20 2191 KEOKUK 52632 56.0 Lee 750 18.09 27.14 6 162.84 4.50 1.19 108.54 9.77 54.30 1.5 35071.0 34615.0
1 2015-11-21 2205 CLARINDA 51632 73.0 Page 750 18.09 27.14 12 325.68 9.00 2.38 217.08 19.54 108.60 1.5 15521.0 15391.0
2 2015-11-16 3549 FORT MADISON 52627 56.0 Lee 150 6.40 9.60 2 19.20 0.30 0.08 12.80 1.15 6.40 1.5 35071.0 34615.0
3 2015-11-04 2513 IOWA CITY 52246 52.0 Johnson 1750 35.55 53.34 3 160.02 5.25 1.39 106.65 9.60 53.37 1.5 144567.0 146547.0
4 2015-11-17 3942 TOLEDO 52342 86.0 Tama 150 6.40 9.60 2 19.20 0.30 0.08 12.80 1.15 6.40 1.5 17341.0 17319.0
In [48]:
iowa["date"]=pd.to_datetime(iowa["date"])
In [49]:
iowa["year"]= iowa.apply(lambda x: x["date"].year, axis =1)
In [50]:
# iowa2 will be the data grouped by store and year. 
# Keep in mind that qty-sold will now be qty-sold per that year, for that store. 
In [51]:
iowa2 = iowa.groupby(["store","year"])[["qty-sold", 'vol-sold-l', 'sale-price', 'excise', "2015-pop", "2016-pop"]].sum()
In [52]:
iowa2.head()
Out[52]:
qty-sold vol-sold-l sale-price excise 2015-pop 2016-pop
store year
2106 2015 99998 93986.47 1434369.85 478695.22 681512395.0 678474920.0
2016 23605 22277.95 337804.05 112662.94 164070271.0 163339016.0
2113 2015 6483 6500.83 85763.42 28693.18 50346112.0 50042609.0
2016 1703 1608.07 21736.63 7260.10 14537856.0 14450217.0
2130 2015 72562 65562.88 1108184.99 369675.74 527187551.0 524837896.0
In [53]:
iowa2 =iowa2.reset_index()
In [54]:
iowa2["pricepbottle"] = iowa2["sale-price"]/iowa2['qty-sold']
In [55]:
iowa2.head()
Out[55]:
store year qty-sold vol-sold-l sale-price excise 2015-pop 2016-pop pricepbottle
0 2106 2015 99998 93986.47 1434369.85 478695.22 681512395.0 678474920.0 14.343985
1 2106 2016 23605 22277.95 337804.05 112662.94 164070271.0 163339016.0 14.310699
2 2113 2015 6483 6500.83 85763.42 28693.18 50346112.0 50042609.0 13.228971
3 2113 2016 1703 1608.07 21736.63 7260.10 14537856.0 14450217.0 12.763729
4 2130 2015 72562 65562.88 1108184.99 369675.74 527187551.0 524837896.0 15.272250
In [56]:
iowa2.columns=[["store","year","qty-sold-pyr","vol-sold-l-pyr", "sales-pyr","excise-pyr","2015-pop","2016-pop", 
                "pricepbottle-pyr"]]
In [57]:
plt.hist(iowa2["pricepbottle-pyr"])
Out[57]:
(array([  2.67800000e+03,   1.60000000e+01,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          1.00000000e+00]),
 array([   3.52204482,   22.66984034,   41.81763585,   60.96543137,
          80.11322689,   99.26102241,  118.40881793,  137.55661345,
         156.70440896,  175.85220448,  195.        ]),
 <a list of 10 Patch objects>)
In [58]:
jtmsales = iowa.set_index(["date"])
In [59]:
len(jtmsales)
Out[59]:
2709552
In [60]:
jtmsales = jtmsales.ix["2015-01-01":"2015-03-31"]
/Users/nmolivo/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  if __name__ == '__main__':
In [61]:
jtmsales.head(5)
Out[61]:
store city zip cnty-num cnty-name vol-ml state-cost state-ret qty-sold sale-price vol-sold-l vol-sold-g cost sales-tax excise markup-pct 2015-pop 2016-pop year
date
2015-01-26 4742 COUNCIL BLUFFS 51503 78.0 Pottawattamie 1750 15.00 22.50 12 270.00 21.00 5.55 180.00 16.20 90.00 1.5 93534.0 93582.0 2015
2015-03-12 4847 LE CLAIRE 52753 82.0 Scott 500 6.83 10.25 1 10.25 0.50 0.13 6.83 0.62 3.42 1.5 172170.0 172474.0 2015
2015-01-15 4969 CLEAR LAKE 50428 17.0 Cerro Gordo 750 30.07 45.11 3 135.33 2.25 0.59 90.21 8.12 45.12 1.5 42988.0 43070.0 2015
2015-02-12 4269 TIPTON 52772 16.0 Cedar 750 7.88 11.82 5 59.10 3.75 0.99 39.40 3.55 19.70 1.5 18362.0 18454.0 2015
2015-01-12 4659 CONRAD 50621 38.0 Grundy 500 7.47 11.21 1 11.21 0.50 0.13 7.47 0.67 3.74 1.5 12413.0 12313.0 2015
In [62]:
jtmsales = pd.DataFrame(jtmsales.groupby(["store"])['sale-price','vol-sold-l','excise','qty-sold'].sum())
In [63]:
jtmsales = pd.DataFrame(jtmsales)
In [64]:
jtmsales.columns = ["Q1-2015-sales", "Q1-vol-sold-l","Q1-excise","Q1-qty-sold"]
In [65]:
jtmsales["Q1-avg-ppl"] = jtmsales["Q1-2015-sales"]/jtmsales["Q1-vol-sold-l"]
jtmsales["Q1-avg-ppbottle"] = jtmsales["Q1-2015-sales"]/jtmsales['Q1-qty-sold']
In [66]:
jtmsales.reset_index(inplace =True)

<font color ="red">Calculate the yearly liquor sales for each store using the provided data. You can add up the transactions for each year, and store sales in 2015 specifically will be used later as your target variable.</font>

<font color ="red">Use the data from 2015 to make a linear model using as many variables as you find useful to predict the yearly sales of all stores. You must use the sales from January to March as one of your variables.</font>

In [67]:
data2015 = iowa2[iowa2["year"]==2015].merge(jtmsales, on="store", how="left")
In [68]:
data2016 = iowa2[iowa2["year"]==2016]

keep in mind this is Q1 of 2016!!

Refine the data

Look for any statistical relationships, correlations, or other relevant properties of the dataset.

Now that we have the data in a format where total sales are grouped by store and year, we can check out statistical relationships

In [69]:
data2015.sort_values(["sales-pyr"], ascending=False).head(5)
Out[69]:
store year qty-sold-pyr vol-sold-l-pyr sales-pyr excise-pyr 2015-pop 2016-pop pricepbottle-pyr Q1-2015-sales Q1-vol-sold-l Q1-excise Q1-qty-sold Q1-avg-ppl Q1-avg-ppbottle
125 2633 2015 595205 582860.61 9839393.08 3282140.27 9.104150e+09 9.247670e+09 16.531100 2337324.42 137622.25 779717.83 139970.0 16.983623 16.698753
1033 4829 2015 520100 502306.92 8742779.31 2916236.61 8.499322e+09 8.633308e+09 16.809804 2097466.57 120327.92 699788.17 123797.0 17.431254 16.942790
36 2512 2015 282804 267489.84 4155665.47 1387461.81 2.063405e+09 2.091665e+09 14.694507 926706.76 58602.42 309449.18 62904.0 15.813455 14.732080
194 3385 2015 245312 273289.20 3947176.01 1318267.65 1.077640e+09 1.086804e+09 16.090432 1082136.01 72225.25 361323.62 67609.0 14.982794 16.005798
200 3420 2015 192664 237374.50 3422351.55 1143907.91 1.453733e+09 1.476650e+09 17.763316 977772.62 64351.25 326853.10 54266.0 15.194307 18.018144
In [70]:
# Highest gross sales is in stores 2633 and 4829, in DeMoines, IA. Polk cty.
# Next highest gross sales is in stores 3385, in Cedar Rapids, IA. Linn cty.
# Next highest gross sales is in store 2512, in Iowa City, IA. Johnson cty.

# Metropolitan areas dominate in liquor sales.
In [71]:
plt.hist(data2015["2015-pop"].dropna())
Out[71]:
(array([ 1280.,    54.,    13.,     7.,     2.,     3.,     4.,     0.,
            0.,     2.]),
 array([  1.43549000e+05,   9.10544144e+08,   1.82094474e+09,
          2.73134534e+09,   3.64174593e+09,   4.55214653e+09,
          5.46254712e+09,   6.37294772e+09,   7.28334831e+09,
          8.19374891e+09,   9.10414950e+09]),
 <a list of 10 Patch objects>)
In [72]:
#wow, it appears a lot of our variables have a right skew.

Build your models

Using scikit-learn or statsmodels, build the necessary models for your scenario. Evaluate model fit.

Use the data from 2015 to make a linear model using as many variables as you find useful to predict the yearly sales of all stores. You must use the sales from January to March as one of your variables.

In [73]:
# There appears to be some n/a values in Q1-2015-sales. This could be due to new stores opened since Q1 2015 OR due 
# to the fact we are working with 10% of the data
In [74]:
# By definition given by the prompt, we want a predictive model that can predict total year sales from the first 
# three month's of a year's data.
In [75]:
data_2015=data2015.dropna()
In [76]:
features = data_2015[["Q1-vol-sold-l","Q1-avg-ppbottle","Q1-2015-sales", "Q1-excise","2015-pop"]]
target = data_2015[["sales-pyr"]]
In [77]:
import statsmodels.api as sm

X = features
y = target
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()
Out[77]:
OLS Regression Results
Dep. Variable: sales-pyr R-squared: 0.983
Model: OLS Adj. R-squared: 0.983
Method: Least Squares F-statistic: 1.468e+04
Date: Mon, 23 Oct 2017 Prob (F-statistic): 0.00
Time: 07:11:35 Log-Likelihood: -15965.
No. Observations: 1273 AIC: 3.194e+04
Df Residuals: 1267 BIC: 3.197e+04
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -2.697e+04 7658.075 -3.522 0.000 -4.2e+04 -1.19e+04
Q1-vol-sold-l -15.8716 2.810 -5.649 0.000 -21.384 -10.359
Q1-avg-ppbottle 2974.7736 650.436 4.574 0.000 1698.724 4250.823
Q1-2015-sales -72.3797 11.799 -6.134 0.000 -95.527 -49.232
Q1-excise 232.0787 35.671 6.506 0.000 162.099 302.059
2015-pop 2.594e-05 4.1e-06 6.328 0.000 1.79e-05 3.4e-05
Omnibus: 788.323 Durbin-Watson: 1.969
Prob(Omnibus): 0.000 Jarque-Bera (JB): 117207.412
Skew: 1.856 Prob(JB): 0.00
Kurtosis: 49.861 Cond. No. 2.88e+09
In [78]:
# Q1 Quantity Sold was not statistically significant (p=.76) and thus was dropped from the model
In [79]:
plt.scatter(predictions, y, s=30, c='r', marker='+', zorder=10)
plt.xlabel("Predicted Values Annual Sales")
plt.ylabel("Actual Values Annual Sales")
plt.show()
print("MSE:", model.mse_model) ## mean squared error
MSE: 6.75048467764e+13
In [80]:
# Wow, an R^2 of .983 that is uncomfortably high. We'll see how this model validates. 
# MSE is pretty uninterpretable, as it seems on the line of the magnitude of most of our variables. Let's see if we 
# can reduce it. I presume this is a very variant model and can be improved.
In [81]:
#wow we're going to need to Regularize this data. Regularizing is a technique used to reduce over fitting.
In [82]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=1)
In [85]:
from sklearn.linear_model import Lasso 
lassoreg = Lasso(alpha=0.5, normalize=True, max_iter=10000)
lassoreg.fit(X_train, y_train)
print(lassoreg.coef_)
[ -6.41837047e+00   4.88259789e+03   4.35932363e+00   2.65599861e-01
   4.16695616e-05]
In [87]:
from sklearn.linear_model import LassoCV
In [89]:
# select the best alpha with LassoCV
lassoregcv = LassoCV(n_alphas=10, normalize=True, random_state=1) #n_alphas = 10 means it will generate 10 random values for alpha
lassoregcv.fit(X_train, y_train)
lassoregcv.alpha_
/Users/nmolivo/anaconda/lib/python3.5/site-packages/sklearn/linear_model/coordinate_descent.py:1094: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Out[89]:
181.77409787129952
In [90]:
from sklearn.linear_model import Lasso 
lassoreg = Lasso(alpha=181.774, normalize=True)
lassoreg.fit(X_train, y_train)
print(lassoreg.coef_)
[  0.00000000e+00   3.22011861e+03   4.03002091e+00   2.99873999e-02
   3.73088093e-05]
In [91]:
from sklearn import metrics
y_pred = lassoreg.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print(metrics.r2_score(y_test,y_pred))
47260.2933702
0.980557382093
In [92]:
#this is a better MSE, still an outrageous R^2.
In [93]:
# The bias/variance trade-off:
# Based on our high R^2, and similarly high MSE, I think our model is extremely variant, meaning it is overfit 
# to the training data. We've made our model slightly more bias by using Lasso Regression, but we'll see how it 
# performs on 2016 data.
In [94]:
data2016.columns=[["store","year","qty-sold-q1","vol-sold-l-q1","sales-q1", "excise-q1","2015-pop","2016-pop",
                   "pricepbottle-q1"]]
In [95]:
feat_2016 = data2016[["vol-sold-l-q1","pricepbottle-q1","sales-q1","excise-q1","2016-pop"]].dropna()
In [96]:
y_pred = lassoreg.predict(feat_2016)
In [98]:
sum(y_pred)
Out[98]:
280759663.87898874

According to the Iowa Division of Alcoholic Beverages Annual Report 2016: "In Fiscal Year 2016, the Division saw another increase in liquor sales to more than $288.9 million."

not bad for this order of magnitude.

<font color = "red">Use your model for 2015 to estimate total sales in 2016, extrapolating from the sales so far for January to March of 2016.</font>

In [99]:
beta_2016 = data2016[["store","vol-sold-l-q1","pricepbottle-q1","sales-q1","excise-q1","2016-pop"]].dropna(subset=["vol-sold-l-q1","pricepbottle-q1","sales-q1","excise-q1","2016-pop"])
In [100]:
beta_2016["predicted-2016-sales"] = y_pred
In [101]:
beta_2016=beta_2016.merge(data_2015[["store","sales-pyr","Q1-2015-sales"]], on="store", how = "left")
In [102]:
beta_2016.head()
Out[102]:
store vol-sold-l-q1 pricepbottle-q1 sales-q1 excise-q1 2016-pop predicted-2016-sales sales-pyr Q1-2015-sales
0 2106 22277.95 14.310699 337804.05 112662.94 163339016.0 1.389715e+06 1434369.85 337166.53
1 2113 1608.07 12.763729 21736.63 7260.10 14450217.0 1.022599e+05 85763.42 22351.86
2 2130 18172.75 15.477121 306942.27 102365.79 138087256.0 1.267847e+06 1108184.99 277764.46
3 2152 1143.52 12.710018 13752.24 4604.93 13222490.0 6.978425e+04 72080.36 16805.11
4 2178 4606.64 12.985217 58939.90 19699.31 8316516.0 2.530472e+05 277987.96 54411.42
In [103]:
beta_2016.columns=[["store","Q1-2016-vol-sold-l","Q1-2016-ppbottle", "Q1-2016-sales-pyr","Q1-2016-excise",
                    "2016-pop","predicted-2016-sales","2015-sales-pyr","Q1-2015-sales"]]
In [104]:
beta_2016.head()
Out[104]:
store Q1-2016-vol-sold-l Q1-2016-ppbottle Q1-2016-sales-pyr Q1-2016-excise 2016-pop predicted-2016-sales 2015-sales-pyr Q1-2015-sales
0 2106 22277.95 14.310699 337804.05 112662.94 163339016.0 1.389715e+06 1434369.85 337166.53
1 2113 1608.07 12.763729 21736.63 7260.10 14450217.0 1.022599e+05 85763.42 22351.86
2 2130 18172.75 15.477121 306942.27 102365.79 138087256.0 1.267847e+06 1108184.99 277764.46
3 2152 1143.52 12.710018 13752.24 4604.93 13222490.0 6.978425e+04 72080.36 16805.11
4 2178 4606.64 12.985217 58939.90 19699.31 8316516.0 2.530472e+05 277987.96 54411.42

Wow, it appears by lasso regression, we predict higher sales for 2016 than in 2015. I brought in Q1 2015 sales to check. For the most part, differences in Q1 sales do correspond with annual sales.

Let's add county to our DF So we can plot sales by county later

In [105]:
county_sales=beta_2016.merge(iowa[["store","cnty-name"]], on = "store", how="left").drop_duplicates(subset="store")
In [106]:
county_sales.head()
Out[106]:
store Q1-2016-vol-sold-l Q1-2016-ppbottle Q1-2016-sales-pyr Q1-2016-excise 2016-pop predicted-2016-sales 2015-sales-pyr Q1-2015-sales cnty-name
0 2106 22277.95 14.310699 337804.05 112662.94 163339016.0 1.389715e+06 1434369.85 337166.53 Black Hawk
6334 2113 1608.07 12.763729 21736.63 7260.10 14450217.0 1.022599e+05 85763.42 22351.86 Webster
8088 2130 18172.75 15.477121 306942.27 102365.79 138087256.0 1.267847e+06 1108184.99 277764.46 Black Hawk
13076 2152 1143.52 12.710018 13752.24 4604.93 13222490.0 6.978425e+04 72080.36 16805.11 Cerro Gordo
14818 2178 4606.64 12.985217 58939.90 19699.31 8316516.0 2.530472e+05 277987.96 54411.42 Allamakee
In [107]:
county_sales=county_sales.groupby(["cnty-name"])["predicted-2016-sales"].sum()
In [108]:
county_sales = pd.DataFrame(county_sales)
In [109]:
county_sales.reset_index(inplace=True)
In [110]:
county_sales
Out[110]:
cnty-name predicted-2016-sales
0 Adair 3.017471e+05
1 Adams 1.017885e+05
2 Allamakee 8.096903e+05
3 Appanoose 7.797760e+05
4 Audubon 1.869861e+05
5 Benton 8.394049e+05
6 Black Hawk 1.608054e+07
7 Boone 1.488115e+06
8 Bremer 2.063157e+06
9 Buchanan 1.058477e+06
10 Buena Vista 1.704833e+06
11 Butler 3.708437e+05
12 Calhoun 4.341080e+05
13 Carroll 2.315461e+06
14 Cass 1.224245e+06
15 Cedar 5.839835e+05
16 Cerro Gordo 5.475566e+06
17 Cherokee 7.004633e+05
18 Chickasaw 4.202307e+05
19 Clarke 6.050323e+05
20 Clay 1.445532e+06
21 Clayton 6.869749e+05
22 Clinton 3.281362e+06
23 Crawford 1.213031e+06
24 Dallas 2.622205e+06
25 Davis 9.685968e+04
26 Decatur 2.428749e+05
27 Delaware 9.306534e+05
28 Des Moines 3.613035e+06
29 Dickinson 2.829081e+06
... ... ...
68 Montgomery 6.328656e+05
69 Muscatine 2.998875e+06
70 Osceola 2.514052e+05
71 Page 1.215185e+06
72 Palo Alto 8.625692e+05
73 Plymouth 1.473923e+06
74 Pocahontas 4.146384e+05
75 Polk 6.552570e+07
76 Pottawattamie 9.969459e+06
77 Poweshiek 1.369151e+06
78 Ringgold 1.734031e+05
79 Sac 5.740611e+05
80 Scott 2.006159e+07
81 Shelby 7.307334e+05
82 Sioux 1.502713e+06
83 Story 9.384399e+06
84 Tama 7.124299e+05
85 Taylor 1.482449e+05
86 Union 1.024780e+06
87 Van Buren 2.250904e+05
88 Wapello 2.504271e+06
89 Warren 2.292395e+06
90 Washington 1.624576e+06
91 Wayne 1.474964e+05
92 Webster 3.810195e+06
93 Winnebago 7.386695e+05
94 Winneshiek 1.270644e+06
95 Woodbury 1.009473e+07
96 Worth 3.134641e+05
97 Wright 8.074824e+05

98 rows × 2 columns

Plot your results

Again make sure that you record any valuable information. For example, in the tax scenario, did you find the sales from the first three months of the year to be a good predictor of the total sales for the year? Plot the predictions versus the true values and discuss the successes and limitations of your models

  • Q1 Sales is a strong positive predictor of sales for the entire year.

Plot using 2015 data of Q1 sales vs total sales

In [111]:
plt.scatter(data_2015["Q1-2015-sales"], data_2015["qty-sold-pyr"], s=30, c='r', marker='+', zorder=10)
plt.xlabel("2015 Sales: Q1")
plt.ylabel("Total 2015 Sales")
plt.show()
  • Model predicts actual values well
In [112]:
plt.scatter(predictions, y, s=30, c='r', marker='+', zorder=10)
plt.xlabel("Predicted Values Annual Sales")
plt.ylabel("Actual Values Annual Sales")
plt.show()
print("MSE:", model.mse_model) ## mean squared error
MSE: 6.75048467764e+13

feat_2016 = data2016[["vol-sold-l-q1","pricepbottle-q1","sales-q1","excise-q1","2016-pop"]].dropna()

Present the Results

Present your conclusions and results. If you have more than one interesting model feel free to include more than one along with a discussion. Use your work in this notebook to prepare your write-up.

In 2015, the following first quarter variables, by store, served as predictors in our model:

  • volume sold (liters)
  • avg. store price per bottle
  • sales
  • excise

As well as 2015 population were pretty good predictors for overall 2015 sales. While some of them were related, they indicate that a strong showing for the first three months of the year are highly indicative of liquor sale performance throughout the rest of the year. However, we also see that the model based on 2015 sales underpredictes reported 2016. This is unsurprising based on the model, as the first quarter sales of 2016 were lower than first quarter sales of 2015.

The difference between predicted and actual values for 2016 suggests that we may have overfit the model for 2015 and should examine additional factors. Based on exploritory analysis, one particular direction of exploration that I feel has merit is to examine types of liquor stores, and training the model to better handle spikes in sales later in the year like in June and December.

The possibility of overfitting is common in all models and is part of the bias/viarance trade off. While we can be confident first quarter sales matter, we should be cautious against considering it the only important predictor. In the case of tax legislation, it is preferrable to under predict than over predict liquor sales because it is a direct line to state revenue.

In [113]:
county_sales.sort_values(["predicted-2016-sales"], ascending=False,inplace=True)
In [114]:
county_sales[0:10].plot(x="cnty-name",y="predicted-2016-sales",kind="bar",figsize=(15,15),fontsize=15)
Out[114]:
<matplotlib.axes._subplots.AxesSubplot at 0x1120862e8>
In [ ]: