Getting started¶

Once you've chosen your scenario, download the data from the Iowa website in csv format. Start by loading the data with pandas. You may need to parse the date columns appropriately.

Scenario 1: State tax board¶

You are a data scientist in residence at the Iowa State tax board. The Iowa State legislature is considering changes in the liquor tax rates and wants a report of current liquor sales by county and projections for the rest of the year to inform their decision.

Goal for Scenario #1: Your task is as follows:

Calculate the yearly liquor sales for each store using the provided data. You can add up the transactions for each year, and store sales in 2015 specifically will be used later as your target variable.
Use the data from 2015 to make a linear model using as many variables as you find useful to predict the yearly sales of all stores. You must use the sales from January to March as one of your variables.
Use your model for 2015 to estimate total sales in 2016, extrapolating from the sales so far for January to March of 2016.
Report your findings, including any projected increase or decrease in total sales (over the entire state) for the tax committee of the Iowa legislature.
Use cross-validation to check how your model predicts to held out data compared to the model metrics on the full dataset.
Fit your model(s) using one or both of the regularization tactics covered. Explain whether the regularized or the non-regularized model performed better and what the selected regression(s) are doing.

Class E liquor licenses: For grocery, liquor and convenience stores, etc. Allows for the sale of alcoholic liquor for off-premises consumption in original unopened containers. No sales by the drink. Sunday sales are included. Also allows wholesale sales to on-premises Class A, B, C and D liquor licensees but must have a TTB Federal Wholesale Basic Permit.

Problem Statement: We are interested in projecting the yearly sales for all stores in the state of Iowa for the purposes of evaluating liquor tax rates.

Iowa is a particular case in that it is the 6th highest tax rate on liquor in the United States. In Iowa, liquor distribution is particularly regulated, with some prohibition era laws still in place. While beer and wine regulations have been relaxed, liquor remains a sticking point for these prohibition style laws.

 Sources:
http://www.desmoinesregister.com/story/news/politics/2015/06/27/booze-bureaucracy/29371105/
http://qctimes.com/news/local/liquor-sales-up-in-iowa/article_aa9155fc-9de3-11df-bdf8-001cc4c03286.html
http://www.thegazette.com/subject/news/government/iowa-posts-record-year-for-liquor-wine-sales-20151201

import pandas as pd
from datetime import datetime
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Load the data into a DataFrame
iowa = pd.read_csv("iowa_liquor_sales_proj_2.csv", parse_dates="Date")

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

len(iowa)

2709552

iowa.head()

iowa.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2709552 entries, 0 to 2709551
Data columns (total 24 columns):
Invoice/Item Number      object
Date                     object
Store Number             int64
Store Name               object
Address                  object
City                     object
Zip Code                 object
Store Location           object
County Number            float64
County                   object
Category                 float64
Category Name            object
Vendor Number            int64
Vendor Name              object
Item Number              int64
Item Description         object
Pack                     int64
Bottle Volume (ml)       int64
State Bottle Cost        object
State Bottle Retail      object
Bottles Sold             int64
Sale (Dollars)           object
Volume Sold (Liters)     float64
Volume Sold (Gallons)    float64
dtypes: float64(4), int64(6), object(14)
memory usage: 516.8+ MB

columns_keep = ['Date', 'Store Number', 'City', 'Zip Code', 'County Number', 'County', 'Bottle Volume (ml)', 
                'State Bottle Cost', 'State Bottle Retail', 'Bottles Sold', 'Sale (Dollars)', 'Volume Sold (Liters)', 
                'Volume Sold (Gallons)']
iowa = iowa[columns_keep]

iowa["Date"] = pd.to_datetime(iowa["Date"], format="%m/%d/%Y") #capital y because year is 4 digits

#Remove dollar signs and convert to float
iowa["State Bottle Cost"] = iowa["State Bottle Cost"].str.replace('$', '').astype('float')
iowa["State Bottle Retail"] = iowa["State Bottle Retail"].str.replace('$', '').astype('float')
iowa["Sale (Dollars)"] = iowa["Sale (Dollars)"].str.replace('$', '').astype('float')

iowa.columns

Index(['Date', 'Store Number', 'City', 'Zip Code', 'County Number', 'County',
       'Bottle Volume (ml)', 'State Bottle Cost', 'State Bottle Retail',
       'Bottles Sold', 'Sale (Dollars)', 'Volume Sold (Liters)',
       'Volume Sold (Gallons)'],
      dtype='object')

#We're going to need to check zipcode/city/county for reasonability
#source: https://www.zip-codes.com/state/ia.asp
zips = pd.read_csv('iowazips.csv')

zips["City"] = zips['City'].str.upper() # make all cities capitalized so it matches our iowa data

zips.head()

#convert all zips to strings, as they are categorical
zips["Zip Code"] = zips["Zip Code"].astype('str')

zips.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1060 entries, 0 to 1059
Data columns (total 3 columns):
Zip Code    1060 non-null object
City        1060 non-null object
County      1060 non-null object
dtypes: object(3)
memory usage: 33.1+ KB

#We merge our zipcode data with iowa in df a.
a = pd.merge(iowa,zips, on = "Zip Code",how="left")

a["do cities match"] = a["City_x"] == a["City_y"]

a["do cities match"].value_counts()

True     2514567
False     194985
Name: do cities match, dtype: int64

# conducted a few spot checks, source y (iowazips.csv) is correct when it does not have a null value. 
# Will merge data accordingly and replace iowa cities with source from df (a) when available.
# This will help with visualizing data and analyzing outliers, which I suspect may be big college towns.
# We will test this suspicion later
a[["City_x", "City_y","Zip Code","do cities match"]].sort_values("do cities match", ascending = True)

a[a["do cities match"] == False]["City_x"].value_counts()

WINDSOR HEIGHTS     27679
DES MOINES          12322
PLEASANT HILL        9705
LEMARS               9344
OTTUWMA              9253
CEDAR RAPIDS         8900
DAVENPORT            7594
CEDAR FALLS          7136
WATERLOO             4810
MANCHESTER           4184
SIOUX CITY           3486
LECLAIRE             3242
IOWA CITY            3127
AMES                 2980
COUNCIL BLUFFS       2616
ARNOLD'S PARK        2593
MARION               2489
NEWTON               2259
DEWITT               2015
WEST DES MOINES      2007
DUBUQUE              2003
DUNLAP               2000
CORNING              1998
ANKENY               1900
MT PLEASANT          1839
BURLINGTON           1830
ST ANSGAR            1825
MASON CITY           1622
CLEAR LAKE           1610
URBANDALE            1608
                    ...  
BELLEVUE               12
HAMBURG                12
COLO                   12
MALVERN                12
EXIRA                  11
STRAWBERRY POINT       11
LOGAN                  11
PALO                   11
MONTEZUMA              10
WINTHROP               10
MARTENSDALE            10
Carroll                 9
MECHANICSVILLE          9
DELMAR                  9
DOWS                    9
MAXWELL                 8
LAWLER                  8
EDDYVILLE               8
CORWITH                 7
MERRILL                 7
SHEFFIELD               7
VICTOR                  7
AFTON                   6
HUDSON                  6
ANTHON                  5
HARPERS FERRY           5
KEOKUK                  3
WELLMAN                 3
CARTER LAKE             3
Davenport               2
Name: City_x, dtype: int64

#trust that zip code is accurate, back out city and county names.
for i, row in zips.iterrows():
    iowa.loc[iowa["Zip Code"] == row["Zip Code"], "City"] = row["City"]
    iowa.loc[iowa["Zip Code"] == row["Zip Code"], "County"] = row["County"]

#for any zip codes not in the zips data, match on city and back out correct zip code.
for i, row in zips.iterrows():
    iowa.loc[iowa["City"] == row["City"], "Zip Code"] = row["Zip Code"]
    iowa.loc[iowa["City"] == row["City"], "County"] = row["County"]

# We have a few null values in "County Number", "Category" and "Category Name"
# I'm going to choose to prioritize the accuracy of county number. Will come back and ammend Category and Category name
# should they impact our model.

iowa["countynull"] = iowa["County Number"].isnull()

iowa[iowa["countynull"]==True]

len(iowa[iowa["countynull"]==True])

10913

a = pd.DataFrame(iowa.groupby("County")["County Number"].value_counts())

a.columns= ["Counts"]

a.reset_index(inplace = True)

a.head(15)

# We have all counties filled in, sometimes mapped to more than one county number, and sometimes mapped to 
# a null county number. So we will make a dictionary w keys = counties and values = county numbers and fill in blank 
# county numbers in our iowa df from it. In cases where the county is mapped to more than one county number, this for 
# loop favors the county number with the most counts in it because we assume the majority of transactions are 
# categorized correctly. County number becomes relevant in a later data visualization.
a_dict = {}
for county in list(iowa["County"].unique()):
    x = a[a["County"]==county]
    if len(x)>1:
        maxi = max(list(x["Counts"]))
        y = list(x.loc[x["Counts"] == maxi, "County Number"])[0]
        a_dict[county] = y
    else:
        a_dict[county] = list(x["County Number"])[0]

countynum = pd.Series(a_dict, index = a_dict.keys())

countynum = pd.DataFrame(countynum, columns=["County Number"])

countynum["County Number"] = countynum["County Number"].astype(int)

countynum = countynum.reset_index()

countynum.columns=["County", "County Number"]

for i, row in countynum.iterrows():
    iowa.loc[iowa["County"] == row["County"], "County Number"] = row["County Number"]

iowa[iowa["County"]=="Benton"]

# great! no more duplicate county numbers! (see spotcheck on Benton, which used to have two county numbers) 
# No nulls in county numbers!

# our only null values are in Category and Category name, which we expect.
iowa.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2709552 entries, 0 to 2709551
Data columns (total 14 columns):
Date                     datetime64[ns]
Store Number             int64
City                     object
Zip Code                 object
County Number            float64
County                   object
Bottle Volume (ml)       int64
State Bottle Cost        float64
State Bottle Retail      float64
Bottles Sold             int64
Sale (Dollars)           float64
Volume Sold (Liters)     float64
Volume Sold (Gallons)    float64
countynull               bool
dtypes: bool(1), datetime64[ns](1), float64(6), int64(3), object(3)
memory usage: 292.0+ MB

import numpy as np

#Are numbers reasonable?
iowa[["Bottle Volume (ml)","State Bottle Cost", "State Bottle Retail", "Bottles Sold", "Sale (Dollars)", 
      "Volume Sold (Liters)","Volume Sold (Gallons)"]].describe()

for i in ["Bottle Volume (ml)","State Bottle Cost", "State Bottle Retail", "Bottles Sold", "Sale (Dollars)", 
          "Volume Sold (Liters)","Volume Sold (Gallons)"]:
    print(iowa[i].value_counts())

750       1227979
1750       541448
1000       367592
375        272113
500        121004
200         99426
600         32835
3000        16294
300         13719
100          8553
800          2866
1200         2832
2400         1004
400           335
850           293
603           158
4800          152
50            150
3600          146
6000          130
2250           91
250            71
950            68
150            60
1125           54
1500           51
900            50
2550           42
4500           27
502             3
180000          2
0               2
189000          1
225000          1
Name: Bottle Volume (ml), dtype: int64
8.25       61549
6.50       48647
9.00       43436
10.00      42490
10.49      40135
15.00      37499
7.17       29630
7.47       28936
7.00       27601
6.30       27472
5.00       27238
5.23       25954
11.49      25728
6.90       25473
7.49       23089
3.34       22910
7.62       22510
6.92       21978
8.98       21958
3.37       21275
12.50      21228
8.20       21029
4.75       20268
11.00      20130
18.49      20098
7.20       19797
5.50       19223
3.50       19151
18.00      18827
10.50      17184
           ...  
13.35          1
102.85         1
245.00         1
120.22         1
14.88          1
407.02         1
48.00          1
20.12          1
22.05          1
27.33          1
15.80          1
8.69           1
14.94          1
2.72           1
71.80          1
11.17          1
14.36          1
6468.00        1
5500.00        1
8.12           1
87.75          1
40.25          1
25.50          1
7.55           1
33.00          1
22.25          1
35.80          1
483.00         1
31.08          1
51.38          1
Name: State Bottle Cost, dtype: int64
12.38      61549
9.75       49084
13.50      43942
15.00      42567
15.74      38370
22.50      37583
10.76      29630
11.21      28936
10.50      27669
9.45       27472
7.50       27238
7.85       25954
17.24      25696
10.35      25473
11.24      23091
5.01       22910
11.43      22510
10.38      21978
13.47      21841
18.75      21484
5.06       21275
12.30      21029
7.13       20267
16.50      20130
27.74      20098
10.80      19797
8.25       19249
5.25       19151
27.00      18882
15.75      17219
           ...  
623.40         1
18.90          1
31.19          1
20.18          1
16.76          1
36.25          1
367.50         1
724.50         1
28.80          1
51.51          1
72.00          1
70.50          1
30.62          1
11.60          1
57.75          1
9.43           1
12.93          1
26.22          1
26.75          1
41.00          1
33.38          1
9.20           1
22.80          1
18.35          1
504.99         1
45.75          1
21.84          1
1011.24        1
77.07          1
15.23          1
Name: State Bottle Retail, dtype: int64
12      729679
6       523228
2       376696
1       313147
3       280903
4       149473
24      148590
48       37388
5        31965
36       21104
18       17129
10       13816
8        12311
60       10904
30        6249
72        5061
120       3850
7         3673
96        3319
84        2023
9         1601
144       1401
15        1138
42        1069
240       1066
180       1037
300        967
90         908
150        805
20         753
         ...  
618          1
615          1
157          1
678          1
92           1
53           1
522          1
159          1
55           1
494          1
59           1
195          1
61           1
64           1
750          1
67           1
69           1
744          1
91           1
74           1
75           1
77           1
726          1
525          1
81           1
178          1
85           1
176          1
684          1
1980         1
Name: Bottles Sold, dtype: int64
162.00     34673
148.56     25146
64.80      20728
94.20      19274
70.56      18967
90.00      18014
60.12      16813
117.00     16558
62.28      16239
73.80      16025
188.88     15928
60.72      15782
64.56      15497
180.00     15449
135.00     15189
126.00     14705
45.00      14481
30.00      13281
161.64     13111
270.00     12724
99.00      12525
72.00      12510
124.20     12375
81.00      12338
132.78     10597
58.50      10338
24.76      10062
31.50       9779
225.00      9745
19.50       9715
           ...  
220.43         1
230.93         1
86.13          1
232.68         1
68.30          1
317.28         1
180.18         1
674.91         1
72.87          1
671.58         1
106.70         1
170.18         1
632.16         1
626.16         1
3119.76        1
620.16         1
606.08         1
605.16         1
74.62          1
535.08         1
78.37          1
173.18         1
568.08         1
83.37          1
558.66         1
177.43         1
554.58         1
548.16         1
542.16         1
215.76         1
Name: Sale (Dollars), dtype: int64
9.00       533259
10.50      357856
1.50       247102
2.25       199464
12.00      173414
0.75       172934
3.00       135887
4.50       122337
1.00        60702
0.50        56816
21.00       52590
2.00        49652
3.50        40584
1.12        37781
18.00       36961
6.00        35611
0.60        30790
1.75        29865
24.00       28755
4.80        21159
1.20        21110
9.60        20249
0.38        19828
5.25        19043
4.00        18020
3.75        16446
31.50       14942
36.00       13845
0.80         9834
0.40         9719
            ...  
1837.50         1
204.75          1
1932.00         1
14.25           1
1980.00         1
124.25          1
4.75            1
1186.50         1
1176.00         1
1165.50         1
96.25           1
229.25          1
51.75           1
99.75           1
913.50          1
24.75           1
217.50          1
211.50          1
972.00          1
206.50          1
117.75          1
1081.50         1
121.50          1
1092.00         1
1102.50         1
1116.00         1
1128.00         1
186.00          1
185.50          1
333.60          1
Name: Volume Sold (Liters), dtype: int64
2.38      533259
2.77      357856
0.40      247102
0.59      199464
3.17      173414
0.20      172934
0.79      135887
1.19      122337
0.26       60702
0.13       56816
5.55       52590
0.53       49652
0.92       40584
0.30       37781
4.76       36961
1.59       35623
0.16       30790
0.46       29865
6.34       28755
1.27       21160
0.32       21123
2.54       20249
0.10       19828
1.39       19043
1.06       18020
0.99       16446
8.32       14942
9.51       13845
0.21        9834
0.11        9719
           ...  
716.43         1
807.18         1
5.81           1
0.77           1
174.62         1
104.02         1
1.64           1
13.47          1
624.11         1
6.97           1
162.47         1
73.97          1
19.22          1
96.69          1
180.69         1
49.14          1
18.39          1
313.44         1
6.14           1
832.14         1
307.89         1
212.39         1
14.58          1
128.39         1
326.52         1
443.81         1
468.77         1
871.77         1
60.56          1
485.42         1
Name: Volume Sold (Gallons), dtype: int64

# next we check that (1/100)*"Bottle Volume (ml)" * "Bottles Sold" == "Volume Sold (Liters)"

a = iowa.copy()

a.columns

Index(['Date', 'Store Number', 'City', 'Zip Code', 'County Number', 'County',
       'Bottle Volume (ml)', 'State Bottle Cost', 'State Bottle Retail',
       'Bottles Sold', 'Sale (Dollars)', 'Volume Sold (Liters)',
       'Volume Sold (Gallons)', 'countynull'],
      dtype='object')

a["volume sold calc"] = np.round(((1/1000)*a["Bottle Volume (ml)"])*a["Bottles Sold"],2)
a["volume sold check"] = a["volume sold calc"] == a["Volume Sold (Liters)"]
a["volume sold check"].value_counts()

True    2709552
Name: volume sold check, dtype: int64

a["gallon calc"] = np.round(a["volume sold calc"]*0.264172, 2)
a["gallon sold check"] = a["gallon calc"] - a["Volume Sold (Gallons)"]
a["gallon sold check"].value_counts()

 0.00    2709203
-0.01        345
 0.01          4
Name: gallon sold check, dtype: int64

# looks good! volume sold (ml AND Gallons) are documented correctly!

#Now we'd like to check that all stores of a specific number in one city only.

a = pd.DataFrame(iowa.groupby(["Store Number"])["City"].value_counts())

a.columns = ["Transaction Counts"]

a.reset_index(inplace=True)

a["duplicate"] = a["Store Number"].duplicated()

a["duplicate"].value_counts()

False    1403
True       32
Name: duplicate, dtype: int64

a[a['duplicate']==True]

#this seems minor given the scale of the data, so we're going to set it aside for now. but may be worth further 
#investigation

# Assumption 1: We've sufficiently ammended for erroneous zip and city info from an external source.
# Assumption 2: We were able to fill in missing or erroneous county number values by favoring other data records where
#      it was filled in. We assume that if there was a county with two county numbers in the data, that the county where 
#      the majority of the records were was the correct one.
# Risk 1: Misinformation in Store code - there is no source to back into these values
# Risk 2: Category and Category Name are left as-is, with the understanding that they will be amended should they 
#      be relevant to our model predicting liquor sales.
# Risk 3: There is no way to determine store profit or anything in excess of what is taxed.

#Let's eliminate extra columns and rename our favorite features.

iowa.head(2)

iowacols = ["date", "store", "city", "zip", "cnty-num", "cnty-name", 'vol-ml', 'state-cost', 'state-ret', 'qty-sold', 
                 'sale-price', 'vol-sold-l', 'vol-sold-g']

len(iowacols)

13

len(iowa.columns)

14

iowa = iowa[list(iowa.columns)[0:13]]
iowa.head(2)

iowa.columns = iowacols

#Finally, we add Cost total, Sales Total, Sales Tax, Excise Tax, and Revenue

import math

iowa["cost"] = np.round(iowa["state-cost"]*iowa['qty-sold'],2)
iowa['sales-tax'] = np.round(.06*iowa["sale-price"],2)
iowa["excise"] = np.round(iowa["sale-price"]-iowa['cost'],2)

iowa.head()

iowa.describe()

#wait, why do we have a zero dollar minimum excise?

iowa.loc[iowa["excise"] ==0,:]

#these don't look right. Let's see what the markup percentage should be
iowa["markup-pct"] = np.round(iowa["state-ret"]/iowa['state-cost'],2)

# It's 1.5! Our research is consistent with this. Good.
iowa["markup-pct"].describe()

count    2709552.000000
mean           1.501768
std            0.011987
min            1.000000
25%            1.500000
50%            1.500000
75%            1.500000
max            3.500000
Name: markup-pct, dtype: float64

# markup percentage of 3.5! what a can of worms we have here. Let's first fix the 0 excise one. 
# We can assume these were documented incorrectly because it is actually illegal to sell these at no tax.
# Iowa collects a 50% markup of any liquor. We can assume that excise in addition to 50% of the markup is erroneous.
# source: http://statelaws.findlaw.com/iowa-law/iowa-consumer-tax-laws.html

iowa.loc[iowa["excise"] ==0,"state-ret"]=np.round(1.5*iowa["state-cost"],2)

#now to correct the markup percentage of 3.5 which is way high!

iowa.loc[iowa["markup-pct"] ==3.5,:]

iowa.loc[iowa["markup-pct"] != 1.5,"state-ret"]=np.round(1.5*iowa["state-cost"],2)

# Phew, would have completely lost those if we didn't check out state legislation and the calculated columns!

# write data to csv for continued use
iowa.to_csv('iowa_clean.csv', index=False)

Explore the data¶

Perform some exploratory statistical analysis and make some plots, such as histograms of transaction totals, bottles sold, etc.

import pandas as pd
from datetime import datetime
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

iowa = pd.read_csv('iowa_clean.csv')

# source = http://www.collegecalc.org/colleges/iowa/four-year/
# Chose only schools with >1000 f/t students and with a ranking, as most schools without a ranking consist of students who 
# are either close to home, living at home, or working in addition to classes. There is less of a drinking culture. 
# The largest university to be excluded on the grounds of no ranking is Kaplan University Davenport Campus with 
# 9231 students.
uni = pd.read_csv("iowa-uni.csv")
uni =uni[0:23]

schools = uni["School"]

from time import sleep

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

You do not have to run this if you also collected file "iowa-uni-coords.csv" from Natalie!

driver = webdriver.Firefox()
wait = WebDriverWait(driver, 30)
coords = []
driver.get('https://www.google.com/maps')
for school in schools:
    searchbox = wait.until(EC.presence_of_element_located((By.ID, 'searchboxinput')))
    searchbox.clear()
    searchbox.send_keys(school + ' iowa')
    driver.find_element_by_id('searchbox-searchbutton').click()
    sleep(6)
    url = driver.current_url
    tries = 0
    while '@' not in url:
        tries += 1
        sleep(1)
        url = driver.current_url
        if tries == 5: #try five times to give the url the chance to resolve
            break
    try:
        long_lat = url.split('@')[1].split(',')[:2]
        coords.append((school, long_lat[0], long_lat[1]))
    except:
        coords.append((school, None, None)) 
driver.quit()

coords

[('Buena Vista University', '42.6410826', '-95.2120009'),
 ('Central College', '41.400595', '-92.9218557'),
 ('Clarke University', '42.5092662', '-90.6935688'),
 ('Coe College', '41.9884283', '-91.6613174'),
 ('Cornell College', '41.9269546', '-91.4275389'),
 ('Dordt College', '43.0824829', '-96.1685495'),
 ('Drake University', '41.6029128', '-93.6574871'),
 ('Graceland University Lamoni', '40.616346', '-93.9278267'),
 ('Grand View University', '41.6207345', '-93.6024867'),
 ('Grinnell College', '41.7490571', '-92.7223189'),
 ('Iowa State University', '40.3680026', '-89.819982'),
 ('Loras College', '42.502738', '-90.6835877'),
 ('Luther College', '43.311063', '-91.808442'),
 ('Morningside College', '42.4733866', '-96.3607242'),
 ('Mount Mercy University', '42.0027013', '-91.6544266'),
 ('Northwestern College', '42.9979484', '-96.0604485'),
 ('Saint Ambrose University', '41.5404254', '-90.5829498'),
 ('Simpson College', '41.364988', '-93.5658147'),
 ('University of Dubuque', '42.4952453', '-90.6978244'),
 ('University of Iowa', '41.6613397', '-91.5536702'),
 ('University of Northern Iowa', '42.5115523', '-92.4602541'),
 ('Upper Iowa Uniersity', '42.8387392', '-91.799972'),
 ('Wartburg College', '42.7285775', '-92.4843476')]

coords = pd.DataFrame(coords)

coords.to_csv("iowa-uni-coords.csv", index = False)

Start here if you have csv file "iowa-uni-coords" collected from NMO.

coords=pd.read_csv("iowa-uni-coords.csv")

coords.columns =[["School", "Latitude", "Longitude"]]

#iowa state university is mapped wrong! how did that happen? I guess always double check your webscraper.
# 42.0266187,-93.6486541

coords.head(20)

coords.loc[coords["School"]=="Iowa State University", "Latitude"] = 42.0266187
coords.loc[coords["School"]=="Iowa State University", "Longitude"] = -93.6486541

#let's import our visualization packages.

import seaborn as sns
import matplotlib.pyplot as plt
from bokeh.io import output_notebook, show
output_notebook()

a = pd.DataFrame(iowa.groupby("cnty-num")["excise"].sum())

a.reset_index(inplace=True)

a.head()

ia_liq ={}
for i in range(0,len(a)):
    key = int(a["cnty-num"][i])
    value = float(a["excise"][i])
    ia_liq[key] = value

from bokeh.io import show
from bokeh.models import (
    ColumnDataSource,
    HoverTool,
    LogColorMapper
)
from bokeh.palettes import Viridis6 as palette
from bokeh.plotting import figure

from bokeh.sampledata.us_counties import data as counties
#from bokeh.sampledata.unemployment import data as unemployment

palette.reverse()

counties = {
    code: county for code, county in counties.items() if county["state"] == "ia"
}

county_xs = [county["lons"] for county in counties.values()]
county_ys = [county["lats"] for county in counties.values()]

county_names = [county['name'] for county in counties.values()]
county_rates = [ia_liq[county_id[1]] if county_id[1] in ia_liq else None for county_id in counties]
color_mapper = LogColorMapper(palette=palette)

source = ColumnDataSource(data=dict(
    x=county_xs,
    y=county_ys,
    name=county_names,
    rate=county_rates,
))

town_xs = list(coords["Longitude"])
town_ys = list(coords["Latitude"])
source2 = ColumnDataSource(data=dict(x=town_xs, y=town_ys))

TOOLS = "pan,wheel_zoom,reset,hover,save"

p = figure(
    title="Liquor Tax Revenue in Iowa", tools=TOOLS,
    x_axis_location=None, y_axis_location=None
)
p.grid.grid_line_color = None

p.patches('x', 'y', source=source,
          fill_color={'field': 'rate', 'transform': color_mapper},
          fill_alpha=0.7, line_color="white", line_width=0.5)

p.circle('x','y', source = source2)

hover = p.select_one(HoverTool)
hover.point_policy = "follow_mouse"
hover.tooltips = [
    ("Name", "@name"),
    ("Total Excise Tax Revenue", "$@rate{1.11}"),
    ("(Long, Lat)", "($x, $y)"),
]

show(p)

# not super impressed with the gradient of colors in bokeh. That's ok. I mostly wanted to see if my college 
# town theory was true. So below I pull in college town information and plot the lat/long on the map. It appears not
# to make that much a difference. Perhaps population is the better indicator; Iowan natives hold their own with any 
# major college towns. College Town indicator does not need to be a variable in our model.

# I'm super interested in the distribution of the excise tax. Here are histograms at varying x-scales.
# We see that among all transactions, there is a wide spread all the way out to ~125000!! wow who got married?

plt.subplot(221)
plt.hist(iowa["excise"])

plt.subplot(222)
plt.hist(iowa["excise"], range = [0,500])

plt.subplot(223)
plt.hist(iowa["excise"], range = [0,200])

plt.subplot(224)
plt.hist(iowa["excise"], range = [0,70])

plt.show()

# Ok let's see the same thing grouped by county. 
a=iowa.groupby(["cnty-name"])["excise"].sum()

a=pd.DataFrame(a)

plt.subplot(221)
plt.hist(a["excise"])
#plt.axis([40, 160, 0, 0.03])

plt.subplot(222)
plt.hist(a["excise"], range = [0,100000])

plt.subplot(223)
plt.hist(a["excise"], range = [0,50000])

plt.subplot(224)
plt.hist(a["excise"], range = [0,20000])

plt.show()

# interesting. skewed right, but when we zoom in on the bulk of the data it is less so

# Let's try to understand those outliers a little better. They skew so far!! We will do so in the "Refine Data" step.

import statsmodels.api as sm
from sklearn import linear_model

/Users/nmolivo/anaconda/lib/python3.5/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

c=iowa.groupby("date")["excise"].sum()

c =pd.DataFrame(c)

c.index = pd.to_datetime(c.index)

c.head()

c["mo-date"] = c.index.month

c["yr-date"] = c.index.year

c.head()

c["shortdate"] = c.apply(lambda x: (str(int(x["mo-date"]))+"-"+str(int(x["yr-date"]))), axis=1)

c = pd.DataFrame(c.groupby(["shortdate"])["excise"].sum())

c.index = pd.to_datetime(c.index)

c.sort_values(["excise"])

plt.figure(figsize = (10,10))
plt.scatter(c.index, c.excise)

<matplotlib.collections.PathCollection at 0x1163680f0>

# When grouped by month, this appears to reflect seasonality. The two lowest datapoints are both January, 
# there's spikes with growth up to a plateau range [~725000,~825000], in the spring. Spikes in June and December.
# Month is clearly influential, we will be using it as a dummy variable. Will probably exclude March, as it is pretty
# representative of the plateu liquor sales have.

# We're definitely going to want to look at population. source:

# Ok let's check out a scatter matrix for good measure.

# this takes forever to run on the full data... holding off for now
from pandas.plotting import scatter_matrix
df = iowa[["vol-ml", 'state-cost', 'state-ret', 'qty-sold', 'sale-price', 'cost', 'excise']]
scatter_matrix(df, alpha=0.2, figsize=(15, 15), diagonal='kde')

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-39-75f4858bbfc0> in <module>()
      1 from pandas.plotting import scatter_matrix
      2 df = iowa[["vol-ml", 'state-cost', 'state-ret', 'qty-sold', 'sale-price', 'cost', 'excise']]
----> 3 scatter_matrix(df, alpha=0.2, figsize=(15, 15), diagonal='kde')

/Users/nmolivo/anaconda/lib/python3.5/site-packages/pandas/plotting/_misc.py in scatter_matrix(frame, alpha, figsize, ax, grid, diagonal, marker, density_kwds, hist_kwds, range_padding, **kwds)
    104 
    105                 ax.scatter(df[b][common], df[a][common],
--> 106                            marker=marker, alpha=alpha, **kwds)
    107 
    108                 ax.set_xlim(boundaries_list[j])

/Users/nmolivo/anaconda/lib/python3.5/site-packages/matplotlib/__init__.py in inner(ax, *args, **kwargs)
   1896                     warnings.warn(msg % (label_namer, func.__name__),
   1897                                   RuntimeWarning, stacklevel=2)
-> 1898             return func(ax, *args, **kwargs)
   1899         pre_doc = inner.__doc__
   1900         if pre_doc is None:

/Users/nmolivo/anaconda/lib/python3.5/site-packages/matplotlib/axes/_axes.py in scatter(self, x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, verts, edgecolors, **kwargs)
   4061                 self.set_ymargin(0.05)
   4062 
-> 4063         self.add_collection(collection)
   4064         self.autoscale_view()
   4065 

/Users/nmolivo/anaconda/lib/python3.5/site-packages/matplotlib/axes/_base.py in add_collection(self, collection, autolim)
   1760 
   1761         if autolim:
-> 1762             self.update_datalim(collection.get_datalim(self.transData))
   1763 
   1764         collection._remove_method = lambda h: self.collections.remove(h)

/Users/nmolivo/anaconda/lib/python3.5/site-packages/matplotlib/collections.py in get_datalim(self, transData)
    227             result = mpath.get_path_collection_extents(
    228                 transform.frozen(), paths, self.get_transforms(),
--> 229                 offsets, transOffset.frozen())
    230             result = result.inverse_transformed(transData)
    231         else:

/Users/nmolivo/anaconda/lib/python3.5/site-packages/matplotlib/path.py in get_path_collection_extents(master_transform, paths, transforms, offsets, offset_transform)
   1008     return Bbox.from_extents(*_path.get_path_collection_extents(
   1009         master_transform, paths, np.atleast_3d(transforms),
-> 1010         offsets, offset_transform))
   1011 
   1012 

/Users/nmolivo/anaconda/lib/python3.5/site-packages/matplotlib/path.py in simplify_threshold(self)
    252         self._update_values()
    253 
--> 254     @property
    255     def simplify_threshold(self):
    256         """

KeyboardInterrupt:

pop = pd.read_csv("../county_pop_data.csv", sep = "\t")

pop = pop[["Geographic name", "7/1/2015", "7/1/2016"]]

pop.columns =[["cnty-name", "2015-pop", "2016-pop"]]

pop["cnty-name"]=[" ".join(x.split()[:-1]) for x in pop["cnty-name"]]

iowa = iowa.merge(pop, on= "cnty-name", how = "left")

import numpy as np

plt.scatter(np.log(iowa["2015-pop"]), iowa["excise"])

<matplotlib.collections.PathCollection at 0x114df0eb8>

#Looks like a correlation! We're going to need to scale our data, for sure.

Record your findings¶

Be sure to write out anything observations from your exploratory analysis.

Per the prompt, we are interested in projecting sales for the following year. In my opinion, the tax board would be interested in the excise. By excluding this, the prompt implies everyone in the room can do the excise tax calculation in their heads and therefore would like everything in terms of sales rather than excise tax.

Since cost is perfectly colinear with sales, we will not include it in the model.
Month is definitely worth including in the model. The prompt specifically asks for Sales from Jan to March per store, likely for this reason.
Population is definitely worth including in the model, though performs better after regularization.

One avenue the Tax Board my take to increase tax revenue is doing a by the gallon or liter tax rather than a 50% markup from the cost.

Mine the data¶

Now you are ready to compute the variables you will use for your regression from the data. For example, you may want to compute total sales per store from Jan to March of 2015, mean price per bottle, etc. Refer to the readme for more ideas appropriate to your scenario.

Pandas is your friend for this task. Take a look at the operations here for ideas on how to make the best use of pandas and feel free to search for blog and Stack Overflow posts to help you group data by certain variables and compute sums, means, etc. You may find it useful to create a new data frame to house this summary data.

iowa.head()

iowa["date"]=pd.to_datetime(iowa["date"])

iowa["year"]= iowa.apply(lambda x: x["date"].year, axis =1)

# iowa2 will be the data grouped by store and year. 
# Keep in mind that qty-sold will now be qty-sold per that year, for that store.

iowa2 = iowa.groupby(["store","year"])[["qty-sold", 'vol-sold-l', 'sale-price', 'excise', "2015-pop", "2016-pop"]].sum()

iowa2.head()

iowa2 =iowa2.reset_index()

iowa2["pricepbottle"] = iowa2["sale-price"]/iowa2['qty-sold']

iowa2.head()

iowa2.columns=[["store","year","qty-sold-pyr","vol-sold-l-pyr", "sales-pyr","excise-pyr","2015-pop","2016-pop", 
                "pricepbottle-pyr"]]

plt.hist(iowa2["pricepbottle-pyr"])

(array([  2.67800000e+03,   1.60000000e+01,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          1.00000000e+00]),
 array([   3.52204482,   22.66984034,   41.81763585,   60.96543137,
          80.11322689,   99.26102241,  118.40881793,  137.55661345,
         156.70440896,  175.85220448,  195.        ]),
 <a list of 10 Patch objects>)

jtmsales = iowa.set_index(["date"])

len(jtmsales)

2709552

jtmsales = jtmsales.ix["2015-01-01":"2015-03-31"]

/Users/nmolivo/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  if __name__ == '__main__':

jtmsales.head(5)

jtmsales = pd.DataFrame(jtmsales.groupby(["store"])['sale-price','vol-sold-l','excise','qty-sold'].sum())

jtmsales = pd.DataFrame(jtmsales)

jtmsales.columns = ["Q1-2015-sales", "Q1-vol-sold-l","Q1-excise","Q1-qty-sold"]

jtmsales["Q1-avg-ppl"] = jtmsales["Q1-2015-sales"]/jtmsales["Q1-vol-sold-l"]
jtmsales["Q1-avg-ppbottle"] = jtmsales["Q1-2015-sales"]/jtmsales['Q1-qty-sold']

jtmsales.reset_index(inplace =True)

Calculate the yearly liquor sales for each store using the provided data. You can add up the transactions for each year, and store sales in 2015 specifically will be used later as your target variable.

Use the data from 2015 to make a linear model using as many variables as you find useful to predict the yearly sales of all stores. You must use the sales from January to March as one of your variables.

data2015 = iowa2[iowa2["year"]==2015].merge(jtmsales, on="store", how="left")

data2016 = iowa2[iowa2["year"]==2016]

keep in mind this is Q1 of 2016!!

Refine the data¶

Look for any statistical relationships, correlations, or other relevant properties of the dataset.

Now that we have the data in a format where total sales are grouped by store and year, we can check out statistical relationships

data2015.sort_values(["sales-pyr"], ascending=False).head(5)

# Highest gross sales is in stores 2633 and 4829, in DeMoines, IA. Polk cty.
# Next highest gross sales is in stores 3385, in Cedar Rapids, IA. Linn cty.
# Next highest gross sales is in store 2512, in Iowa City, IA. Johnson cty.

# Metropolitan areas dominate in liquor sales.

plt.hist(data2015["2015-pop"].dropna())

(array([ 1280.,    54.,    13.,     7.,     2.,     3.,     4.,     0.,
            0.,     2.]),
 array([  1.43549000e+05,   9.10544144e+08,   1.82094474e+09,
          2.73134534e+09,   3.64174593e+09,   4.55214653e+09,
          5.46254712e+09,   6.37294772e+09,   7.28334831e+09,
          8.19374891e+09,   9.10414950e+09]),
 <a list of 10 Patch objects>)

#wow, it appears a lot of our variables have a right skew.

Build your models¶

Using scikit-learn or statsmodels, build the necessary models for your scenario. Evaluate model fit.

Use the data from 2015 to make a linear model using as many variables as you find useful to predict the yearly sales of all stores. You must use the sales from January to March as one of your variables.

# There appears to be some n/a values in Q1-2015-sales. This could be due to new stores opened since Q1 2015 OR due 
# to the fact we are working with 10% of the data

# By definition given by the prompt, we want a predictive model that can predict total year sales from the first 
# three month's of a year's data.

data_2015=data2015.dropna()

features = data_2015[["Q1-vol-sold-l","Q1-avg-ppbottle","Q1-2015-sales", "Q1-excise","2015-pop"]]
target = data_2015[["sales-pyr"]]

import statsmodels.api as sm

X = features
y = target
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()

# Q1 Quantity Sold was not statistically significant (p=.76) and thus was dropped from the model

plt.scatter(predictions, y, s=30, c='r', marker='+', zorder=10)
plt.xlabel("Predicted Values Annual Sales")
plt.ylabel("Actual Values Annual Sales")
plt.show()
print("MSE:", model.mse_model) ## mean squared error

MSE: 6.75048467764e+13

# Wow, an R^2 of .983 that is uncomfortably high. We'll see how this model validates. 
# MSE is pretty uninterpretable, as it seems on the line of the magnitude of most of our variables. Let's see if we 
# can reduce it. I presume this is a very variant model and can be improved.

#wow we're going to need to Regularize this data. Regularizing is a technique used to reduce over fitting.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=1)

from sklearn.linear_model import Lasso 
lassoreg = Lasso(alpha=0.5, normalize=True, max_iter=10000)
lassoreg.fit(X_train, y_train)
print(lassoreg.coef_)

[ -6.41837047e+00   4.88259789e+03   4.35932363e+00   2.65599861e-01
   4.16695616e-05]

from sklearn.linear_model import LassoCV

# select the best alpha with LassoCV
lassoregcv = LassoCV(n_alphas=10, normalize=True, random_state=1) #n_alphas = 10 means it will generate 10 random values for alpha
lassoregcv.fit(X_train, y_train)
lassoregcv.alpha_

/Users/nmolivo/anaconda/lib/python3.5/site-packages/sklearn/linear_model/coordinate_descent.py:1094: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

181.77409787129952

from sklearn.linear_model import Lasso 
lassoreg = Lasso(alpha=181.774, normalize=True)
lassoreg.fit(X_train, y_train)
print(lassoreg.coef_)

[  0.00000000e+00   3.22011861e+03   4.03002091e+00   2.99873999e-02
   3.73088093e-05]

from sklearn import metrics
y_pred = lassoreg.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print(metrics.r2_score(y_test,y_pred))

47260.2933702
0.980557382093

#this is a better MSE, still an outrageous R^2.

# The bias/variance trade-off:
# Based on our high R^2, and similarly high MSE, I think our model is extremely variant, meaning it is overfit 
# to the training data. We've made our model slightly more bias by using Lasso Regression, but we'll see how it 
# performs on 2016 data.

data2016.columns=[["store","year","qty-sold-q1","vol-sold-l-q1","sales-q1", "excise-q1","2015-pop","2016-pop",
                   "pricepbottle-q1"]]

feat_2016 = data2016[["vol-sold-l-q1","pricepbottle-q1","sales-q1","excise-q1","2016-pop"]].dropna()

y_pred = lassoreg.predict(feat_2016)

sum(y_pred)

280759663.87898874

According to the Iowa Division of Alcoholic Beverages Annual Report 2016: "In Fiscal Year 2016, the Division saw another increase in liquor sales to more than $288.9 million."

not bad for this order of magnitude.

Use your model for 2015 to estimate total sales in 2016, extrapolating from the sales so far for January to March of 2016.

beta_2016 = data2016[["store","vol-sold-l-q1","pricepbottle-q1","sales-q1","excise-q1","2016-pop"]].dropna(subset=["vol-sold-l-q1","pricepbottle-q1","sales-q1","excise-q1","2016-pop"])

beta_2016["predicted-2016-sales"] = y_pred

beta_2016=beta_2016.merge(data_2015[["store","sales-pyr","Q1-2015-sales"]], on="store", how = "left")

beta_2016.head()

beta_2016.columns=[["store","Q1-2016-vol-sold-l","Q1-2016-ppbottle", "Q1-2016-sales-pyr","Q1-2016-excise",
                    "2016-pop","predicted-2016-sales","2015-sales-pyr","Q1-2015-sales"]]

beta_2016.head()

Wow, it appears by lasso regression, we predict higher sales for 2016 than in 2015. I brought in Q1 2015 sales to check. For the most part, differences in Q1 sales do correspond with annual sales.

Let's add county to our DF So we can plot sales by county later

county_sales=beta_2016.merge(iowa[["store","cnty-name"]], on = "store", how="left").drop_duplicates(subset="store")

county_sales.head()

county_sales=county_sales.groupby(["cnty-name"])["predicted-2016-sales"].sum()

county_sales = pd.DataFrame(county_sales)

county_sales.reset_index(inplace=True)

county_sales

Plot your results¶

Again make sure that you record any valuable information. For example, in the tax scenario, did you find the sales from the first three months of the year to be a good predictor of the total sales for the year? Plot the predictions versus the true values and discuss the successes and limitations of your models

Q1 Sales is a strong positive predictor of sales for the entire year.

Plot using 2015 data of Q1 sales vs total sales

plt.scatter(data_2015["Q1-2015-sales"], data_2015["qty-sold-pyr"], s=30, c='r', marker='+', zorder=10)
plt.xlabel("2015 Sales: Q1")
plt.ylabel("Total 2015 Sales")
plt.show()

Model predicts actual values well

plt.scatter(predictions, y, s=30, c='r', marker='+', zorder=10)
plt.xlabel("Predicted Values Annual Sales")
plt.ylabel("Actual Values Annual Sales")
plt.show()
print("MSE:", model.mse_model) ## mean squared error

MSE: 6.75048467764e+13

feat_2016 = data2016[["vol-sold-l-q1","pricepbottle-q1","sales-q1","excise-q1","2016-pop"]].dropna()

Present the Results¶

Present your conclusions and results. If you have more than one interesting model feel free to include more than one along with a discussion. Use your work in this notebook to prepare your write-up.

In 2015, the following first quarter variables, by store, served as predictors in our model:

volume sold (liters)
avg. store price per bottle
sales
excise

As well as 2015 population were pretty good predictors for overall 2015 sales. While some of them were related, they indicate that a strong showing for the first three months of the year are highly indicative of liquor sale performance throughout the rest of the year. However, we also see that the model based on 2015 sales underpredictes reported 2016. This is unsurprising based on the model, as the first quarter sales of 2016 were lower than first quarter sales of 2015.

The difference between predicted and actual values for 2016 suggests that we may have overfit the model for 2015 and should examine additional factors. Based on exploritory analysis, one particular direction of exploration that I feel has merit is to examine types of liquor stores, and training the model to better handle spikes in sales later in the year like in June and December.

The possibility of overfitting is common in all models and is part of the bias/viarance trade off. While we can be confident first quarter sales matter, we should be cautious against considering it the only important predictor. In the case of tax legislation, it is preferrable to under predict than over predict liquor sales because it is a direct line to state revenue.

county_sales.sort_values(["predicted-2016-sales"], ascending=False,inplace=True)

county_sales[0:10].plot(x="cnty-name",y="predicted-2016-sales",kind="bar",figsize=(15,15),fontsize=15)

<matplotlib.axes._subplots.AxesSubplot at 0x1120862e8>

	Invoice/Item Number	Date	Store Number	Store Name	Address	City	Zip Code	Store Location	County Number	County	...	Item Number	Item Description	Pack	Bottle Volume (ml)	State Bottle Cost	State Bottle Retail	Bottles Sold	Sale (Dollars)	Volume Sold (Liters)	Volume Sold (Gallons)
0	S29198800001	11/20/2015	2191	Keokuk Spirits	1013 MAIN	KEOKUK	52632	1013 MAIN\nKEOKUK 52632\n(40.39978, -91.387531)	56	Lee	...	297	Templeton Rye w/Flask	6	750	$18.09	$27.14	6	$162.84	4.50	1.19
1	S29195400002	11/21/2015	2205	Ding's Honk And Holler	900 E WASHINGTON	CLARINDA	51632	900 E WASHINGTON\nCLARINDA 51632\n(40.739238, ...	73	Page	...	297	Templeton Rye w/Flask	6	750	$18.09	$27.14	12	$325.68	9.00	2.38
2	S29050300001	11/16/2015	3549	Quicker Liquor Store	1414 48TH ST	FORT MADISON	52627	1414 48TH ST\nFORT MADISON 52627\n(40.624226, ...	56	Lee	...	249	Disaronno Amaretto Cavalli Mignon 3-50ml Pack	20	150	$6.40	$9.60	2	$19.20	0.30	0.08
3	S28867700001	11/04/2015	2513	Hy-Vee Food Store #2 / Iowa City	812 S 1ST AVE	IOWA CITY	52240	812 S 1ST AVE\nIOWA CITY 52240\n	52	Johnson	...	237	Knob Creek w/ Crystal Decanter	3	1750	$35.55	$53.34	3	$160.02	5.25	1.39
4	S29050800001	11/17/2015	3942	Twin Town Liquor	104 HIGHWAY 30 WEST	TOLEDO	52342	104 HIGHWAY 30 WEST\nTOLEDO 52342\n(41.985887,...	86	Tama	...	249	Disaronno Amaretto Cavalli Mignon 3-50ml Pack	20	150	$6.40	$9.60	2	$19.20	0.30	0.08

	Zip Code	City	County
0	50001	ACKWORTH	Warren
1	50002	ADAIR	Adair
2	50003	ADEL	Dallas
3	50005	ALBION	Marshall
4	50006	ALDEN	Hardin

	City_x	City_y	Zip Code	do cities match
2709551	DES MOINES	NaN	50317	False
2206013	LEMARS	LE MARS	51031	False
852584	DUNLAP	NaN	712-2	False
374336	DUNLAP	NaN	712-2	False
1278321	PLEASANT HILL	DES MOINES	50317	False
1483725	OTTUWMA	OTTUMWA	52501	False
1765977	WATERLOO	EVANSDALE	50707	False
374332	DUMONT	ALLISON	50602	False
374331	PLEASANT HILL	DES MOINES	50327	False
1891994	WINDSOR HEIGHTS	DES MOINES	50311	False
1765982	Des Moines	DES MOINES	50313	False
1278308	ARNOLD'S PARK	ARNOLDS PARK	51331	False
374324	LEMARS	LE MARS	51031	False
1278326	DEWITT	DE WITT	52742	False
525110	ARNOLD'S PARK	ARNOLDS PARK	51331	False
171446	PLEASANT HILL	DES MOINES	50317	False
1122946	ST ANSGAR	SAINT ANSGAR	50472	False
525114	MARION	JEFFERSON	50129	False
374310	LEMARS	LE MARS	51031	False
525121	LEMARS	LE MARS	51031	False
2205990	MANCHESTER	NaN	52087	False
1122960	CEDAR RAPIDS	NaN	52303	False
1632654	WINDSOR HEIGHTS	DES MOINES	50311	False
1122965	MANCHESTER	NaN	52087	False
1483765	Des Moines	DES MOINES	50313	False
1278278	NEWTON	WATERLOO	50702	False
374296	NEWTON	WATERLOO	50702	False
1122944	WINDSOR HEIGHTS	URBANDALE	50322	False
525133	WINDSOR HEIGHTS	DES MOINES	50311	False
1765974	CEDAR RAPIDS	NaN	52303	False
...	...	...	...	...
922001	SPIRIT LAKE	SPIRIT LAKE	51360	True
922002	WAVERLY	WAVERLY	50677	True
922003	SHELDON	SHELDON	51201	True
922004	CEDAR RAPIDS	CEDAR RAPIDS	52405	True
922005	DENISON	DENISON	51442	True
922006	WATERLOO	WATERLOO	50701	True
922007	ORANGE CITY	ORANGE CITY	51041	True
922000	CEDAR RAPIDS	CEDAR RAPIDS	52403	True
921990	JOHNSTON	JOHNSTON	50131	True
921989	CEDAR RAPIDS	CEDAR RAPIDS	52411	True
921988	BONDURANT	BONDURANT	50035	True
921971	CEDAR RAPIDS	CEDAR RAPIDS	52402	True
921972	STORM LAKE	STORM LAKE	50588	True
921973	CEDAR RAPIDS	CEDAR RAPIDS	52404	True
921974	LE CLAIRE	LE CLAIRE	52753	True
921975	CHARITON	CHARITON	50049	True
921976	BOONE	BOONE	50036	True
921977	SIOUX CITY	SIOUX CITY	51103	True
921978	COUNCIL BLUFFS	COUNCIL BLUFFS	51501	True
921979	ATLANTIC	ATLANTIC	50022	True
921980	DECORAH	DECORAH	52101	True
921981	MISSOURI VALLEY	MISSOURI VALLEY	51555	True
921982	FLOYD	FLOYD	50435	True
921983	MUSCATINE	MUSCATINE	52761	True
921984	AMES	AMES	50014	True
921985	WESLEY	WESLEY	50483	True
921986	DES MOINES	DES MOINES	50314	True
921987	KEOKUK	KEOKUK	52632	True
922009	SAC CITY	SAC CITY	50583	True
1354775	DES MOINES	DES MOINES	50317	True

	Date	Store Number	City	Zip Code	County Number	County	Bottle Volume (ml)	State Bottle Cost	State Bottle Retail	Bottles Sold	Sale (Dollars)	Volume Sold (Liters)	Volume Sold (Gallons)	countynull
469	2015-11-03	4247	BELMOND	50421	NaN	Wright	1750	15.75	23.63	6	141.78	10.50	2.77	True
1036	2015-10-13	4247	BELMOND	50421	NaN	Wright	750	11.49	17.24	1	17.24	0.75	0.20	True
1851	2015-05-12	4247	BELMOND	50421	NaN	Wright	750	13.73	20.60	2	41.20	1.50	0.40	True
4023	2015-07-28	4247	BELMOND	50421	NaN	Wright	750	6.20	9.30	1	9.30	0.75	0.20	True
5758	2015-11-03	4247	BELMOND	50421	NaN	Wright	1750	12.59	18.89	6	113.34	10.50	2.77	True
7332	2015-07-28	4247	BELMOND	50421	NaN	Wright	750	10.49	15.74	1	15.74	0.75	0.20	True
8049	2015-06-02	4247	BELMOND	50421	NaN	Wright	1000	9.25	13.88	2	27.76	2.00	0.53	True
10232	2015-04-21	4247	BELMOND	50421	NaN	Wright	1750	7.20	10.80	6	64.80	10.50	2.77	True
12253	2015-11-03	4247	BELMOND	50421	NaN	Wright	750	4.26	6.39	4	25.56	3.00	0.79	True
14001	2015-03-03	4247	BELMOND	50421	NaN	Wright	750	10.00	15.00	12	180.00	9.00	2.38	True
15973	2015-05-12	4247	BELMOND	50421	NaN	Wright	1000	11.00	16.50	1	16.50	1.00	0.26	True
16139	2015-08-04	4247	BELMOND	50421	NaN	Wright	1750	8.74	13.36	6	80.16	10.50	2.77	True
17853	2015-06-29	4247	BELMOND	50421	NaN	Wright	1750	10.45	15.68	72	1128.96	126.00	33.29	True
21775	2015-05-12	4247	BELMOND	50421	NaN	Wright	1750	12.59	18.89	6	113.34	10.50	2.77	True
21879	2015-08-25	4247	BELMOND	50421	NaN	Wright	750	10.49	15.74	3	47.22	2.25	0.59	True
25008	2015-01-20	4247	BELMOND	50421	NaN	Wright	750	8.98	13.47	12	161.64	9.00	2.38	True
25841	2015-01-06	4247	BELMOND	50421	NaN	Wright	750	9.85	14.78	12	177.36	9.00	2.38	True
27265	2015-03-24	4247	BELMOND	50421	NaN	Wright	750	8.25	12.38	12	148.56	9.00	2.38	True
30017	2015-06-02	4247	BELMOND	50421	NaN	Wright	1750	24.99	37.49	6	224.94	10.50	2.77	True
31914	2015-02-17	4247	BELMOND	50421	NaN	Wright	750	8.25	12.38	3	37.14	2.25	0.59	True
33084	2015-10-27	4247	BELMOND	50421	NaN	Wright	750	6.90	10.35	3	31.05	2.25	0.59	True
33170	2015-01-13	4247	BELMOND	50421	NaN	Wright	1750	14.66	21.99	1	21.99	1.75	0.46	True
35742	2015-05-12	4247	BELMOND	50421	NaN	Wright	1750	29.67	44.51	6	267.06	10.50	2.77	True
37000	2015-05-26	4247	BELMOND	50421	NaN	Wright	750	6.83	10.25	2	20.50	1.50	0.40	True
39680	2015-05-19	4247	BELMOND	50421	NaN	Wright	1750	7.20	10.80	6	64.80	10.50	2.77	True
40449	2015-06-29	4247	BELMOND	50421	NaN	Wright	750	6.83	10.25	1	10.25	0.75	0.20	True
41807	2015-01-27	4247	BELMOND	50421	NaN	Wright	750	8.25	12.38	1	12.38	0.75	0.20	True
45215	2015-04-07	4247	BELMOND	50421	NaN	Wright	1750	8.20	12.30	6	73.80	10.50	2.77	True
45656	2015-06-23	4247	BELMOND	50421	NaN	Wright	1750	7.20	10.80	6	64.80	10.50	2.77	True
45990	2015-01-20	4247	BELMOND	50421	NaN	Wright	1750	22.49	33.74	2	67.48	3.50	0.92	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2706431	2016-03-31	5247	ROCKWELL	50469	NaN	Cerro Gordo	750	8.67	13.01	2	26.02	1.50	0.40	True
2706432	2016-03-31	5247	ROCKWELL	50469	NaN	Cerro Gordo	750	5.78	8.67	2	17.34	1.50	0.40	True
2706433	2016-03-31	5247	ROCKWELL	50469	NaN	Cerro Gordo	750	12.50	18.75	1	18.75	0.75	0.20	True
2706434	2016-03-31	5247	ROCKWELL	50469	NaN	Cerro Gordo	1750	18.00	27.00	2	54.00	3.50	0.92	True
2706435	2016-03-31	5247	ROCKWELL	50469	NaN	Cerro Gordo	1000	12.59	18.89	2	37.78	2.00	0.53	True
2706436	2016-03-31	5247	ROCKWELL	50469	NaN	Cerro Gordo	1000	18.49	27.74	1	27.74	1.00	0.26	True
2706437	2016-03-31	5247	ROCKWELL	50469	NaN	Cerro Gordo	1000	7.87	11.81	2	23.62	2.00	0.53	True
2706438	2016-03-31	5247	ROCKWELL	50469	NaN	Cerro Gordo	750	5.38	8.07	1	8.07	0.75	0.20	True
2706439	2016-03-31	5247	ROCKWELL	50469	NaN	Cerro Gordo	750	6.75	10.13	3	30.39	2.25	0.59	True
2706440	2016-03-31	5247	ROCKWELL	50469	NaN	Cerro Gordo	1750	6.92	10.38	2	20.76	3.50	0.92	True
2706441	2016-03-31	5247	ROCKWELL	50469	NaN	Cerro Gordo	1000	3.97	5.96	2	11.92	2.00	0.53	True
2706442	2016-03-31	5247	ROCKWELL	50469	NaN	Cerro Gordo	375	3.55	5.33	2	10.66	0.75	0.20	True
2706443	2016-03-31	5247	ROCKWELL	50469	NaN	Cerro Gordo	1000	22.04	33.06	1	33.06	1.00	0.26	True
2706444	2016-03-31	5247	ROCKWELL	50469	NaN	Cerro Gordo	1000	4.42	6.63	2	13.26	2.00	0.53	True
2706445	2016-03-31	5247	ROCKWELL	50469	NaN	Cerro Gordo	750	8.32	12.48	1	12.48	0.75	0.20	True
2706446	2016-03-31	5247	ROCKWELL	50469	NaN	Cerro Gordo	1000	12.97	19.46	2	38.92	2.00	0.53	True
2706447	2016-03-31	5247	ROCKWELL	50469	NaN	Cerro Gordo	750	6.83	10.25	2	20.50	1.50	0.40	True
2706881	2016-03-31	5229	NORWALK	50211	NaN	Warren	300	7.35	11.03	2	22.06	0.60	0.16	True
2706882	2016-03-31	5229	NORWALK	50211	NaN	Warren	375	6.25	9.38	3	28.14	1.12	0.30	True
2706883	2016-03-31	5229	NORWALK	50211	NaN	Warren	750	11.49	17.24	3	51.72	2.25	0.59	True
2706884	2016-03-31	5229	NORWALK	50211	NaN	Warren	750	11.49	17.24	3	51.72	2.25	0.59	True
2706885	2016-03-31	5229	NORWALK	50211	NaN	Warren	750	3.31	4.97	3	14.91	2.25	0.59	True
2706886	2016-03-31	5229	NORWALK	50211	NaN	Warren	200	3.51	5.27	3	15.81	0.60	0.16	True
2706887	2016-03-31	5229	NORWALK	50211	NaN	Warren	375	7.42	11.13	3	33.39	1.12	0.30	True
2706888	2016-03-31	5229	NORWALK	50211	NaN	Warren	750	10.00	15.00	3	45.00	2.25	0.59	True
2706889	2016-03-31	5229	NORWALK	50211	NaN	Warren	750	8.25	12.38	3	37.14	2.25	0.59	True
2706890	2016-03-31	5229	NORWALK	50211	NaN	Warren	750	8.25	12.38	3	37.14	2.25	0.59	True
2706891	2016-03-31	5229	NORWALK	50211	NaN	Warren	500	11.50	17.25	10	172.50	5.00	1.32	True
2709500	2016-03-31	5222	CEDAR RAPIDS	52499	NaN	Linn	200	5.70	8.55	120	1026.00	24.00	6.34	True
2709501	2016-03-31	5222	CEDAR RAPIDS	52499	NaN	Linn	375	10.66	15.99	60	959.40	22.50	5.94	True

	County	County Number	Counts
0	Adair	1	2795
1	Adams	2	2240
2	Adams	22	51
3	Allamakee	3	10716
4	Appanoose	4	10475
5	Audubon	5	2453
6	Benton	6	9609
7	Benton	57	194
8	Black Hawk	7	146278
9	Black Hawk	94	2246
10	Boone	8	17035
11	Bremer	9	22104
12	Bremer	7	4668
13	Buchanan	10	12560
14	Buena Vista	11	27516

	Date	Store Number	City	Zip Code	County Number	County	Bottle Volume (ml)	State Bottle Cost	State Bottle Retail	Bottles Sold	Sale (Dollars)	Volume Sold (Liters)	Volume Sold (Gallons)	countynull
186	2015-06-23	4228	VINTON	52349	6	Benton	750	13.86	20.79	12	249.48	9.00	2.38	False
612	2015-03-17	4025	BELLE PLAINE	52208	6	Benton	750	14.49	21.74	2	43.48	1.50	0.40	False
1631	2015-09-22	4228	VINTON	52349	6	Benton	1750	10.45	15.68	120	1881.60	210.00	55.48	False
1698	2015-10-13	4228	VINTON	52349	6	Benton	750	3.40	5.10	12	61.20	9.00	2.38	False
1777	2015-11-17	4025	BELLE PLAINE	52208	6	Benton	750	11.49	17.24	2	34.48	1.50	0.40	False
1794	2015-03-24	4228	VINTON	52349	6	Benton	1750	6.92	10.38	6	62.28	10.50	2.77	False
1845	2015-04-21	4551	URBANA	52345	6	Benton	375	3.50	5.25	6	31.50	2.25	0.59	False
1919	2015-03-03	4346	SHELLSBURG	52332	6	Benton	750	6.30	9.45	1	9.45	0.75	0.20	False
2195	2015-06-09	4070	BELLE PLAINE	52208	6	Benton	750	13.86	20.79	12	249.48	9.00	2.38	False
2321	2015-05-26	4228	VINTON	52349	6	Benton	1750	11.25	16.88	6	101.28	10.50	2.77	False
2520	2015-09-15	4070	BELLE PLAINE	52208	6	Benton	1000	18.49	27.74	4	110.96	4.00	1.06	False
3318	2015-09-01	4346	SHELLSBURG	52332	6	Benton	1750	8.50	12.75	1	12.75	1.75	0.46	False
3333	2015-02-03	4025	BELLE PLAINE	52208	6	Benton	1750	7.47	11.21	12	134.52	21.00	5.55	False
3589	2015-09-08	4228	VINTON	52349	6	Benton	1750	9.00	13.50	6	81.00	10.50	2.77	False
3820	2015-11-10	4228	VINTON	52349	6	Benton	1750	11.99	17.99	6	107.94	10.50	2.77	False
4099	2015-08-18	4070	BELLE PLAINE	52208	6	Benton	750	5.23	7.85	12	94.20	9.00	2.38	False
4119	2015-06-09	4025	BELLE PLAINE	52208	6	Benton	750	6.30	9.45	3	28.35	2.25	0.59	False
4168	2015-09-08	4346	SHELLSBURG	52332	6	Benton	1000	11.00	16.50	5	82.50	5.00	1.32	False
5807	2015-09-29	4228	VINTON	52349	6	Benton	1750	10.99	16.49	6	98.94	10.50	2.77	False
6040	2015-09-22	4228	VINTON	52349	6	Benton	1750	8.20	12.30	6	73.80	10.50	2.77	False
6369	2015-07-07	4346	SHELLSBURG	52332	6	Benton	1750	6.00	9.25	2	18.50	3.50	0.92	False
6869	2015-10-20	4070	BELLE PLAINE	52208	6	Benton	750	4.34	6.51	6	39.06	4.50	1.19	False
7329	2015-05-05	4346	SHELLSBURG	52332	6	Benton	1000	12.50	18.75	2	37.50	2.00	0.53	False
7330	2015-01-06	4070	BELLE PLAINE	52208	6	Benton	750	13.86	20.79	12	249.48	9.00	2.38	False
7543	2015-02-03	4346	SHELLSBURG	52332	6	Benton	1750	7.17	10.76	12	129.12	21.00	5.55	False
8045	2015-02-17	4346	SHELLSBURG	52332	6	Benton	750	15.00	22.50	3	67.50	2.25	0.59	False
8140	2015-09-15	4346	SHELLSBURG	52332	6	Benton	200	1.75	2.63	2	5.26	0.40	0.11	False
8390	2015-06-02	4346	SHELLSBURG	52332	6	Benton	1000	6.92	10.38	1	10.38	1.00	0.26	False
8523	2015-11-21	4070	BELLE PLAINE	52208	6	Benton	750	15.00	22.50	12	270.00	9.00	2.38	False
8539	2015-10-20	4228	VINTON	52349	6	Benton	750	6.88	10.82	12	129.84	9.00	2.38	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2691601	2016-03-29	4346	SHELLSBURG	52332	6	Benton	1750	7.62	11.43	1	11.43	1.75	0.46	False
2691602	2016-03-29	4346	SHELLSBURG	52332	6	Benton	200	1.75	2.63	5	13.15	1.00	0.26	False
2691603	2016-03-29	4346	SHELLSBURG	52332	6	Benton	375	3.55	5.33	2	10.66	0.75	0.20	False
2691604	2016-03-29	4346	SHELLSBURG	52332	6	Benton	1750	7.84	11.76	1	11.76	1.75	0.46	False
2691605	2016-03-29	4346	SHELLSBURG	52332	6	Benton	1000	6.63	9.95	4	39.80	4.00	1.06	False
2691606	2016-03-29	4346	SHELLSBURG	52332	6	Benton	1000	18.38	27.57	5	137.85	5.00	1.32	False
2691607	2016-03-29	4346	SHELLSBURG	52332	6	Benton	750	27.00	40.50	1	40.50	0.75	0.20	False
2691608	2016-03-29	4346	SHELLSBURG	52332	6	Benton	1000	7.50	11.25	2	22.50	2.00	0.53	False
2691609	2016-03-29	4346	SHELLSBURG	52332	6	Benton	1000	9.50	14.25	1	14.25	1.00	0.26	False
2691610	2016-03-29	4346	SHELLSBURG	52332	6	Benton	1000	4.05	6.08	1	6.08	1.00	0.26	False
2691611	2016-03-29	4346	SHELLSBURG	52332	6	Benton	1000	11.50	17.25	1	17.25	1.00	0.26	False
2691612	2016-03-29	4346	SHELLSBURG	52332	6	Benton	1000	7.87	11.81	4	47.24	4.00	1.06	False
2691613	2016-03-29	4346	SHELLSBURG	52332	6	Benton	1000	7.87	11.81	1	11.81	1.00	0.26	False
2691614	2016-03-29	4346	SHELLSBURG	52332	6	Benton	1000	7.87	11.81	1	11.81	1.00	0.26	False
2691615	2016-03-29	4346	SHELLSBURG	52332	6	Benton	750	4.75	7.13	1	7.13	0.75	0.20	False
2691616	2016-03-29	4346	SHELLSBURG	52332	6	Benton	1000	11.00	16.50	2	33.00	2.00	0.53	False
2691617	2016-03-29	4346	SHELLSBURG	52332	6	Benton	1000	11.34	17.01	2	34.02	2.00	0.53	False
2691618	2016-03-29	4346	SHELLSBURG	52332	6	Benton	1000	12.97	19.46	1	19.46	1.00	0.26	False
2691619	2016-03-29	4346	SHELLSBURG	52332	6	Benton	1000	6.14	9.21	2	18.42	2.00	0.53	False
2691620	2016-03-29	4346	SHELLSBURG	52332	6	Benton	1000	18.75	28.13	2	56.26	2.00	0.53	False
2691621	2016-03-29	4346	SHELLSBURG	52332	6	Benton	750	9.45	14.18	2	28.36	1.50	0.40	False
2699523	2016-03-30	5190	WALFORD	52351	6	Benton	750	8.25	12.38	12	148.56	9.00	2.38	False
2699524	2016-03-30	5190	WALFORD	52351	6	Benton	750	4.00	6.00	12	72.00	9.00	2.38	False
2699525	2016-03-30	5190	WALFORD	52351	6	Benton	750	12.05	18.08	3	54.24	2.25	0.59	False
2699526	2016-03-30	5190	WALFORD	52351	6	Benton	750	15.07	22.61	12	271.32	9.00	2.38	False
2699527	2016-03-30	5190	WALFORD	52351	6	Benton	750	5.23	7.85	12	94.20	9.00	2.38	False
2699528	2016-03-30	5190	WALFORD	52351	6	Benton	1750	10.45	15.68	6	94.08	10.50	2.77	False
2699529	2016-03-30	5190	WALFORD	52351	6	Benton	1750	7.20	10.80	6	64.80	10.50	2.77	False
2699530	2016-03-30	5190	WALFORD	52351	6	Benton	750	6.49	9.74	3	29.22	2.25	0.59	False
2699531	2016-03-30	5190	WALFORD	52351	6	Benton	750	7.88	11.82	5	59.10	3.75	0.99	False

	Bottle Volume (ml)	State Bottle Cost	State Bottle Retail	Bottles Sold	Sale (Dollars)	Volume Sold (Liters)	Volume Sold (Gallons)
count	2709552.000000	2709552.000000	2709552.000000	2709552.000000	2709552.000000	2709552.000000	2709552.000000
mean	924.211126	9.816209	14.740118	9.838390	129.119100	8.921504	2.357017
std	546.485776	14.906556	22.359532	23.498628	399.461537	28.244350	7.461356
min	0.000000	0.890000	1.340000	1.000000	1.340000	0.000000	0.000000
25%	750.000000	5.510000	8.270000	2.000000	30.480000	1.500000	0.400000
50%	750.000000	8.070000	12.300000	6.000000	70.560000	5.250000	1.390000
75%	1000.000000	11.960000	17.940000	12.000000	135.000000	10.500000	2.770000
max	225000.000000	6468.000000	9702.000000	3960.000000	106326.000000	3960.000000	1046.120000

	Store Number	City	Transaction Counts	duplicate
52	2536	PLEASANT HILL	334	True
56	2543	OTTUWMA	394	True
105	2604	LEMARS	181	True
118	2620	WINDSOR HEIGHTS	503	True
146	2656	CORNING	51	True
227	3542	ST ANSGAR	37	True
255	3645	WINDSOR HEIGHTS	72	True
302	3742	LEMARS	36	True
310	3762	WINDSOR HEIGHTS	218	True
364	3877	DES MOINES	25	True
381	3909	MELCHER-DALLAS	67	True
392	3930	ST LUCAS	34	True
626	4351	CLEAR LAKE	14	True
694	4433	PLEASANT HILL	38	True
785	4533	ST ANSGAR	18	True
830	4584	DUMONT	46	True
884	4647	WATERLOO	45	True
1057	4839	Northwood	27	True
1059	4840	WATERLOO	50	True
1066	4847	LECLAIRE	124	True
1096	4882	ST CHARLES	26	True
1132	4925	URBANDALE	21	True
1134	4926	DES MOINES	55	True
1152	4947	CEDAR FALLS	227	True
1171	4971	PLEASANT HILL	47	True
1252	5058	Urbandale	18	True
1301	5106	DEWITT	85	True
1303	5107	WINDSOR HEIGHTS	72	True
1363	5169	Des Moines	83	True
1427	9002	LECLAIRE	14	True
1430	9013	Cumming	7	True
1434	9023	Carroll	1	True

	Date	Store Number	City	Zip Code	County Number	County	Bottle Volume (ml)	State Bottle Cost	State Bottle Retail	Bottles Sold	Sale (Dollars)	Volume Sold (Liters)	Volume Sold (Gallons)	countynull
0	2015-11-20	2191	KEOKUK	52632	56	Lee	750	18.09	27.14	6	162.84	4.5	1.19	False
1	2015-11-21	2205	CLARINDA	51632	73	Page	750	18.09	27.14	12	325.68	9.0	2.38	False

	date	store	city	zip	cnty-num	cnty-name	vol-ml	state-cost	state-ret	qty-sold	sale-price	vol-sold-l	vol-sold-g	cost	sales-tax	excise
15657	2015-07-30	2515	MASON CITY	50402	17	Cerro Gordo	600	8.4	8.4	2	16.8	1.2	0.32	16.8	1.01	0
16253	2015-05-13	2543	OTTUMWA	52501	90	Wapello	600	8.4	8.4	3	25.2	1.8	0.48	25.2	1.51	0
22540	2015-06-16	2327	CORNING	50841	2	Adams	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
25348	2015-05-18	2613	COUNCIL BLUFFS	51503	78	Pottawattamie	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
67203	2015-05-12	2518	RED OAK	51591	69	Montgomery	600	8.4	8.4	2	16.8	1.2	0.32	16.8	1.01	0
71157	2015-05-27	4509	AMES	50014	85	Story	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
83087	2015-06-15	2521	WEST DES MOINES	50398	77	Polk	600	8.4	8.4	10	84.0	6.0	1.59	84.0	5.04	0
124267	2015-05-22	2613	COUNCIL BLUFFS	51503	78	Pottawattamie	600	8.4	8.4	10	84.0	6.0	1.59	84.0	5.04	0
132073	2015-07-15	3650	HOLSTEIN	51025	47	Ida	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
137397	2015-06-30	4295	DES MOINES	50981	77	Polk	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
178194	2015-05-06	3650	HOLSTEIN	51025	47	Ida	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
199968	2015-06-04	5116	WATERLOO	50706	7	Black Hawk	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
210546	2015-07-06	5116	WATERLOO	50706	7	Black Hawk	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
219423	2015-05-27	2505	BOONE	50037	8	Boone	600	8.4	8.4	3	25.2	1.8	0.48	25.2	1.51	0
224467	2015-05-04	2653	WASHINGTON	52353	92	Washington	600	8.4	8.4	2	16.8	1.2	0.32	16.8	1.01	0
253040	2015-05-05	3514	FORT DODGE	50501	94	Webster	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
264124	2015-05-21	2515	MASON CITY	50402	17	Cerro Gordo	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
280133	2015-06-30	2623	SIOUX CITY	51111	97	Woodbury	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
372627	2015-06-08	2662	MUSCATINE	52761	70	Muscatine	600	8.4	8.4	2	16.8	1.2	0.32	16.8	1.01	0
379674	2015-06-15	2528	DES MOINES	50981	77	Polk	600	8.4	8.4	2	16.8	1.2	0.32	16.8	1.01	0
387539	2015-04-23	3678	DES MOINES	50981	77	Polk	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
393906	2015-05-26	2624	DUBUQUE	52099	31	Dubuque	600	8.4	8.4	10	84.0	6.0	1.59	84.0	5.04	0
408340	2015-06-17	2623	SIOUX CITY	51111	97	Woodbury	600	8.4	8.4	2	16.8	1.2	0.32	16.8	1.01	0
435385	2015-05-05	2665	WAUKEE	50263	25	Dallas	600	8.4	8.4	2	16.8	1.2	0.32	16.8	1.01	0
498252	2015-05-28	2626	DES MOINES	50981	77	Polk	600	8.4	8.4	10	84.0	6.0	1.59	84.0	5.04	0
503396	2015-05-06	3650	HOLSTEIN	51025	47	Ida	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
512747	2015-05-12	2327	CORNING	50841	2	Adams	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
515002	2015-06-09	4277	LE MARS	51031	75	Plymouth	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
560989	2015-07-08	2623	SIOUX CITY	51111	97	Woodbury	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
634813	2015-05-04	4307	DUNLAP	51529	43	Harrison	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1380986	2015-05-11	2662	MUSCATINE	52761	70	Muscatine	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
1385788	2015-05-27	4509	AMES	50014	85	Story	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
1406868	2015-05-05	2629	COUNCIL BLUFFS	51503	78	Pottawattamie	600	8.4	8.4	2	16.8	1.2	0.32	16.8	1.01	0
1412153	2015-05-19	3942	TOLEDO	52342	86	Tama	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
1540193	2015-05-27	2505	BOONE	50037	8	Boone	600	8.4	8.4	3	25.2	1.8	0.48	25.2	1.51	0
1565248	2015-07-27	2506	BURLINGTON	52601	29	Des Moines	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
1588435	2015-04-22	3842	BANCROFT	50517	55	Kossuth	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
1609173	2015-06-04	5116	WATERLOO	50706	7	Black Hawk	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
1615042	2015-06-10	2623	SIOUX CITY	51111	97	Woodbury	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
1620985	2015-06-01	2643	WATERLOO	50706	7	Black Hawk	600	8.4	8.4	2	16.8	1.2	0.32	16.8	1.01	0
1624137	2015-06-16	2228	WINTERSET	50273	61	Madison	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
1628697	2015-06-30	2623	SIOUX CITY	51111	97	Woodbury	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
1651168	2015-06-02	2666	ANKENY	50023	77	Polk	600	8.4	8.4	10	84.0	6.0	1.59	84.0	5.04	0
1657188	2015-07-08	2623	SIOUX CITY	51111	97	Woodbury	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
1669960	2015-06-02	2666	ANKENY	50023	77	Polk	600	8.4	8.4	10	84.0	6.0	1.59	84.0	5.04	0
1677299	2015-05-05	2518	RED OAK	51591	69	Montgomery	600	8.4	8.4	2	16.8	1.2	0.32	16.8	1.01	0
1680377	2015-05-14	2582	MASON CITY	50402	17	Cerro Gordo	600	8.4	8.4	2	16.8	1.2	0.32	16.8	1.01	0
1706248	2015-06-01	4504	INDIANOLA	50125	91	Warren	600	8.4	8.4	10	84.0	6.0	1.59	84.0	5.04	0
1706510	2015-06-01	2506	BURLINGTON	52601	29	Des Moines	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
1708279	2015-04-29	2614	DAVENPORT	52809	82	Scott	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
1738444	2015-05-11	2538	WATERLOO	50706	7	Black Hawk	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
1745839	2015-04-29	2614	DAVENPORT	52809	82	Scott	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
1758180	2015-07-20	4114	ATLANTIC	50022	15	Cass	600	8.4	8.4	3	25.2	1.8	0.48	25.2	1.51	0
1825455	2015-05-19	3942	TOLEDO	52342	86	Tama	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
1851707	2015-06-09	4277	LE MARS	51031	75	Plymouth	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0
1853646	2015-05-06	3932	MAPLETON	51034	67	Monona	600	8.4	8.4	2	16.8	1.2	0.32	16.8	1.01	0
1888279	2015-05-27	2603	BETTENDORF	52722	82	Scott	600	8.4	8.4	10	84.0	6.0	1.59	84.0	5.04	0
1900960	2015-07-29	2623	SIOUX CITY	51111	97	Woodbury	600	8.4	8.4	2	16.8	1.2	0.32	16.8	1.01	0
1914323	2015-07-20	4114	ATLANTIC	50022	15	Cass	600	8.4	8.4	3	25.2	1.8	0.48	25.2	1.51	0
1949719	2015-07-27	2506	BURLINGTON	52601	29	Des Moines	600	8.4	8.4	1	8.4	0.6	0.16	8.4	0.50	0

	date	store	city	zip	cnty-num	cnty-name	vol-ml	state-cost	state-ret	qty-sold	sale-price	vol-sold-l	vol-sold-g	cost	sales-tax	excise	markup-pct
325986	2015-10-08	2633	DES MOINES	50981	77	Polk	750	2.2	7.7	1	7.7	0.75	0.20	2.2	0.46	5.5	3.5
489183	2015-10-22	2633	DES MOINES	50981	77	Polk	750	2.2	7.7	24	184.8	18.00	4.76	52.8	11.09	132.0	3.5
863878	2015-10-01	5102	MOUNT VERNON	52314	57	Linn	750	2.2	7.7	6	46.2	4.50	1.19	13.2	2.77	33.0	3.5
954408	2015-10-12	2633	DES MOINES	50981	77	Polk	750	2.2	7.7	1	7.7	0.75	0.20	2.2	0.46	5.5	3.5
1015894	2015-10-05	2633	DES MOINES	50981	77	Polk	750	2.2	7.7	3	23.1	2.25	0.59	6.6	1.39	16.5	3.5
1135118	2015-10-13	2578	CHARLES CITY	50616	34	Floyd	750	2.2	7.7	6	46.2	4.50	1.19	13.2	2.77	33.0	3.5
1149858	2015-10-01	2633	DES MOINES	50981	77	Polk	750	2.2	7.7	1	7.7	0.75	0.20	2.2	0.46	5.5	3.5
1171131	2015-10-21	5180	OELWEIN	50662	33	Fayette	750	2.2	7.7	2	15.4	1.50	0.40	4.4	0.92	11.0	3.5
1798342	2015-10-26	2633	DES MOINES	50981	77	Polk	750	2.2	7.7	30	231.0	22.50	5.94	66.0	13.86	165.0	3.5

	School	Latitude	Longitude
0	Buena Vista University	42.641083	-95.212001
1	Central College	41.400595	-92.921856
2	Clarke University	42.509266	-90.693569
3	Coe College	41.988428	-91.661317
4	Cornell College	41.926955	-91.427539
5	Dordt College	43.082483	-96.168549
6	Drake University	41.602913	-93.657487
7	Graceland University Lamoni	40.616346	-93.927827
8	Grand View University	41.620734	-93.602487
9	Grinnell College	41.749057	-92.722319
10	Iowa State University	40.368003	-89.819982
11	Loras College	42.502738	-90.683588
12	Luther College	43.311063	-91.808442
13	Morningside College	42.473387	-96.360724
14	Mount Mercy University	42.002701	-91.654427
15	Northwestern College	42.997948	-96.060448
16	Saint Ambrose University	41.540425	-90.582950
17	Simpson College	41.364988	-93.565815
18	University of Dubuque	42.495245	-90.697824
19	University of Iowa	41.661340	-91.553670

	cnty-num	excise
0	1.0	99248.06
1	2.0	41836.02
2	3.0	336293.58
3	4.0	338571.27
4	5.0	71761.10

	excise
date
2015-01-05	541543.17
2015-01-06	384096.17
2015-01-07	490866.86
2015-01-08	341018.87
2015-01-12	383348.67

	excise
shortdate
2015-01-01	6181739.35
2016-01-01	6346672.56
2015-02-01	7112305.30
2015-08-01	7386557.28
2015-07-01	7490484.40
2015-05-01	7515179.79
2015-04-01	7633332.48
2016-02-01	7635352.07
2015-03-01	7769270.87
2016-03-01	7906418.06
2015-11-01	7913577.60
2015-09-01	8043362.97
2015-10-01	8696331.15
2015-06-01	9061761.84
2015-12-01	10261225.59

		qty-sold	vol-sold-l	sale-price	excise	2015-pop	2016-pop
store	year
2106	2015	99998	93986.47	1434369.85	478695.22	681512395.0	678474920.0
2106	2016	23605	22277.95	337804.05	112662.94	164070271.0	163339016.0
2113	2015	6483	6500.83	85763.42	28693.18	50346112.0	50042609.0
2113	2016	1703	1608.07	21736.63	7260.10	14537856.0	14450217.0
2130	2015	72562	65562.88	1108184.99	369675.74	527187551.0	524837896.0

	store	year	qty-sold	vol-sold-l	sale-price	excise	2015-pop	2016-pop	pricepbottle
0	2106	2015	99998	93986.47	1434369.85	478695.22	681512395.0	678474920.0	14.343985
1	2106	2016	23605	22277.95	337804.05	112662.94	164070271.0	163339016.0	14.310699
2	2113	2015	6483	6500.83	85763.42	28693.18	50346112.0	50042609.0	13.228971
3	2113	2016	1703	1608.07	21736.63	7260.10	14537856.0	14450217.0	12.763729
4	2130	2015	72562	65562.88	1108184.99	369675.74	527187551.0	524837896.0	15.272250

	store	year	qty-sold-pyr	vol-sold-l-pyr	sales-pyr	excise-pyr	2015-pop	2016-pop	pricepbottle-pyr	Q1-2015-sales	Q1-vol-sold-l	Q1-excise	Q1-qty-sold	Q1-avg-ppl	Q1-avg-ppbottle
125	2633	2015	595205	582860.61	9839393.08	3282140.27	9.104150e+09	9.247670e+09	16.531100	2337324.42	137622.25	779717.83	139970.0	16.983623	16.698753
1033	4829	2015	520100	502306.92	8742779.31	2916236.61	8.499322e+09	8.633308e+09	16.809804	2097466.57	120327.92	699788.17	123797.0	17.431254	16.942790
36	2512	2015	282804	267489.84	4155665.47	1387461.81	2.063405e+09	2.091665e+09	14.694507	926706.76	58602.42	309449.18	62904.0	15.813455	14.732080
194	3385	2015	245312	273289.20	3947176.01	1318267.65	1.077640e+09	1.086804e+09	16.090432	1082136.01	72225.25	361323.62	67609.0	14.982794	16.005798
200	3420	2015	192664	237374.50	3422351.55	1143907.91	1.453733e+09	1.476650e+09	17.763316	977772.62	64351.25	326853.10	54266.0	15.194307	18.018144

OLS Regression Results
Dep. Variable:	sales-pyr	R-squared:	0.983
Model:	OLS	Adj. R-squared:	0.983
Method:	Least Squares	F-statistic:	1.468e+04
Date:	Mon, 23 Oct 2017	Prob (F-statistic):	0.00
Time:	07:11:35	Log-Likelihood:	-15965.
No. Observations:	1273	AIC:	3.194e+04
Df Residuals:	1267	BIC:	3.197e+04
Df Model:	5
Covariance Type:	nonrobust

	store	city	zip	cnty-num	cnty-name	vol-ml	state-cost	state-ret	qty-sold	sale-price	vol-sold-l	vol-sold-g	cost	sales-tax	excise	markup-pct	2015-pop	2016-pop	year
date
2015-01-26	4742	COUNCIL BLUFFS	51503	78.0	Pottawattamie	1750	15.00	22.50	12	270.00	21.00	5.55	180.00	16.20	90.00	1.5	93534.0	93582.0	2015
2015-03-12	4847	LE CLAIRE	52753	82.0	Scott	500	6.83	10.25	1	10.25	0.50	0.13	6.83	0.62	3.42	1.5	172170.0	172474.0	2015
2015-01-15	4969	CLEAR LAKE	50428	17.0	Cerro Gordo	750	30.07	45.11	3	135.33	2.25	0.59	90.21	8.12	45.12	1.5	42988.0	43070.0	2015
2015-02-12	4269	TIPTON	52772	16.0	Cedar	750	7.88	11.82	5	59.10	3.75	0.99	39.40	3.55	19.70	1.5	18362.0	18454.0	2015
2015-01-12	4659	CONRAD	50621	38.0	Grundy	500	7.47	11.21	1	11.21	0.50	0.13	7.47	0.67	3.74	1.5	12413.0	12313.0	2015

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-2.697e+04	7658.075	-3.522	0.000	-4.2e+04	-1.19e+04
Q1-vol-sold-l	-15.8716	2.810	-5.649	0.000	-21.384	-10.359
Q1-avg-ppbottle	2974.7736	650.436	4.574	0.000	1698.724	4250.823
Q1-2015-sales	-72.3797	11.799	-6.134	0.000	-95.527	-49.232
Q1-excise	232.0787	35.671	6.506	0.000	162.099	302.059
2015-pop	2.594e-05	4.1e-06	6.328	0.000	1.79e-05	3.4e-05

Omnibus:	788.323	Durbin-Watson:	1.969
Prob(Omnibus):	0.000	Jarque-Bera (JB):	117207.412
Skew:	1.856	Prob(JB):	0.00
Kurtosis:	49.861	Cond. No.	2.88e+09

	cnty-name	predicted-2016-sales
0	Adair	3.017471e+05
1	Adams	1.017885e+05
2	Allamakee	8.096903e+05
3	Appanoose	7.797760e+05
4	Audubon	1.869861e+05
5	Benton	8.394049e+05
6	Black Hawk	1.608054e+07
7	Boone	1.488115e+06
8	Bremer	2.063157e+06
9	Buchanan	1.058477e+06
10	Buena Vista	1.704833e+06
11	Butler	3.708437e+05
12	Calhoun	4.341080e+05
13	Carroll	2.315461e+06
14	Cass	1.224245e+06
15	Cedar	5.839835e+05
16	Cerro Gordo	5.475566e+06
17	Cherokee	7.004633e+05
18	Chickasaw	4.202307e+05
19	Clarke	6.050323e+05
20	Clay	1.445532e+06
21	Clayton	6.869749e+05
22	Clinton	3.281362e+06
23	Crawford	1.213031e+06
24	Dallas	2.622205e+06
25	Davis	9.685968e+04
26	Decatur	2.428749e+05
27	Delaware	9.306534e+05
28	Des Moines	3.613035e+06
29	Dickinson	2.829081e+06
...	...	...
68	Montgomery	6.328656e+05
69	Muscatine	2.998875e+06
70	Osceola	2.514052e+05
71	Page	1.215185e+06
72	Palo Alto	8.625692e+05
73	Plymouth	1.473923e+06
74	Pocahontas	4.146384e+05
75	Polk	6.552570e+07
76	Pottawattamie	9.969459e+06
77	Poweshiek	1.369151e+06
78	Ringgold	1.734031e+05
79	Sac	5.740611e+05
80	Scott	2.006159e+07
81	Shelby	7.307334e+05
82	Sioux	1.502713e+06
83	Story	9.384399e+06
84	Tama	7.124299e+05
85	Taylor	1.482449e+05
86	Union	1.024780e+06
87	Van Buren	2.250904e+05
88	Wapello	2.504271e+06
89	Warren	2.292395e+06
90	Washington	1.624576e+06
91	Wayne	1.474964e+05
92	Webster	3.810195e+06
93	Winnebago	7.386695e+05
94	Winneshiek	1.270644e+06
95	Woodbury	1.009473e+07
96	Worth	3.134641e+05
97	Wright	8.074824e+05