Once you've chosen your scenario, download the data from the Iowa website in csv format. Start by loading the data with pandas. You may need to parse the date columns appropriately.
You are a data scientist in residence at the Iowa State tax board. The Iowa State legislature is considering changes in the liquor tax rates and wants a report of current liquor sales by county and projections for the rest of the year to inform their decision.
Goal for Scenario #1: Your task is as follows:
Class E liquor licenses: For grocery, liquor and convenience stores, etc. Allows for the sale of alcoholic liquor for off-premises consumption in original unopened containers. No sales by the drink. Sunday sales are included. Also allows wholesale sales to on-premises Class A, B, C and D liquor licensees but must have a TTB Federal Wholesale Basic Permit.
<font color = "red">Problem Statement: We are interested in projecting the yearly sales for all stores in the state of Iowa for the purposes of evaluating liquor tax rates.
Iowa is a particular case in that it is the 6th highest tax rate on liquor in the United States. In Iowa, liquor distribution is particularly regulated, with some prohibition era laws still in place. While beer and wine regulations have been relaxed, liquor remains a sticking point for these prohibition style laws.
</font>
Sources:
http://www.desmoinesregister.com/story/news/politics/2015/06/27/booze-bureaucracy/29371105/
http://qctimes.com/news/local/liquor-sales-up-in-iowa/article_aa9155fc-9de3-11df-bdf8-001cc4c03286.html
http://www.thegazette.com/subject/news/government/iowa-posts-record-year-for-liquor-wine-sales-20151201
import pandas as pd
from datetime import datetime
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
## Load the data into a DataFrame
iowa = pd.read_csv("iowa_liquor_sales_proj_2.csv", parse_dates="Date")
len(iowa)
iowa.head()
iowa.info()
columns_keep = ['Date', 'Store Number', 'City', 'Zip Code', 'County Number', 'County', 'Bottle Volume (ml)',
'State Bottle Cost', 'State Bottle Retail', 'Bottles Sold', 'Sale (Dollars)', 'Volume Sold (Liters)',
'Volume Sold (Gallons)']
iowa = iowa[columns_keep]
iowa["Date"] = pd.to_datetime(iowa["Date"], format="%m/%d/%Y") #capital y because year is 4 digits
#Remove dollar signs and convert to float
iowa["State Bottle Cost"] = iowa["State Bottle Cost"].str.replace('$', '').astype('float')
iowa["State Bottle Retail"] = iowa["State Bottle Retail"].str.replace('$', '').astype('float')
iowa["Sale (Dollars)"] = iowa["Sale (Dollars)"].str.replace('$', '').astype('float')
iowa.columns
#We're going to need to check zipcode/city/county for reasonability
#source: https://www.zip-codes.com/state/ia.asp
zips = pd.read_csv('iowazips.csv')
zips["City"] = zips['City'].str.upper() # make all cities capitalized so it matches our iowa data
zips.head()
#convert all zips to strings, as they are categorical
zips["Zip Code"] = zips["Zip Code"].astype('str')
zips.info()
#We merge our zipcode data with iowa in df a.
a = pd.merge(iowa,zips, on = "Zip Code",how="left")
a["do cities match"] = a["City_x"] == a["City_y"]
a["do cities match"].value_counts()
# conducted a few spot checks, source y (iowazips.csv) is correct when it does not have a null value.
# Will merge data accordingly and replace iowa cities with source from df (a) when available.
# This will help with visualizing data and analyzing outliers, which I suspect may be big college towns.
# We will test this suspicion later
a[["City_x", "City_y","Zip Code","do cities match"]].sort_values("do cities match", ascending = True)
a[a["do cities match"] == False]["City_x"].value_counts()
#trust that zip code is accurate, back out city and county names.
for i, row in zips.iterrows():
iowa.loc[iowa["Zip Code"] == row["Zip Code"], "City"] = row["City"]
iowa.loc[iowa["Zip Code"] == row["Zip Code"], "County"] = row["County"]
#for any zip codes not in the zips data, match on city and back out correct zip code.
for i, row in zips.iterrows():
iowa.loc[iowa["City"] == row["City"], "Zip Code"] = row["Zip Code"]
iowa.loc[iowa["City"] == row["City"], "County"] = row["County"]
# We have a few null values in "County Number", "Category" and "Category Name"
# I'm going to choose to prioritize the accuracy of county number. Will come back and ammend Category and Category name
# should they impact our model.
iowa["countynull"] = iowa["County Number"].isnull()
iowa[iowa["countynull"]==True]
len(iowa[iowa["countynull"]==True])
a = pd.DataFrame(iowa.groupby("County")["County Number"].value_counts())
a.columns= ["Counts"]
a.reset_index(inplace = True)
a.head(15)
# We have all counties filled in, sometimes mapped to more than one county number, and sometimes mapped to
# a null county number. So we will make a dictionary w keys = counties and values = county numbers and fill in blank
# county numbers in our iowa df from it. In cases where the county is mapped to more than one county number, this for
# loop favors the county number with the most counts in it because we assume the majority of transactions are
# categorized correctly. County number becomes relevant in a later data visualization.
a_dict = {}
for county in list(iowa["County"].unique()):
x = a[a["County"]==county]
if len(x)>1:
maxi = max(list(x["Counts"]))
y = list(x.loc[x["Counts"] == maxi, "County Number"])[0]
a_dict[county] = y
else:
a_dict[county] = list(x["County Number"])[0]
countynum = pd.Series(a_dict, index = a_dict.keys())
countynum = pd.DataFrame(countynum, columns=["County Number"])
countynum["County Number"] = countynum["County Number"].astype(int)
countynum = countynum.reset_index()
countynum.columns=["County", "County Number"]
for i, row in countynum.iterrows():
iowa.loc[iowa["County"] == row["County"], "County Number"] = row["County Number"]
iowa[iowa["County"]=="Benton"]
# great! no more duplicate county numbers! (see spotcheck on Benton, which used to have two county numbers)
# No nulls in county numbers!
# our only null values are in Category and Category name, which we expect.
iowa.info()
import numpy as np
#Are numbers reasonable?
iowa[["Bottle Volume (ml)","State Bottle Cost", "State Bottle Retail", "Bottles Sold", "Sale (Dollars)",
"Volume Sold (Liters)","Volume Sold (Gallons)"]].describe()
for i in ["Bottle Volume (ml)","State Bottle Cost", "State Bottle Retail", "Bottles Sold", "Sale (Dollars)",
"Volume Sold (Liters)","Volume Sold (Gallons)"]:
print(iowa[i].value_counts())
# next we check that (1/100)*"Bottle Volume (ml)" * "Bottles Sold" == "Volume Sold (Liters)"
a = iowa.copy()
a.columns
a["volume sold calc"] = np.round(((1/1000)*a["Bottle Volume (ml)"])*a["Bottles Sold"],2)
a["volume sold check"] = a["volume sold calc"] == a["Volume Sold (Liters)"]
a["volume sold check"].value_counts()
a["gallon calc"] = np.round(a["volume sold calc"]*0.264172, 2)
a["gallon sold check"] = a["gallon calc"] - a["Volume Sold (Gallons)"]
a["gallon sold check"].value_counts()
# looks good! volume sold (ml AND Gallons) are documented correctly!
#Now we'd like to check that all stores of a specific number in one city only.
a = pd.DataFrame(iowa.groupby(["Store Number"])["City"].value_counts())
a.columns = ["Transaction Counts"]
a.reset_index(inplace=True)
a["duplicate"] = a["Store Number"].duplicated()
a["duplicate"].value_counts()
a[a['duplicate']==True]
#this seems minor given the scale of the data, so we're going to set it aside for now. but may be worth further
#investigation
# Assumption 1: We've sufficiently ammended for erroneous zip and city info from an external source.
# Assumption 2: We were able to fill in missing or erroneous county number values by favoring other data records where
# it was filled in. We assume that if there was a county with two county numbers in the data, that the county where
# the majority of the records were was the correct one.
# Risk 1: Misinformation in Store code - there is no source to back into these values
# Risk 2: Category and Category Name are left as-is, with the understanding that they will be amended should they
# be relevant to our model predicting liquor sales.
# Risk 3: There is no way to determine store profit or anything in excess of what is taxed.
#Let's eliminate extra columns and rename our favorite features.
iowa.head(2)
iowacols = ["date", "store", "city", "zip", "cnty-num", "cnty-name", 'vol-ml', 'state-cost', 'state-ret', 'qty-sold',
'sale-price', 'vol-sold-l', 'vol-sold-g']
len(iowacols)
len(iowa.columns)
iowa = iowa[list(iowa.columns)[0:13]]
iowa.head(2)
iowa.columns = iowacols
#Finally, we add Cost total, Sales Total, Sales Tax, Excise Tax, and Revenue
import math
iowa["cost"] = np.round(iowa["state-cost"]*iowa['qty-sold'],2)
iowa['sales-tax'] = np.round(.06*iowa["sale-price"],2)
iowa["excise"] = np.round(iowa["sale-price"]-iowa['cost'],2)
iowa.head()
iowa.describe()
#wait, why do we have a zero dollar minimum excise?
iowa.loc[iowa["excise"] ==0,:]
#these don't look right. Let's see what the markup percentage should be
iowa["markup-pct"] = np.round(iowa["state-ret"]/iowa['state-cost'],2)
# It's 1.5! Our research is consistent with this. Good.
iowa["markup-pct"].describe()
# markup percentage of 3.5! what a can of worms we have here. Let's first fix the 0 excise one.
# We can assume these were documented incorrectly because it is actually illegal to sell these at no tax.
# Iowa collects a 50% markup of any liquor. We can assume that excise in addition to 50% of the markup is erroneous.
# source: http://statelaws.findlaw.com/iowa-law/iowa-consumer-tax-laws.html
iowa.loc[iowa["excise"] ==0,"state-ret"]=np.round(1.5*iowa["state-cost"],2)
#now to correct the markup percentage of 3.5 which is way high!
iowa.loc[iowa["markup-pct"] ==3.5,:]
iowa.loc[iowa["markup-pct"] != 1.5,"state-ret"]=np.round(1.5*iowa["state-cost"],2)
# Phew, would have completely lost those if we didn't check out state legislation and the calculated columns!
# write data to csv for continued use
iowa.to_csv('iowa_clean.csv', index=False)
Perform some exploratory statistical analysis and make some plots, such as histograms of transaction totals, bottles sold, etc.
import pandas as pd
from datetime import datetime
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
iowa = pd.read_csv('iowa_clean.csv')
# source = http://www.collegecalc.org/colleges/iowa/four-year/
# Chose only schools with >1000 f/t students and with a ranking, as most schools without a ranking consist of students who
# are either close to home, living at home, or working in addition to classes. There is less of a drinking culture.
# The largest university to be excluded on the grounds of no ranking is Kaplan University Davenport Campus with
# 9231 students.
uni = pd.read_csv("iowa-uni.csv")
uni =uni[0:23]
schools = uni["School"]
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
<font color = "red">You do not have to run this if you also collected file "iowa-uni-coords.csv" from Natalie!</font>
driver = webdriver.Firefox()
wait = WebDriverWait(driver, 30)
coords = []
driver.get('https://www.google.com/maps')
for school in schools:
searchbox = wait.until(EC.presence_of_element_located((By.ID, 'searchboxinput')))
searchbox.clear()
searchbox.send_keys(school + ' iowa')
driver.find_element_by_id('searchbox-searchbutton').click()
sleep(6)
url = driver.current_url
tries = 0
while '@' not in url:
tries += 1
sleep(1)
url = driver.current_url
if tries == 5: #try five times to give the url the chance to resolve
break
try:
long_lat = url.split('@')[1].split(',')[:2]
coords.append((school, long_lat[0], long_lat[1]))
except:
coords.append((school, None, None))
driver.quit()
coords
coords = pd.DataFrame(coords)
coords.to_csv("iowa-uni-coords.csv", index = False)
<font color ="red">Start here if you have csv file "iowa-uni-coords" collected from NMO.</font>
coords=pd.read_csv("iowa-uni-coords.csv")
coords.columns =[["School", "Latitude", "Longitude"]]
#iowa state university is mapped wrong! how did that happen? I guess always double check your webscraper.
# 42.0266187,-93.6486541
coords.head(20)
coords.loc[coords["School"]=="Iowa State University", "Latitude"] = 42.0266187
coords.loc[coords["School"]=="Iowa State University", "Longitude"] = -93.6486541
#let's import our visualization packages.
import seaborn as sns
import matplotlib.pyplot as plt
from bokeh.io import output_notebook, show
output_notebook()
a = pd.DataFrame(iowa.groupby("cnty-num")["excise"].sum())
a.reset_index(inplace=True)
a.head()
ia_liq ={}
for i in range(0,len(a)):
key = int(a["cnty-num"][i])
value = float(a["excise"][i])
ia_liq[key] = value
from bokeh.io import show
from bokeh.models import (
ColumnDataSource,
HoverTool,
LogColorMapper
)
from bokeh.palettes import Viridis6 as palette
from bokeh.plotting import figure
from bokeh.sampledata.us_counties import data as counties
#from bokeh.sampledata.unemployment import data as unemployment
palette.reverse()
counties = {
code: county for code, county in counties.items() if county["state"] == "ia"
}
county_xs = [county["lons"] for county in counties.values()]
county_ys = [county["lats"] for county in counties.values()]
county_names = [county['name'] for county in counties.values()]
county_rates = [ia_liq[county_id[1]] if county_id[1] in ia_liq else None for county_id in counties]
color_mapper = LogColorMapper(palette=palette)
source = ColumnDataSource(data=dict(
x=county_xs,
y=county_ys,
name=county_names,
rate=county_rates,
))
town_xs = list(coords["Longitude"])
town_ys = list(coords["Latitude"])
source2 = ColumnDataSource(data=dict(x=town_xs, y=town_ys))
TOOLS = "pan,wheel_zoom,reset,hover,save"
p = figure(
title="Liquor Tax Revenue in Iowa", tools=TOOLS,
x_axis_location=None, y_axis_location=None
)
p.grid.grid_line_color = None
p.patches('x', 'y', source=source,
fill_color={'field': 'rate', 'transform': color_mapper},
fill_alpha=0.7, line_color="white", line_width=0.5)
p.circle('x','y', source = source2)
hover = p.select_one(HoverTool)
hover.point_policy = "follow_mouse"
hover.tooltips = [
("Name", "@name"),
("Total Excise Tax Revenue", "$@rate{1.11}"),
("(Long, Lat)", "($x, $y)"),
]
show(p)
# not super impressed with the gradient of colors in bokeh. That's ok. I mostly wanted to see if my college
# town theory was true. So below I pull in college town information and plot the lat/long on the map. It appears not
# to make that much a difference. Perhaps population is the better indicator; Iowan natives hold their own with any
# major college towns. College Town indicator does not need to be a variable in our model.
# I'm super interested in the distribution of the excise tax. Here are histograms at varying x-scales.
# We see that among all transactions, there is a wide spread all the way out to ~125000!! wow who got married?
plt.subplot(221)
plt.hist(iowa["excise"])
plt.subplot(222)
plt.hist(iowa["excise"], range = [0,500])
plt.subplot(223)
plt.hist(iowa["excise"], range = [0,200])
plt.subplot(224)
plt.hist(iowa["excise"], range = [0,70])
plt.show()
# Ok let's see the same thing grouped by county.
a=iowa.groupby(["cnty-name"])["excise"].sum()
a=pd.DataFrame(a)
plt.subplot(221)
plt.hist(a["excise"])
#plt.axis([40, 160, 0, 0.03])
plt.subplot(222)
plt.hist(a["excise"], range = [0,100000])
plt.subplot(223)
plt.hist(a["excise"], range = [0,50000])
plt.subplot(224)
plt.hist(a["excise"], range = [0,20000])
plt.show()
# interesting. skewed right, but when we zoom in on the bulk of the data it is less so
# Let's try to understand those outliers a little better. They skew so far!! We will do so in the "Refine Data" step.
import statsmodels.api as sm
from sklearn import linear_model
c=iowa.groupby("date")["excise"].sum()
c =pd.DataFrame(c)
c.index = pd.to_datetime(c.index)
c.head()
c["mo-date"] = c.index.month
c["yr-date"] = c.index.year
c.head()
c["shortdate"] = c.apply(lambda x: (str(int(x["mo-date"]))+"-"+str(int(x["yr-date"]))), axis=1)
c = pd.DataFrame(c.groupby(["shortdate"])["excise"].sum())
c.index = pd.to_datetime(c.index)
c.sort_values(["excise"])
plt.figure(figsize = (10,10))
plt.scatter(c.index, c.excise)
# When grouped by month, this appears to reflect seasonality. The two lowest datapoints are both January,
# there's spikes with growth up to a plateau range [~725000,~825000], in the spring. Spikes in June and December.
# Month is clearly influential, we will be using it as a dummy variable. Will probably exclude March, as it is pretty
# representative of the plateu liquor sales have.
# We're definitely going to want to look at population. source:
# Ok let's check out a scatter matrix for good measure.
# this takes forever to run on the full data... holding off for now
from pandas.plotting import scatter_matrix
df = iowa[["vol-ml", 'state-cost', 'state-ret', 'qty-sold', 'sale-price', 'cost', 'excise']]
scatter_matrix(df, alpha=0.2, figsize=(15, 15), diagonal='kde')
pop = pd.read_csv("../county_pop_data.csv", sep = "\t")
pop = pop[["Geographic name", "7/1/2015", "7/1/2016"]]
pop.columns =[["cnty-name", "2015-pop", "2016-pop"]]
pop["cnty-name"]=[" ".join(x.split()[:-1]) for x in pop["cnty-name"]]
iowa = iowa.merge(pop, on= "cnty-name", how = "left")
import numpy as np
plt.scatter(np.log(iowa["2015-pop"]), iowa["excise"])
#Looks like a correlation! We're going to need to scale our data, for sure.
Be sure to write out anything observations from your exploratory analysis.
Per the prompt, we are interested in projecting sales for the following year. In my opinion, the tax board would be interested in the excise. By excluding this, the prompt implies everyone in the room can do the excise tax calculation in their heads and therefore would like everything in terms of sales rather than excise tax.
Since cost is perfectly colinear with sales, we will not include it in the model.
Month is definitely worth including in the model. The prompt specifically asks for Sales from Jan to March per store,
likely for this reason.
Population is definitely worth including in the model, though performs better after regularization.
One avenue the Tax Board my take to increase tax revenue is doing a by the gallon or liter tax rather than a 50% markup from the cost.</font>
Now you are ready to compute the variables you will use for your regression from the data. For example, you may want to compute total sales per store from Jan to March of 2015, mean price per bottle, etc. Refer to the readme for more ideas appropriate to your scenario.
Pandas is your friend for this task. Take a look at the operations here for ideas on how to make the best use of pandas and feel free to search for blog and Stack Overflow posts to help you group data by certain variables and compute sums, means, etc. You may find it useful to create a new data frame to house this summary data.
iowa.head()
iowa["date"]=pd.to_datetime(iowa["date"])
iowa["year"]= iowa.apply(lambda x: x["date"].year, axis =1)
# iowa2 will be the data grouped by store and year.
# Keep in mind that qty-sold will now be qty-sold per that year, for that store.
iowa2 = iowa.groupby(["store","year"])[["qty-sold", 'vol-sold-l', 'sale-price', 'excise', "2015-pop", "2016-pop"]].sum()
iowa2.head()
iowa2 =iowa2.reset_index()
iowa2["pricepbottle"] = iowa2["sale-price"]/iowa2['qty-sold']
iowa2.head()
iowa2.columns=[["store","year","qty-sold-pyr","vol-sold-l-pyr", "sales-pyr","excise-pyr","2015-pop","2016-pop",
"pricepbottle-pyr"]]
plt.hist(iowa2["pricepbottle-pyr"])
jtmsales = iowa.set_index(["date"])
len(jtmsales)
jtmsales = jtmsales.ix["2015-01-01":"2015-03-31"]
jtmsales.head(5)
jtmsales = pd.DataFrame(jtmsales.groupby(["store"])['sale-price','vol-sold-l','excise','qty-sold'].sum())
jtmsales = pd.DataFrame(jtmsales)
jtmsales.columns = ["Q1-2015-sales", "Q1-vol-sold-l","Q1-excise","Q1-qty-sold"]
jtmsales["Q1-avg-ppl"] = jtmsales["Q1-2015-sales"]/jtmsales["Q1-vol-sold-l"]
jtmsales["Q1-avg-ppbottle"] = jtmsales["Q1-2015-sales"]/jtmsales['Q1-qty-sold']
jtmsales.reset_index(inplace =True)
<font color ="red">Calculate the yearly liquor sales for each store using the provided data. You can add up the transactions for each year, and store sales in 2015 specifically will be used later as your target variable.</font>
<font color ="red">Use the data from 2015 to make a linear model using as many variables as you find useful to predict the yearly sales of all stores. You must use the sales from January to March as one of your variables.</font>
data2015 = iowa2[iowa2["year"]==2015].merge(jtmsales, on="store", how="left")
data2016 = iowa2[iowa2["year"]==2016]
keep in mind this is Q1 of 2016!!
Look for any statistical relationships, correlations, or other relevant properties of the dataset.
Now that we have the data in a format where total sales are grouped by store and year, we can check out statistical relationships
data2015.sort_values(["sales-pyr"], ascending=False).head(5)
# Highest gross sales is in stores 2633 and 4829, in DeMoines, IA. Polk cty.
# Next highest gross sales is in stores 3385, in Cedar Rapids, IA. Linn cty.
# Next highest gross sales is in store 2512, in Iowa City, IA. Johnson cty.
# Metropolitan areas dominate in liquor sales.
plt.hist(data2015["2015-pop"].dropna())
#wow, it appears a lot of our variables have a right skew.
Using scikit-learn or statsmodels, build the necessary models for your scenario. Evaluate model fit.
Use the data from 2015 to make a linear model using as many variables as you find useful to predict the yearly sales of all stores. You must use the sales from January to March as one of your variables.
# There appears to be some n/a values in Q1-2015-sales. This could be due to new stores opened since Q1 2015 OR due
# to the fact we are working with 10% of the data
# By definition given by the prompt, we want a predictive model that can predict total year sales from the first
# three month's of a year's data.
data_2015=data2015.dropna()
features = data_2015[["Q1-vol-sold-l","Q1-avg-ppbottle","Q1-2015-sales", "Q1-excise","2015-pop"]]
target = data_2015[["sales-pyr"]]
import statsmodels.api as sm
X = features
y = target
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
# Print out the statistics
model.summary()
# Q1 Quantity Sold was not statistically significant (p=.76) and thus was dropped from the model
plt.scatter(predictions, y, s=30, c='r', marker='+', zorder=10)
plt.xlabel("Predicted Values Annual Sales")
plt.ylabel("Actual Values Annual Sales")
plt.show()
print("MSE:", model.mse_model) ## mean squared error
# Wow, an R^2 of .983 that is uncomfortably high. We'll see how this model validates.
# MSE is pretty uninterpretable, as it seems on the line of the magnitude of most of our variables. Let's see if we
# can reduce it. I presume this is a very variant model and can be improved.
#wow we're going to need to Regularize this data. Regularizing is a technique used to reduce over fitting.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=1)
from sklearn.linear_model import Lasso
lassoreg = Lasso(alpha=0.5, normalize=True, max_iter=10000)
lassoreg.fit(X_train, y_train)
print(lassoreg.coef_)
from sklearn.linear_model import LassoCV
# select the best alpha with LassoCV
lassoregcv = LassoCV(n_alphas=10, normalize=True, random_state=1) #n_alphas = 10 means it will generate 10 random values for alpha
lassoregcv.fit(X_train, y_train)
lassoregcv.alpha_
from sklearn.linear_model import Lasso
lassoreg = Lasso(alpha=181.774, normalize=True)
lassoreg.fit(X_train, y_train)
print(lassoreg.coef_)
from sklearn import metrics
y_pred = lassoreg.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print(metrics.r2_score(y_test,y_pred))
#this is a better MSE, still an outrageous R^2.
# The bias/variance trade-off:
# Based on our high R^2, and similarly high MSE, I think our model is extremely variant, meaning it is overfit
# to the training data. We've made our model slightly more bias by using Lasso Regression, but we'll see how it
# performs on 2016 data.
data2016.columns=[["store","year","qty-sold-q1","vol-sold-l-q1","sales-q1", "excise-q1","2015-pop","2016-pop",
"pricepbottle-q1"]]
feat_2016 = data2016[["vol-sold-l-q1","pricepbottle-q1","sales-q1","excise-q1","2016-pop"]].dropna()
y_pred = lassoreg.predict(feat_2016)
sum(y_pred)
According to the Iowa Division of Alcoholic Beverages Annual Report 2016:
"In Fiscal Year 2016, the Division saw another increase in liquor sales to more than $288.9 million."
not bad for this order of magnitude.
<font color = "red">Use your model for 2015 to estimate total sales in 2016, extrapolating from the sales so far for January to March of 2016.</font>
beta_2016 = data2016[["store","vol-sold-l-q1","pricepbottle-q1","sales-q1","excise-q1","2016-pop"]].dropna(subset=["vol-sold-l-q1","pricepbottle-q1","sales-q1","excise-q1","2016-pop"])
beta_2016["predicted-2016-sales"] = y_pred
beta_2016=beta_2016.merge(data_2015[["store","sales-pyr","Q1-2015-sales"]], on="store", how = "left")
beta_2016.head()
beta_2016.columns=[["store","Q1-2016-vol-sold-l","Q1-2016-ppbottle", "Q1-2016-sales-pyr","Q1-2016-excise",
"2016-pop","predicted-2016-sales","2015-sales-pyr","Q1-2015-sales"]]
beta_2016.head()
Wow, it appears by lasso regression, we predict higher sales for 2016 than in 2015. I brought in Q1 2015 sales to check. For the most part, differences in Q1 sales do correspond with annual sales.
Let's add county to our DF So we can plot sales by county later
county_sales=beta_2016.merge(iowa[["store","cnty-name"]], on = "store", how="left").drop_duplicates(subset="store")
county_sales.head()
county_sales=county_sales.groupby(["cnty-name"])["predicted-2016-sales"].sum()
county_sales = pd.DataFrame(county_sales)
county_sales.reset_index(inplace=True)
county_sales
Again make sure that you record any valuable information. For example, in the tax scenario, did you find the sales from the first three months of the year to be a good predictor of the total sales for the year? Plot the predictions versus the true values and discuss the successes and limitations of your models
Plot using 2015 data of Q1 sales vs total sales
plt.scatter(data_2015["Q1-2015-sales"], data_2015["qty-sold-pyr"], s=30, c='r', marker='+', zorder=10)
plt.xlabel("2015 Sales: Q1")
plt.ylabel("Total 2015 Sales")
plt.show()
plt.scatter(predictions, y, s=30, c='r', marker='+', zorder=10)
plt.xlabel("Predicted Values Annual Sales")
plt.ylabel("Actual Values Annual Sales")
plt.show()
print("MSE:", model.mse_model) ## mean squared error
feat_2016 = data2016[["vol-sold-l-q1","pricepbottle-q1","sales-q1","excise-q1","2016-pop"]].dropna()
Present your conclusions and results. If you have more than one interesting model feel free to include more than one along with a discussion. Use your work in this notebook to prepare your write-up.
In 2015, the following first quarter variables, by store, served as predictors in our model:
As well as 2015 population were pretty good predictors for overall 2015 sales. While some of them were related, they indicate that a strong showing for the first three months of the year are highly indicative of liquor sale performance throughout the rest of the year. However, we also see that the model based on 2015 sales underpredictes reported 2016. This is unsurprising based on the model, as the first quarter sales of 2016 were lower than first quarter sales of 2015.
The difference between predicted and actual values for 2016 suggests that we may have overfit the model for 2015 and should examine additional factors. Based on exploritory analysis, one particular direction of exploration that I feel has merit is to examine types of liquor stores, and training the model to better handle spikes in sales later in the year like in June and December.
The possibility of overfitting is common in all models and is part of the bias/viarance trade off. While we can be confident first quarter sales matter, we should be cautious against considering it the only important predictor. In the case of tax legislation, it is preferrable to under predict than over predict liquor sales because it is a direct line to state revenue.
county_sales.sort_values(["predicted-2016-sales"], ascending=False,inplace=True)
county_sales[0:10].plot(x="cnty-name",y="predicted-2016-sales",kind="bar",figsize=(15,15),fontsize=15)