import pandas as pd
import sqlite3
conn = sqlite3.connect("Pathname")
The Craft Beers dataset
In this project, I utilized an intriguing dataset of craft beers, originally compiled by Jean-Nicholas Hould. This dataset, which I sourced from a detailed web scraping tutorial available on Hould's website, encompasses a diverse range of craft beers, providing detailed information about each, including their names, types, brewing details, and other relevant attributes. My objective in this project is to delve deep into this dataset, applying my data analysis skills to uncover trends, patterns, and insights that reveal the nuances and dynamics of the craft beer market. This project not only serves as a testament to my analytical capabilities but also highlights the potential of skillfully curated data in driving meaningful analysis.
Data source: here.
The project began with connecting to a craft beer SQLite database and identifying 'beers' and 'breweries' as the key data tables:
query = """
SELECT name
FROM sqlite_master
WHERE type='table';
"""
df = pd.read_sql_query(query, conn)
df
name | |
---|---|
0 | beers |
1 | breweries |
The schemas of 'beers' and 'breweries' tables were examined to understand their composition, essential for informed data analysis:
for table in ['beers','breweries']:
query = f"""
PRAGMA table_info({table});
"""
df = pd.read_sql_query(query, conn)
print(df[['name','type']])
print('='*30)
name type 0 abv REAL 1 ibu REAL 2 id INTEGER 3 name TEXT 4 style TEXT 5 brewery_id INTEGER 6 ounces REAL ============================== name type 0 brewery_id INTEGER 1 name TEXT 2 city TEXT 3 state TEXT ==============================
Comprehensive metadata from the 'sqlite_master' table was retrieved to gain insight into the overall database structure, guiding the analytical approach:
metadata_query = """
SELECT *
FROM sqlite_master;
"""
metadata_result = pd.read_sql_query(metadata_query, conn)
metadata_result
type | name | tbl_name | rootpage | sql | |
---|---|---|---|---|---|
0 | table | beers | beers | 2 | CREATE TABLE "beers" (\n"abv" REAL,\n "ibu" R... |
1 | table | breweries | breweries | 40 | CREATE TABLE "breweries" (\n"brewery_id" INTEG... |
The first five records from the 'beers' table were analyzed, focusing on fundamental attributes like alcohol content and beer style, setting a direction for deeper exploration:
five_rows_beers_query = """
SELECT abv, ibu, name, style
FROM beers
LIMIT 5
"""
five_rows_beers_result = pd.read_sql_query(five_rows_beers_query, conn)
five_rows_beers_result
abv | ibu | name | style | |
---|---|---|---|---|
0 | 0.050 | None | Pub Beer | American Pale Lager |
1 | 0.066 | None | Devil's Cup | American Pale Ale (APA) |
2 | 0.071 | None | Rise of the Phoenix | American IPA |
3 | 0.090 | None | Sinister | American Double / Imperial IPA |
4 | 0.075 | None | Sex and Candy | American IPA |
The initial exploration of the 'breweries' table involved examining its first five entries, focusing on brewery identification and location, critical for contextual brewery analysis:
five_rows_breweries_query = """
SELECT brewery_id, name, city, state
FROM breweries
LIMIT 5;
"""
five_rows_breweries_result = pd.read_sql_query(five_rows_breweries_query, conn)
five_rows_breweries_result
brewery_id | name | city | state | |
---|---|---|---|---|
0 | 0 | NorthGate Brewing | Minneapolis | MN |
1 | 1 | Against the Grain Brewery | Louisville | KY |
2 | 2 | Jack's Abby Craft Lagers | Framingham | MA |
3 | 3 | Mike Hess Brewing Company | San Diego | CA |
4 | 4 | Fort Point Beer Company | San Francisco | CA |
The total number of entries in the 'beers' table was calculated to determine the dataset's breadth, informing the strategy for subsequent analyses:
beers_total_rows_query = """
SELECT COUNT(*) AS row_count
FROM beers;
"""
beers_total_rows_result = pd.read_sql_query(beers_total_rows_query, conn)
beers_total_rows_result
row_count | |
---|---|
0 | 2410 |
Unique beer styles in the 'beers' table were counted to assess the diversity of styles represented in the dataset, reflecting the craft beer market's variety:
distinct_styles_query = """
SELECT COUNT (DISTINCT(style)) AS distinct_styles
FROM beers;
"""
distinct_styles_result = pd.read_sql_query(distinct_styles_query, conn)
distinct_styles_result
distinct_styles | |
---|---|
0 | 99 |
The range of alcoholic strength in the dataset was analyzed by calculating the minimum, average, and maximum ABV in the 'beers' table, identifying brewing trends and preferences:
alcoholic_content_query = """
SELECT MIN(abv) AS min, AVG(abv) AS avg, MAX(abv) AS max
FROM beers;
"""
alcoholic_content_result = pd.read_sql_query(alcoholic_content_query, conn)
alcoholic_content_result
min | avg | max | |
---|---|---|---|
0 | 0.001 | 0.059773 | 0.128 |
Alcohol by Volume (ABV) data from the 'beers' table was extracted for a detailed analysis of alcohol content distribution, laying the groundwork for statistical examination and visualization:
abv_distribution_query = """
SELECT abv
FROM beers;
"""
abv_distribution_result = pd.read_sql_query(abv_distribution_query, conn)
abv_distribution_result
abv | |
---|---|
0 | 0.050 |
1 | 0.066 |
2 | 0.071 |
3 | 0.090 |
4 | 0.075 |
... | ... |
2405 | 0.067 |
2406 | 0.052 |
2407 | 0.055 |
2408 | 0.055 |
2409 | 0.052 |
2410 rows × 1 columns
The distribution of ABV was visualized using a histogram with 30 bins, providing detailed insights into the variation and typical ranges of alcohol content:
import matplotlib.pyplot as plt
ax = abv_distribution_result.hist(bins=30)[0][0]
ax.set_title("Alcohol by Volume")
plt.show()
Distinct IPA styles from the 'beers' table were isolated and listed to explore the diversity within this popular beer category, informing a targeted analysis of IPAs:
ipa_styles_query = """
SELECT DISTINCT(style) AS ipa_styles
FROM beers
WHERE style LIKE "%IPA%"
ORDER BY style ASC;
"""
ipa_styles_result = pd.read_sql_query(ipa_styles_query, conn)
ipa_styles_result
ipa_styles | |
---|---|
0 | American Double / Imperial IPA |
1 | American IPA |
2 | American White IPA |
3 | Belgian IPA |
4 | English India Pale Ale (IPA) |
Average bitterness levels for lagers and ales, including IPAs and Pale Ales, were analyzed to compare flavor profiles and discern trends across beer types:
ale_bitterness_query = """
SELECT AVG(ibu)
FROM beers
WHERE style LIKE '%lager%';
"""
ale_bitterness_result = pd.read_sql_query(ale_bitterness_query, conn)
print(ale_bitterness_result)
lager_bitterness_query = """
SELECT AVG(ibu)
FROM beers
WHERE style LIKE "%IPA%" OR "%APA%" OR "%Pale Ale%";
"""
lager_bitterness_result = pd.read_sql_query(lager_bitterness_query, conn)
print(lager_bitterness_result)
AVG(ibu) 0 22.863636 AVG(ibu) 0 71.94898
The most common beer styles were identified using a window function to rank styles based on occurrence frequency, highlighting the predominant styles in the craft beer market:
style_count_query = """
SELECT DISTINCT(style),
COUNT(name) OVER(PARTITION BY style) AS number
FROM beers
ORDER BY number DESC
LIMIT 10;
"""
style_count_result = pd.read_sql_query(style_count_query, conn)
style_count_result
style | number | |
---|---|---|
0 | American IPA | 424 |
1 | American Pale Ale (APA) | 245 |
2 | American Amber / Red Ale | 133 |
3 | American Blonde Ale | 108 |
4 | American Double / Imperial IPA | 105 |
5 | American Pale Wheat Ale | 97 |
6 | American Brown Ale | 70 |
7 | American Porter | 68 |
8 | Saison / Farmhouse Ale | 52 |
9 | Witbier | 51 |
Beers and their corresponding breweries in San Francisco were listed, employing an inner join to provide a structured overview of the local craft beer offerings:
beers_and_brewery_query = """
SELECT breweries.name AS brewery, beers.name AS beer
FROM beers
INNER JOIN breweries
ON beers.brewery_id = breweries.brewery_id
WHERE breweries.city = 'San Francisco'
ORDER BY brewery ASC, beer ASC;
"""
beers_and_brewery_result = pd.read_sql_query(beers_and_brewery_query, conn)
beers_and_brewery_result
brewery | beer | |
---|---|---|
0 | 21st Amendment Brewery | 21st Amendment IPA (2006) |
1 | 21st Amendment Brewery | 21st Amendment Watermelon Wheat Beer (2006) |
2 | 21st Amendment Brewery | Back in Black |
3 | 21st Amendment Brewery | Bitter American |
4 | 21st Amendment Brewery | Bitter American (2011) |
5 | 21st Amendment Brewery | Brew Free! or Die IPA |
6 | 21st Amendment Brewery | Brew Free! or Die IPA (2008) |
7 | 21st Amendment Brewery | Brew Free! or Die IPA (2009) |
8 | 21st Amendment Brewery | Fireside Chat |
9 | 21st Amendment Brewery | Fireside Chat (2010) |
10 | 21st Amendment Brewery | He Said Baltic-Style Porter |
11 | 21st Amendment Brewery | He Said Belgian-Style Tripel |
12 | 21st Amendment Brewery | Hell or High Watermelon Wheat |
13 | 21st Amendment Brewery | Hell or High Watermelon Wheat (2009) |
14 | 21st Amendment Brewery | Hell or High Watermelon Wheat (2009) |
15 | 21st Amendment Brewery | Hop Crisis |
16 | 21st Amendment Brewery | Lower De Boom |
17 | 21st Amendment Brewery | Marooned On Hog Island |
18 | 21st Amendment Brewery | Monk's Blood |
19 | 21st Amendment Brewery | Special Edition: Allies Win The War! |
20 | Anchor Brewing Company | Brotherhood Steam |
21 | Anchor Brewing Company | California Lager |
22 | Anchor Brewing Company | IPA |
23 | Anchor Brewing Company | Liberty Ale |
24 | Anchor Brewing Company | Summer Wheat |
25 | Fort Point Beer Company | KSA |
26 | Fort Point Beer Company | Park |
27 | Fort Point Beer Company | Villager |
28 | Fort Point Beer Company | Westfalia |
29 | Pine Street Brewery | Atom Splitter Pale Ale |
30 | Pine Street Brewery | Black Bay Milk Stout |
31 | Speakasy Ales & Lagers | Baby Daddy Session IPA |
The ten breweries with the most extensive range of beers were identified, showcasing the breweries with the most significant variety in their offerings:
largest_breweries_query = """
SELECT breweries.name AS name, COUNT(*) AS "number of beers"
FROM beers
INNER JOIN breweries
ON beers.brewery_id = breweries.brewery_id
GROUP BY breweries.brewery_id, breweries.name
ORDER BY "number of beers" DESC, name ASC
LIMIT 10;
"""
largest_breweries_result = pd.read_sql_query(largest_breweries_query, conn)
largest_breweries_result
name | number of beers | |
---|---|---|
0 | Brewery Vivant | 62 |
1 | Sun King Brewing Company | 38 |
2 | Oskar Blues Brewery | 33 |
3 | Cigar City Brewing Company | 25 |
4 | Sixpoint Craft Ales | 24 |
5 | Hopworks Urban Brewery | 23 |
6 | Stevens Point Brewery | 22 |
7 | 21st Amendment Brewery | 20 |
8 | Great Crescent Brewery | 20 |
9 | Bonfire Brewing Company | 19 |
The top 10 states by the number of breweries and total beer production were analyzed, combining subqueries and aggregation to rank states based on craft beer industry activity:
top_10_states_query = """
SELECT state, count(*) AS "breweries", sum(b.beers) "beers" FROM breweries
JOIN
(SELECT brewery_id, count(*) AS beers
FROM beers
GROUP BY brewery_id) b
ON b.brewery_id = breweries.brewery_id
GROUP BY state
ORDER BY breweries DESC, state ASC
LIMIT 10
"""
top_10_states_result = pd.read_sql_query(top_10_states_query, conn)
top_10_states_result
state | breweries | beers | |
---|---|---|---|
0 | CO | 47 | 265 |
1 | CA | 39 | 183 |
2 | MI | 32 | 162 |
3 | OR | 29 | 125 |
4 | TX | 28 | 130 |
5 | PA | 25 | 100 |
6 | MA | 23 | 82 |
7 | WA | 23 | 68 |
8 | IN | 22 | 139 |
9 | WI | 20 | 87 |
Analyzing the relationship between ABV and IBU. The correlation heatmap shows a moderate positive correlation of 0.67 between ABV (Alcohol by Volume) and IBU (International Bitterness Units), suggesting beers with higher ABV tend to be more bitter, but the relationship is not strong enough to predict bitterness solely based on alcohol content:
import seaborn as sns
abv_ibu_query = """
SELECT abv AS ABV, ibu AS IBU
FROM beers;
"""
abv_ibu_df = pd.read_sql_query(abv_ibu_query, conn)
correlation_data = abv_ibu_df.dropna()
sns.heatmap(correlation_data.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation between ABV and IBU")
plt.show()
The bar chart displays the average Alcohol by Volume (ABV) for craft beers across various states, revealing Nevada as having the highest average ABV and Utah as having the lowest. The data suggest a standardization in ABV among most states, with Nevada and Utah marking the upper and lower bounds of this range. The results could be indicative of state-specific regulations, consumer tastes, or brewing traditions that influence the alcohol content in craft beer production:
import sqlite3
import matplotlib.pyplot as plt
avg_abv_by_state_query = """
SELECT state, AVG(abv) as average_abv
FROM beers
JOIN breweries ON beers.brewery_id = breweries.brewery_id
GROUP BY state
ORDER BY average_abv DESC;
"""
avg_abv_by_state = pd.read_sql_query(avg_abv_by_state_query, conn)
avg_abv_by_state.plot(kind='bar', x='state', y='average_abv', figsize=(10, 6))
plt.title("Average ABV by State")
plt.ylabel("Average ABV")
plt.xlabel('State')
plt.show()
This section of the analysis prepares the craft beer dataset for a geographical deep dive. We start by standardizing state names in the avg_abv_by_state dataset, ensuring uniformity for accurate mapping. A mapping dictionary is utilized to convert state abbreviations to their full names. This step is pivotal for aligning our beer data with the geographical data, setting the foundation for an insightful spatial analysis:
import geopandas as gpd
avg_abv_by_state['state'] = avg_abv_by_state['state'].str.upper().str.strip()
state_abbrev_to_full = {
'AL': 'Alabama', 'AK': 'Alaska', 'AZ': 'Arizona', 'AR': 'Arkansas', 'CA': 'California',
'CO': 'Colorado', 'CT': 'Connecticut', 'DE': 'Delaware', 'FL': 'Florida', 'GA': 'Georgia',
'HI': 'Hawaii', 'ID': 'Idaho', 'IL': 'Illinois', 'IN': 'Indiana', 'IA': 'Iowa',
'KS': 'Kansas', 'KY': 'Kentucky', 'LA': 'Louisiana', 'ME': 'Maine', 'MD': 'Maryland',
'MA': 'Massachusetts', 'MI': 'Michigan', 'MN': 'Minnesota', 'MS': 'Mississippi', 'MO': 'Missouri',
'MT': 'Montana', 'NE': 'Nebraska', 'NV': 'Nevada', 'NH': 'New Hampshire', 'NJ': 'New Jersey',
'NM': 'New Mexico', 'NY': 'New York', 'NC': 'North Carolina', 'ND': 'North Dakota', 'OH': 'Ohio',
'OK': 'Oklahoma', 'OR': 'Oregon', 'PA': 'Pennsylvania', 'RI': 'Rhode Island', 'SC': 'South Carolina',
'SD': 'South Dakota', 'TN': 'Tennessee', 'TX': 'Texas', 'UT': 'Utah', 'VT': 'Vermont',
'VA': 'Virginia', 'WA': 'Washington', 'WV': 'West Virginia', 'WI': 'Wisconsin', 'WY': 'Wyoming',
'DC': 'District of Columbia'
}
avg_abv_by_state['state'] = avg_abv_by_state['state'].map(state_abbrev_to_full)
gdf = gpd.read_file('Pathname')
gdf = gdf.merge(avg_abv_by_state, left_on='NAME', right_on='state')
gdf.set_index(gdf.state, inplace= True)
gdf.drop(['GEO_ID', 'STATE', 'LSAD', 'NAME'], inplace= True, axis= 1)
gdf.head()
CENSUSAREA | geometry | state | average_abv | |
---|---|---|---|---|
state | ||||
Maine | 30842.923 | MULTIPOLYGON (((-67.61976 44.51975, -67.61541 ... | Maine | 0.057815 |
Massachusetts | 7800.058 | MULTIPOLYGON (((-70.83204 41.60650, -70.82373 ... | Massachusetts | 0.055683 |
Michigan | 56538.901 | MULTIPOLYGON (((-88.68443 48.11579, -88.67563 ... | Michigan | 0.063377 |
Montana | 145545.801 | POLYGON ((-104.05770 44.99743, -104.25015 44.9... | Montana | 0.056487 |
Nevada | 109781.180 | POLYGON ((-114.05060 37.00040, -114.04999 36.9... | Nevada | 0.066900 |
Here, we integrate the prepared craft beer data with geospatial information. Using GeoPandas, we load US state boundary data from a GeoJSON file and merge it with our craft beer dataset. By aligning the state names from both datasets and setting up the data frame for geographical plotting, we pave the way for a compelling visualization that will showcase ABV trends across the US:
from bokeh.io import show, output_notebook
from bokeh.models import GeoJSONDataSource, LinearColorMapper, ColorBar, HoverTool
from bokeh.palettes import brewer
from bokeh.plotting import figure
# Add a formatted ABV field for hover display
gdf['formatted_abv'] = gdf['average_abv'].apply(lambda x: f'{x:.3f}')
# Convert GeoDataFrame to GeoJSON format
geo_json = gdf.to_json()
# Create a GeoJSONDataSource
geo_source = GeoJSONDataSource(geojson=geo_json)
# Define a color mapper: palette can be adjusted
palette = brewer['YlGnBu'][8]
color_mapper = LinearColorMapper(palette=palette, low=gdf.average_abv.min(), high=gdf.average_abv.max())
# Create figure object with adjusted dimensions
p = figure(title='Average Alcohol by Volume (ABV) across States',
height=600, width=950,
toolbar_location='below',
tools="pan, wheel_zoom, reset")
# Set the x and y range bounds
p.x_range.bounds = (-130, -65)
p.y_range.bounds = (24, 50)
# Remove axis labels and grid lines
p.axis.visible = False
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
# Add patch renderer to figure
p.patches('xs', 'ys', source=geo_source,
fill_color={'field': 'average_abv', 'transform': color_mapper},
line_color='black', line_width=0.5, fill_alpha=1)
# Configure hover tool
hover = HoverTool()
hover.tooltips = [
("State", "@state"),
("Average ABV", "@formatted_abv")
]
p.add_tools(hover)
# Add color bar
color_bar = ColorBar(color_mapper=color_mapper, label_standoff=12, width=500, height=20,
location=(0,0), orientation='horizontal')
p.add_layout(color_bar, 'below')
output_notebook()
show(p)
The resulting map, depicting the average ABV by state, reveals intriguing regional trends in the craft beer market. States like Nevada and California might show higher average ABV, indicating a preference for stronger beers, whereas states like Utah could exhibit lower ABV averages, possibly due to local regulations or cultural preferences. This geographical representation not only highlights regional variations in beer strength but also provides valuable insights into the diverse landscape of American craft beer preferences.
The bar chart indicates that "ale" is the most frequently occurring word in beer names, suggesting it is a very popular type of beer. On the opposite end, "wheat" appears as the least common among the listed words, which could indicate it's a less common flavor profile or style in the selection of beers analyzed:
from collections import Counter
import re
query_beers = """
SELECT *
FROM beers;
"""
beers = pd.read_sql_query(query_beers, conn)
words = Counter(re.findall(r'\w+', ' '.join(beers['name']).lower()))
most_common_words = words.most_common(10)
words_df = pd.DataFrame(most_common_words, columns=['Word', 'Frequency'])
words_df.plot(kind='bar', x='Word', y='Frequency')
plt.title("Most Common Words in Beer Names")
plt.show()