In this notebook, I am analyzing Starbucks nutrition data I scraped from https://www.starbucks.com/menu on September 2, 2020. This data only includes items from the Starbucks Drinks menu.
Starbucks provides a nutrition analysis of its menu items to help you balance your Starbucks order with other foods you eat. Their goal is to provide you with the information you need to make sensible decisions about balance, variety, and moderation in your diet.
The data file _sbuxnutrition.csv contains the drink nutrition data for this analysis. It contains the following variables:
All drinks from the Starbucks online main menu (collected in Fall 2020) are included, with the exception of Clover® Brewed Coffees, Coffee Travelers, Iced Clover® Brewed Coffees, Bottled Teas, Milk, Sparkling Water, and Water. There are 11 columns and 525 rows.
For the purpose of this comparison analysis, I am filtering the dataset to only include drinks in grande size. Therefore, each line is a unique drink with a unique drink name. I am also omitting drinks in which grande size nutrition data was not provided on the Starbucks website menu.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("data/sbux_nutrition.csv")
data.head()
# How many NAs?
data.isnull().sum()
# Clean up
data = data.dropna()
data = data[data['size'] == 'Grande']
data['type'] = data['type'].replace({'Frappuccino® Blended Beverages':'Frappuccinos'})
data['drink_name'] = data['drink_name'].str.replace('Frappuccino® Blended Beverage', 'Frappuccino')
data['drink_name'] = data['drink_name'].str.replace('Frappuccino®', 'Frappuccino')
data.head()
# New number of rows/columns
data.shape
After data cleaning, we will be analyzing a total of 139 rows (unique drinks).
Summary table of each column (nutrition value). Some things to note from the table below:
data.describe()
What are the average nutrition values for each drink category (type)?
data.groupby('type').mean()
What ratio of Starbucks drinks contain caffeine?
caf_perc = round(len(data[data.caffeine > 0]) / len(data) * 100, 2)
print('Percentage of menu drinks that contain caffeine: ', caf_perc, '%', sep='')
How many varieties of each category does Starbucks offer?
data['type'].value_counts().plot(kind='bar', figsize=(8, 6), rot=0, color='green')
plt.title('Variety in Starbucks Drink Categories', fontsize=16)
plt.xlabel("Drink Category")
plt.ylabel("Number of Drinks")
plt.show()
plt.figure(figsize=(10,8), dpi= 80)
sns.heatmap(data.corr(), xticklabels=data.corr().columns, yticklabels=data.corr().columns, cmap='Greens', center=0, annot=True)
# Decorations
plt.title('Correlogram of Starbucks Nutrition Types', fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()
This correlogram shows correlations between each nutrition type in Starbucks drinks. Darker green colors encode a higher (more positive) nutrition type correlation. Every nutrition type shows a positive correlation with each other with the exception of caffiene - caffeine has a slight negative correlation or close to zero correlation with the other nutrition types. This may suggest that different levels of caffeine can be present in both sugary and non-sugary drinks. The strongest positive correlation is between fat and cholesterol, which makes sense because cholesterol is one of many types of lipids. Similarly, carbohydrates and sugar have the second strongest correlation as sugar is a carbohydrate.
Other very strong positive relationships include: calories~carbohydrates, calories~fat, calories~cholesterol, and calories~sugar. While these correlations don't point out any unique observations in Starbucks nutrition, it's good to verify that the correlations make sense overall.
Why are we looking mostly at calories and sugar?
According to NPD’s Health Aspirations and Behavioral Tracking Service, the top two items consumers look for on nutrition labels are sugars and calories. I also want to discuss caffeine as it is in over 85% of items on the drink menu, and it is a hot topic in today's nutrition and health discussions.
data.nlargest(5, 'calories')[['drink_name', 'calories']]
fig = plt.figure(figsize = (6, 4))
plt.hist(data.calories, bins = 8, rwidth= 0.85, color='green')
plt.title('Distribution of Starbucks Calories')
plt.xlabel("Calories")
plt.ylabel("Number of Drinks")
plt.show()
data.nlargest(5, 'sugar')[['drink_name', 'sugar']]
fig = plt.figure(figsize = (6, 4))
plt.hist(data.sugar, bins = 8, rwidth= 0.85, color='green')
plt.title('Distribution of Starbucks Sugar')
plt.xlabel("Sugar (g)")
plt.ylabel("Number of Drinks")
plt.show()
data.nlargest(5, 'caffeine')[['drink_name', 'caffeine']]
fig = plt.figure(figsize = (6, 4))
plt.hist(data.caffeine, bins = 8, rwidth= 0.85, color='green')
plt.title('Distribution of Starbucks Caffeine')
plt.xlabel("Caffeine (g)")
plt.ylabel("Number of Drinks")
plt.show()
The distributions show that Starbucks drinks are slightly left-skewed for calories, sugar, and caffeine. This makes sense as Starbucks does offer a number of zero-calorie and zero-sugar drinks, mostly falling in the tea category. It also offers a good amount of non-coffee and decaf drinks.
data.nsmallest(10, 'calories')[['drink_name', 'calories']]
fig = plt.figure(figsize = (8, 6))
plt.scatter(data.calories, data.sugar, c=data.caffeine, cmap='Greens')
cbar = plt.colorbar()
cbar.set_label('Caffeine (g)', rotation=270)
# draw line
plt.plot(np.unique(data.calories), np.poly1d(np.polyfit(data.calories, data.sugar, 2))
(np.unique(data.calories)), color = 'black')
plt.title('Calories vs Sugar in Starbucks Drinks', fontsize=16)
plt.xlabel("Calories")
plt.ylabel("Sugar (g)")
plt.show()
The above scatterplot and linear regression shows the relationship between calories and sugar in all Starbucks drinks. Caffeine is denoted by color, where darker color means more caffeine. As expected, there is a clear and consistent positive relationship between calories and sugar. Caffeine looks to have no clear relationship with the other two variables, with the exception where drinks with very little to no calories and sugar contain a high amount of caffeine. This makes sense - coffee roasts, which do not have any milk or sugar added, have a very high amount of caffeine relative to other drinks.
Let's look into drink categories to see if we can uncover deeper patterns in caffeine.
sns.set(palette="muted")
sns.catplot(x="type", y="caffeine", hue="type",
kind="swarm", data=data, aspect=1.5);
plt.title("Caffeine by Drink Type", fontsize=16)
plt.show()
sns.catplot(x="type", y="calories",
kind="swarm", data=data, aspect=1.5);
plt.title("Calories by Drink Type", fontsize=16)
plt.show()
sns.catplot(x="type", y="sugar",
kind="swarm", data=data, aspect=1.5);
plt.title("Sugar by Drink Type", fontsize=16)
plt.show()
Caffeine is most apparent in Hot Coffee and Cold Coffee, with most points falling at or above 150mg.
The Frappuccino category easily identified as the category with the most calories and sugar. The majority of points for this category fall above 350 calories and 40g of sugar.
sns.lmplot(x="calories", y="sugar", hue="type",
data=data, aspect=1.5);
plt.title("Calories & Sugar in Starbucks Drink Types", fontsize=16)
plt.show()
We can use the categorical linear regressions above to compare calories to sugar in Starbucks drinks. For every drink category, we can see that as # of calories increases, sugar increases. Using color to show the relationships by drink category, we can see that data points in the Frappuccino category tend to be placed higher in calories and sugar, relative to the other categories. Data points in the Hot Tea category tend to be in the lower ranges of calories and sugar. Finally, the Hot Coffee and Cold Coffee categories have a very wide distribution in this plot. This shows that the drink's nutrition in these categories are not dependent on the category, but instead on other variables that make up the drink.
Looking at g/mg values for all of the nutrition types doesn't tell us much because we cannot compare the amounts of each nutrition type with different scales. To standardize this, I want to know how the nutrition values for each type compare to the daily suggested intake.
As Specified by the FDA Based on a 2,000 Calorie Intake for Adults and Children 4 or More Years of Age.
Source: https://www.fda.gov/media/135301/download (updated March 2020)
For healthy adults, it is generally recommended to not go over 400mg a day, an amount not associated with negative affects - (FDA)
After this transformation, I wanted to see how nutrition types compare to each other now. What is the average daily intake percentage of each nutrition type for Starbucks drink items?
dv = data.copy(deep=True)
dv.calories = dv.calories / 2000 * 100
dv.fat = dv.fat / 78 * 100
dv.cholesterol = dv.cholesterol / 300 * 100
dv.sodium = dv.sodium / 2300 * 100
dv.carb = dv.carb / 275 * 100
dv = dv.drop(columns=['sugar']) # no daily level for total sugars
dv.protein = dv.protein / 50 * 100
dv.caffeine = dv.caffeine / 400 * 100
dv.head()
# Average daily intake percentage
dv.mean()
# Max daily intake percentage
dv.max()[3:]
No drinks seem to go above 1/3 of recommended daily value for any nutrition type, with the exception of caffeine. Let's be happy that no grande drinks have a nutrition value intake equivalent to one of our 3 daily meals!
# Highest daily intake % of caffeine
dv.nlargest(10, 'caffeine')[['drink_name', 'caffeine']]
Blonde Roast takes the #1 spot, taking up 90% of the recommended caffeine daily intake! You best not be planning on another coffee later in the day. Unsurprisingly, we find many forms of Nitro Cold Brew in the top 10 most caffeinated drinks.
# Highest daily intake % of calories
dv.nlargest(10, 'calories')[['drink_name', 'calories']]
# Highest daily intake % of carbs (includes sugar)
dv.nlargest(10, 'carb')[['drink_name', 'carb']]
heatmap = dv.groupby('type').mean()
plt.figure(figsize=(8,6))
plt.title('Average Nutrition Daily Intake % of Starbucks Items by Category', fontsize=16)
sns.heatmap(heatmap, cmap="Greens", annot=True)
A high level of caffeine intake is apparent in this heatmap for the categories Cold Coffee and Hot Coffee. This further supports the observations that caffeine is most apparent in the variant Nitro Cold Brew drinks (Cold Coffee) and the variant "roast" drinks (Hot Coffee).
On a positive note, it is very good to see little darkness in the other nutrition types. With the other nutrition types, we need to worry about how they add up when we start concerning our daily meal intake as well. Having a lighter value is good so that we can save the bulk of our calorie intake for the meals that fuel us. On the other hand, we don't have to worry about caffeine intake for the rest of the day... unless you are someone who has multiple cups of coffee a day.
With caffeine greater than 60% DV, and with calories and carbohydrates greater than 20% DV, here are your worst drinks!
In terms of caffeine...
dv[(dv['caffeine'] > 60 )].sort_values(by=['caffeine'], ascending=False)
In terms of calories and carbs...
dv[(dv['calories'] >= 20) & (dv['carb'] > 20)].sort_values(by=['calories', 'carb'], ascending=False)
Very unfortunate that one of my absolute favorites is the Salted Caramel Mocha.