TastyData

Logo

This is Exploratory Data Analysis Project for the Winter 2023 offering of DSC 80 at UC San Diego.

View the Project on GitHub Cao1224/recipeRatings_EDA

TastyData

by Yuancheng Cao (yuc094@ucsd.edu), Grace Chen

Credit: UC San Diego DSC 80 Winter 2023 Course Project Instruction

Introduction

The recipes dataset contains 83782 unique recipes. The interactions dataset contains reviews and ratings for recipes in the recipes dataset. We observed from the recipes dataframe that submission year of recipes ranged from 2008 to 2018, so we are interested into studying interesting trends of changes to the recipes submitted over time. This information can be useful to food providers such as restaurant owners, cookbook author, or even the hdh team at UCSD!.

We will perform EDA on a broad range of columns and determine which exact type of change we want to investigate. The columns we will try to investigate are ‘submitted’, ‘minutes’, ‘nutrition’, ‘n_steps’, ‘description’, ‘recipe_id’ from the recipes dataset and ‘rating’ from the interactions dataset:

We decided to work with the merged dataframe, with 234,429 rows. We also realized that some recipes are represented in multiple rows, but we believe this overrepresentation will not significantly affect our data since the large dataset is relatively robust. Therefore, we plan on investigate trends of changes in all above columns againest the ‘submitted’ column. After EDA, we decided to investigate if recipes submitted contains more protein over time. Therefore, our research question is: Is there a significant change in the average protein content of recipes over time?

Note: PDV - stands for “Percentage of Daily Value”

Cleaning and EDA

Data Cleaning

  1. Load the Datasets
    • Use pd.read_csv() to read ‘recipes’ dataset with os.path.join() to get recipes dataset.
    • Use pd.read_csv() to read ‘interactions’ dataset with os.path.join() to get interactions dataset.
  2. Merge DataFrames
    • Left merge the recipes and interactions datasets together.
    • Fill all ratings of 0 with np.nan in ‘rating’ column. Following Question: Why we need to fill all ratings of 0 with np.nan.
    • The reason why we need to fill all ratings of 0 with np.nan is to handle missing or incomplete data. In some cases, a rating of 0 may indicate that the recipe is not rated at all and does not necessarily reflect the quality of the recipe. By replacing these 0 values with np.nan, we can avoid the bias that might be introduced by using 0 ass a proxy for missing values. In addition, when we are computing statistics, np.nan is automatically ignored.
name id minutes submitted n_steps description n_ingredients rating avg_rating calories total_fat sugar sodium protein saturated_fat carbohydrates spring summer fall winter low-protein high-protein meatloaf meat meatballs beef-organ-meats beef ground-beef beef-ribs roast-beef
1 brownies in the world best ever 333281 40 2008 10 these are the most; chocolatey, moist, rich, dense, fudgy, delicious brownies that you’ll ever make…..sereiously! there’s no doubt that these will be your fav brownies ever for you can add things to them or make them plain…..either way they’re pure heaven! 9 4 4 138.4 10 50 3 3 19 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 in canada chocolate chip cookies 453467 45 2011 12 this is the recipe that we use at my school cafeteria for chocolate chip cookies. they must be the best chocolate chip cookies i have ever had! if you don’t have margarine or don’t like it, then just use butter (softened) instead. 11 5 5 595.1 46 211 22 13 51 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0
412 broccoli casserole 306168 40 2008 6 since there are already 411 recipes for broccoli casserole posted to “zaar” ,i decided to call this one #412 broccoli casserole.i don’t think there are any like this one in the database. i based this one on the famous “green bean casserole” from campbell’s soup. but i think mine is better since i don’t like cream of mushroom soup.submitted to “zaar” on may 28th,2008 9 5 5 194.8 20 6 32 22 36 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
412 broccoli casserole 306168 40 2008 6 since there are already 411 recipes for broccoli casserole posted to “zaar” ,i decided to call this one #412 broccoli casserole.i don’t think there are any like this one in the database. i based this one on the famous “green bean casserole” from campbell’s soup. but i think mine is better since i don’t like cream of mushroom soup.submitted to “zaar” on may 28th,2008 9 5 5 194.8 20 6 32 22 36 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
412 broccoli casserole 306168 40 2008 6 since there are already 411 recipes for broccoli casserole posted to “zaar” ,i decided to call this one #412 broccoli casserole.i don’t think there are any like this one in the database. i based this one on the famous “green bean casserole” from campbell’s soup. but i think mine is better since i don’t like cream of mushroom soup.submitted to “zaar” on may 28th,2008 9 5 5 194.8 20 6 32 22 36 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Univariate Analysis

This histogram shows distribution of the ‘submitted’ column. Specifically, because we will be grouping submission date by years, this histogram also bins data from the ‘submitted’ column by years. The x axis is labeled by each submitted year, the y axis is labeled by count of recipes submitted in the corresponding year. We can see that the shape of the graph is very left skewed. This indicates that much more recipes were submitted before 2010, the earlier years that we are interested in, while much fewer recipes were submitted from 2014 or later, the late years that we are interested in. However, given that the whole dataframe is large enough, the skewed distribution should will not significantly influence our test results.

These two histograms show distribution of the ‘minute’ column. The x axes are labeled by minutes, the y axes are labeled by count of recipes submitted in the corresponding year. The top histogram shows the distribution of the ‘minutes’ column after removing outliers > 2880 minutes, which is 0.25% of the data. We can see that the shape of the graph is extremely left skewed. Most recipes have suggested preparing time under 120 minutes.

So we again removed extreme values > 120 minutes, which is 10.36% of the data and plot the bottom histogram. In the bottom histogram, we observed spikes at integer points ending in 5s and 0s, such as 30 minutes, 20 minutes, or 15 minutes.

Bivariate Analysis

The histogram shows the mean protein content for each season. The graph shows that the fall and winter recipes have a higher mean protein content compared to the spring and summer recipes. Because there are many holidays in the fall and winter, such as Labor Day, Veteran’s Day, Thanksgiving, Christmas, New Year, etc. People often have meals with family and friends. Meat and other high-protein foods are cooked in different wayss for their meals. This may result in higher overall protein content in these recipes. On the other hand, spring and summer seasons tend to have more light and fresh meals, with a focus on vegetables and fruits, whcih generally have lower protein content.

The box plot allows us to explore the relationship between protein content and meat/beef related tags in dataset. First of all, the box plots of the five groupss have outliers. In contrast to the other, the ditrbution of boxplotss with 2 and 4 meat/beef realted taps iss not very diffuse. When the boxplot of the number of meat/beef related label is 0, it shows that only meat and beef have high protein, but also legumes or other foods without these tags still have high protein.

Interesting Aggregates

  1. This groupby DataFrame shows the count and mean of the protein column for each unique recipe ID in the full_recipes DataFrame. The resulting DataFrame is sorted in asscending order based on the mean protein value. This DataFrame can help identify which recipes have the highest and lowest protein content, and frequency of each recipe in the full_recipes DataFrame. Below DataFrame shows first and last 5 rows of DataFrame.

    id count mean
    461235 1 0
    447587 3 0
    308738 2 0
    308774 1 0
    335660 1 0
    446886 2 2281
    449670 1 2637
    365630 1 2929
    338356 5 3605
    392286 1 4356
  2. The pivot table is grouping the data in the full_recipes DataFrame by the recipe id (id column) and calculating the sum of values in the low-protein and high-protein columns for each unique recipe. This can be useful for the understanding how many recipes have each tags (either low-protein or high-protein). The resulting DataFrame can be used to compare the prevalence of the two tags and identify any patterns or trendss in the data. For example, after pivoting the DataFrame, I found that the recipe id is 519068 has 3 high-protein tags and 0 low-protein tag. Below DataFrame shows first 5 rows of DataFrame.

    id high-protein low-protein
    537459 0 0
    537485 0 0
    537543 0 0
    537671 0 0
    537716 0 0
  3. I think the above two DataFrames are too big and need to be shrunk further. So, the following groupby table shows the count and mean of protein for recipes with low-protein and high-protein tags. The categories are defined based on the presence or absence of the respective tags in the recipes. The DataFrame helps to compare the protein content of recipes with and without these tags, and to identify any significant differences in protein content based on these tags. It can also help in creating recipes with a specific protein content range and provide insights into the popularity and nutrient information of recipes with and without these tags.

      low-protein high-protein count mean
    0 0 0 192286 35.0086
    1 0 1 8229 93.2573
    2 1 0 33914 8.19741

Assesssment of Missingness

NMAR Analysis

Out of the three columns with missing values, we believe that the column ‘rating’ from the interactions dataset might be NMAR. People are more likely to rate a recipe if they have extreme opinions towards the recipe. On the other hand, people are more likely not rate a recipe or simple forget to rate a reipe if they do not have strong opinions toward the recipe. Therefore, we believe that the missing values in the column ‘rating’ from the interactions are likely in the middle range of all possible ratings, such as 2s, 3s, and 4s. And the missingness of the column might be NMAR.

From the example recipe given directly from the website, food.com, I saw that viewers have the option to SAVE, DOWNLOAD, PRINT, or SHARE the recipe. If we can obtain information on whether each user corresponding to the user id did any of these four actions, as 4 boolean columns, we might be able to explain the missingness of ‘rating’ by performing test. This is because people who save, download print or share the recipes likely have strong opinions about the recipe, whether positive or negative, and this could relate to their willingness of rating the recipe as well.

Missingness Dependency

If Missingness of column description is Dependent on column avg_rating

The plot on the top shows distribution of the ‘description’ column when the ‘avg_rating’ column is missing and distribution of the ‘description’ column when the ‘avg_rating’ column is not missing. The red line represents distribution of the ‘description’ column when the ‘avg_rating’ column is missing and the blue line represents distribution of the ‘description’ column when the ‘avg_rating’ column is not missing. We observed that the shape of the two distributions are not alike, especially toward the right tail. However, the we center of the two distributions are roughly the same. So we decided to use Kolmogorov-Smirnov statistic as test statistic. We then performed permutation tests to analyze the dependency of the missingness of the ‘description’ column on the ‘avg_rating’ column. From the 1000 repetitions, we found a p_value of 0.017, so we reject the null hypothesis at significance level 0.05, and conclude that the Missingness of column ‘description’ is indeed dependent on column ‘avg_rating’. However, there is no way for use to conclude a causation relationship between the two columns. The graph in the bottom shows the empirical distribution of the test statistic, with the red line representing the observed statistics. This visualization confirms our conclusion from the calculation.

If Missingness of column description is Dependent on column protein

The plot shows the distribution of the ‘description’ column when the ‘protein’ column is missing and distribution of the ‘description’ column when the ‘protein’ column is not missing. The red line represents distribution of the ‘description’ column when the ‘protein’ column is missing and the blue line represents distribution of the ‘description’ column when the ‘protein’ column is not missing.We observed that the shape of the two distributions are similar, except the distribution of the ‘description’ column when the ‘protein’ column is not missing has a longer right tail due to the larger data group. However, center of the red line seemed slightly greater than center of the blue line, so we decided to use difference in group means as test statistic. We then performed permutation tests to analyze the dependency of the missingness of the ‘description’ column on the ‘protein’ columns. From the 1000 repetitions, we found a p_value of 0.306, so we fail to reject the null hypothesis at significance level 0.05, and conclude that the Missingness of column ‘description’ is not dependent on column ‘protein’. Intuitively, this is totally reasonable, because there does not seem to exist a direct relationship between possibly to not providing a description and the amount of proteins contained in the recipes.

Hypothesis

Null and Alternative Hypothesis

Choice of Test Statistic and Significance Level

Simulate the distribution of mean differences under the null hypothesis and p-value

Finally, to get p-value by using how many mean differences greater than observed mean difference to divided by N.

Conclusion

Since the p-value is 0.0, which is less than significance level of 0.05, we can reject the null hypothesis. This means that there is no significant change in the average protein content of recipes between two periods (2008-2013 and 2014-2018). At the same time, from the Empirical Distribution of the Mean Difference of Protein Content Between Two Periods, we rarely see a differece of about 5 proteins. Therefore, we reject the null hypothesis that a significant change in the average protein content of recipes between two periods (2008-2013 and 2014-2018).

In addition, I think there could be several confounding factors that could affect the average protein content of recipes between two periods (2008-2013 and 2014-2018).

  1. Changes in dietary trends could affect the average potein content of the recipes. First, if you are a bodybuilder or a person who exercises a lot, you need to consume more protein, this way the average protein content of the recipe goes up. Second, if more people switched to sustainable foods, vegetarian or plant-based diets, the average protein content of recipes would drop.
  2. Changes in cooking techniques that have influenced the protein content of the recipes. Depending on time and cooking tool and methods (pressure cooker vs. slow cooker) affectss the amount of protein in a recipe.
  3. The availability of ingredients could have changed between the two periods, leading to differences in the protein content of the recipes.
  4. Changes in portion sizes of recipe could affect the amount of protein.