TDM 20200: Project 11 — 2023

Motivation: Data wrangling is the process of gathering, cleaning, structuring, and transforming data. Data wrangling is a big part in any data driven project, and sometimes can take a great deal of time. tidyverse is a great, but opinionated, suite of integrated packages to wrangle, tidy and visualize data. It is useful to gain some familiarity with this collection of packages, in case you run into a situation where these packages are needed — you may even find that you enjoy using them!

Context: We have covered a few topics on the tidyverse packages, but there is a lot more to learn! We will continue our strong focus on the tidyverse (including ggplot) and data wrangling tasks. This is the second in a series of 5 projects focused around using tidyverse packages to solve data-driven problems.

Scope: R, tidyverse, ggplot

Learning Objectives
  • Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems.

  • Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows.

  • Group data and calculate aggregated statistics using group_by, mutate, summarize, and transform functions.

  • Demonstrate the ability to create basic graphs with default settings, in ggplot.

  • Demonstrate the ability to modify axes labels and titles.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/beer/beers.csv

  • /anvil/projects/tdm/data/beer/reviews_sample.csv

Questions

Question 1

Let’s pick up where we left in the previous project. Copy and paste your commands from questions 1 to 3 that result in our beers_reviews dataset.

Using the pipelines (remember, the %>%), combine the necessary parts of questions 2 and 3, removing the need to have an intermediate reviews_summary dataset. This is a great way to practice and get a better understanding of tidyverse.

Your code should read the datasets, summarize the reviews data similarly to what you did in question 2, and combine the summarized dataset with the beers dataset. This should all be accomplished from a single chunk of "piped-together" code.

Feel free to remove the reviews dataset after we have the beers_reviews dataset.

rm(reviews)

If you want to update how you calculated your beer_goodness_indicator from the previous project, this would be a great time to do so!

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Are there any differences in terms of abv between beers that are available in specific seasons?

ABV refers to the alcohol by volume of a beer. The higher the ABV, the more alcohol is in the beer.

  1. Filter the beers_reviews dataset to contain beers available only in a specific season (Fall, Winter, Spring, Summer).

    Only click below if you are stuck!

    This function will help you do this operation.

  2. Make a side-by-side boxplot comparing abv for each season availability.

    Only click below if you are stuck!

    This function will help you do this operation.

  3. Make sure to use the labs function to have nice x-axis label and y-axis label.

    This is more information on labs.

Use pipelines, resulting in a single chunk of "piped-together" code.

Use the fill argument to this function to color your boxplots differently for each season.

Write 1-2 sentences comparing the beers in terms of abv between the specific seasons. Are the results surprising or did you expect them?

The aes part of many ggplot plots may be confusing at first. In a nutshell, aes is used to match x-axis and y-axis values to columns of data in the given dataset. You should read this documentation and the examples carefully to better understand.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • 1-2 sentences comparing the beers in terms of abv between the specific seasons. Are the results surprising or did you expect them?

Question 3

Modify your code from question 2 to:

  1. Create a new variable is_good that is 1 or TRUE if beer_goodness_indicator is greater than 3.5, and 0 or FALSE otherwise.

  2. Facet your boxplot based on is_good. The resulting graphic should make it easy to compare the "good" vs "bad" beers for each season.

    facet_grid and facet_wrap are two other functions that can be a bit confusing at first. With that being said, they are incredible powerful and make creating really impressive graphics very straightforward.

Make sure to use piping %>% as well as layers (+) to create your final ggplot plot, using a single chunk of piped/layered code.

How do beers differ in terms of ABV and being considered good or not (based on our definition) for the different seasons? Write 1-2 sentences describing what you see based on the plots.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • 1-2 sentences answering the question.

Question 4

Modify your code from question 3 to answer the question based on summary statistics instead of graphical displays.

Make sure you compare the ABV per season availability and is_good using mean, median and sd. Your final dataframe should have 8 rows and the following columns: is_good, availability, mean_abv, median_abv, std_abv.

The following function will be useful for this question: filter, mutate, group_by, summarize (within summarize: mean, median, sd).

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

In this question, we want to make comparison in terms of ABV and beer_goodness_indicator for US states.

Feel free to use whichever data-driven method you desire to answer this question! You can take summary statistics, make a variety of plots, and even filter to compare specific US states — you can even create new columns combining states (based on region, political affiliation, etc).

Write a question related to US states, ABV and our "beer_goodness_indicator". Use your data-driven method(s) to answer it (if only anecdotally).

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • Write 1-2 sentences explaining your question and data-driven method(s).

  • Write 1-2 sentences answering your question.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.