This is a lesson on tidying/untidying data, remixed from Jenny Bryan’s similar lesson using “Lord of the Rings” data. Most text + code is Jenny’s, basically we plopped a new dataset in there 😉
Enough about tidy data. How do I make it messy?
Regardless of which gather
ing adventure you embarked upon, we’ll all use the Bachelor/Bachelorette data from 538 to practice spreading.
Use the following code to read in and tidy the data (Note: you’ll get a bunch of parsing errors and that is OK to ignore):
# load package
library(tidyverse)
# import and tidy
b_tidy <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/bachelorette/bachelorette.csv",
col_types = cols(SEASON = col_integer())) %>%
filter(!SEASON == "SEASON") %>%
select(SHOW, SEASON, CONTESTANT, starts_with("ELIMINATION")) %>%
gather(key = "week", value = "eliminated", starts_with("ELIMINATION"), na.rm = TRUE) %>%
mutate(week = str_replace(week, "-", "_"),
week = parse_number(week))
What are the possible values in the new eliminated
column? (hint: distinct
possible values)
#> # A tibble: 8 x 1
#> eliminated
#> <chr>
#> 1 R1
#> 2 E
#> 3 EQ
#> 4 EF
#> 5 EU
#> 6 R
#> 7 ED
#> 8 W
What do those mean? Here is a key:
Using that tidy data, count
the values in your new eliminated
column by contestant name and show.
#> # A tibble: 1,128 x 4
#> CONTESTANT SHOW eliminated n
#> <chr> <chr> <chr> <int>
#> 1 01_ALEXA_X Bachelor E 1
#> 2 01_AMANDA_M Bachelor W 1
#> 3 01_AMBER_X Bachelor E 1
#> 4 01_AMY_X Bachelor E 1
#> 5 01_ANGELA_X Bachelor E 1
#> 6 01_ANGELIQUE_X Bachelor E 1
#> 7 01_BILLY_X Bachelorette E 1
#> 8 01_BOB_G Bachelorette E 1
#> 9 01_BRIAN_C Bachelorette E 1
#> 10 01_BRIAN_H Bachelorette EQ 1
#> # ... with 1,118 more rows
Again, we can squint hard at this 1128 row tibble, but if we want to look at numbers like this:
We need to reshape this data (again).
Let’s spread
that counted data, so that we get a column for each possible value in the eliminated
column, and those columns hold the values in the n
column. Set fill
to 0.
Make a facetted bar plot with this data to show how many winning/losing contestants (hint: geom_col()
might be a good choice here) in each show got first impression roses.
Calculate percentages of contestants in each show to answer what percentage of contestants that got a first impression rose were not winners?
# use tidy data to calculate percentages
(bummers <- first_impressions %>%
count(SHOW, R1, W) %>%
add_count(SHOW, R1, wt = n) %>%
mutate(perc = n/nn * 100))
#> # A tibble: 8 x 6
#> SHOW R1 W n nn perc
#> <chr> <dbl> <dbl> <int> <int> <dbl>
#> 1 Bachelor 0 0 501 519 96.53179
#> 2 Bachelor 0 1 18 519 3.468208
#> 3 Bachelor 1 0 25 27 92.59259
#> 4 Bachelor 1 1 2 27 7.407407
#> 5 Bachelorette 0 0 311 319 97.49216
#> 6 Bachelorette 0 1 8 319 2.507837
#> 7 Bachelorette 1 0 13 18 72.22222
#> 8 Bachelorette 1 1 5 18 27.77778
Answers: Only 7.41 % of Bachelor contestants who got a first impression rose won. Bachelorette contestants were a little better off: 27.78 % who got a first impression rose won.
# plot percentages
ggplot(first_impressions, aes(x = as.factor(SHOW), y = R1, fill = as.factor(W))) +
geom_col(position = "fill")