This is a lesson on tidying/untidying data, remixed from Jenny Bryan’s similar lesson using “Lord of the Rings” data. Most text + code is Jenny’s, basically we plopped a new dataset in there 😉

Enough about tidy data. How do I make it messy?

Regardless of which gathering adventure you embarked upon, we’ll all use the Bachelor/Bachelorette data from 538 to practice spreading.

Import & Tidy

Use the following code to read in and tidy the data (Note: you’ll get a bunch of parsing errors and that is OK to ignore):

# load package
library(tidyverse)

# import and tidy
b_tidy <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/bachelorette/bachelorette.csv",
                         col_types = cols(SEASON = col_integer())) %>% 
  filter(!SEASON == "SEASON") %>% 
  select(SHOW, SEASON, CONTESTANT, starts_with("ELIMINATION")) %>% 
  gather(key = "week", value = "eliminated", starts_with("ELIMINATION"), na.rm = TRUE) %>% 
  mutate(week = str_replace(week, "-", "_"),
         week = parse_number(week))

Explore

What are the possible values in the new eliminated column? (hint: distinct possible values)

#> # A tibble: 8 x 1
#>   eliminated
#>   <chr>     
#> 1 R1        
#> 2 E         
#> 3 EQ        
#> 4 EF        
#> 5 EU        
#> 6 R         
#> 7 ED        
#> 8 W

What do those mean? Here is a key:

“E” connotes a standard elimination, typically at a rose ceremony.
“EQ” means the contestant quits.
“EF” means the contestant was fired by production.
“ED” connotes a date elimination.
“EU” connotes an unscheduled elimination, one that takes place at a time outside of a date or rose ceremony.
“R” means the contestant received a rose.
“R1” means the contestant got a first impression rose.
“W” means the contestant won.

Count

Using that tidy data, count the values in your new eliminated column by contestant name and show.

#> # A tibble: 1,128 x 4
#>    CONTESTANT     SHOW         eliminated     n
#>    <chr>          <chr>        <chr>      <int>
#>  1 01_ALEXA_X     Bachelor     E              1
#>  2 01_AMANDA_M    Bachelor     W              1
#>  3 01_AMBER_X     Bachelor     E              1
#>  4 01_AMY_X       Bachelor     E              1
#>  5 01_ANGELA_X    Bachelor     E              1
#>  6 01_ANGELIQUE_X Bachelor     E              1
#>  7 01_BILLY_X     Bachelorette E              1
#>  8 01_BOB_G       Bachelorette E              1
#>  9 01_BRIAN_C     Bachelorette E              1
#> 10 01_BRIAN_H     Bachelorette EQ             1
#> # ... with 1,118 more rows

Again, we can squint hard at this 1128 row tibble, but if we want to look at numbers like this:

We need to reshape this data (again).

Spread

Let’s spread that counted data, so that we get a column for each possible value in the eliminated column, and those columns hold the values in the n column. Set fill to 0.

Plot

Make a facetted bar plot with this data to show how many winning/losing contestants (hint: geom_col() might be a good choice here) in each show got first impression roses.

Bonus material

Calculate percentages of contestants in each show to answer what percentage of contestants that got a first impression rose were not winners?

# use tidy data to calculate percentages
(bummers <- first_impressions %>% 
  count(SHOW, R1, W) %>% 
  add_count(SHOW, R1, wt = n) %>% 
  mutate(perc = n/nn * 100))
#> # A tibble: 8 x 6
#>   SHOW            R1     W     n    nn      perc
#>   <chr>        <dbl> <dbl> <int> <int>     <dbl>
#> 1 Bachelor         0     0   501   519 96.53179 
#> 2 Bachelor         0     1    18   519  3.468208
#> 3 Bachelor         1     0    25    27 92.59259 
#> 4 Bachelor         1     1     2    27  7.407407
#> 5 Bachelorette     0     0   311   319 97.49216 
#> 6 Bachelorette     0     1     8   319  2.507837
#> 7 Bachelorette     1     0    13    18 72.22222 
#> 8 Bachelorette     1     1     5    18 27.77778

Answers: Only 7.41 % of Bachelor contestants who got a first impression rose won. Bachelorette contestants were a little better off: 27.78 % who got a first impression rose won.

# plot percentages
ggplot(first_impressions, aes(x = as.factor(SHOW), y = R1, fill = as.factor(W))) +
  geom_col(position = "fill")

Spread to Tidy Data

Alison Hill & Daniel Anderson

04 November, 2018