Perhaps the most interesting aspect of ESPN’s NFL game recap page is the win probability graph that they’ve included in the bottom left corner since the 2016 season. Without knowing anything else about the game, one could follow the twists and turns on this chart to relive all the exciting moments (if there were any).
In fact there’s a metric called the Excitement Index which adds up all the absolute changes in win probability to arrive at a single number that expresses a contest’s cumulative drama. You can think of the Excitement Index as pulling the win probability graph’s jagged path until it’s a tight horizontal line and then measuring its length. So can we retrieve all these win probability graphs in order to calculate and visualize their Excitement Index?
library(tidyverse)
library(rvest)
library(gganimate)
library(grid)
library(png)
library(RCurl)
options(dplyr.summarise.inform = FALSE)
First we have to collect ESPN’s gamecast webpages and arrange them in a logical file structure. Since web-scraping is not the point of this blogpost and a somewhat questionable act to begin with, I won’t include the code that fetches each webpage. Feel free to message me if you’d like some help building a scraper.
Once we have the raw HTML, we can sift through it to gather the information we need. In particular, the away
and home
team identities are not too hard to find, along with their logos. The raw data for the win probability chart is hardcoded as a JSON object in one of the page’s accompanying scripts:
setwd("/Users/walkerharrison/Desktop/ESPN WinProb/")
html_files <- list.files(recursive = T)
dfs <- map(html_files, ~ {
year <- .x %>%
str_extract("(?<=eason/)\\d+") %>%
as.numeric()
season <- str_extract(.x, "[^/]+(?=/)")
week <- .x %>%
str_extract("(?<=Week\\s)\\d+") %>%
as.numeric()
game <- .x %>%
str_extract("\\d+(?=\\.html)") %>%
as.numeric()
page <- read_html(.x)
away <- page %>%
html_nodes("body") %>%
html_nodes(".away") %>%
html_nodes(".short-name") %>%
html_text()
home <- page %>%
html_nodes("body") %>%
html_nodes(".home") %>%
html_nodes(".short-name") %>%
html_text()
away_logo <- page %>%
html_nodes("body") %>%
html_nodes(".away") %>%
html_nodes(".team-logo") %>%
html_attr("src")
home_logo <- page %>%
html_nodes("body") %>%
html_nodes(".home") %>%
html_nodes(".team-logo") %>%
html_attr("src")
probability_data <- page %>%
html_nodes("script") %>%
keep(~str_detect(html_text(.x), "probability.data")) %>%
html_text() %>%
str_extract("probability\\.data.+") %>%
str_replace("probability.data = ", "") %>%
substr(1, nchar(.) -1) %>%
jsonlite::fromJSON(flatten = TRUE)
payload <- data.frame(year = year,
season = season,
week = week,
game = game,
home = home,
away = away,
home_logo = home_logo,
away_logo = away_logo)
if(length(probability_data) != 0) {
payload <- probability_data %>%
mutate(year = year,
season = season,
week = week,
game = game,
home = home,
away = away,
home_logo = home_logo,
away_logo = away_logo,
play_number = row_number())
}
payload
}
)
We bind these dataframes together to create our master list, all_plays
, and do a little bit of cleaning to weed out bad data – specifically, the first filter
makes sure that the scores are non-decreasing, since occasionally a penalty-negated score will be momentarily counted, and the second filter
stipulates that once a graph reaches 0% or 100% win probability, that game is over. The all important change
variable is simply the (absolute) delta between the previous home win probability and the current one.
all_plays <- bind_rows(dfs) %>%
arrange(game, play_number) %>%
group_by(game) %>%
filter(play.homeScore <= lead(play.homeScore, defaul = Inf),
play.awayScore <= lead(play.awayScore, default = Inf)) %>%
filter(lag(cumsum(homeWinPercentage %in% c(0, 1)), default = 0) < 1) %>%
mutate(change = abs(homeWinPercentage - lag(homeWinPercentage))) %>%
ungroup()
Now for the fun stuff! We aggregate the total change
by each game and throw in some other identifying features. Maybe it shouldn’t be a surprise that the two games with the highest Excitement Index in recent years are both overtime ties:
game_EI <- all_plays %>%
group_by(year, season, week, game, home, home_logo, away, away_logo) %>%
summarize(home_final_score = max(play.homeScore),
away_final_score = max(play.awayScore),
ExcitementIndex = sum(change, na.rm = T)) %>%
mutate(winner = case_when(
home_final_score > away_final_score ~ "home",
home_final_score < away_final_score ~ "away",
TRUE ~ "tie"
)) %>%
arrange(desc(ExcitementIndex)) %>%
ungroup() %>%
mutate(score_text = ifelse(home_final_score > away_final_score,
paste0(home, " ", home_final_score, ", ", away, " ", away_final_score),
paste0(away, " ", away_final_score, ", ", home, " ", home_final_score)))
game_EI %>%
filter(season == "Regular Season") %>%
head(10) %>%
select(year, season, week, score_text, ExcitementIndex) %>%
knitr::kable()
year | season | week | score_text | ExcitementIndex |
---|---|---|---|---|
2020 | Regular Season | 3 | Bengals 23, Eagles 23 | 14.8690 |
2016 | Regular Season | 8 | Redskins 27, Bengals 27 | 14.7430 |
2020 | Regular Season | 7 | Cardinals 37, Seahawks 34 | 13.2360 |
2016 | Regular Season | 8 | Raiders 30, Buccaneers 24 | 12.9310 |
2020 | Regular Season | 5 | Saints 30, Chargers 27 | 12.6710 |
2021 | Regular Season | 6 | Vikings 34, Panthers 28 | 12.3731 |
2018 | Regular Season | 4 | Raiders 45, Browns 42 | 12.0920 |
2020 | Regular Season | 1 | Chargers 16, Bengals 13 | 12.0320 |
2020 | Regular Season | 1 | Titans 16, Broncos 14 | 12.0270 |
2021 | Regular Season | 6 | Cowboys 35, Patriots 29 | 11.9796 |
Of course, no one really cares about a Week 3 Bengals game (sorry, Cincy), so let’s also print out the top postseason games, which feature some classic (and sometimes controversial) conference championships:
game_EI %>%
filter(season == "Postseason") %>%
head(10) %>%
select(year, season, week, score_text, ExcitementIndex) %>%
knitr::kable()
year | season | week | score_text | ExcitementIndex |
---|---|---|---|---|
2019 | Postseason | 1 | Texans 22, Bills 19 | 10.136 |
2019 | Postseason | 1 | Vikings 26, Saints 20 | 9.565 |
2018 | Postseason | 3 | Rams 26, Saints 23 | 8.998 |
2018 | Postseason | 3 | Patriots 37, Chiefs 31 | 8.814 |
2018 | Postseason | 1 | Eagles 16, Bears 15 | 8.153 |
2017 | Postseason | 2 | Vikings 29, Saints 24 | 7.983 |
2018 | Postseason | 1 | Cowboys 24, Seahawks 22 | 7.962 |
2017 | Postseason | 2 | Eagles 15, Falcons 10 | 7.118 |
2017 | Postseason | 5 | Eagles 41, Patriots 33 | 7.010 |
2019 | Postseason | 1 | Titans 20, Patriots 13 | 7.006 |
So how do we straighten out a win probability chart to measure its length? First let’s take the original form of that top playoff game:
best_game <- game_EI %>% filter(season == "Postseason") %>% head(1)
plays <- all_plays %>% inner_join(best_game)
plays %>%
ggplot(aes(play_number, homeWinPercentage)) +
geom_line() +
theme_bw()
What we’re trying to do is “pull” each line segment until it’s horizontal. Another way to think of that is that we have a vector from one win probability to the next lying at an angle determined by the severity of the win probability change. If we slowly reduce that angle to 0, the vector will lie flat with a length of the distance between the original points.
reps <- 10
plays_flattened <- plays %>%
mutate(homeWinPercentage = 100*homeWinPercentage) %>%
mutate(change = (lead(homeWinPercentage) - homeWinPercentage)) %>%
# determine length and angle of each line segment
mutate(length = sqrt(1 + change^2),
angle = atan(change),
angled = angle*180/pi) %>%
# copy each row ten times
slice(rep(1:nrow(.), each = reps)) %>%
group_by(game, play_number) %>%
mutate(idx = row_number()) %>%
# gradually reduce the angle in each row copy
mutate(angled_new = (reps-idx)/(reps-1)*angled) %>%
mutate(angle_new = angled_new*pi/180) %>%
arrange(idx, play_number) %>%
group_by(game, idx) %>%
# use sine and cosine to determine the new x and y coordinates
mutate(newx = coalesce(lag(cos(angle_new)*length), play_number),
newy = coalesce(lag(sin(angle_new)*length), homeWinPercentage)) %>%
# accumulate these differences to get final location
mutate(new_play_number = cumsum(newx),
new_wp = cumsum(newy))
Here’s what pulling the slack out of a single play looks like, courtesy of gganimate
:
p <- plays_flattened %>%
filter(play_number %in% 1:2) %>%
ggplot(aes(new_play_number, new_wp)) +
geom_line() +
theme_bw() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
labs(y = "Win Probability", x = "Play Number") +
transition_time(idx)
animate(p, nframes = 150, fps = 25, start_pause = 50, end_pause = 50)
And here’s how it looks when you do all plays simultaneously:
p <- plays_flattened %>%
ggplot(aes(new_play_number, new_wp)) +
geom_line() +
scale_y_continuous(limits = c(0, 100), breaks = c(0, 50, 100)) +
scale_x_continuous(limits = c(0, max(plays_flattened$new_play_number) + 50)) +
theme_bw() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
labs(y = "Win Probability") +
transition_time(idx)
animate(p, nframes = 150, fps = 25, start_pause = 50, end_pause = 50)
Okay let’s spice this up a little bit. We’re going to do multiple games at once, pull in some logos, and make it so that the final straight line lands on the winning team, which we’ll accomplish my gradually moving the y-component toward 1 or 0, depending on who won the game.
best_games <- game_EI %>% filter(season == "Postseason") %>% head(5)
plays_flattened <- all_plays %>%
inner_join(best_games) %>%
arrange(desc(ExcitementIndex)) %>%
# ensures the're in decscending excitement order on the facet
mutate(score_text = factor(score_text, levels = unique(.$score_text))) %>%
group_by(game) %>%
# same process as before
mutate(homeWinPercentage = 100*homeWinPercentage) %>%
mutate(change = (lead(homeWinPercentage) - homeWinPercentage)) %>%
mutate(length = sqrt(1 + change^2),
angle = atan(change),
angled = angle*180/pi) %>%
slice(rep(1:nrow(.), each = reps)) %>%
group_by(game, play_number) %>%
mutate(idx = row_number()) %>%
mutate(angled_new = (reps-idx)/(reps-1)*angled) %>%
mutate(angle_new = angled_new*pi/180) %>%
arrange(idx, play_number) %>%
group_by(game, idx) %>%
mutate(newx = coalesce(lag(cos(angle_new)*length), play_number),
newy = coalesce(lag(sin(angle_new)*length), homeWinPercentage)) %>%
mutate(new_play_number = cumsum(newx),
new_wp = cumsum(newy)) %>%
# now we gradually move the line up toward 1 or 0
group_by(game, play_number) %>%
mutate(new_wp_adj = case_when(
winner == "home" ~ new_wp + (idx-1)/(reps-1)*(100-new_wp),
winner == "away" ~ new_wp + (idx-1)/(reps-1)*(-new_wp),
winner == "tie" ~ new_wp + (idx-1)/(reps-1)*(50-new_wp))
) %>%
ungroup()
p <- plays_flattened %>%
ggplot(aes(new_play_number, new_wp_adj)) +
geom_line() +
scale_y_continuous(limits = c(-20, 120), breaks = c(0, 50, 100)) +
scale_x_continuous(limits = c(-150, max(plays_flattened$new_play_number) + 50)) +
facet_wrap(~score_text, ncol = 1) +
theme_bw() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
labs(y = "Win Probability",
title = "Most Exciting Postseason Games Since 2016")
The annotation_custom2
function from this Stack Overflow post allows us to add the team logos to individual facets:
annotation_custom2 <- function (grob, xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf, data){
layer(data = data, stat = StatIdentity, position = PositionIdentity,
geom = ggplot2:::GeomCustomAnn,
inherit.aes = TRUE, params = list(grob = grob,
xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax))}
for (gm in best_games$game){
away_img <- best_games %>% filter(game == gm) %>% pull(away_logo) %>% getURLContent() %>% readPNG()
home_img <- best_games %>% filter(game == gm) %>% pull(home_logo) %>% getURLContent() %>% readPNG()
away_annotation <- annotation_custom2(rasterGrob(away_img, interpolate=TRUE), xmin = -150, xmax = -10, ymin = -20, ymax = 40,
data = plays_flattened %>% filter(game == gm))
home_annotation <- annotation_custom2(rasterGrob(home_img, interpolate=TRUE), xmin = -150, xmax = -10, ymin = 60, ymax = 120,
data = plays_flattened %>% filter(game == gm))
p <- p + away_annotation + home_annotation
}
p <- p + transition_time(idx)
animate(p, nframes = 150, fps = 25, start_pause = 30, end_pause = 50)
Now that’s a good looking GIF!