Visualizing time series data - II

Lecture 6

Dr. Mine Çetinkaya-Rundel

Duke University
STA 313 - Spring 2024

Warm up

Announcements

  • Reading Quiz 2 is due at 10 am on Tuesday

  • Project 1 proposals due at 1 pm on Wednesday

Setup

# load packages
library(countdown)
library(tidyverse)
library(janitor)
library(colorspace)
library(broom)
library(fs)

# set theme for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 14))

# set figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 7, # 7" width
  fig.asp = 0.618, # the golden ratio
  fig.retina = 3, # dpi multiplier for displaying HTML output on retina
  fig.align = "center", # center align figures
  dpi = 300 # higher dpi, sharper image
)

Working with dates

Air Quality Index

  • The AQI is the Environmental Protection Agency’s index for reporting air quality

  • Higher values of AQI indicate worse air quality

AQI Basics for Ozone and Particle Pollution

AQI levels

The previous graphic in tibble form, to be used later…

aqi_levels <- tribble(
  ~aqi_min, ~aqi_max, ~color,    ~level,
  0,        50,       "#D8EEDA", "Good",
  51,       100,      "#F1E7D4", "Moderate",
  101,      150,      "#F8E4D8", "Unhealthy for sensitive groups",
  151,      200,      "#FEE2E1", "Unhealthy",
  201,      300,      "#F4E3F7", "Very unhealthy",
  301,      400,      "#F9D0D4", "Hazardous"
)

AQI data

  • Source: EPA’s Daily Air Quality Tracker

  • 2016 - 2022 AQI (Ozone and PM2.5 combined) for Durham-Chapel Hill, NC core-based statistical area (CBSA), one file per year

  • 2016 - 2022 AQI (Ozone and PM2.5 combined) for San Francisco-Oakland-Hayward, CA CBSA, one file per year

2022 Durham-Chapel Hill

  • Load data
dch_2022 <- read_csv(here::here("data/durham-chapel-hill/ad_aqi_tracker_data-2022.csv"))
  • Metadata
dim(dch_2022)
[1] 365  11
names(dch_2022)
 [1] "Date"                       "AQI Value"                 
 [3] "Main Pollutant"             "Site Name"                 
 [5] "Site ID"                    "Source"                    
 [7] "20-year High (2000-2019)"   "20-year Low (2000-2019)"   
 [9] "5-year Average (2015-2019)" "Date of 20-year High"      
[11] "Date of 20-year Low"       

Clean variable names

dch_2022 <- dch_2022 |>
  janitor::clean_names()

names(dch_2022)
 [1] "date"                      "aqi_value"                
 [3] "main_pollutant"            "site_name"                
 [5] "site_id"                   "source"                   
 [7] "x20_year_high_2000_2019"   "x20_year_low_2000_2019"   
 [9] "x5_year_average_2015_2019" "date_of_20_year_high"     
[11] "date_of_20_year_low"      

First look

This plot looks quite bizarre. What might be going on?

ggplot(dch_2022, aes(x = date, y = aqi_value, group = 1)) +
  geom_line()

Peek at data

dch_2022 |>
  select(date, aqi_value, site_name, site_id)
# A tibble: 365 × 4
   date       aqi_value site_name     site_id    
   <chr>      <chr>     <chr>         <chr>      
 1 01/01/2022 22        Durham Armory 37-063-0015
 2 01/02/2022 12        Durham Armory 37-063-0015
 3 01/03/2022 10        Durham Armory 37-063-0015
 4 01/04/2022 21        Durham Armory 37-063-0015
 5 01/05/2022 35        Durham Armory 37-063-0015
 6 01/06/2022 29        Durham Armory 37-063-0015
 7 01/07/2022 15        Durham Armory 37-063-0015
 8 01/08/2022 28        Durham Armory 37-063-0015
 9 01/09/2022 28        Durham Armory 37-063-0015
10 01/10/2022 15        Durham Armory 37-063-0015
# ℹ 355 more rows

Transforming date

Using lubridate::mdy():

dch_2022 |>
  mutate(date = mdy(date))
# A tibble: 365 × 11
   date       aqi_value main_pollutant site_name     site_id     source
   <date>     <chr>     <chr>          <chr>         <chr>       <chr> 
 1 2022-01-01 22        PM2.5          Durham Armory 37-063-0015 AQS   
 2 2022-01-02 12        PM2.5          Durham Armory 37-063-0015 AQS   
 3 2022-01-03 10        PM2.5          Durham Armory 37-063-0015 AQS   
 4 2022-01-04 21        PM2.5          Durham Armory 37-063-0015 AQS   
 5 2022-01-05 35        PM2.5          Durham Armory 37-063-0015 AQS   
 6 2022-01-06 29        PM2.5          Durham Armory 37-063-0015 AQS   
 7 2022-01-07 15        PM2.5          Durham Armory 37-063-0015 AQS   
 8 2022-01-08 28        PM2.5          Durham Armory 37-063-0015 AQS   
 9 2022-01-09 28        PM2.5          Durham Armory 37-063-0015 AQS   
10 2022-01-10 15        PM2.5          Durham Armory 37-063-0015 AQS   
# ℹ 355 more rows
# ℹ 5 more variables: x20_year_high_2000_2019 <dbl>,
#   x20_year_low_2000_2019 <dbl>, x5_year_average_2015_2019 <dbl>,
#   date_of_20_year_high <chr>, date_of_20_year_low <chr>

Transforming AQI values

What does this warning mean?

dch_2022 |>
  mutate(aqi_value = as.numeric(aqi_value))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `aqi_value = as.numeric(aqi_value)`.
Caused by warning:
! NAs introduced by coercion
# A tibble: 365 × 11
  date  aqi_value main_pollutant site_name site_id source x20_year_high_2000_2…¹
  <chr>     <dbl> <chr>          <chr>     <chr>   <chr>                   <dbl>
1 01/0…        22 PM2.5          Durham A… 37-063… AQS                       111
2 01/0…        12 PM2.5          Durham A… 37-063… AQS                        76
3 01/0…        10 PM2.5          Durham A… 37-063… AQS                        66
# ℹ 362 more rows
# ℹ abbreviated name: ¹​x20_year_high_2000_2019
# ℹ 4 more variables: x20_year_low_2000_2019 <dbl>,
#   x5_year_average_2015_2019 <dbl>, date_of_20_year_high <chr>,
#   date_of_20_year_low <chr>

Investigating AQI values

  • Take a peek at distinct values of AQI
dch_2022 |>
  distinct(aqi_value) |>
  pull()
 [1] "22" "12" "10" "21" "35" "29" "15" "28" "25" "36" "49" "19" "24" "46" "38"
[16] "48" "34" "47" "57" "26" "45" "42" "13" "31" "39" "62" "40" "30" "20" "18"
[31] "27" "67" "41" "56" "37" "32" "43" "51" "50" "44" "33" "54" "52" "77" "74"
[46] "53" "58" "64" "61" "93" "16" "17" "23" "5"  "6"  "."  "9"  "14" "59"
  • "." likely indicates NA, and it’s causing the entire column to be read in as characters

Rewind, and start over

dch_2022 <- read_csv(
  here::here("data/durham-chapel-hill/ad_aqi_tracker_data-2022.csv"),
  na = c(".", "")
)
glimpse(dch_2022)
Rows: 365
Columns: 11
$ Date                         <chr> "01/01/2022", "01/02/2022", "01/03/2022",…
$ `AQI Value`                  <dbl> 22, 12, 10, 21, 35, 29, 15, 28, 28, 15, 2…
$ `Main Pollutant`             <chr> "PM2.5", "PM2.5", "PM2.5", "PM2.5", "PM2.…
$ `Site Name`                  <chr> "Durham Armory", "Durham Armory", "Durham…
$ `Site ID`                    <chr> "37-063-0015", "37-063-0015", "37-063-001…
$ Source                       <chr> "AQS", "AQS", "AQS", "AQS", "AQS", "AQS",…
$ `20-year High (2000-2019)`   <dbl> 111, 76, 66, 61, 83, 71, 75, 76, 57, 71, …
$ `20-year Low (2000-2019)`    <dbl> 10, 8, 14, 9, 8, 15, 18, 15, 19, 20, 14, …
$ `5-year Average (2015-2019)` <dbl> 39.2, 36.8, 38.2, 30.4, 26.0, 32.4, 37.0,…
$ `Date of 20-year High`       <chr> "01/01/2000", "01/02/2005", "01/03/2004",…
$ `Date of 20-year Low`        <chr> "01/01/2007", "01/02/2012", "01/03/2012",…

Data cleaning

dch_2022 <- dch_2022 |>
  janitor::clean_names() |>
  mutate(date = mdy(date))

dch_2022
# A tibble: 365 × 11
   date       aqi_value main_pollutant site_name     site_id     source
   <date>         <dbl> <chr>          <chr>         <chr>       <chr> 
 1 2022-01-01        22 PM2.5          Durham Armory 37-063-0015 AQS   
 2 2022-01-02        12 PM2.5          Durham Armory 37-063-0015 AQS   
 3 2022-01-03        10 PM2.5          Durham Armory 37-063-0015 AQS   
 4 2022-01-04        21 PM2.5          Durham Armory 37-063-0015 AQS   
 5 2022-01-05        35 PM2.5          Durham Armory 37-063-0015 AQS   
 6 2022-01-06        29 PM2.5          Durham Armory 37-063-0015 AQS   
 7 2022-01-07        15 PM2.5          Durham Armory 37-063-0015 AQS   
 8 2022-01-08        28 PM2.5          Durham Armory 37-063-0015 AQS   
 9 2022-01-09        28 PM2.5          Durham Armory 37-063-0015 AQS   
10 2022-01-10        15 PM2.5          Durham Armory 37-063-0015 AQS   
# ℹ 355 more rows
# ℹ 5 more variables: x20_year_high_2000_2019 <dbl>,
#   x20_year_low_2000_2019 <dbl>, x5_year_average_2015_2019 <dbl>,
#   date_of_20_year_high <chr>, date_of_20_year_low <chr>

Another look

ggplot(dch_2022, aes(x = date, y = aqi_value, group = 1)) +
  geom_line()

Improve

How would you improve this visualization?

Visualizing Durham AQI - v1

Recreate the following visualization.

Visualizing Durham AQI - v2

dch_2022 |>
  ggplot(aes(x = date, y = aqi_value, group = 1)) +
  geom_line(linewidth = 0.8) +
  scale_x_date(
    name = NULL, date_labels = "%b",
    limits = c(ymd("2022-01-01"), ymd("2023-03-01"))
  ) +
  scale_y_continuous(breaks = c(0, 50, 100, 150, 200, 300, 400)) +
  geom_text(
    data = aqi_levels,
    aes(
      x = ymd("2023-02-28"), y = aqi_mid, 
      label = level, color = darken(color, 0.3)
    ),
    hjust = 1, size = 6, fontface = "bold"
  ) +
  scale_color_identity() +
  annotate(
    geom = "text",
    x = c(ymd("2022-01-01"), ymd("2023-01-01")), y = -80,
    label = c("2022", "2023"), size = 4
  ) +
  coord_cartesian(clip = "off", ylim = c(0, 400)) +
  labs(
    x = NULL, y = "AQI",
    title = "Ozone and PM2.5 Daily AQI Values",
    subtitle = "Durham-Chapel Hill, NC",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(
    plot.title.position = "plot",
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank()
  )

Highlights

  • The lubridate package is useful for converting to dates from character strings in a given format, e.g. mdy(), ymd(), etc.

  • The colorspace package is useful for programmatically darkening / lightening colors

  • scale_x_date: Set date_labels as "%b %y" for month-2 digit year, "%D" for date format such as %m/%d/%y, etc. See help for strptime() for more.

  • scale_color_identity() or scale_fill_identity() can be useful when your data already represents aesthetic values that ggplot2 can handle directly. By default doesn’t produce a legend.

What’s missing?

What’s missing in this plot?

Filling in the gaps

Filling in the gaps - Code

dch_2022 |>
  filter(!is.na(aqi_value)) |>
  ggplot(aes(x = date, y = aqi_value, group = 1)) +
  geom_line(linewidth = 0.8) +
  labs(
    x = NULL, y = "AQI",
    title = "Ozone and PM2.5 Daily AQI Values",
    subtitle = "Durham-Chapel Hill, NC",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  )
dch_2022 |>
  mutate(
    impute = if_else(is.na(aqi_value), TRUE, FALSE),
    impute = if_else(is.na(lead(aqi_value)), TRUE, impute),
    aqi_value = if_else(is.na(aqi_value), (lag(aqi_value) + lead(aqi_value))/2, aqi_value)
  ) |>
  ggplot(aes(x = date, y = aqi_value, group = 1)) +
  geom_line(
    aes(color = impute), show.legend = FALSE,
    linewidth = 0.8
  ) +
  scale_color_manual(values = c("black", "cornflowerblue")) +
  labs(
    x = NULL, y = "AQI",
    title = "Ozone and PM2.5 Daily AQI Values",
    subtitle = "Durham-Chapel Hill, NC",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  )

Calculating cumulatives

Cumulatives over time

  • When visualizing time series data, a somewhat common task is to calculate cumulatives over time and plot them

  • In our example we’ll calculate the number of days with “good” AQI (\(\le\) 50) and plot that value on the y-axis and the date on the x-axis

Calculating cumulatives

Step 1. Arrange your data

dch_2022 |>
  select(date, aqi_value) |>
  filter(!is.na(aqi_value)) |>
  arrange(date)
# A tibble: 364 × 2
  date       aqi_value
  <date>         <dbl>
1 2022-01-01        22
2 2022-01-02        12
3 2022-01-03        10
4 2022-01-04        21
5 2022-01-05        35
# ℹ 359 more rows

Calculating cumulatives

Step 2. Identify good days

dch_2022 |>
  select(date, aqi_value) |>
  filter(!is.na(aqi_value)) |>
  arrange(date) |>
  mutate(good_aqi = if_else(aqi_value <= 50, 1, 0))
# A tibble: 364 × 3
  date       aqi_value good_aqi
  <date>         <dbl>    <dbl>
1 2022-01-01        22        1
2 2022-01-02        12        1
3 2022-01-03        10        1
4 2022-01-04        21        1
5 2022-01-05        35        1
# ℹ 359 more rows

Calculating cumulatives

Step 3. Sum over time

dch_2022 |>
  select(date, aqi_value) |>
  filter(!is.na(aqi_value)) |>
  arrange(date) |>
  mutate(
    good_aqi = if_else(aqi_value <= 50, 1, 0),
    cumsum_good_aqi = cumsum(good_aqi)
  )
# A tibble: 364 × 4
  date       aqi_value good_aqi cumsum_good_aqi
  <date>         <dbl>    <dbl>           <dbl>
1 2022-01-01        22        1               1
2 2022-01-02        12        1               2
3 2022-01-03        10        1               3
4 2022-01-04        21        1               4
5 2022-01-05        35        1               5
# ℹ 359 more rows

Plotting cumulatives

dch_2022 |>
  select(date, aqi_value) |>
  filter(!is.na(aqi_value)) |>
  arrange(date) |>
  mutate(
    good_aqi = if_else(aqi_value <= 50, 1, 0),
    cumsum_good_aqi = cumsum(good_aqi)
  ) |>
  ggplot(aes(x = date, y = cumsum_good_aqi, group = 1)) +
  geom_line() +
  scale_x_date(date_labels = "%b %Y") +
  labs(
    x = NULL, y = "Number of days",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "Durham-Chapel Hill, NC",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")

Detrending

Detrending

  • Detrending is removing prominent long-term trend in time series to specifically highlight any notable deviations

  • Let’s demonstrate using multiple years of AQI data

Multiple years of Durham-Chapel Hill data

dch_files <- fs::dir_ls(here::here("data/durham-chapel-hill"))
dch_files
/Users/mine/Desktop/teaching/Duke/sta313-s24/vizdata-s24/slides/06/data/durham-chapel-hill/ad_aqi_tracker_data-2016.csv
/Users/mine/Desktop/teaching/Duke/sta313-s24/vizdata-s24/slides/06/data/durham-chapel-hill/ad_aqi_tracker_data-2017.csv
/Users/mine/Desktop/teaching/Duke/sta313-s24/vizdata-s24/slides/06/data/durham-chapel-hill/ad_aqi_tracker_data-2018.csv
/Users/mine/Desktop/teaching/Duke/sta313-s24/vizdata-s24/slides/06/data/durham-chapel-hill/ad_aqi_tracker_data-2019.csv
/Users/mine/Desktop/teaching/Duke/sta313-s24/vizdata-s24/slides/06/data/durham-chapel-hill/ad_aqi_tracker_data-2020.csv
/Users/mine/Desktop/teaching/Duke/sta313-s24/vizdata-s24/slides/06/data/durham-chapel-hill/ad_aqi_tracker_data-2021.csv
/Users/mine/Desktop/teaching/Duke/sta313-s24/vizdata-s24/slides/06/data/durham-chapel-hill/ad_aqi_tracker_data-2022.csv

Reading multiple files

dch <- read_csv(dch_files, na = c(".", ""))

dch <- dch |>
  janitor::clean_names() |>
  mutate(
    date = mdy(date),
    good_aqi = if_else(aqi_value <= 50, 1, 0)
  ) |>
  filter(!is.na(aqi_value)) |>
  arrange(date) |>
  mutate(cumsum_good_aqi = cumsum(good_aqi), .after = aqi_value)

dch
# A tibble: 2,547 × 13
  date       aqi_value cumsum_good_aqi main_pollutant site_name   site_id source
  <date>         <dbl>           <dbl> <chr>          <chr>       <chr>   <chr> 
1 2016-01-01        32               1 PM2.5          Durham Arm… 37-063… AQS   
2 2016-01-02        37               2 PM2.5          Durham Arm… 37-063… AQS   
3 2016-01-03        45               3 PM2.5          Durham Arm… 37-063… AQS   
4 2016-01-04        33               4 PM2.5          Durham Arm… 37-063… AQS   
5 2016-01-05        27               5 PM2.5          Durham Arm… 37-063… AQS   
# ℹ 2,542 more rows
# ℹ 6 more variables: x20_year_high_2000_2019 <dbl>,
#   x20_year_low_2000_2019 <dbl>, x5_year_average_2015_2019 <dbl>,
#   date_of_20_year_high <chr>, date_of_20_year_low <chr>, good_aqi <dbl>

Plot trend since 2016

dch |>
  ggplot(aes(x = date, y = cumsum_good_aqi, group = 1)) +
  geom_smooth(method = "lm", color = "pink") +
  geom_line() +
  scale_x_date(
    expand = expansion(mult = 0.07),
    date_labels = "%Y"
  ) +
  labs(
    x = NULL, y = "Number of days",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "Durham-Chapel Hill, NC",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")
`geom_smooth()` using formula = 'y ~ x'

Detrend

Step 1. Fit a simple linear regression

m <- lm(cumsum_good_aqi ~ date, data = dch)

m

Call:
lm(formula = cumsum_good_aqi ~ date, data = dch)

Coefficients:
(Intercept)         date  
 -1.341e+04    7.954e-01  

Detrend

Step 2. Augment the data with model results (using broom::augment())

dch_aug <- augment(m)

dch_aug
# A tibble: 2,547 × 8
  cumsum_good_aqi date       .fitted .resid    .hat .sigma .cooksd .std.resid
            <dbl> <date>       <dbl>  <dbl>   <dbl>  <dbl>   <dbl>      <dbl>
1               1 2016-01-01   -42.8   43.8 0.00157   25.4 0.00234       1.73
2               2 2016-01-02   -42.0   44.0 0.00157   25.4 0.00236       1.74
3               3 2016-01-03   -41.3   44.3 0.00156   25.4 0.00238       1.74
4               4 2016-01-04   -40.5   44.5 0.00156   25.4 0.00240       1.75
5               5 2016-01-05   -39.7   44.7 0.00156   25.4 0.00242       1.76
# ℹ 2,542 more rows

Detrend

Step 3. Divide the observed value of cumsum_good_aqi by the respective value in the long-term trend (i.e., .fitted)

dch_aug <- dch_aug |>
  mutate(ratio = cumsum_good_aqi / .fitted, .after = .fitted)


dch_aug
# A tibble: 2,547 × 9
  cumsum_good_aqi date       .fitted   ratio .resid    .hat .sigma .cooksd
            <dbl> <date>       <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
1               1 2016-01-01   -42.8 -0.0233   43.8 0.00157   25.4 0.00234
2               2 2016-01-02   -42.0 -0.0476   44.0 0.00157   25.4 0.00236
3               3 2016-01-03   -41.3 -0.0727   44.3 0.00156   25.4 0.00238
4               4 2016-01-04   -40.5 -0.0989   44.5 0.00156   25.4 0.00240
5               5 2016-01-05   -39.7 -0.126    44.7 0.00156   25.4 0.00242
# ℹ 2,542 more rows
# ℹ 1 more variable: .std.resid <dbl>

Visualize detrended data

dch_aug |>
  ggplot(aes(x = date, y = ratio, group = 1)) +
  geom_hline(yintercept = 1, color = "gray") +
  geom_line() +
  scale_x_date(
    expand = expansion(mult = 0.07),
    date_labels = "%Y"
  ) +
  labs(
    x = NULL, y = "Number of days\n(detrended)",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "Durham-Chapel Hill, NC",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")

Air Quality in Durham



barely anything interesting happening!

let’s look at data from somewhere with a bit more “interesting” air quality data…

Read in multiple years of SF data

sf_files <- fs::dir_ls(here::here("data/san-francisco"))
sf <- read_csv(sf_files, na = c(".", ""))

sf <- sf |>
  janitor::clean_names() |>
  mutate(
    date = mdy(date),
    good_aqi = if_else(aqi_value <= 50, 1, 0)
  ) |>
  filter(!is.na(aqi_value)) |>
  arrange(date) |>
  mutate(cumsum_good_aqi = cumsum(good_aqi), .after = aqi_value)

sf
# A tibble: 2,557 × 13
  date       aqi_value cumsum_good_aqi main_pollutant site_name   site_id source
  <date>         <dbl>           <dbl> <chr>          <chr>       <chr>   <chr> 
1 2016-01-01        32               1 PM2.5          Durham Arm… 37-063… AQS   
2 2016-01-02        37               2 PM2.5          Durham Arm… 37-063… AQS   
3 2016-01-03        45               3 PM2.5          Durham Arm… 37-063… AQS   
4 2016-01-04        33               4 PM2.5          Durham Arm… 37-063… AQS   
5 2016-01-05        27               5 PM2.5          Durham Arm… 37-063… AQS   
# ℹ 2,552 more rows
# ℹ 6 more variables: x20_year_high_2000_2019 <dbl>,
#   x20_year_low_2000_2019 <dbl>, x5_year_average_2015_2019 <dbl>,
#   date_of_20_year_high <chr>, date_of_20_year_low <chr>, good_aqi <dbl>

Plot trend since 2016

sf |>
  ggplot(aes(x = date, y = cumsum_good_aqi, group = 1)) +
  geom_smooth(method = "lm", color = "pink") +
  geom_line() +
  scale_x_date(
    expand = expansion(mult = 0.07),
    date_labels = "%Y"
  ) +
  labs(
    x = NULL, y = "Number of days",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "San Francisco-Oakland-Hayward, CA",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")
`geom_smooth()` using formula = 'y ~ x'

Detrend

  1. Fit a simple linear regression
m_sf <- lm(cumsum_good_aqi ~ date, data = sf)
  1. Augment the data with model results
sf_aug <- augment(m_sf)
  1. Divide the observed value of cumsum_good_aqi by the respective value in the long-term trend (i.e., .fitted)
sf_aug <- sf_aug |>
  mutate(ratio = cumsum_good_aqi / .fitted, .after = .fitted)

Visualize detrended data

sf_aug |>
  ggplot(aes(x = date, y = ratio, group = 1)) +
  geom_hline(yintercept = 1, color = "gray") +
  geom_line() +
  scale_x_date(
    expand = expansion(mult = 0.07),
    date_labels = "%Y"
  ) +
  labs(
    x = NULL, y = "Number of days\n(detrended)",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "San Francisco-Oakland-Hayward, CA",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")

Detrending

  • In step 2 we fit a very simple model

  • Depending on the complexity you’re trying to capture you might choose to fit a much more complex model

  • You can also decompose the trend into multiple trends, e.g. monthly, long-term, seasonal, etc.