Visualizing time series data - II

Lecture 6

Dr. Mine Çetinkaya-Rundel

Duke University
STA 313 - Spring 2024

Warm up

Announcements

Reading Quiz 2 is due at 10 am on Tuesday
Project 1 proposals due at 1 pm on Wednesday

Setup

# load packages
library(countdown)
library(tidyverse)
library(janitor)
library(colorspace)
library(broom)
library(fs)

# set theme for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 14))

# set figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 7, # 7" width
  fig.asp = 0.618, # the golden ratio
  fig.retina = 3, # dpi multiplier for displaying HTML output on retina
  fig.align = "center", # center align figures
  dpi = 300 # higher dpi, sharper image
)

Working with dates

Air Quality Index

The AQI is the Environmental Protection Agency’s index for reporting air quality
Higher values of AQI indicate worse air quality

AQI levels

The previous graphic in tibble form, to be used later…

aqi_levels <- tribble(
  ~aqi_min, ~aqi_max, ~color,    ~level,
  0,        50,       "#D8EEDA", "Good",
  51,       100,      "#F1E7D4", "Moderate",
  101,      150,      "#F8E4D8", "Unhealthy for sensitive groups",
  151,      200,      "#FEE2E1", "Unhealthy",
  201,      300,      "#F4E3F7", "Very unhealthy",
  301,      400,      "#F9D0D4", "Hazardous"
)

AQI data

Source: EPA’s Daily Air Quality Tracker
2016 - 2022 AQI (Ozone and PM2.5 combined) for Durham-Chapel Hill, NC core-based statistical area (CBSA), one file per year
2016 - 2022 AQI (Ozone and PM2.5 combined) for San Francisco-Oakland-Hayward, CA CBSA, one file per year

2022 Durham-Chapel Hill

Load data

dch_2022 <- read_csv(here::here("data/durham-chapel-hill/ad_aqi_tracker_data-2022.csv"))

Metadata

dim(dch_2022)

[1] 365  11

names(dch_2022)

 [1] "Date"                       "AQI Value"                 
 [3] "Main Pollutant"             "Site Name"                 
 [5] "Site ID"                    "Source"                    
 [7] "20-year High (2000-2019)"   "20-year Low (2000-2019)"   
 [9] "5-year Average (2015-2019)" "Date of 20-year High"      
[11] "Date of 20-year Low"

Clean variable names

dch_2022 <- dch_2022 |>
  janitor::clean_names()

names(dch_2022)

 [1] "date"                      "aqi_value"                
 [3] "main_pollutant"            "site_name"                
 [5] "site_id"                   "source"                   
 [7] "x20_year_high_2000_2019"   "x20_year_low_2000_2019"   
 [9] "x5_year_average_2015_2019" "date_of_20_year_high"     
[11] "date_of_20_year_low"

First look

This plot looks quite bizarre. What might be going on?

ggplot(dch_2022, aes(x = date, y = aqi_value, group = 1)) +
  geom_line()

Peek at data

dch_2022 |>
  select(date, aqi_value, site_name, site_id)

# A tibble: 365 × 4
   date       aqi_value site_name     site_id    
   <chr>      <chr>     <chr>         <chr>      
 1 01/01/2022 22        Durham Armory 37-063-0015
 2 01/02/2022 12        Durham Armory 37-063-0015
 3 01/03/2022 10        Durham Armory 37-063-0015
 4 01/04/2022 21        Durham Armory 37-063-0015
 5 01/05/2022 35        Durham Armory 37-063-0015
 6 01/06/2022 29        Durham Armory 37-063-0015
 7 01/07/2022 15        Durham Armory 37-063-0015
 8 01/08/2022 28        Durham Armory 37-063-0015
 9 01/09/2022 28        Durham Armory 37-063-0015
10 01/10/2022 15        Durham Armory 37-063-0015
# ℹ 355 more rows

Transforming date

Using lubridate::mdy():

dch_2022 |>
  mutate(date = mdy(date))

# A tibble: 365 × 11
   date       aqi_value main_pollutant site_name     site_id     source
   <date>     <chr>     <chr>          <chr>         <chr>       <chr> 
 1 2022-01-01 22        PM2.5          Durham Armory 37-063-0015 AQS   
 2 2022-01-02 12        PM2.5          Durham Armory 37-063-0015 AQS   
 3 2022-01-03 10        PM2.5          Durham Armory 37-063-0015 AQS   
 4 2022-01-04 21        PM2.5          Durham Armory 37-063-0015 AQS   
 5 2022-01-05 35        PM2.5          Durham Armory 37-063-0015 AQS   
 6 2022-01-06 29        PM2.5          Durham Armory 37-063-0015 AQS   
 7 2022-01-07 15        PM2.5          Durham Armory 37-063-0015 AQS   
 8 2022-01-08 28        PM2.5          Durham Armory 37-063-0015 AQS   
 9 2022-01-09 28        PM2.5          Durham Armory 37-063-0015 AQS   
10 2022-01-10 15        PM2.5          Durham Armory 37-063-0015 AQS   
# ℹ 355 more rows
# ℹ 5 more variables: x20_year_high_2000_2019 <dbl>,
#   x20_year_low_2000_2019 <dbl>, x5_year_average_2015_2019 <dbl>,
#   date_of_20_year_high <chr>, date_of_20_year_low <chr>

Transforming AQI values

What does this warning mean?

dch_2022 |>
  mutate(aqi_value = as.numeric(aqi_value))

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `aqi_value = as.numeric(aqi_value)`.
Caused by warning:
! NAs introduced by coercion

# A tibble: 365 × 11
  date  aqi_value main_pollutant site_name site_id source x20_year_high_2000_2…¹
  <chr>     <dbl> <chr>          <chr>     <chr>   <chr>                   <dbl>
1 01/0…        22 PM2.5          Durham A… 37-063… AQS                       111
2 01/0…        12 PM2.5          Durham A… 37-063… AQS                        76
3 01/0…        10 PM2.5          Durham A… 37-063… AQS                        66
# ℹ 362 more rows
# ℹ abbreviated name: ¹x20_year_high_2000_2019
# ℹ 4 more variables: x20_year_low_2000_2019 <dbl>,
#   x5_year_average_2015_2019 <dbl>, date_of_20_year_high <chr>,
#   date_of_20_year_low <chr>

Investigating AQI values

Take a peek at distinct values of AQI

dch_2022 |>
  distinct(aqi_value) |>
  pull()

 [1] "22" "12" "10" "21" "35" "29" "15" "28" "25" "36" "49" "19" "24" "46" "38"
[16] "48" "34" "47" "57" "26" "45" "42" "13" "31" "39" "62" "40" "30" "20" "18"
[31] "27" "67" "41" "56" "37" "32" "43" "51" "50" "44" "33" "54" "52" "77" "74"
[46] "53" "58" "64" "61" "93" "16" "17" "23" "5"  "6"  "."  "9"  "14" "59"

"." likely indicates NA, and it’s causing the entire column to be read in as characters

Rewind, and start over

dch_2022 <- read_csv(
  here::here("data/durham-chapel-hill/ad_aqi_tracker_data-2022.csv"),
  na = c(".", "")
)

glimpse(dch_2022)

Rows: 365
Columns: 11
$ Date                         <chr> "01/01/2022", "01/02/2022", "01/03/2022",…
$ `AQI Value`                  <dbl> 22, 12, 10, 21, 35, 29, 15, 28, 28, 15, 2…
$ `Main Pollutant`             <chr> "PM2.5", "PM2.5", "PM2.5", "PM2.5", "PM2.…
$ `Site Name`                  <chr> "Durham Armory", "Durham Armory", "Durham…
$ `Site ID`                    <chr> "37-063-0015", "37-063-0015", "37-063-001…
$ Source                       <chr> "AQS", "AQS", "AQS", "AQS", "AQS", "AQS",…
$ `20-year High (2000-2019)`   <dbl> 111, 76, 66, 61, 83, 71, 75, 76, 57, 71, …
$ `20-year Low (2000-2019)`    <dbl> 10, 8, 14, 9, 8, 15, 18, 15, 19, 20, 14, …
$ `5-year Average (2015-2019)` <dbl> 39.2, 36.8, 38.2, 30.4, 26.0, 32.4, 37.0,…
$ `Date of 20-year High`       <chr> "01/01/2000", "01/02/2005", "01/03/2004",…
$ `Date of 20-year Low`        <chr> "01/01/2007", "01/02/2012", "01/03/2012",…

Data cleaning

dch_2022 <- dch_2022 |>
  janitor::clean_names() |>
  mutate(date = mdy(date))

dch_2022

# A tibble: 365 × 11
   date       aqi_value main_pollutant site_name     site_id     source
   <date>         <dbl> <chr>          <chr>         <chr>       <chr> 
 1 2022-01-01        22 PM2.5          Durham Armory 37-063-0015 AQS   
 2 2022-01-02        12 PM2.5          Durham Armory 37-063-0015 AQS   
 3 2022-01-03        10 PM2.5          Durham Armory 37-063-0015 AQS   
 4 2022-01-04        21 PM2.5          Durham Armory 37-063-0015 AQS   
 5 2022-01-05        35 PM2.5          Durham Armory 37-063-0015 AQS   
 6 2022-01-06        29 PM2.5          Durham Armory 37-063-0015 AQS   
 7 2022-01-07        15 PM2.5          Durham Armory 37-063-0015 AQS   
 8 2022-01-08        28 PM2.5          Durham Armory 37-063-0015 AQS   
 9 2022-01-09        28 PM2.5          Durham Armory 37-063-0015 AQS   
10 2022-01-10        15 PM2.5          Durham Armory 37-063-0015 AQS   
# ℹ 355 more rows
# ℹ 5 more variables: x20_year_high_2000_2019 <dbl>,
#   x20_year_low_2000_2019 <dbl>, x5_year_average_2015_2019 <dbl>,
#   date_of_20_year_high <chr>, date_of_20_year_low <chr>

Another look

ggplot(dch_2022, aes(x = date, y = aqi_value, group = 1)) +
  geom_line()

Improve

How would you improve this visualization?

Visualizing Durham AQI - v1

Recreate the following visualization.

dch_2022 |>
  ggplot(aes(x = date, y = aqi_value, group = 1)) +
  geom_line(linewidth = 0.8) +
  scale_x_date(
    name = NULL, date_labels = "%b",
    limits = c(ymd("2022-01-01"), ymd("2023-03-01"))
  ) +
  scale_y_continuous(breaks = c(0, 50, 100, 150, 200, 300, 400)) +
  geom_text(
    data = aqi_levels,
    aes(
      x = ymd("2023-02-28"), y = aqi_mid, 
      label = level, color = darken(color, 0.3)
    ),
    hjust = 1, size = 6, fontface = "bold"
  ) +
  scale_color_identity() +
  annotate(
    geom = "text",
    x = c(ymd("2022-01-01"), ymd("2023-01-01")), y = -80,
    label = c("2022", "2023"), size = 4
  ) +
  coord_cartesian(clip = "off", ylim = c(0, 400)) +
  labs(
    x = NULL, y = "AQI",
    title = "Ozone and PM2.5 Daily AQI Values",
    subtitle = "Durham-Chapel Hill, NC",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(
    plot.title.position = "plot",
    panel.grid.minor.y = element_blank(),
    panel.grid.minor.x = element_blank()
  )

Highlights

The lubridate package is useful for converting to dates from character strings in a given format, e.g. mdy(), ymd(), etc.
The colorspace package is useful for programmatically darkening / lightening colors
scale_x_date: Set date_labels as "%b %y" for month-2 digit year, "%D" for date format such as %m/%d/%y, etc. See help for strptime() for more.
scale_color_identity() or scale_fill_identity() can be useful when your data already represents aesthetic values that ggplot2 can handle directly. By default doesn’t produce a legend.

What’s missing?

What’s missing in this plot?

Filling in the gaps

Option 1
Option 2

Filling in the gaps - Code

Option 1
Option 2

dch_2022 |>
  filter(!is.na(aqi_value)) |>
  ggplot(aes(x = date, y = aqi_value, group = 1)) +
  geom_line(linewidth = 0.8) +
  labs(
    x = NULL, y = "AQI",
    title = "Ozone and PM2.5 Daily AQI Values",
    subtitle = "Durham-Chapel Hill, NC",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  )

dch_2022 |>
  mutate(
    impute = if_else(is.na(aqi_value), TRUE, FALSE),
    impute = if_else(is.na(lead(aqi_value)), TRUE, impute),
    aqi_value = if_else(is.na(aqi_value), (lag(aqi_value) + lead(aqi_value))/2, aqi_value)
  ) |>
  ggplot(aes(x = date, y = aqi_value, group = 1)) +
  geom_line(
    aes(color = impute), show.legend = FALSE,
    linewidth = 0.8
  ) +
  scale_color_manual(values = c("black", "cornflowerblue")) +
  labs(
    x = NULL, y = "AQI",
    title = "Ozone and PM2.5 Daily AQI Values",
    subtitle = "Durham-Chapel Hill, NC",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  )

Calculating cumulatives

Cumulatives over time

When visualizing time series data, a somewhat common task is to calculate cumulatives over time and plot them
In our example we’ll calculate the number of days with “good” AQI (\(\le\) 50) and plot that value on the y-axis and the date on the x-axis

Calculating cumulatives

Step 1. Arrange your data

dch_2022 |>
  select(date, aqi_value) |>
  filter(!is.na(aqi_value)) |>
  arrange(date)

# A tibble: 364 × 2
  date       aqi_value
  <date>         <dbl>
1 2022-01-01        22
2 2022-01-02        12
3 2022-01-03        10
4 2022-01-04        21
5 2022-01-05        35
# ℹ 359 more rows

Calculating cumulatives

Step 2. Identify good days

dch_2022 |>
  select(date, aqi_value) |>
  filter(!is.na(aqi_value)) |>
  arrange(date) |>
  mutate(good_aqi = if_else(aqi_value <= 50, 1, 0))

# A tibble: 364 × 3
  date       aqi_value good_aqi
  <date>         <dbl>    <dbl>
1 2022-01-01        22        1
2 2022-01-02        12        1
3 2022-01-03        10        1
4 2022-01-04        21        1
5 2022-01-05        35        1
# ℹ 359 more rows

Calculating cumulatives

Step 3. Sum over time

dch_2022 |>
  select(date, aqi_value) |>
  filter(!is.na(aqi_value)) |>
  arrange(date) |>
  mutate(
    good_aqi = if_else(aqi_value <= 50, 1, 0),
    cumsum_good_aqi = cumsum(good_aqi)
  )

# A tibble: 364 × 4
  date       aqi_value good_aqi cumsum_good_aqi
  <date>         <dbl>    <dbl>           <dbl>
1 2022-01-01        22        1               1
2 2022-01-02        12        1               2
3 2022-01-03        10        1               3
4 2022-01-04        21        1               4
5 2022-01-05        35        1               5
# ℹ 359 more rows

Plotting cumulatives

Plot
Code

dch_2022 |>
  select(date, aqi_value) |>
  filter(!is.na(aqi_value)) |>
  arrange(date) |>
  mutate(
    good_aqi = if_else(aqi_value <= 50, 1, 0),
    cumsum_good_aqi = cumsum(good_aqi)
  ) |>
  ggplot(aes(x = date, y = cumsum_good_aqi, group = 1)) +
  geom_line() +
  scale_x_date(date_labels = "%b %Y") +
  labs(
    x = NULL, y = "Number of days",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "Durham-Chapel Hill, NC",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")

Detrending

Detrending is removing prominent long-term trend in time series to specifically highlight any notable deviations
Let’s demonstrate using multiple years of AQI data

Multiple years of Durham-Chapel Hill data

dch_files <- fs::dir_ls(here::here("data/durham-chapel-hill"))
dch_files

/Users/mine/Desktop/teaching/Duke/sta313-s24/vizdata-s24/slides/06/data/durham-chapel-hill/ad_aqi_tracker_data-2016.csv
/Users/mine/Desktop/teaching/Duke/sta313-s24/vizdata-s24/slides/06/data/durham-chapel-hill/ad_aqi_tracker_data-2017.csv
/Users/mine/Desktop/teaching/Duke/sta313-s24/vizdata-s24/slides/06/data/durham-chapel-hill/ad_aqi_tracker_data-2018.csv
/Users/mine/Desktop/teaching/Duke/sta313-s24/vizdata-s24/slides/06/data/durham-chapel-hill/ad_aqi_tracker_data-2019.csv
/Users/mine/Desktop/teaching/Duke/sta313-s24/vizdata-s24/slides/06/data/durham-chapel-hill/ad_aqi_tracker_data-2020.csv
/Users/mine/Desktop/teaching/Duke/sta313-s24/vizdata-s24/slides/06/data/durham-chapel-hill/ad_aqi_tracker_data-2021.csv
/Users/mine/Desktop/teaching/Duke/sta313-s24/vizdata-s24/slides/06/data/durham-chapel-hill/ad_aqi_tracker_data-2022.csv

Reading multiple files

dch <- read_csv(dch_files, na = c(".", ""))

dch <- dch |>
  janitor::clean_names() |>
  mutate(
    date = mdy(date),
    good_aqi = if_else(aqi_value <= 50, 1, 0)
  ) |>
  filter(!is.na(aqi_value)) |>
  arrange(date) |>
  mutate(cumsum_good_aqi = cumsum(good_aqi), .after = aqi_value)

dch

# A tibble: 2,547 × 13
  date       aqi_value cumsum_good_aqi main_pollutant site_name   site_id source
  <date>         <dbl>           <dbl> <chr>          <chr>       <chr>   <chr> 
1 2016-01-01        32               1 PM2.5          Durham Arm… 37-063… AQS   
2 2016-01-02        37               2 PM2.5          Durham Arm… 37-063… AQS   
3 2016-01-03        45               3 PM2.5          Durham Arm… 37-063… AQS   
4 2016-01-04        33               4 PM2.5          Durham Arm… 37-063… AQS   
5 2016-01-05        27               5 PM2.5          Durham Arm… 37-063… AQS   
# ℹ 2,542 more rows
# ℹ 6 more variables: x20_year_high_2000_2019 <dbl>,
#   x20_year_low_2000_2019 <dbl>, x5_year_average_2015_2019 <dbl>,
#   date_of_20_year_high <chr>, date_of_20_year_low <chr>, good_aqi <dbl>

Plot trend since 2016

Plot
Code

dch |>
  ggplot(aes(x = date, y = cumsum_good_aqi, group = 1)) +
  geom_smooth(method = "lm", color = "pink") +
  geom_line() +
  scale_x_date(
    expand = expansion(mult = 0.07),
    date_labels = "%Y"
  ) +
  labs(
    x = NULL, y = "Number of days",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "Durham-Chapel Hill, NC",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")

`geom_smooth()` using formula = 'y ~ x'

Detrend

Step 1. Fit a simple linear regression

m <- lm(cumsum_good_aqi ~ date, data = dch)

m


Call:
lm(formula = cumsum_good_aqi ~ date, data = dch)

Coefficients:
(Intercept)         date  
 -1.341e+04    7.954e-01

Detrend

Step 2. Augment the data with model results (using broom::augment())

dch_aug <- augment(m)

dch_aug

# A tibble: 2,547 × 8
  cumsum_good_aqi date       .fitted .resid    .hat .sigma .cooksd .std.resid
            <dbl> <date>       <dbl>  <dbl>   <dbl>  <dbl>   <dbl>      <dbl>
1               1 2016-01-01   -42.8   43.8 0.00157   25.4 0.00234       1.73
2               2 2016-01-02   -42.0   44.0 0.00157   25.4 0.00236       1.74
3               3 2016-01-03   -41.3   44.3 0.00156   25.4 0.00238       1.74
4               4 2016-01-04   -40.5   44.5 0.00156   25.4 0.00240       1.75
5               5 2016-01-05   -39.7   44.7 0.00156   25.4 0.00242       1.76
# ℹ 2,542 more rows

Detrend

Step 3. Divide the observed value of cumsum_good_aqi by the respective value in the long-term trend (i.e., .fitted)

dch_aug <- dch_aug |>
  mutate(ratio = cumsum_good_aqi / .fitted, .after = .fitted)


dch_aug

# A tibble: 2,547 × 9
  cumsum_good_aqi date       .fitted   ratio .resid    .hat .sigma .cooksd
            <dbl> <date>       <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
1               1 2016-01-01   -42.8 -0.0233   43.8 0.00157   25.4 0.00234
2               2 2016-01-02   -42.0 -0.0476   44.0 0.00157   25.4 0.00236
3               3 2016-01-03   -41.3 -0.0727   44.3 0.00156   25.4 0.00238
4               4 2016-01-04   -40.5 -0.0989   44.5 0.00156   25.4 0.00240
5               5 2016-01-05   -39.7 -0.126    44.7 0.00156   25.4 0.00242
# ℹ 2,542 more rows
# ℹ 1 more variable: .std.resid <dbl>

Visualize detrended data

Plot
Code

dch_aug |>
  ggplot(aes(x = date, y = ratio, group = 1)) +
  geom_hline(yintercept = 1, color = "gray") +
  geom_line() +
  scale_x_date(
    expand = expansion(mult = 0.07),
    date_labels = "%Y"
  ) +
  labs(
    x = NULL, y = "Number of days\n(detrended)",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "Durham-Chapel Hill, NC",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")

Air Quality in Durham

barely anything interesting happening!

let’s look at data from somewhere with a bit more “interesting” air quality data…

Read in multiple years of SF data

sf_files <- fs::dir_ls(here::here("data/san-francisco"))

sf <- read_csv(sf_files, na = c(".", ""))

sf <- sf |>
  janitor::clean_names() |>
  mutate(
    date = mdy(date),
    good_aqi = if_else(aqi_value <= 50, 1, 0)
  ) |>
  filter(!is.na(aqi_value)) |>
  arrange(date) |>
  mutate(cumsum_good_aqi = cumsum(good_aqi), .after = aqi_value)

sf

# A tibble: 2,557 × 13
  date       aqi_value cumsum_good_aqi main_pollutant site_name   site_id source
  <date>         <dbl>           <dbl> <chr>          <chr>       <chr>   <chr> 
1 2016-01-01        32               1 PM2.5          Durham Arm… 37-063… AQS   
2 2016-01-02        37               2 PM2.5          Durham Arm… 37-063… AQS   
3 2016-01-03        45               3 PM2.5          Durham Arm… 37-063… AQS   
4 2016-01-04        33               4 PM2.5          Durham Arm… 37-063… AQS   
5 2016-01-05        27               5 PM2.5          Durham Arm… 37-063… AQS   
# ℹ 2,552 more rows
# ℹ 6 more variables: x20_year_high_2000_2019 <dbl>,
#   x20_year_low_2000_2019 <dbl>, x5_year_average_2015_2019 <dbl>,
#   date_of_20_year_high <chr>, date_of_20_year_low <chr>, good_aqi <dbl>

Plot trend since 2016

Plot
Code

sf |>
  ggplot(aes(x = date, y = cumsum_good_aqi, group = 1)) +
  geom_smooth(method = "lm", color = "pink") +
  geom_line() +
  scale_x_date(
    expand = expansion(mult = 0.07),
    date_labels = "%Y"
  ) +
  labs(
    x = NULL, y = "Number of days",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "San Francisco-Oakland-Hayward, CA",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")

`geom_smooth()` using formula = 'y ~ x'

Detrend

Fit a simple linear regression

m_sf <- lm(cumsum_good_aqi ~ date, data = sf)

Augment the data with model results

sf_aug <- augment(m_sf)

Divide the observed value of cumsum_good_aqi by the respective value in the long-term trend (i.e., .fitted)

sf_aug <- sf_aug |>
  mutate(ratio = cumsum_good_aqi / .fitted, .after = .fitted)

Visualize detrended data

Plot
Code

sf_aug |>
  ggplot(aes(x = date, y = ratio, group = 1)) +
  geom_hline(yintercept = 1, color = "gray") +
  geom_line() +
  scale_x_date(
    expand = expansion(mult = 0.07),
    date_labels = "%Y"
  ) +
  labs(
    x = NULL, y = "Number of days\n(detrended)",
    title = "Cumulative number of good AQI days (AQI < 50)",
    subtitle = "San Francisco-Oakland-Hayward, CA",
    caption = "\nSource: EPA Daily Air Quality Tracker"
  ) +
  theme(plot.title.position = "plot")

Detrending

In step 2 we fit a very simple model
Depending on the complexity you’re trying to capture you might choose to fit a much more complex model
You can also decompose the trend into multiple trends, e.g. monthly, long-term, seasonal, etc.