Deep dive into ggplot2 layers

Lecture 2

Dr. Mine Çetinkaya-Rundel

Duke University
STA 313 - Spring 2026

Warm up

Announcements

Office hours + locations finalized at vizdata.org/course-team.html
Don’t forget to complete the getting to know you survey by 8pm today

Setup

# load packages
library(tidyverse)
library(scales)
library(openintro)

# set theme for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 14))

# set figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 7, # 7" width
  fig.asp = 0.618, # the golden ratio
  fig.retina = 3, # dpi multiplier for displaying HTML output on retina
  fig.align = "center", # center align figures
  dpi = 300 # higher dpi, sharper image
)

A/B testing

Data: Sale prices of houses in Duke Forest

Data on houses that were sold in the Duke Forest neighborhood of Durham, NC around November 2020
Scraped from Zillow
Source: openintro::duke_forest

`openintro::duke_forest`

glimpse(duke_forest)

Rows: 98
Columns: 13
$ address    <chr> "1 Learned Pl, Durham, NC 27705", "1616 Pinecrest Rd, Durha…
$ price      <dbl> 1520000, 1030000, 420000, 680000, 428500, 456000, 1270000, …
$ bed        <dbl> 3, 5, 2, 4, 4, 3, 5, 4, 4, 3, 4, 4, 3, 5, 4, 5, 3, 4, 4, 3,…
$ bath       <dbl> 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 5.0, 3.0, 5.0, 2.0, 3.0, 3.0,…
$ area       <dbl> 6040, 4475, 1745, 2091, 1772, 1950, 3909, 2841, 3924, 2173,…
$ type       <chr> "Single Family", "Single Family", "Single Family", "Single …
$ year_built <dbl> 1972, 1969, 1959, 1961, 2020, 2014, 1968, 1973, 1972, 1964,…
$ heating    <chr> "Other, Gas", "Forced air, Gas", "Forced air, Gas", "Heat p…
$ cooling    <fct> central, central, central, central, central, central, centr…
$ parking    <chr> "0 spaces", "Carport, Covered", "Garage - Attached, Covered…
$ lot        <dbl> 0.97, 1.38, 0.51, 0.84, 0.16, 0.45, 0.94, 0.79, 0.53, 0.73,…
$ hoa        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ url        <chr> "https://www.zillow.com/homedetails/1-Learned-Pl-Durham-NC-…

ggplot(duke_forest, aes(x = area, y = price)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.7) +
  labs(
    x = "Area (square feet)",
    y = "Sale price (USD)",
    title = "Price and area of houses in Duke Forest"
  )

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

New variable: `decade_built`

duke_forest <- duke_forest |>
  mutate(decade_built = (year_built %/% 10) * 10)

duke_forest |>
  select(year_built, decade_built)

# A tibble: 98 × 2
   year_built decade_built
        <dbl>        <dbl>
 1       1972         1970
 2       1969         1960
 3       1959         1950
 4       1961         1960
 5       2020         2020
 6       2014         2010
 7       1968         1960
 8       1973         1970
 9       1972         1970
10       1964         1960
# ℹ 88 more rows

New variable: `decade_built_cat`

duke_forest <- duke_forest |>
  mutate(
    decade_built_cat = case_when(
      decade_built <= 1940 ~ "1940 or before",
      decade_built >= 1990 ~ "1990 or after",
      .default = as.character(decade_built)
    )
  )

duke_forest |>
  count(decade_built_cat)

# A tibble: 6 × 2
  decade_built_cat     n
  <chr>            <int>
1 1940 or before       8
2 1950                26
3 1960                32
4 1970                11
5 1980                13
6 1990 or after        8

A slightly more complex visualization

Code
Plot

ggplot(
  duke_forest,
  aes(x = area, y = price, color = decade_built_cat)
) +
  geom_point(alpha = 0.7, show.legend = FALSE) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.5, show.legend = FALSE) +
  facet_wrap(~decade_built_cat) +
  labs(
    x = "Area (square feet)",
    y = "Sale price (USD)",
    color = "Decade built",
    title = "Price and area of houses in Duke Forest"
  )

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

A/B testing

In the next two slides, the same plots are created with different “cosmetic” choices. Examine the plots two given (Plot A and Plot B), and indicate your preference by voting for one of them in the Vote tab.

Test 1

Plot A
Plot B
Vote

Test 2

Plot A
Plot B
Vote

What makes figures bad?

Bad taste

Data-to-ink ratio

Tufte strongly recommends maximizing the data-to-ink ratio this in the Visual Display of Quantitative Information (Tufte, 1983).

Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design … [It] consists of complex ideas communicated with clarity, precision, and efficiency. … [It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space … [It] is nearly always multivariate … And graphical excellence requires telling the truth about the data. (Tufte, 1983, p. 51).

Cover of The Visual Display of Quantitative Information

Which of the plots has a higher data-to-ink ratio?

Plot A
Plot B
Vote

A deeper look

at the plotting code

Summary statistics

mean_area_decade <- duke_forest |>
  group_by(decade_built_cat) |>
  summarize(mean_area = mean(area))

mean_area_decade

# A tibble: 6 × 2
  decade_built_cat mean_area
  <chr>                <dbl>
1 1940 or before       2072.
2 1950                 2545.
3 1960                 2873.
4 1970                 3413.
5 1980                 2889.
6 1990 or after        2822.

Barplot

Code
Plot

ggplot(
  mean_area_decade,
  aes(y = decade_built_cat, x = mean_area)
) +
  geom_col() +
  labs(
    x = "Mean area (square feet)",
    y = "Decade built",
    title = "Mean area of houses in Duke Forest, by decade built"
  )

Scaterplot

Code
Plot

ggplot(
  mean_area_decade,
  aes(y = decade_built_cat, x = mean_area)
) +
  geom_point(size = 4) +
  labs(
    x = "Mean area (square feet)",
    y = "Decade built",
    title = "Mean area of houses in Duke Forest, by decade built"
  )

Lollipop chart – a happy medium?

Application exercise

Go to the course GitHub organization: https://github.com/vizdata-s26
Clone the repo called ae-01-YOUR-GITHUB-NAME and work on the exercise.
Once you’re done, commit and push your changes to GitHub.
Label your cell(s) and pay attention to code style and formatting!

Deep dive into ggplot2 layers

Warm up

Announcements

Setup

A/B testing

Data: Sale prices of houses in Duke Forest

openintro::duke_forest

A simple visualization

New variable: decade_built

New variable: decade_built_cat

A slightly more complex visualization

A/B testing

Test 1

Test 2

What makes figures bad?

Bad taste

Data-to-ink ratio

A deeper look

Summary statistics

Barplot

Scaterplot

Lollipop chart – a happy medium?

Application exercise

`openintro::duke_forest`

New variable: `decade_built`

New variable: `decade_built_cat`