Deep dive into ggplot2 layers

Lecture 2

Dr. Mine Çetinkaya-Rundel

Duke University
STA 313 - Spring 2026

Warm up

Announcements

Setup

# load packages
library(tidyverse)
library(scales)
library(openintro)

# set theme for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 14))

# set figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 7, # 7" width
  fig.asp = 0.618, # the golden ratio
  fig.retina = 3, # dpi multiplier for displaying HTML output on retina
  fig.align = "center", # center align figures
  dpi = 300 # higher dpi, sharper image
)

A/B testing

Data: Sale prices of houses in Duke Forest

  • Data on houses that were sold in the Duke Forest neighborhood of Durham, NC around November 2020

  • Scraped from Zillow

  • Source: openintro::duke_forest

Modernist house in Duke Forest

openintro::duke_forest

glimpse(duke_forest)
Rows: 98
Columns: 13
$ address    <chr> "1 Learned Pl, Durham, NC 27705", "1616 Pinecrest Rd, Durha…
$ price      <dbl> 1520000, 1030000, 420000, 680000, 428500, 456000, 1270000, …
$ bed        <dbl> 3, 5, 2, 4, 4, 3, 5, 4, 4, 3, 4, 4, 3, 5, 4, 5, 3, 4, 4, 3,…
$ bath       <dbl> 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 5.0, 3.0, 5.0, 2.0, 3.0, 3.0,…
$ area       <dbl> 6040, 4475, 1745, 2091, 1772, 1950, 3909, 2841, 3924, 2173,…
$ type       <chr> "Single Family", "Single Family", "Single Family", "Single …
$ year_built <dbl> 1972, 1969, 1959, 1961, 2020, 2014, 1968, 1973, 1972, 1964,…
$ heating    <chr> "Other, Gas", "Forced air, Gas", "Forced air, Gas", "Heat p…
$ cooling    <fct> central, central, central, central, central, central, centr…
$ parking    <chr> "0 spaces", "Carport, Covered", "Garage - Attached, Covered…
$ lot        <dbl> 0.97, 1.38, 0.51, 0.84, 0.16, 0.45, 0.94, 0.79, 0.53, 0.73,…
$ hoa        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ url        <chr> "https://www.zillow.com/homedetails/1-Learned-Pl-Durham-NC-…

A simple visualization

ggplot(duke_forest, aes(x = area, y = price)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.7) +
  labs(
    x = "Area (square feet)",
    y = "Sale price (USD)",
    title = "Price and area of houses in Duke Forest"
  )
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

New variable: decade_built

duke_forest <- duke_forest |>
  mutate(decade_built = (year_built %/% 10) * 10)

duke_forest |>
  select(year_built, decade_built)
# A tibble: 98 × 2
   year_built decade_built
        <dbl>        <dbl>
 1       1972         1970
 2       1969         1960
 3       1959         1950
 4       1961         1960
 5       2020         2020
 6       2014         2010
 7       1968         1960
 8       1973         1970
 9       1972         1970
10       1964         1960
# ℹ 88 more rows

New variable: decade_built_cat

duke_forest <- duke_forest |>
  mutate(
    decade_built_cat = case_when(
      decade_built <= 1940 ~ "1940 or before",
      decade_built >= 1990 ~ "1990 or after",
      .default = as.character(decade_built)
    )
  )

duke_forest |>
  count(decade_built_cat)
# A tibble: 6 × 2
  decade_built_cat     n
  <chr>            <int>
1 1940 or before       8
2 1950                26
3 1960                32
4 1970                11
5 1980                13
6 1990 or after        8

A slightly more complex visualization

ggplot(
  duke_forest,
  aes(x = area, y = price, color = decade_built_cat)
) +
  geom_point(alpha = 0.7, show.legend = FALSE) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.5, show.legend = FALSE) +
  facet_wrap(~decade_built_cat) +
  labs(
    x = "Area (square feet)",
    y = "Sale price (USD)",
    color = "Decade built",
    title = "Price and area of houses in Duke Forest"
  )
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

A/B testing

In the next two slides, the same plots are created with different “cosmetic” choices. Examine the plots two given (Plot A and Plot B), and indicate your preference by voting for one of them in the Vote tab.

Test 1

Test 2

What makes figures bad?

Bad taste

Data-to-ink ratio

Tufte strongly recommends maximizing the data-to-ink ratio this in the Visual Display of Quantitative Information (Tufte, 1983).

Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design … [It] consists of complex ideas communicated with clarity, precision, and efficiency. … [It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space … [It] is nearly always multivariate … And graphical excellence requires telling the truth about the data. (Tufte, 1983, p. 51).

Cover of The Visual Display of Quantitative Information

Which of the plots has a higher data-to-ink ratio?

A deeper look

at the plotting code

Summary statistics

mean_area_decade <- duke_forest |>
  group_by(decade_built_cat) |>
  summarize(mean_area = mean(area))

mean_area_decade
# A tibble: 6 × 2
  decade_built_cat mean_area
  <chr>                <dbl>
1 1940 or before       2072.
2 1950                 2545.
3 1960                 2873.
4 1970                 3413.
5 1980                 2889.
6 1990 or after        2822.

Barplot

ggplot(
  mean_area_decade,
  aes(y = decade_built_cat, x = mean_area)
) +
  geom_col() +
  labs(
    x = "Mean area (square feet)",
    y = "Decade built",
    title = "Mean area of houses in Duke Forest, by decade built"
  )

Scaterplot

ggplot(
  mean_area_decade,
  aes(y = decade_built_cat, x = mean_area)
) +
  geom_point(size = 4) +
  labs(
    x = "Mean area (square feet)",
    y = "Decade built",
    title = "Mean area of houses in Duke Forest, by decade built"
  )

Lollipop chart – a happy medium?

Application exercise

  • Go to the course GitHub organization: https://github.com/vizdata-s26

  • Clone the repo called ae-01-YOUR-GITHUB-NAME and work on the exercise.

  • Once you’re done, commit and push your changes to GitHub.

  • Label your cell(s) and pay attention to code style and formatting!