Data wrangling + tidying for visualization

Lecture 3

Dr. Mine Çetinkaya-Rundel

Duke University
STA 313 - Spring 2026

Warm up

Clarifications…

requested in the “Getting to know you survey”

If I have to be absent for the in-class quizzes (e.g., field trip for another class, tournament for a sport, etc.), is there any way to make it up?

No, from the Assessment section of the syllabus “There are no make-ups or late submissions for missed quizzes.”, however “lowest 3-4 quizzes (roughly 25% dropped, depending on how many quizzes we end up with throughout the semester) will be dropped.” Of course, university recognized excused absences are handled differently and if there is a long term illness affecting your participation in the quizzes, I will work with your Dean to come up with a policy to apply to your case.

How long is the extension on the one-time wavier?

The one-time waiver waives the late penalty, so it only applies to the late penalty period for assignments and assessments where late work is accepted with a penalty. See the Late work & extensions section of the syllbus for details. TL;DR (but please do read!!!), 48 hours.

Setup

# load packages
library(tidyverse)
library(scales)
library(openintro)
library(ggthemes)

# set theme for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 14))

# set figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 7, # 7" width
  fig.asp = 0.618, # the golden ratio
  fig.retina = 3, # dpi multiplier for displaying HTML output on retina
  fig.align = "center", # center align figures
  dpi = 300 # higher dpi, sharper image
)

Positron tip (optional)

Tired of trying to properly style your code? Sick of collaborators with messy code?

Open the Command Palette with ⌘⇧P (Command + Shift + P)
Type “preferences” and select Preferences: Open User Settings (JSON)
Add the following to auto format your R code in R scripts and Quarto documents with Air:

{
  # other preferences (if any) here,
    "[r]": {
        "editor.formatOnSave": true,
        "editor.defaultFormatter": "Posit.air-vscode"
    },
    "[quarto]": {
        "editor.formatOnSave": true,
        "editor.defaultFormatter": "quarto.quarto"
    }
}

From last time

Lollipop chart

Code

duke_forest <- duke_forest |>
  mutate(
    decade_built = (year_built %/% 10) * 10,
    decade_built_cat = case_when(
      decade_built <= 1940 ~ "1940 or before",
      decade_built >= 1990 ~ "1990 or after",
      .default = as.character(decade_built)
    )
  )

mean_area_decade <- duke_forest |>
  group_by(decade_built_cat) |>
  summarize(mean_area = mean(area))

ggplot(
  mean_area_decade,
  aes(y = decade_built_cat, x = mean_area)
) +
  geom_point(size = 4) +
  geom_segment(
    aes(
      x = 0,
      xend = mean_area,
      y = decade_built_cat,
      yend = decade_built_cat
    )
  ) +
  labs(
    x = "Mean area (square feet)",
    y = "Decade built",
    title = "Mean area of houses in Duke Forest, by decade built"
  )

Faceted plot showing the average importance of democracy in 6 countries over time.

Bad perception

Aspect ratios affect our perception of rates of change, modeled after an example by William S. Cleveland.

Aesthetic mappings in ggplot2

Activity: Spot the differences I

Can you spot differences between plots 1 ans 2? How about differences in codes 1 ans 2?

Plot 1
Plot 2
Code 1
Code 2

ggplot(
  mean_area_decade,
  aes(y = decade_built_cat, x = mean_area)
) +
  geom_point(size = 4) +
  geom_segment(
    aes(
      x = 0,
      xend = mean_area,
      y = decade_built_cat,
      yend = decade_built_cat
    )
  ) +
  labs(
    x = "Mean area (square feet)",
    y = "Decade built",
    title = "Mean area of houses in Duke Forest, by decade built"
  )

ggplot(
  mean_area_decade,
  aes(y = decade_built_cat, x = mean_area)
) +
  geom_point(size = 4) +
  geom_segment(
    aes(
      xend = 0,
      yend = decade_built_cat
    )
  ) +
  labs(
    x = "Mean area (square feet)",
    y = "Decade built",
    title = "Mean area of houses in Duke Forest, by decade built"
  )

Global vs. layer-specific aesthetics

Aesthetic mappings can be supplied in the initial ggplot() call, in individual layers, or in some combination of both.
Within each layer, you can add, override, or remove mappings.
If you only have one layer in the plot, the way you specify aesthetics doesn’t make any difference. However, the distinction is important when you start adding additional layers.

Activity: Spot the differences II

Do you expect the following plots to be the same or different? If different, how? Discuss in a pair (or group) without running the code and sketch the resulting plots based on what you think the code will produce.

# Plot A
ggplot(duke_forest, aes(x = area, y = price)) +
  geom_point(aes(color = decade_built_cat))

# Plot B
ggplot(duke_forest, aes(x = area, y = price)) +
  geom_point(color = "blue")

# Plot C
ggplot(duke_forest, aes(x = area, y = price)) +
  geom_point(color = "#a493ba")

Recall

Think back to all the plots you saw in this lecture so far and the previous lecture, without flipping back through the slides. Which plot first comes to mind?

Geoms

Geometric objects, or geoms for short, perform the actual rendering of the layer, controlling the type of plot that you create
You can think of them as “the geometric shape used to represent the data”

One variable

Discrete:
- geom_bar(): display distribution of discrete variable.

Continuous:
- geom_histogram(): bin and count continuous variable, display with bars
- geom_density(): smoothed density estimate
- geom_dotplot(): stack individual points into a dot plot
- geom_freqpoly(): bin and count continuous variable, display with lines

Aside

Always use “typewriter text” (monospace font) when writing function names, and follow with (), e.g.,

geom_freqpoly()
mean()
lm()

`geom_dotplot()`

What does each point represent? How are their locations determined? What do the x and y axes represent?

ggplot(duke_forest, aes(x = price)) +
  geom_dotplot(binwidth = 50000, dotsize = 0.7)

Comparing across groups

Which of the following allows for easier comparison across groups?

Histogram
Frequency polygon

ggplot(duke_forest, aes(x = price, fill = decade_built_cat)) +
  geom_histogram(binwidth = 100000)

ggplot(duke_forest, aes(x = price, color = decade_built_cat)) +
  geom_freqpoly(binwidth = 100000, linewidth = 1)

Two variables - both continuous

geom_point(): scatterplot
geom_quantile(): smoothed quantile regression
geom_rug(): marginal rug plots
geom_smooth(): smoothed line of best fit
geom_text(): text labels

Two variables - show density

geom_bin2d(): bin into rectangles and count
geom_density2d(): smoothed 2d density estimate
geom_hex(): bin into hexagons and count

`geom_hex()`

Not very helpful for 98 observations:

ggplot(duke_forest, aes(x = area, y = price)) +
  geom_hex()

`geom_hex()`

More helpful for 53940 observations:

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_hex()

`geom_hex()`

(Maybe) even more helpful on the (natural) log scale:

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_hex() +
  scale_fill_gradient(trans = "log")

`geom_hex()` and warnings

Requires installing the hexbin package separately!

install.packages("hexbin")

Otherwise you might see

Warning: Computation failed in `stat_binhex()`

Two variables

At least one discrete:
- geom_count(): count number of point at distinct locations
- geom_jitter(): randomly jitter overlapping points

One continuous, one discrete:
- geom_col(): a bar chart of pre-computed summaries
- geom_boxplot(): boxplots
- geom_violin(): show density of values in each group

`geom_jitter()`

How are the following three plots different?

Plot A
Plot B
Plot C

ggplot(duke_forest, aes(x = bed, y = price)) +
  geom_point()

ggplot(duke_forest, aes(x = bed, y = price)) +
  geom_jitter()

ggplot(duke_forest, aes(x = bed, y = price)) +
  geom_jitter()

`geom_jitter()` and `set.seed()`

Plot A
Plot B

set.seed(1234)

ggplot(duke_forest, aes(x = bed, y = price)) +
  geom_jitter()

set.seed(1234)

ggplot(duke_forest, aes(x = bed, y = price)) +
  geom_jitter()

Two variables

One time, one continuous
- geom_area(): area plot
- geom_line(): line plot
- geom_step(): step plot
Display uncertainty:
- geom_crossbar(): vertical bar with center
- geom_errorbar(): error bars
- geom_linerange(): vertical line
- geom_pointrange(): vertical line with center
Spatial
- geom_sf(): for map data (more on this later…)

Average price per year built

mean_price_year <- duke_forest |>
  group_by(year_built) |>
  summarize(
    n = n(),
    mean_price = mean(price),
    sd_price = sd(price)
  )

mean_price_year

# A tibble: 44 × 4
   year_built     n mean_price sd_price
        <dbl> <int>      <dbl>    <dbl>
 1       1923     1     285000      NA 
 2       1934     1     600000      NA 
 3       1938     1     265000      NA 
 4       1940     1     105000      NA 
 5       1941     2     432500   28284.
 6       1945     2     525000  530330.
 7       1951     2     567500  258094.
 8       1952     2     531250  469165.
 9       1953     2     575000   35355.
10       1954     4     600000   33912.
# ℹ 34 more rows

`geom_line()`

ggplot(mean_price_year, aes(x = year_built, y = mean_price)) +
  geom_line()

`geom_area()`

ggplot(mean_price_year, aes(x = year_built, y = mean_price)) +
  geom_area()

`geom_step()`

ggplot(mean_price_year, aes(x = year_built, y = mean_price)) +
  geom_step()

Transforming and reshaping a single data frame

Scenario 1

We…

have a single data frame

want to slice it, and dice it, and juice it, and process it, so we can plot it

Data: Hotel bookings

Data from two hotels: one resort and one city hotel
Observations: Each row represents a hotel booking

hotels <- read_csv(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv"
)
names(hotels)

 [1] "hotel"                          "is_canceled"                   
 [3] "lead_time"                      "arrival_date_year"             
 [5] "arrival_date_month"             "arrival_date_week_number"      
 [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
 [9] "stays_in_week_nights"           "adults"                        
[11] "children"                       "babies"                        
[13] "meal"                           "country"                       
[15] "market_segment"                 "distribution_channel"          
[17] "is_repeated_guest"              "previous_cancellations"        
[19] "previous_bookings_not_canceled" "reserved_room_type"            
[21] "assigned_room_type"             "booking_changes"               
[23] "deposit_type"                   "agent"                         
[25] "company"                        "days_in_waiting_list"          
[27] "customer_type"                  "adr"                           
[29] "required_car_parking_spaces"    "total_of_special_requests"     
[31] "reservation_status"             "reservation_status_date"

dplyr 101

Which of the following (if any) are unfamiliar to you?

distinct()
select(), relocate()
arrange(), arrange(desc())
slice(), slice_head(), slice_tail(), slice_sample()
filter()
mutate()
summarize(), count()

Average cost of daily stay

ae-02 - Part 1: Let’s recreate this visualization!

Data wrangling + tidying for visualization

Warm up

Clarifications…

Setup

Positron tip (optional)

From last time

Lollipop chart

Bad data

Bad perception

Aesthetic mappings in ggplot2

Activity: Spot the differences I

Global vs. layer-specific aesthetics

Activity: Spot the differences II

Recall

Geoms

Geoms

One variable

Aside

geom_dotplot()

Comparing across groups

Two variables - both continuous

Two variables - show density

geom_hex()

geom_hex()

geom_hex()

geom_hex() and warnings

Two variables

geom_jitter()

geom_jitter() and set.seed()

Two variables

Average price per year built

geom_line()

geom_area()

geom_step()

Transforming and reshaping a single data frame

Scenario 1

Data: Hotel bookings

dplyr 101

Average cost of daily stay

`geom_dotplot()`

`geom_hex()`

`geom_hex()`

`geom_hex()`

`geom_hex()` and warnings

`geom_jitter()`

`geom_jitter()` and `set.seed()`

`geom_line()`

`geom_area()`

`geom_step()`