Deep dive into ggplot2 layers - II

Lecture 2

Dr. Mine Çetinkaya-Rundel

Duke University
STA 313 - Spring 2024

Warm up

Announcements

  • Thank you for filling out the survey!

  • At this point everyone should be:

    • On Slack, in public channels for #general, #random, and a private channel for their lab section.
    • Make sure your profile photo/avatar and name matches between GitHub and Slack.
  • Reading Quiz 1 posted, due next Tuesday at 10 am (before class).

Setup

# load packages
library(tidyverse)
library(openintro)
library(countdown)
library(palmerpenguins)
library(scales)

# set theme for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 14))

# set width of code output
options(width = 65)

# set figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 7,        # 7" width
  fig.asp = 0.618,      # the golden ratio
  fig.retina = 3,       # dpi multiplier for displaying HTML output on retina
  fig.align = "center", # center align figures
  dpi = 300             # higher dpi, sharper image
)

From last time

duke_forest <- duke_forest |>
  mutate(
    decade_built = (year_built %/% 10) * 10,
    decade_built_cat = case_when(
      decade_built <= 1940 ~ "1940 or before",
      decade_built >= 1990 ~ "1990 or after",
      TRUE ~ as.character(decade_built)
    )
  )

mean_area_decade <- duke_forest |>
  group_by(decade_built_cat) |>
  summarise(mean_area = mean(area))

mean_area_decade
# A tibble: 6 × 2
  decade_built_cat mean_area
  <chr>                <dbl>
1 1940 or before       2072.
2 1950                 2545.
3 1960                 2873.
4 1970                 3413.
5 1980                 2889.
6 1990 or after        2822.

Geoms

Geoms

  • Geometric objects, or geoms for short, perform the actual rendering of the layer, controlling the type of plot that you create

  • You can think of them as “the geometric shape used to represent the data”

One variable

  • Discrete:

    • geom_bar(): display distribution of discrete variable.
  • Continuous

    • geom_histogram(): bin and count continuous variable, display with bars

    • geom_density(): smoothed density estimate

    • geom_dotplot(): stack individual points into a dot plot

    • geom_freqpoly(): bin and count continuous variable, display with lines

Aside

Always use “typewriter text” (monospace font) when writing function names, and follow with (), e.g.,

  • geom_freqpoly()

  • mean()

  • lm()

geom_dotplot()

What does each point represent? How are their locations determined? What do the x and y axes represent?

ggplot(duke_forest, aes(x = price)) +
  geom_dotplot(binwidth = 50000, dotsize = 0.7)

Comparing across groups

Which of the following allows for easier comparison across groups?

ggplot(duke_forest, aes(x = price, fill = decade_built_cat)) +
  geom_histogram(binwidth = 100000)

ggplot(duke_forest, aes(x = price, color = decade_built_cat)) +
  geom_freqpoly(binwidth = 100000, linewidth = 1)

Two variables - both continuous

  • geom_point(): scatterplot

  • geom_quantile(): smoothed quantile regression

  • geom_rug(): marginal rug plots

  • geom_smooth(): smoothed line of best fit

  • geom_text(): text labels

Application exercise - Part 1

  • Go to the course GitHub organization: https://github.com/vizdata-s24

  • Clone the repo called ae-02-[YOUR-GITHUB-USERNAME] and work on the exercises for Part 1.

  • Once you’re done, share your plots on Slack in #general.

  • Label your chunk(s) and pay attention to code style and formatting!

10:00

Two variables - show density

  • geom_bin2d(): bin into rectangles and count

  • geom_density2d(): smoothed 2d density estimate

  • geom_hex(): bin into hexagons and count

geom_hex()

Not very helpful for 98 observations:

ggplot(duke_forest, aes(x = area, y = price)) +
  geom_hex()

geom_hex()

More helpful for 53940 observations:

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_hex()

geom_hex()

(Maybe) even more helpful on the (natural) log scale:

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_hex() +
  scale_fill_gradient(trans = "log")

geom_hex() and warnings

  • Requires installing the hexbin package separately!
install.packages("hexbin")
  • Otherwise you might see
Warning: Computation failed in `stat_binhex()`

Two variables

  • At least one discrete
    • geom_count(): count number of point at distinct locations
    • geom_jitter(): randomly jitter overlapping points
  • One continuous, one discrete
    • geom_col(): a bar chart of pre-computed summaries
    • geom_boxplot(): boxplots
    • geom_violin(): show density of values in each group

geom_jitter()

How are the following three plots different?

ggplot(duke_forest, aes(x = bed, y = price)) +
  geom_point()

ggplot(duke_forest, aes(x = bed, y = price)) +
  geom_jitter()

ggplot(duke_forest, aes(x = bed, y = price)) +
  geom_jitter()

geom_jitter() and set.seed()

set.seed(1234)

ggplot(duke_forest, aes(x = bed, y = price)) +
  geom_jitter()

set.seed(1234)

ggplot(duke_forest, aes(x = bed, y = price)) +
  geom_jitter()

Two variables

  • One time, one continuous
    • geom_area(): area plot
    • geom_line(): line plot
    • geom_step(): step plot
  • Display uncertainty:
    • geom_crossbar(): vertical bar with center
    • geom_errorbar(): error bars
    • geom_linerange(): vertical line
    • geom_pointrange(): vertical line with center
  • Spatial
    • geom_sf(): for map data (more on this later…)

Average price per year built

mean_price_year <- duke_forest |>
  group_by(year_built) |>
  summarize(
    n = n(),
    mean_price = mean(price),
    sd_price = sd(price)
  )

mean_price_year
# A tibble: 44 × 4
   year_built     n mean_price sd_price
        <dbl> <int>      <dbl>    <dbl>
 1       1923     1     285000      NA 
 2       1934     1     600000      NA 
 3       1938     1     265000      NA 
 4       1940     1     105000      NA 
 5       1941     2     432500   28284.
 6       1945     2     525000  530330.
 7       1951     2     567500  258094.
 8       1952     2     531250  469165.
 9       1953     2     575000   35355.
10       1954     4     600000   33912.
# ℹ 34 more rows

geom_line()

ggplot(mean_price_year, aes(x = year_built, y = mean_price)) +
  geom_line()

geom_area()

ggplot(mean_price_year, aes(x = year_built, y = mean_price)) +
  geom_area()

geom_step()

ggplot(mean_price_year, aes(x = year_built, y = mean_price)) +
  geom_step()

Application exercise - Part 2

  • Go to the course GitHub organization: https://github.com/vizdata-s24

  • Clone the repo called ae-02-[YOUR-GITHUB-USERNAME] and work on the exercises for Part 2.

  • Once you’re done, share your plot on Slack in #general.

  • Label your chunk(s) and pay attention to code style and formatting!

05:00

let’s clean things up a bit!

Let’s clean things up a bit!

ggplot(duke_forest, aes(x = area, y = price)) +
  geom_point(alpha = 0.6, size = 2, color = "#012169") +
  scale_x_continuous(labels = label_number(big.mark = ",")) +
  scale_y_continuous(labels = label_dollar(scale = 1/1000, suffix = "K")) +
  labs(
    x = "Area (square feet)",
    y = "Sale price (USD)",
    title = "Sale prices of homes in Duke Forest",
    subtitle = "As of November 2020",
    caption = "Source: Zillow.com"
  )