October 7, 2017

Visualizations

  • Statistics can be misleading
  • When it's life or death
  • Grammar of Graphics & ggplot
  • Exploratory Data Analysis
  • HTML Widgets
  • Shiny

Statistics can be misleading

“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey

Anscombe's Quartet

Wikipedia: Anscombe's Quartet

Dino Data

Dino Data

Dino Data in R

Box Plots

When it's Life or Death

Cholera in London: 1854

John Snow

The other John Snow

Snow's Map

Tufte's Design

  1. Place data in an appropriate context for assessing cause and effect.

  2. Making quantitatives comparisons (e.g., Workhouse & Brewery).

  3. Considering alternative explanations and contrary cases.

  4. Assessment for possible errors (e.g., Compared with what?)

Compared with What?

Mark Monmonier's "How to Lie with Maps" (1991)

Challenger

Challenger: Presidential Commission

Chartjunk, lack of cause-effect, & wrong order (Tufte, p. 48). See the bottom.

Richard Feynman's Ice Dunk

Tufte's Take

Box Plot 1

Source: Tufte, Visual Explanations

Grammar of Graphics & ggplot2

1812 Napoleon Russia Invasion

Napoleon Invasion

ggplot2

ggplot2 (code)

troops <- read.table("../data/minard-troops.txt", header=T)
cities <- read.table("../data/minard-cities.txt", header=T)

library(ggplot2); library(scales)

plot_troops <- ggplot(troops, aes(long, lat)) +
  geom_path(aes(size = survivors, colour = direction, group = group))
  
plot_both <- plot_troops + 
  geom_text(aes(label = city), size = 4, data = cities)
  
plot_polished <- plot_both + 
  scale_size(breaks = c(1, 2, 3) * 10^5, labels = comma(c(1, 2, 3) * 10^5)) + 
  scale_colour_manual(values = c("grey50","red")) +
  xlab(NULL) + 
  ylab(NULL)

Follow along: Viz Chapter

Exploratory Data Analysis: Tukey

geom_bar

Two Questions

  • What type of variation occurs within my variables?

  • What type of covariation occurs between my variables?

Variation: Categorical

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

Variation: Continuous

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

Variation: Box-Plot

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_boxplot()

Variation: Box-Plot

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) + coord_flip()

Co-Variation: Discrete

ggplot(data = diamonds) +
  geom_count(mapping = aes(x = cut, y = color))

Co-Variation: Discrete (2)

diamonds %>% count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n))

Co-Variation: Continous

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price))

Application of Co-variation: Continuous

Stochastic random walks; used in pricing financial derivatives.

set.seed(2)
d <- purrr::map_dfr(letters, ~ data.frame(idx = 1:400,
               value = cumsum(runif(400, -1, 1)),
               type = .,
               stringsAsFactors = FALSE))

# save ggplot object as g
g <- ggplot(d) +
  geom_line(aes(idx, value, colour = type))

Application of Co-variation: Continuous

# runs ggplot object
g

HTML Widgets

Shiny

Other Shiny Materials

"Assignment"

  1. Create a shiny app that runs k-means clustering app with the dinosauRus package data. Ideally, it would have a dropdown widget to choose which dataset.

  2. Even better, get a shinyapps.io and publish it.

  3. Even even better, push your code onto GitHub. Try to follow the R package structure.

How well does k-means clustering do?

Which distributions does it do really bad.