ETC5523: Communicating with Data

Tutorial 5

Author

Michael Lydeamore

Published

May 1, 2001

🎯 Objectives

  • create new functions and generic methods
  • understand and follow guidelines for best way to communicate with tables
  • construct regression, descriptive, and interactive tables with R Markdown
  • deconstruct and manipulate “novel” model objects
  • modify the look of HTML tables produced by R Markdown or Quarto

Install the R-packages

install.packages(c("broom", "DT", "kableExtra", "palmerpenguins", "modelsummary", "gtsummary"))

🛠️ Exercise 5A

Create new generic methods

The range function in R returns a vector containing the minimum and maximum. However, the statistical definition of a range of a set of data is the difference between the largest and smallest values.

  • Create a new function called myrange that computes the difference between the largest and smallest values for vectors of classes: integer, numeric, Date and POSIXct (or more generally POSIXt).
  • In addition if the vector is a character with only one letter for each element, compute the position difference in the English alphabet (e.g. a vector with “A” and “B” should return 1). Cases can be ignored (e.g. a vector with “a” and “B” should return 1). Hint: ?match, ?tolower and ?letters.
  • Design your functions as S3 generic methods.
  • Test your function for the following data:
x1 <- c(2L, 1L, 0L, 3L, 9L, 2L, 3L, 1L, 1L, 5L)
x2 <- c(6.1, -0.6, 5.2, 7.8, NA, 0.7, 4.3, 5.6, 5.6, 4, 0.4)
x3 <- as.Date(c("2021-08-01", "2021-08-04", NA, "2021-08-02", "2021-07-08", "2021-07-09",  "2021-07-30", "2021-07-12", "2021-07-26", "2021-07-22", "2021-08-08"))
x4 <- as.POSIXct(c("2021-08-08 19:26:32", "2021-08-08 19:26:48", "2021-08-03 01:27:02", 
"2021-08-08 19:26:31", "2021-08-08 19:26:29", "2021-08-08 19:26:57", 
"2021-08-08 19:26:54", "2021-08-08 19:26:59", "2021-08-08 19:26:27", 
"2021-08-08 19:26:38"), tz = "UTC")
x5 <- c("N", "X", "E", "l", "Y", "u", "M", "S", "D", "A")

👥 Exercise 5B

Working with robust linear models

Consider the artificial data in the object df. There are two obvious outliers. Which observations are the outliers?

library(tidyverse)
library(broom)
library(palmerpenguins)
library(MASS) # needed for rlm
df <- tibble(x = 1:20, y = c(1:18, 49:50))
ggplot(df, aes(x, y)) + 
  geom_point(size = 2) + theme_bw(base_size = 12)

A simple linear model where parameters are estimated using least squares estimate are not robust to outliers. Below we fit two models:

  1. a linear model with least square estimates contained in fit_lm and
  2. a robust linear model contained in fit_rlm.

To explain briefly, a robust linear model down-weights the contribution of the observations that are outlying observations to estimate parameters.

fit_lm <- lm(y ~ x, data = df) 
fit_rlm <- rlm(y ~ x, data = df) 

What is the class of fit_rlm? Study the object structure of both fit_lm and fit_rlm. How are the structures different between fit_lm and fit_rlm?

Below the augment method from the broom package extracts some model-related values but it does not return the weights from the iterated re-weighted least squares. Modify the augment method for the appropriate class so that the code below gets the desired output.

# modify augment method to the desired output
dfa <- augment(fit_rlm)
Open here for the desired output for dfa
dfa
# A tibble: 20 x 7
       y     x .fitted     .resid      .hat .sigma         w
   <int> <int>   <dbl>      <dbl>     <dbl>  <dbl>     <dbl>
 1     1     1   0.999  0.000546  0.206      10.3  1        
 2     2     2   2.00   0.000452  0.172      10.3  1        
 3     3     3   3.00   0.000358  0.143      10.3  1        
 4     4     4   4.00   0.000263  0.118      10.3  1        
 5     5     5   5.00   0.000169  0.0974     10.3  1        
 6     6     6   6.00   0.0000749 0.0808     10.3  1        
 7     7     7   7.00  -0.0000193 0.0685     10.3  1        
 8     8     8   8.00  -0.000114  0.0603     10.3  1        
 9     9     9   9.00  -0.000208  0.0564     10.3  1        
10    10    10  10.0   -0.000302  0.0566     10.3  1        
11    11    11  11.0   -0.000396  0.0611     10.3  1        
12    12    12  12.0   -0.000490  0.0698     10.3  1        
13    13    13  13.0   -0.000585  0.0827     10.3  1        
14    14    14  14.0   -0.000679  0.0998     10.3  1        
15    15    15  15.0   -0.000773  0.121      10.3  1        
16    16    16  16.0   -0.000867  0.147      10.3  1        
17    17    17  17.0   -0.000961  0.172      10.3  0.977    
18    18    18  18.0   -0.00106   0.187      10.3  0.890    
19    49    19  19.0   30.0       0.0000184   7.28 0.0000742
20    50    20  20.0   30.0       0.0000216   7.28 0.0000742


Once the above is done, you can use the code below to get the plot seen at the end of this question. What do you think the w is doing?

ggplot(dfa, aes(x, y, color = w)) + 
  geom_point(size = 2) + 
  geom_smooth(method = "lm", se = FALSE, color = "#C8008F") +
  geom_smooth(method = "rlm", se = FALSE, color = "#027EB6") + 
  annotate("text", x = 20.5, y = c(21, 31), 
           label = c("rlm", "lm")) + 
  theme_bw(base_size = 12) + 
  labs(title = "Robust vs. non-robust linear models") + 
  scale_color_continuous(type = "viridis")

👥 Exercise 5C

Reporting regression tables

Create regression tables, respecting the standard guidelines of presentation for tables, with models lm1 and rlm1 below using modelsummary, gtsummary or otherwise.

library(palmerpenguins)
lm1 <- lm(bill_length_mm ~ species + bill_depth_mm + flipper_length_mm + body_mass_g + sex, data = penguins)
rlm1 <- rlm(bill_length_mm ~ . , data = penguins)

🛠️ Exercise 5D

Interactive data tables

Produce the same table that you find below using DT using the penguins data in the palmerpenguins package.