ETC5523: Communicating with Data
Tutorial 5
🎯 Objectives
- create new functions and generic methods
- understand and follow guidelines for best way to communicate with tables
- construct regression, descriptive, and interactive tables with R Markdown
- deconstruct and manipulate “novel” model objects
- modify the look of HTML tables produced by R Markdown or Quarto
Install the R-packages
install.packages(c("broom", "DT", "kableExtra", "palmerpenguins", "modelsummary", "gtsummary"))
🛠️ Exercise 5A
Create new generic methods
The range
function in R returns a vector containing the minimum and maximum. However, the statistical definition of a range of a set of data is the difference between the largest and smallest values.
- Create a new function called
myrange
that computes the difference between the largest and smallest values for vectors of classes:integer
,numeric
,Date
andPOSIXct
(or more generallyPOSIXt
). - In addition if the vector is a character with only one letter for each element, compute the position difference in the English alphabet (e.g. a vector with “A” and “B” should return 1). Cases can be ignored (e.g. a vector with “a” and “B” should return 1). Hint:
?match
,?tolower
and?letters
. - Design your functions as S3 generic methods.
- Test your function for the following data:
<- c(2L, 1L, 0L, 3L, 9L, 2L, 3L, 1L, 1L, 5L)
x1 <- c(6.1, -0.6, 5.2, 7.8, NA, 0.7, 4.3, 5.6, 5.6, 4, 0.4)
x2 <- as.Date(c("2021-08-01", "2021-08-04", NA, "2021-08-02", "2021-07-08", "2021-07-09", "2021-07-30", "2021-07-12", "2021-07-26", "2021-07-22", "2021-08-08"))
x3 <- as.POSIXct(c("2021-08-08 19:26:32", "2021-08-08 19:26:48", "2021-08-03 01:27:02",
x4 "2021-08-08 19:26:31", "2021-08-08 19:26:29", "2021-08-08 19:26:57",
"2021-08-08 19:26:54", "2021-08-08 19:26:59", "2021-08-08 19:26:27",
"2021-08-08 19:26:38"), tz = "UTC")
<- c("N", "X", "E", "l", "Y", "u", "M", "S", "D", "A") x5
👥 Exercise 5B
Working with robust linear models
Consider the artificial data in the object df
. There are two obvious outliers. Which observations are the outliers?
library(tidyverse)
library(broom)
library(palmerpenguins)
library(MASS) # needed for rlm
<- tibble(x = 1:20, y = c(1:18, 49:50))
df ggplot(df, aes(x, y)) +
geom_point(size = 2) + theme_bw(base_size = 12)
A simple linear model where parameters are estimated using least squares estimate are not robust to outliers. Below we fit two models:
- a linear model with least square estimates contained in
fit_lm
and - a robust linear model contained in
fit_rlm
.
To explain briefly, a robust linear model down-weights the contribution of the observations that are outlying observations to estimate parameters.
<- lm(y ~ x, data = df)
fit_lm <- rlm(y ~ x, data = df) fit_rlm
What is the class of fit_rlm
? Study the object structure of both fit_lm
and fit_rlm
. How are the structures different between fit_lm
and fit_rlm
?
Below the augment
method from the broom
package extracts some model-related values but it does not return the weights from the iterated re-weighted least squares. Modify the augment
method for the appropriate class so that the code below gets the desired output.
# modify augment method to the desired output
<- augment(fit_rlm) dfa
Open here for the desired output for dfa
dfa
# A tibble: 20 x 7
y x .fitted .resid .hat .sigma w
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.999 0.000546 0.206 10.3 1
2 2 2 2.00 0.000452 0.172 10.3 1
3 3 3 3.00 0.000358 0.143 10.3 1
4 4 4 4.00 0.000263 0.118 10.3 1
5 5 5 5.00 0.000169 0.0974 10.3 1
6 6 6 6.00 0.0000749 0.0808 10.3 1
7 7 7 7.00 -0.0000193 0.0685 10.3 1
8 8 8 8.00 -0.000114 0.0603 10.3 1
9 9 9 9.00 -0.000208 0.0564 10.3 1
10 10 10 10.0 -0.000302 0.0566 10.3 1
11 11 11 11.0 -0.000396 0.0611 10.3 1
12 12 12 12.0 -0.000490 0.0698 10.3 1
13 13 13 13.0 -0.000585 0.0827 10.3 1
14 14 14 14.0 -0.000679 0.0998 10.3 1
15 15 15 15.0 -0.000773 0.121 10.3 1
16 16 16 16.0 -0.000867 0.147 10.3 1
17 17 17 17.0 -0.000961 0.172 10.3 0.977
18 18 18 18.0 -0.00106 0.187 10.3 0.890
19 49 19 19.0 30.0 0.0000184 7.28 0.0000742
20 50 20 20.0 30.0 0.0000216 7.28 0.0000742
Once the above is done, you can use the code below to get the plot seen at the end of this question. What do you think the w
is doing?
ggplot(dfa, aes(x, y, color = w)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE, color = "#C8008F") +
geom_smooth(method = "rlm", se = FALSE, color = "#027EB6") +
annotate("text", x = 20.5, y = c(21, 31),
label = c("rlm", "lm")) +
theme_bw(base_size = 12) +
labs(title = "Robust vs. non-robust linear models") +
scale_color_continuous(type = "viridis")
👥 Exercise 5C
Reporting regression tables
Create regression tables, respecting the standard guidelines of presentation for tables, with models lm1
and rlm1
below using modelsummary
, gtsummary
or otherwise.
library(palmerpenguins)
<- lm(bill_length_mm ~ species + bill_depth_mm + flipper_length_mm + body_mass_g + sex, data = penguins)
lm1 <- rlm(bill_length_mm ~ . , data = penguins) rlm1
🛠️ Exercise 5D
Interactive data tables
Produce the same table that you find below using DT
using the penguins
data in the palmerpenguins
package.