--- title: 'ETC5523: Communicating with Data' subtitle: "Tutorial 3" author: "Emi Tanaka" date: "Week 3" output: html_document: toc: true css: "tutorial.css" --- **Please do [Preparation](#preparation) *before* the tutorial!** ## `r emo::ji("target")` Objectives * create new functions and generic methods * understand and follow guidelines for best way to communicate with tables * construct regression, descriptive, and interactive tables with R Markdown * deconstruct and manipulate "novel" model objects * modify the look of *HTML tables* produced by R Markdown ## `r emo::ji("wrench")` Preparation 1. Install R-packages ```{r, eval = FALSE} install.packages(c("broom", "DT", "GT", "kableExtra", "palmerpenguins", "modelsummary", "gtsummary")) ``` ## 🛠️ Exercise 3A **Create new generic methods** The `range` function in R returns a vector containing the minimum and maximum. However, the statistical definition of a range of a set of data is the *difference* between the largest and smallest values. * Create a new function called `myrange` that computes the difference between the largest and smallest values for vectors of classes: `integer`, `numeric`, `Date` and `POSIXct` (or more generally `POSIXt`). * In addition if the vector is a character with only one letter for each element, compute the position difference in the English alphabet (e.g. a vector with "A" and "B" should return 1). Cases can be ignored (e.g. a vector with "a" and "B" should return 1). Hint: `?match`, `?tolower` and `?letters`. * Design your functions as S3 generic methods. * Test your function for the following data: ```{r} x1 <- c(2L, 1L, 0L, 3L, 9L, 2L, 3L, 1L, 1L, 5L) x2 <- c(6.1, -0.6, 5.2, 7.8, NA, 0.7, 4.3, 5.6, 5.6, 4, 0.4) x3 <- as.Date(c("2021-08-01", "2021-08-04", NA, "2021-08-02", "2021-07-08", "2021-07-09", "2021-07-30", "2021-07-12", "2021-07-26", "2021-07-22", "2021-08-08")) x4 <- as.POSIXct(c("2021-08-08 19:26:32", "2021-08-08 19:26:48", "2021-08-03 01:27:02", "2021-08-08 19:26:31", "2021-08-08 19:26:29", "2021-08-08 19:26:57", "2021-08-08 19:26:54", "2021-08-08 19:26:59", "2021-08-08 19:26:27", "2021-08-08 19:26:38"), tz = "UTC") x5 <- c("N", "X", "E", "l", "Y", "u", "M", "S", "D", "A") ``` ## 👥 Exercise 3B **Working with robust linear models** Consider the artificial data in the object `df`. There are two obvious outliers. Which observations are the outliers? ```{r, message = FALSE, warning = FALSE} library(tidyverse) library(broom) library(palmerpenguins) library(MASS) # needed for rlm df <- tibble(x = 1:20, y = c(1:18, 49:50)) ggplot(df, aes(x, y)) + geom_point(size = 2) + theme_bw(base_size = 12) ``` A simple linear model where parameters are estimated using least squares estimate are not robust to outliers. Below we fit two models: 1. a linear model with least square estimates contained in `fit_lm` and 2. a robust linear model contained in `fit_rlm`. To explain briefly, a robust linear model down-weights the contribution of the observations that are outlying observations to estimate parameters. ```{r} fit_lm <- lm(y ~ x, data = df) fit_rlm <- rlm(y ~ x, data = df) ``` What is the class of `fit_rlm`? Study the object structure of both `fit_lm` and `fit_rlm`. How are the structures different between `fit_lm` and `fit_rlm`? Below the `augment` method from the `broom` package extracts some model-related values but it does not return the weights from the iterated re-weighted least squares. Modify the `augment` method for the appropriate class so that the code below gets the desired output. ```{r} # modify augment method to the desired output dfa <- augment(fit_rlm) ```
Open here for the desired output for `dfa` ``` dfa # A tibble: 20 x 7 y x .fitted .resid .hat .sigma w 1 1 1 0.999 0.000546 0.206 10.3 1 2 2 2 2.00 0.000452 0.172 10.3 1 3 3 3 3.00 0.000358 0.143 10.3 1 4 4 4 4.00 0.000263 0.118 10.3 1 5 5 5 5.00 0.000169 0.0974 10.3 1 6 6 6 6.00 0.0000749 0.0808 10.3 1 7 7 7 7.00 -0.0000193 0.0685 10.3 1 8 8 8 8.00 -0.000114 0.0603 10.3 1 9 9 9 9.00 -0.000208 0.0564 10.3 1 10 10 10 10.0 -0.000302 0.0566 10.3 1 11 11 11 11.0 -0.000396 0.0611 10.3 1 12 12 12 12.0 -0.000490 0.0698 10.3 1 13 13 13 13.0 -0.000585 0.0827 10.3 1 14 14 14 14.0 -0.000679 0.0998 10.3 1 15 15 15 15.0 -0.000773 0.121 10.3 1 16 16 16 16.0 -0.000867 0.147 10.3 1 17 17 17 17.0 -0.000961 0.172 10.3 0.977 18 18 18 18.0 -0.00106 0.187 10.3 0.890 19 49 19 19.0 30.0 0.0000184 7.28 0.0000742 20 50 20 20.0 30.0 0.0000216 7.28 0.0000742 ```

Once the above is done, you can use the code below to get the plot seen at the end of this question. What do you think the `w` is doing? ```{r, eval = FALSE} ggplot(dfa, aes(x, y, color = w)) + geom_point(size = 2) + geom_smooth(method = "lm", se = FALSE, color = "#C8008F") + geom_smooth(method = "rlm", se = FALSE, color = "#027EB6") + annotate("text", x = 20.5, y = c(21, 31), label = c("rlm", "lm")) + theme_bw(base_size = 12) + labs(title = "Robust vs. non-robust linear models") + scale_color_continuous(type = "viridis") ``` ```{r, echo = FALSE} knitr::include_graphics("images/tutorial03-rlm-1.png") ``` ## 👥 Exercise 3C **Reporting regression tables** Create regression tables, respecting the standard guidelines of presentation for tables, with models `lm1` and `rlm1` below using `modelsummary`, `gtsummary` or otherwise. ```{r} library(palmerpenguins) lm1 <- lm(bill_length_mm ~ species + bill_depth_mm + flipper_length_mm + body_mass_g + sex, data = penguins) rlm1 <- rlm(bill_length_mm ~ . , data = penguins) ``` ## 🛠️ Exercise 3D **Interactive data tables** Produce the same table that you find below using `DT`: