ETC5523: Communicating with Data

Statistical model outputs

Lecturer: Michael Lydeamore

Department of Econometrics and Business Statistics

michael.lydeamore@monash.edu
Week 5
cwd.numbat.space

Aim

Extract information from model objects
Understand and create functions in R

Why

Working with model objects is necessary for you to get the information you need for communication
These concepts will be helpful later when we start developing R-packages

📈 Statistical models

All models are approximations of the unknown data generating process
How good of an approximation depends on the collected data and the model choice

🎯 Characterise mpg in terms of wt.

Fitting linear models in R

We fit the model:

\[\texttt{mpg}_i = \beta_0 + \beta_1\texttt{wt}_i + e_i\]

In R we fit this as

fit <- lm(mpg ~ wt, data = mtcars)

\(\hat{\beta}_0 = 37.285\) and \(\hat{\beta}_1 = -5.344\)

ℹ️ Extracting information from the fitted model

When you fit a model, there would be a number of information you will be interested in extracting from the fit including:
- the model parameter estimates,
- model-related summary statistics, e.g. \(R^2\), AIC and BIC,
- model-related values, e.g. residuals, fitted values and predictions.
So how do you extract these values from the fit?
What does fit even contain?

ℹ️ Extracting information from the fitted model

str(fit)

List of 12
 $ coefficients : Named num [1:2] 37.29 -5.34
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
 $ residuals    : Named num [1:32] -2.28 -0.92 -2.09 1.3 -0.2 ...
  ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ effects      : Named num [1:32] -113.65 -29.116 -1.661 1.631 0.111 ...
  ..- attr(*, "names")= chr [1:32] "(Intercept)" "wt" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:32] 23.3 21.9 24.9 20.1 18.9 ...
  ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:32, 1:2] -5.657 0.177 0.177 0.177 0.177 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.18 1.05
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 30
 $ xlevels      : Named list()
 $ call         : language lm(formula = mpg ~ wt, data = mtcars)
 $ terms        :Classes 'terms', 'formula'  language mpg ~ wt
  .. ..- attr(*, "variables")= language list(mpg, wt)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. ..$ : chr "wt"
  .. ..- attr(*, "term.labels")= chr "wt"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
 $ model        :'data.frame':  32 obs. of  2 variables:
  ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
  ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ wt
  .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. ..$ : chr "wt"
  .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
 - attr(*, "class")= chr "lm"

ℹ️ Extracting information from the fitted model

Accessing model parameter estimates:

fit$coefficients

(Intercept)          wt 
  37.285126   -5.344472

# OR using
coef(fit)

(Intercept)          wt 
  37.285126   -5.344472

This gives us the estimates of \(\beta_0\) and \(\beta_1\).

But what about \(\sigma^2\)? Recall \(e_i \sim NID(0, \sigma^2)\).

sigma(fit)^2

[1] 9.277398

ℹ️ Extracting information from the fitted model

You can also get a summary of the model object:

summary(fit)


Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

So how do I extract these summary values out?

Model objects to tidy data

broom::tidy(fit)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    37.3      1.88      19.9  8.24e-19
2 wt             -5.34     0.559     -9.56 1.29e-10

broom::glance(fit)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.753         0.745  3.05      91.4 1.29e-10     1  -80.0  166.  170.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

broom::augment(fit)

# A tibble: 32 × 9
   .rownames           mpg    wt .fitted .resid   .hat .sigma .cooksd .std.resid
   <chr>             <dbl> <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>      <dbl>
 1 Mazda RX4          21    2.62    23.3 -2.28  0.0433   3.07 1.33e-2    -0.766 
 2 Mazda RX4 Wag      21    2.88    21.9 -0.920 0.0352   3.09 1.72e-3    -0.307 
 3 Datsun 710         22.8  2.32    24.9 -2.09  0.0584   3.07 1.54e-2    -0.706 
 4 Hornet 4 Drive     21.4  3.22    20.1  1.30  0.0313   3.09 3.02e-3     0.433 
 5 Hornet Sportabout  18.7  3.44    18.9 -0.200 0.0329   3.10 7.60e-5    -0.0668
 6 Valiant            18.1  3.46    18.8 -0.693 0.0332   3.10 9.21e-4    -0.231 
 7 Duster 360         14.3  3.57    18.2 -3.91  0.0354   3.01 3.13e-2    -1.31  
 8 Merc 240D          24.4  3.19    20.2  4.16  0.0313   3.00 3.11e-2     1.39  
 9 Merc 230           22.8  3.15    20.5  2.35  0.0314   3.07 9.96e-3     0.784 
10 Merc 280           19.2  3.44    18.9  0.300 0.0329   3.10 1.71e-4     0.100 
# ℹ 22 more rows

But how do these functions work?

Some common tables

Descriptive summary statistics tables

library(tidyverse)
mtcars %>% 
  select(mpg, wt, vs, cyl) %>% 
  glimpse()

Rows: 32
Columns: 4
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, …
$ wt  <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.4…
$ vs  <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, …
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, …

	4 (N=11)	6 (N=7)	8 (N=14)	Overall (N=32)
mpg
Mean (SD)	26.7 (4.51)	19.7 (1.45)	15.1 (2.56)	20.1 (6.03)
Median [Min, Max]	26.0 [21.4, 33.9]	19.7 [17.8, 21.4]	15.2 [10.4, 19.2]	19.2 [10.4, 33.9]
wt
Mean (SD)	2.29 (0.570)	3.12 (0.356)	4.00 (0.759)	3.22 (0.978)
Median [Min, Max]	2.20 [1.51, 3.19]	3.22 [2.62, 3.46]	3.76 [3.17, 5.42]	3.33 [1.51, 5.42]
vs
0	1 (9.1%)	3 (42.9%)	14 (100%)	18 (56.3%)
1	10 (90.9%)	4 (57.1%)	0 (0%)	14 (43.8%)

Regression results tables

df <- mutate(mtcars, 
             cyl = as.factor(cyl))
fit1 <- lm(mpg ~ wt + vs + cyl, data = df)
fit2 <- lm(mpg ~ wt + cyl,  data = df)

Characteristic	`fit1`			`fit2`
Characteristic	Beta	95% CI ¹	p-value	Beta	95% CI ¹	p-value
wt	-3.3	-4.9, -1.7	<0.001	-3.2	-4.7, -1.7	<0.001
vs	0.86	-2.5, 4.2	0.6
cyl
4	—	—		—	—
6	-3.9	-7.1, -0.67	0.020	-4.3	-7.1, -1.4	0.005
8	-5.1	-10, -0.10	0.046	-6.1	-9.5, -2.7	<0.001
¹ CI = Confidence Interval

What statistics to present?

This depends on what you want to convey and your audience.
There are two key purposes of the table:
1. display information; and
2. communicate information.
In general, tables tend to be about display of information and graphs are preferred for communication.
However, if precision matters then tables can be better at communicating this than graphs.

Descriptive summary statistics

The main goal is to give a numerical summary to give a “feel” of what the data contains. These generally should convey:
- variables in the data,
- the number of observations for each variable,
- missing values (if any),
- the distribution, e.g. in the form of five number of summaries or counts/percentages for categorical variables.

Descriptive summary statistics

For numerical variables, you may have a correlation table which displays the correlation coefficients of every pair of variables. Because it is symmetrical, you can omit out the upper triangle.

Variables	Mean	SD	1.	2.
1. Miles per gallon	20.09	6.03
2. Weight	3.22	0.98	-0.87
3. Horsepower	146.69	68.56	-0.78	0.66

Descriptive summary statistics

You may have cross-tabulations (also called contingency tables)

cyl/gear	3	4	5	Total
4	3.1% (1)	25.0% (8)	6.2% (2)	34.4% (11)
6	6.2% (2)	12.5% (4)	3.1% (1)	21.9% (7)
8	37.5% (12)	0.0% (0)	6.2% (2)	43.8% (14)
Total	46.9% (15)	37.5% (12)	15.6% (5)	100.0% (32)

Regression results

Or more generally, the important numerical characteristics of your model. This may include:
- parameter estimates of your model,
- the standard error or confidence/credible interval of your estimates,
- model fit quality, e.g. \(R^2\), AIC, BIC and so on.
You see often the inclusion of \(p\)-value, usually from the significance test of the corresponding variable and some indicate the significance level by the amount of *.
It is important to convey the uncertainty for the estimates or predictions from your model.

Regression results

There is also what is called the ANOVA table that shows the variation according to different sources.

glimpse(ToothGrowth)

Rows: 60
Columns: 3
$ len  <dbl> 4.2, 11.5, 7.3, 5.8, 6.4, 10.0, 11.2, 11.2, 5.2, 7.0, 16.5, 16.5,…
$ supp <fct> VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, V…
$ dose <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0, …

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
supp	1	205.350	205.35000	15.571979	0.0002312
factor(dose)	2	2426.434	1213.21717	91.999965	0.0000000
supp:factor(dose)	2	108.319	54.15950	4.106991	0.0218603
Residuals	54	712.106	13.18715

Communication considerations for tables

Numerical precision

Select an appropriate precision for your goal and audience:

	Weight
Car brand	5 d.p.	2 d.p.	0 d.p.
Mazda RX4	2.61937	2.62	3
Mazda RX4 Wag	2.87518	2.88	3
Datsun 710	2.31916	2.32	2
Hornet 4 Drive	3.21660	3.21	3

Numerical precision

Display trailing zeroes to match selected precision of the column:

Trailing zeroes
Yes	No
0.233	0.233
0.320	0.32
0.400	0.4
0.343	0.343

Numerical precision

Change and display units as appropriate:

Car brand	Weight (lbs)	Weight (1000 lbs)	Weight (kg)	Weight (g)	Weight (mg)
Mazda RX4	2620	2.620	1188	1190000	1,188,000,000
Mazda RX4 Wag	2875	2.875	1304	1300000	1,304,000,000
Datsun 710	2320	2.320	1052	1050000	1,052,000,000
Hornet 4 Drive	3215	3.215	1458	1460000	1,458,000,000

Show comma every 3 digits (or other marks as needed).
E.g. 1000000 is harder to read than 1,000,000.

scales::comma(c(439024, 4900343), accuracy = 1000)

[1] "439,000"   "4,900,000"

Column alignment

Spanner labels are usually aligned in center.
Right-align numbers
Left-align texts

Car brand			Horsepower
Left	Center	Right	Left	Center	Right
Mazda RX4	Mazda RX4	Mazda RX4	110	110	110
Mazda RX4 Wag	Mazda RX4 Wag	Mazda RX4 Wag	110	110	110
Datsun 710	Datsun 710	Datsun 710	93	93	93
Hornet 4 Drive	Hornet 4 Drive	Hornet 4 Drive	110	110	110

Labels within tables

It is possibly obvious, but tables designed as final product (e.g. in report) should have polished labels
For columns, the unit may be written in the column header label
You shouldn’t label the unit within the table

✅

Car brand	Weight<br> (1000 lbs)	Displacement<br> (cubic inches)
Mazda RX4	2.620	160
Mazda RX4 Wag	2.875	160
Datsun 710	2.320	108
Hornet 4 Drive	3.215	258
Hornet Sportabout	3.440	360
Valiant	3.460	225

❌

car	wt	disp
Mazda RX4	2620 lbs	160 cubic inches
Mazda RX4 Wag	2875 lbs	160 cubic inches
Datsun 710	2320 lbs	108 cubic inches
Hornet 4 Drive	3215 lbs	258 cubic inches
Hornet Sportabout	3440 lbs	360 cubic inches
Valiant	3460 lbs	225 cubic inches

Texts accompanying tables

Besides the contents of table, a table may be accompanied with: table header, caption, footnotes and/or source notes.
The conventions of how and what to write will depend on your audience and medium of report
Generally if you are communicating information, your caption should:
- summarise the take-away message, in other words, why should the audience care about this table?
- give context of the table (e.g. “\(R_0 > 1\) means that the virus is more infectious”)

Making tables with R

🏗️ How to make tables in R?

There are many packages that make table in R, including ones that wrangle the data for you to make specialised table output. E.g. knitr::kable, kableExtra, formattable, gt, DT, pander, xtable, stargazer.
You can read the documentation for each packages to make the table you want (I mainly use knitr::kable, kableExtra and DT).
Whatever package you use, it’s important that you understand the output and how it works with the medium you are trying to display the table.

Table in Markdown

In markdown file:

First Header  | Second Header
------------- | -------------
Content Cell  | Content Cell
Content Cell  | Content Cell

Possible display:

First Header	Second Header
Content Cell	Content Cell
Content Cell	Content Cell

Table in HTML & PDF

HTML → Web browser

<table>
<thead>
<tr>
<th>First Header</th> <th>Second Header</th>
</tr>
</thead>
<tbody>
<tr>
<td>Content Cell</td> <td>Content Cell</td>
</tr>
<tr>
<td>Content Cell</td> <td>Content Cell</td>
</tr>
</tbody>
</table>

LaTeX → PDF

\begin{tabular}{cc}
\hline
First Header & Second Header\\
\hline
Content Cell & Content Cell\\
Content Cell & Content Cell\\
\hline
\end{tabular}

HTML is for HTML output
LaTeX is for PDF output
Most HTML elements do not work in LaTeX and vice versa.

Case study: building static tables by components with `gt`

Source: https://gt.rstudio.com/

Case study: building static tables by components with `gt`

library(gt)
df <- select(head(mtcars), wt, disp, cyl)

Case study: building static tables by components with `gt`

library(gt)
df <- select(head(mtcars), wt, disp, cyl)
gt(df)

wt	disp	cyl
2.620	160	6
2.875	160	6
2.320	108	4
3.215	258	6
3.440	360	8
3.460	225	6

Case study: building static tables by components with `gt`

library(gt)
df <- select(head(mtcars), wt, disp, cyl)
gt(df) %>% 
  tab_header(title = "Motor Trend Car Road Tests", 
             subtitle = "Design and performance of 1973-74 automobile models")

Motor Trend Car Road Tests
Design and performance of 1973-74 automobile models
wt	disp	cyl
2.620	160	6
2.875	160	6
2.320	108	4
3.215	258	6
3.440	360	8
3.460	225	6

Case study: building static tables by components with `gt`

library(gt)
df <- select(head(mtcars), wt, disp, cyl)
gt(df) %>% 
  tab_header(title = "Motor Trend Car Road Tests",
             subtitle = "Design and performance of 1973-74 automobile models") %>% 
  tab_source_note(md("Source: 1974 *Motor Trend* US magazine"))

Motor Trend Car Road Tests
Design and performance of 1973-74 automobile models
wt	disp	cyl
2.620	160	6
2.875	160	6
2.320	108	4
3.215	258	6
3.440	360	8
3.460	225	6
Source: 1974 Motor Trend US magazine

Case study: building static tables by components with `gt`

library(gt)
df <- select(head(mtcars), wt, disp, cyl)
gt(df) %>% 
  tab_header(title = "Motor Trend Car Road Tests",
             subtitle = "Design and performance of 1973-74 automobile models") %>% 
  tab_source_note(md("Source: 1974 *Motor Trend* US magazine")) %>% 
  cols_label( 
    wt = html("Weight<br>(1000lbs)"), 
    disp = html("Displacement<br> (inch<sup>3</sup>)"), 
    cyl = html("Number of<br>cylinders") 
  )

Motor Trend Car Road Tests
Design and performance of 1973-74 automobile models
Weight (1000lbs)	Displacement (inch³)	Number of cylinders
2.620	160	6
2.875	160	6
2.320	108	4
3.215	258	6
3.440	360	8
3.460	225	6
Source: 1974 Motor Trend US magazine

Case study: interactive tables with `DT`

DT::datatable(mtcars, options = list(pageLength = 4))

DT::datatable works best for HTML documents.
For PDF documents, it takes a webshot of the HTML table and inserts it as an image.
It’s useful for data exploration where the main goal of the table is display of information rather than communication.

When do you make tables over plots?

When you want to show exact values or the accuracy of the values are important to convey.

You can combine plots with tables!

Case study heatmap table with `formattable`

library(formattable)
mtcars %>% 
  select(mpg, disp, wt, hp) %>% 
  cor() %>% 
  as.data.frame() %>% 
  rownames_to_column("Variables") %>% 
  formattable(list(area(col = 2:5) ~ color_tile("#F5B7B1", "#7DCEA0")))

Variables	mpg	disp	wt	hp
mpg	1.0000000	-0.8475514	-0.8676594	-0.7761684
disp	-0.8475514	1.0000000	0.8879799	0.7909486
wt	-0.8676594	0.8879799	1.0000000	0.6587479
hp	-0.7761684	0.7909486	0.6587479	1.0000000

Case study: inline plots with `sparkline`

Sparkline refers to a small chart drawn without axes or coordinates.
This will be out of scope for this unit.

Week 5 Lesson

Summary

You saw various common tables that present information and the motivation behind it
We went through some guidelines for best way to communicate with tables

Resources

ETC5523: Communicating with Data

Statistical model outputs

📈 Statistical models

Fitting linear models in R

ℹ️ Extracting information from the fitted model

ℹ️ Extracting information from the fitted model

ℹ️ Extracting information from the fitted model

ℹ️ Extracting information from the fitted model

Model objects to tidy data

Some common tables

Descriptive summary statistics tables

Regression results tables

What statistics to present?

Descriptive summary statistics

Descriptive summary statistics

Descriptive summary statistics

Regression results

Regression results

Communication considerations for tables

Numerical precision

Numerical precision

Numerical precision

Column alignment

Labels within tables

Texts accompanying tables

Making tables with R

🏗️ How to make tables in R?

Table in Markdown

Table in HTML & PDF

Case study: building static tables by components with gt

Case study: building static tables by components with gt

Case study: building static tables by components with gt

Case study: building static tables by components with gt

Case study: building static tables by components with gt

Case study: building static tables by components with gt

Case study: interactive tables with DT

When do you make tables over plots?

Case study heatmap table with formattable

Case study: inline plots with sparkline

Week 5 Lesson

Case study: building static tables by components with `gt`

Case study: building static tables by components with `gt`

Case study: building static tables by components with `gt`

Case study: building static tables by components with `gt`

Case study: building static tables by components with `gt`

Case study: building static tables by components with `gt`

Case study: interactive tables with `DT`

Case study heatmap table with `formattable`

Case study: inline plots with `sparkline`