We started out with vectors and thinking about what we can get with adding and multiplying:
Image: 3B1B linearity definition
We generalized to “vectori-sh” objects to get the intuition behind basis functions.
Image: 3B1B Abstract vector spaces
In GAMs, we get splines by taking linear combinations of basis functions.
Figure: Gavin Simpson’s basis function animation
How do we find the “best” line?
For PCA, we want to maximize the variance captured, which also minimizes the residuals.
> As I said before, PCA will
find the “best” line according to two different criteria of what is the
“best”. First, the variation of values along this line should be
maximal. Pay attention to how the “spread” (we call it “variance”) of
the red dots changes while the line rotates; can you see when it reaches
maximum? Second, if we reconstruct the original two characteristics
(position of a blue dot) from the new one (position of a red dot), the
reconstruction error will be given by the length of the connecting red
line. Observe how the length of these red lines changes while the line
rotates; can you see when the total length reaches minimum?
If you stare at this animation for some time, you will notice that “the maximum variance” and “the minimum error” are reached at the same time, namely when the line points to the magenta ticks I marked on both sides of the wine cloud. This line corresponds to the new wine property that will be constructed by PCA.
How do we find the best-fit line, that minimizes the error? The best-fit line is in fact the maximum likelihood estimate! Important—observe the importance of the independence assumption here in the likelihood computation. (Remember: minimizing/maximizing typically means computing the derivative.)
We talked the most about likelihood, posterior, etc. when we discussed Ch. 4 of Introduction to Statistical Learning.
Aside: And, that’s how we estimate the smooths for GAMs as well, as the maximum likelihood estimate, with a penalty term for wiggliness (penalized maximum likelihood), as schematized in Gavin Simpson’s animation:
Simpson animation: MLE for GAMs
Reminder: concept of span: “The span of two vectors is basically a way of asking what are all the possible vectors you can reach using these two by only using those fundamental operations of vector addition and scalar multiplication.” 3B1B span animation
From a tutorial vid Josh found
Linear regression and the hat matrix
Binary outcome variable, estimates of probabilities: getting back to a linear model via transformation of logistic function to log-odds space!
\[p(X) = \frac{e^{\beta_0+\beta_1X}}{1 + e^{\beta_0+\beta_1X}} \]
\[\log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1X \]
We saw this especially in the Gubian tutorials. A reminder, from Soskuthy GAMMs tutorial, Section 2.4, p. 14:
Figure: increasing CIs as random effects added
For a quick and very accessible intro to mixed models, besides Gelman and Hill, you can also try Winter (2019), Chs. 14 and 15.
Winter (2019, p. 232)
What is independence? Rolling a die repeatedly is a nice example of a truly inde- pendent process. Granted that you shake the die thoroughly before rolling it, the out- come of each roll is independent from another one. A dependence, then, is any form of connection between data points. For example, two data points could be connected by virtue of coming from the same participant. In that case, these data points are not independent anymore. Most of the time, multiple data points from the same participant are more similar to each other than data points from different participants. You can think of this as a statement about the residuals: if participant A performs overall differently from participant B, then all of participant A’s residuals will act as a group, and so will all of participant B’s residuals.
Violations of the independence assumption have massive effects on the Type I error (false positive) rates of a study.
Strategies for dealing with violations of independence:
p. 233:
How can one deal with violations of the independence assumption? Whether or not the independence assumption has been violated is something that has to do with the design of a study, as well as with how a study is analyzed. One can deal with the independence assumption by designing studies that minimize dependence between data points. For example, in some circumstances it may be possible to perform single- trial between-participant experiments where each participant only contributes one data point.
p. 234 > Another way of dealing with non-independences is via aggregation. If you had multiple data points from the same participant, why not average everything so that each participant only contributes one data point? This is a possible way of dealing with non-independent cases, but it’s not the optimal way, because whenever you compute averages you lose information (see Chapter 4). In particular, the variation across the non-independent cases is not retained in the final analysis. If your statistical model only ‘sees’ the averaged data points, it is going to underestimate the amount of varia- tion present in the data. This also means that the model has less information available for making adequate inferences.
Mixed models! Section 14.4, p. 234
These models allow incorporating non-independent clusters of data into one’s analysis. In other words: you can tell your mixed model about the dependency structures within a dataset so that it makes appropriate estimates and draws appropriate inferences.
The primary workhorse for dealing with clusters of non-independent data points are what many researchers call ‘random effects’; specifically, ‘random intercepts’ and ‘random slopes’.
On random effects as “factors” (categorical)
While fixed effects can be continuous (Chapters 4–6) or cat- egorical (Chapter 7), random effects are necessarily categorical. Why? Think about it this way: the whole point of fitting a mixed model is to account for dependent clusters of data points that somehow group together. The concept of a ‘group’ is a categorical one….it may also help to think from a sampling perspective: sampling from a population, such as the population of all speakers, involves sampling discrete units of that population.
Note also that with random effects, “mixed models do not actually estimate one parameter per participant in this dataset. Instead, mixed models estimate the variation around the specified random effect. So, if you allow intercepts to vary by participants in your model, then this only adds one term to your model.”
Figure: Adding in random effects
Another perspective from Soskuthy tutorial, Section 2.4, p. 12-14
Without “telling” the model about structure in the data, we artificially interflate “n”.
When we fit a regression model to our data set without random effects or an approriate error model, we essentially ignore the grouping/temporal/spatial structure in the data: we pretend that the data set consists of independent measurements, each of them taken from a separate vowel. As a result, we will also erroneously assume that all the information in the individual data points can be used towards estimating the underlying parameters, even though some of the information is actually about other things like the separate trajectories. Since we think that we have more information that we actually do, we become overconfident about our estimates, which leads to anti-conservative results: p-values that are biased downwards and overly narrow confidence intervals. When the random effects / error model are correctly specified, the model uses the right amount of information towards estimating the underlying parameters, and the overconfidence issue disappears.
What assumptions come with a linear model?
Linear model: linear combination of predictors
\[y_i = \beta_0 +\sum_j \color{red}{ \beta_j x_{ji}} +\epsilon_i\]
Additive model (generalize from linear combination of predictors to smooths over predictors) \[y_i = \beta_0 + \sum_j \color{red}{s_j(x_{ji})} + \epsilon_i\]
(What do we assume about \[\epsilon_i\]?)
So in linear models, you restrict the functions (smooth functions) to be of the form:
\[ s_j(x_{ji}) = \beta_jx_{ji} \]
From ISLR2 Ch. 4
Recall what we mean by “linear” and “additive”? See p. 87:
The additivity assumption means that the association between a predictor \(X_j\) and the response \(Y\) does not depend on the values of the other predictors.
The linearity assumption states that the change in the response \(Y\) associated with a one-unit change in \(X_j\) is constant, regardless of the value of \(X_j\).
See also bike example
So what happened to interactions?? Manually add?
How can we get the linear model to include interactions? Before you fit the linear model, add a column to the feature matrix that represents the interaction between the features and fit the model as usual. The solution is elegant in a way, since it does not require any change of the linear model, only additional columns in the data.
See also this comparison of mixed effects models with GAMs.
Using GAMMs to model trial-by-trial fluctuations in experimental data: More risks but hardly any benefit: Thul, Conklin and Barr (JML, 2021), OSF repo.
From an old (now only existing in cached archives) Twitter thread by Dale Barr
Psychology data often include trial-by-trial fluctuations due to human factors such as practice or fatigue. Should you model these patterns, or just ignore them? Here are highlights from a study I did with Rüdiger Thul and Kathy Conklin:
We investigated the use of Generalized Additive Mixed Models (GAMMs), a technique which can account for wiggly patterns in data. Advocates of using GAMMs claim that if you ignore by-trial fluctuations, you end up with autocorrelated residuals. They claim autocorrelated residuals inflate false positives. Wrong: we show temporal autocorrelation is irrelevant for randomized experiments, except in special cases. BUT, you could reduce noise by modeling the wiggly patterns using GAMMs. The question is: SHOULD you? This depends on the costs and benefits of using GAMMs. We used Monte Carlo simulation, evaulating the performance of GAMMs versus Linear Mixed-Effects Models (LMEMs) across a wide variety of by-trial patterns. The simulated experiments had one within-subject and one between-subject factor. What we found surprised us: GAMMs slightly increased power for within-subject effects but they seemed only to do this by sacrificing power for between-subject effects, often severely! LMEMs had good power and showed no signs of excess false positives. We conclude: (1) it’s not really necessary to use GAMMs if time is not of interest; (2) take care to randomize trial order to avoid problems; and (3) don’t use GAMMs if between-subject effects are important!
And update thread in April 2021:
The current thread describes new results, available in the updated preprint as well as in the final published version in JML.
A main criticism of the original manuscript was that we used default (unpenalized) random smooths, which behave like fixed, not random effects. We re-ran the simulations to estimate power for penalized smooths. GAMMs still showed poor power vs LMEMs for between-subject effects.
However, these simulations used artificial autocorrelated errors, raising concerns about external validity. So we ran a second set of simulations where we randomly sampled residuals from the Stroop task in the Many Labs 3 mega-study by @CharlieEbersole et al #hoorayforopendata
The Stroop residuals from ML3 show strong practice effects as well as other patterns across three blocks of trials. Here are residual-by-trial plots from a subset of subjects as well as mean residual-by-trial plots for the 3K+ participants in the dataset.
Despite the practice effects, mixed effects models were totally fine. GAMMs showed the same basic pattern as in the first simulations: small benefit to within-subject power, loss of power for between subject effects (but only for the recommended ‘penalized’ version).
Bottom line: most experiments already negate temporal autocorrelation by randomizing. GAMMs may have hidden costs, so handle with care.
P.S.: The repo has all the simulation data, supplementary simulation results, as well as an R package (autocorr) bundled into a Singularity software container for reproducibility. It includes instructions for creating simulations with your own time-varying error patterns.
From the paper:
Keeping in mind that we are focused on experiments with univariate data and random presentation order, observing temporal autocorrelation in residuals from models where time does not appear as a variable should not be cause for concern about violating independence assumptions. If temporal autocorrelation appears in the residuals and the analyst chooses to ignore it, it is wrong to view the analyst’s model as statistically unsound on that basis alone. They are non-independent only with respect to a variable that is invisible to the model. It is more accurate to view the analyst as having foregone the opportunity to obtain more precise estimates by incorporating the temporal structure in the model. Opting out from doing so can be a rational choice: GAMMs are technically challenging, they yield complex output that can be difficult to interpret, and their potential pitfalls are not well known. Consequently, using them does not guarantee better insight into data.
Lest it still seem ‘wrong’ or ‘inappropriate’ for an analyst to ignore residual autocorrelation when analyzing data from a randomized experiment, let us attempt to further dislodge this notion by considering yet another way in which experimental data form a time series. Just as the individual measurements taken from a given subject are influenced by the characteristics of the particular moment at which the measurement is taken, so is the overall performance of a subject influenced by the time of day at which testing occurs, with subjects often showing more efficient performance on tasks during afternoons compared to mornings (Kleitman, 1963, Blake, 1967). Time of day fluctuations are very likely to induce residual autocorrelation, at least for tasks that are minimally cognitively demanding. In a study where you attempt to predict participants’ mean performance by experimental group, this autocorrelation could be seen by plotting the residual for each subject against the time of day at which the session took place. Yet despite this variable being a likely source of non-independence, and despite it being a variable that is recorded in every experiment (if only by computer timestamps) it is almost always ignored during analysis. Why does no one question this?.
We do not worry about time-of-day effects for the same reason we need not worry about time-of-trial effects: because we randomized. Randomization prospectively guards against the contamination of variance we care about by variance that we don’t. Even beyond time of trial or time of day effects, in any study there is a potentially infinite number of possible variables for which residuals, when plotted against the variable values, would show non-independence—day of the week, season of the year, temperature of the room, degree of ambient visual or acoustic noise, subject conscienciousness, visual angle subtended by stimuli relative to each subject’s visual acuity, and so on. So long as we have randomized, we don’t need to worry about non-independence relative to these variables. Trial-by-trial fluctuations in performance are no different.
Distinction between time as a “nuisance factor” and as an inescapable part of the data structure itself:
We have also only considered univariate data where there is a single observation per trial. This excludes from consideration many types of research such as visual world eyetracking, EEG/MEG, pupillometry, and longitudinal studies. Our intention is not to dampen enthusiasm for the use of GAMMs in analyses where time is a critical variable. Indeed, GAMMs strike us as a promising approach in these contexts (van Rij et al., 2019).