Skip to content

Instantly share code, notes, and snippets.

@richardbeare
Last active July 7, 2023 07:01
Show Gist options
  • Save richardbeare/b679c38dcb644ec50ea34ac061200504 to your computer and use it in GitHub Desktop.
Save richardbeare/b679c38dcb644ec50ea34ac061200504 to your computer and use it in GitHub Desktop.
ggplot gam smoothing with random effects

Tricks for plotting formant data with gam smoothing

Speech data sets typically contain multiple repetitions of the same speech sound, each sampled at multiple time points. The samples within a given repetition are thus not independent and smoothing procedures should take this into account. One way of doing this with generalized additive models (gams) is via a random effect. However, random effects are not included when performing smoothing with ggplot2 using default options. This gist illustrates some tricks for incuding random effects in ggplot smoothing.

The methods below have been used in the following papers:

A formant study of the alveolar versus retroflex contrast in three Central Australian languages: Stop, nasal, and lateral manners of articulation. Tabain, Butcher, Breen, Beare. Journal of the Acoustical Society of America. 2020. https://doi.org/10.1121/10.0001012

An articulatory study of the alveolar versus retroflex contrast in pre-and post-stress position in Arrernte. Tabain, Beare. Journal of Phonetics. 2020. https://doi.org/10.1016/j.wocn.2019.100952

Generate some simulated data

The following creates some very simple data in the right format. There are a set of group columns - manner, place and speaker and the tcknum column identifies a specific trace.

library(tidyverse)

genTrace <- function(tcknum=1) {
  # generate a noisy trace representing a single formant.
  #
  mannerchoice <- c("stop", "nasal", "lateral")
  placechoice <- c("dental", "alveolar")
  speakerchoice <- c("spkA", "spkB")
  manner <- sample(mannerchoice, 1)
  place <- sample(placechoice, 1)
  speaker <- sample(speakerchoice, 1)

  MO <- c(-100, 0, 100)
  names(MO) <- mannerchoice
  SO <- c(-1.1, 1, 1.1)
  names(SO) <- placechoice
  time <- seq(from=0.005, to=0.1, by=0.005)
  slope <- runif(1, min=0, max=5000) * SO[place]
  intercept <- runif(1, min=1400, max=1750) +MO[manner]
  SampleValue <- time*slope + intercept + rnorm(length(time), mean=0, sd=20)
  return(tibble(time=time, SampleValue=SampleValue, place=place, speaker.labs=speaker, manner=manner, tcknum=tcknum))
}

Generate 400 traces.

pretenddata <- bind_rows(map(1:400, genTrace))

Standard ggplot

Standard gam smoothing via geom_smooth

ggplot(pretenddata, aes(x=time, y=SampleValue)) +
  facet_grid(place ~ speaker.labs) +
  geom_line(aes(group=tcknum, colour=manner), alpha=0.05) +
  geom_smooth(aes(group=manner, colour=manner),
              method="gam", formula = y ~ s(x, bs = "cs") ) +
  ggtitle("Standard gam smoothing") +
  xlab("time (seconds)") +
  ylab("Hz")

standard

Including a random effect

The following snippet replaces the ggplot predict function for gams with one that will include a random effect.

predictdf.gam <- function(model, xseq, se, level) {
  olddata <- model.frame(model)
  if (is.null(olddata$randomid)) {
    newdata= tibble(x=xseq)
  } else {
    newdata = tibble(x=xseq, randomid=olddata$randomid[1])
  }
  pred <- predict(model, exclude="s(randomid)", newdata = newdata,
                  se.fit = se, level = level, interval = if (se)
                    "confidence"
                  else "none")
  if (se) {
    y = pred$fit
    ci <- pred$se.fit * 1.96
    ymin = y - ci
    ymax = y + ci
    tibble(x = xseq, y, ymin, ymax, se = pred$se.fit)
  }
  else {
    tibble(x = xseq, y = as.vector(pred))
  }

}
environment(predictdf.gam) <- environment(ggplot2:::predictdf.glm)

Now plot with smoothing formula including a random effect

ggplot(pretenddata, aes(x=time, y=SampleValue)) +
  facet_grid(place ~ speaker.labs) +
  geom_line(aes(group=tcknum, colour=manner), alpha=0.05) +
  geom_smooth(aes(group=manner, colour=manner, randomid=factor(tcknum)),
              method="gam", formula = y ~ s(x, bs = "cs") + s(randomid,
                                                              bs="re")) +
  ggtitle("gam smoothing with random effect") +
  xlab("time (seconds)") +
  ylab("Hz")

randomid

Comments

The confidence intervals are larger when the random effect is used because the correlation between residuals from the same track (lack of independence) has been included in the estimation.

There are also subtle difference in shape of the result. In some cases the difference in shape can be dramatic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment