library(tidyr)
Warning messages:
1: The closing backticks on line 162 ("````") in morrisonkondaurova09-explorations-ky.Rmd do not match the opening backticks "```" on line 153. You are recommended to fix either the opening or closing delimiter of the code chunk to use exactly the same numbers of backticks and same level of indentation (or blockquote).
2: The closing backticks on line 162 ("````") in morrisonkondaurova09-explorations-ky.Rmd do not match the opening backticks "```" on line 153. You are recommended to fix either the opening or closing delimiter of the code chunk to use exactly the same numbers of backticks and same level of indentation (or blockquote).
3: The closing backticks on line 162 ("````") in morrisonkondaurova09-explorations-ky.Rmd do not match the opening backticks "```" on line 153. You are recommended to fix either the opening or closing delimiter of the code chunk to use exactly the same numbers of backticks and same level of indentation (or blockquote).
library(ggplot2)
library(dplyr)
This code explores Morrison and
Kondaurova (2009)’s methods paper using the data available from
Morrison’s website.
The introduction of the paper says:
Experiment 1: First-language (L1) English listeners, L1-Spanish
L2-English listeners, and L1-Russian L2-English listeners classified
vowels from an English /i/-/ɪ/ continuum. The continuum consisted of a
99-point two-dimensional grid of linear-predictive-coding resynthesized
/bVt/ tokens, in which the vowel duration ranged from 35 to 275 ms in 30
ms steps, and the vowel formant center frequencies ranged from F1=458Hz,
F2=1876Hz, and F3=2523Hz to F1=326Hz, F2=2056Hz, and F3=2943Hz,
respectively, in equal mel steps. The grid was labeled using reference
numbers where duration and spectral values of 1 refer to the most
/i/-like properties (long duration, and low F1 and high F2) and duration
and spectral values of 9 refer to the most /ɪ/-like properties (short
duration, and high F1 and low F2). The 81 stimuli were each presented
ten times in random order and on each trial the listener classified the
stimulus as either English /bit/ or /bɪt/ (there were a total of 16
L1-English listeners, 18 L1-Spanish listeners, and 19 L1-Russian
listeners).
Read in data and label
dat.eng <- read.table('data/data_English.txt', sep = '\t')
dat.rus <- read.table('data/data_Russian.txt', sep = '\t')
dat.span <- read.table('data/data_Spanish.txt', sep = '\t')
Unfortunately I’m not sure where to find the information about what
the columns mean but we can reverse-engineer:
summary(dat.eng)
V1 V2 V3 V4 V5
Min. : 1.00 Min. :0 Min. :0 Min. : 0.000 Min. : 0.000
1st Qu.: 4.75 1st Qu.:2 1st Qu.:2 1st Qu.: 0.000 1st Qu.: 0.000
Median : 8.50 Median :4 Median :4 Median : 5.000 Median : 5.000
Mean : 8.50 Mean :4 Mean :4 Mean : 4.916 Mean : 5.084
3rd Qu.:12.25 3rd Qu.:6 3rd Qu.:6 3rd Qu.:10.000 3rd Qu.:10.000
Max. :16.00 Max. :8 Max. :8 Max. :10.000 Max. :10.000
apply(dat.eng, 2, unique)
$V1
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
$V2
[1] 0 1 2 3 4 5 6 7 8
$V3
[1] 0 1 2 3 4 5 6 7 8
$V4
[1] 2 4 1 3 8 5 0 7 9 6 10
$V5
[1] 8 6 9 7 2 5 10 3 1 4 0
So the unique values in each column in the English data are: - Col 1:
1-16, in unit increments (must be listener id) - Col 2: 0-8, in unit
increments - Col 3: 0-8, in unit increments - Col 4: 0-10, in unit
increments - Col 5: 0-10, in unit increments
And in Logistic_Regression_input_arguments.txt it
says:
stimcols {‘listener’ ‘spec’ ‘dur’} respcols {‘/I/’ ‘/i/’}
So maybe column 2 is the spectral steps, column 3 is the duration
steps, and column 4 and 5 give how many times listener responded /I/ and
/i/, respectively. Unfortunately, we don’t have trial-by-trial data here
(but we could reconstruct that, except missing block/repetition
number).
So let’s label the data frame accordingly and write a function to do
this for each of the data sets.
load.label.dat <- function(filename, directory = 'data/') {
filepath = paste(directory, filename, sep = '')
dat = read.table(filepath, sep = '\t')
names(dat) = c("listener", "step.spec", "step.dur", "num.ih", "num.iy")
dat$listener = as.factor(dat$listener)
return(dat)
}
and process each data set this way:
dat.eng <- load.label.dat("data_English.txt")
dat.rus <- load.label.dat("data_Russian.txt")
dat.span <- load.label.dat("data_Spanish.txt")
str(dat.eng)
'data.frame': 1296 obs. of 5 variables:
$ listener : Factor w/ 16 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ step.spec: int 0 0 0 0 0 0 0 0 0 1 ...
$ step.dur : int 0 1 2 3 4 5 6 7 8 0 ...
$ num.ih : int 2 4 1 1 2 4 3 8 5 0 ...
$ num.iy : int 8 6 9 9 8 6 7 2 5 10 ...
Exploratory plots
We can make plots by listener for proportion of /ih/ responses
depending on one kind of step, aggregated over the other kind of step,
like in Figure 1.
Here we aggregate over spectral steps to get by-listener responses
for each duration step.
dat.span %>%
group_by(listener,step.dur) %>%
summarize(num.ih = sum(num.ih),
num.iy = sum(num.iy)) %>%
mutate(prop.ih = num.ih/(num.ih + num.iy)) -> dat.span.by.dur
`summarise()` has grouped output by 'listener'. You can override using the `.groups` argument.
ggplot(data = dat.span.by.dur, aes(x = step.dur, y = prop.ih, group = listener, color = listener)) + geom_line() + ylim(0,1.0) + theme(aspect.ratio=1)

We can also plot a subplot for each individual listener:
ggplot(data = dat.span.by.dur, aes(x = step.dur, y = prop.ih)) + geom_line() + ylim(0,1.0) + theme(aspect.ratio=1) + facet_wrap(~listener)

Endpoint calculations
In Ref. 3, in Experiment 1, the duration endpoint-difference scores
were calculated as the proportion of /ɪ/ responses for all stimuli with
duration value of 9 minus the proportion of /ɪ/ responses for all
stimuli with duration value of 1. Likewise the spectral
endpoint-difference scores were calculated as the difference in the
proportion of /ɪ/ responses at the two spectral extremes of the stimulus
set.
So we could compute endpoint difference scores with a function like
this. Note that we must hold the step continuum over a single
variable—either spectral quality or duration, but not both—basically we
just drop the information about the other dimension. Or we could set one
of the variables to some constant, e.g., let’s compute endpoint
difference scores for spectral quality, with duration held constant at
step 5.
dat.span.by.dur %>%
dplyr::select(listener,step.dur,prop.ih) %>%
filter(step.dur %in% c(0, 8)) %>%
pivot_wider(names_from = step.dur, names_prefix = "prop.ih.", values_from = prop.ih) %>%
mutate(endpoint_diff = prop.ih.8 - prop.ih.0)
NA
NA
You can write a function to generalize the computations, as I started
to do below.
# compute.endpoint.diff <- function(data = dat.span, step = "step.spec") {
# require(dplyr)
#
# dat.by.step <- dat %>%
# group_by(listener,.data[[step]]) %>%
# summarize(num.ih = sum(num.ih),
# num.iy = sum(num.iy)) %>%
# mutate(prop.ih = num.ih/(num.ih + num.iy))
#
# dat.by.step %>%
Linear discriminant analysis
The authors point out that data from an identification task is not
normally distributed.
Here’s what two normal distributions over steps might look like:
norm_lower = rnorm(n = 1000, mean = 2, sd = 2)
norm_higher = rnorm(n = 1000, mean = 5, sd = 2)
plot(density(norm_lower))
lines(density(norm_higher))

NA
NA
Compare to the sigmoidal identification curves:
sigmoid = function(x) {
1 / (1 + exp(-x))
}
x1 <- seq(-5, 5, 0.1)
plot(x1, sigmoid(x1),pch = '.')
points(x1+2, sigmoid(x1), pch = '.')

Why do the authors say that LDA assumes a normal distribution? This
comes from thinking about LDA as a special case of Bayes
classification….
---
title: "Morrison and Kondaurova (2009) investigations"
output: html_notebook
---

```{r import-libs}
library(tidyr)
library(ggplot2)
library(dplyr)
```

This code explores [Morrison and Kondaurova (2009)'s methods paper](https://asa.scitation.org/doi/10.1121/1.3216917) using [the data available from Morrison's website](https://geoff-morrison.net/#LogReg2009).

The introduction of the paper says:

> Experiment 1: First-language (L1) English listeners, L1-Spanish L2-English listeners, and L1-Russian L2-English listeners classified vowels from an English /i/-/ɪ/ continuum. The continuum consisted of a 99-point two-dimensional grid of linear-predictive-coding resynthesized /bVt/ tokens, in which the vowel duration ranged from 35 to 275 ms in 30 ms steps, and the vowel formant center frequencies ranged from F1=458Hz, F2=1876Hz, and F3=2523Hz to F1=326Hz, F2=2056Hz, and F3=2943Hz, respectively, in equal mel steps. The grid was labeled using reference numbers where duration and spectral values of 1 refer to the most /i/-like properties (long duration, and low F1 and high F2) and duration and spectral values of 9 refer to the most /ɪ/-like properties (short duration, and high F1 and low F2). The 81 stimuli were each presented ten times in random order and on each trial the listener classified the stimulus as either English /bit/ or /bɪt/ (there were a total of 16 L1-English listeners, 18 L1-Spanish listeners, and 19 L1-Russian listeners).

# Read in data and label

```{r read-in}
dat.eng <- read.table('data/data_English.txt', sep = '\t')
dat.rus <- read.table('data/data_Russian.txt', sep = '\t')
dat.span <- read.table('data/data_Spanish.txt', sep = '\t')

```

Unfortunately I'm not sure where to find the information about what the columns mean but we can reverse-engineer:

```{r label-cols}

summary(dat.eng)
apply(dat.eng, 2, unique)

```

So the unique values in each column in the English data are:
- Col 1: 1-16, in unit increments (must be listener id)
- Col 2: 0-8, in unit increments
- Col 3: 0-8, in unit increments
- Col 4: 0-10, in unit increments
- Col 5: 0-10, in unit increments

And in `Logistic_Regression_input_arguments.txt` it says:

> stimcols        {'listener' 'spec' 'dur'}
respcols        {'/I/' '/i/'}

So maybe column 2 is the spectral steps, column 3 is the duration steps, and column 4 and 5 give how many times listener responded /I/ and /i/, respectively.  Unfortunately, we don't have trial-by-trial data here (but we could reconstruct that, except missing block/repetition number).

So let's label the data frame accordingly and write a function to do this for each of the data sets.

```{r fxn-load-label-data}

load.label.dat <- function(filename, directory = 'data/') {
  filepath = paste(directory, filename, sep = '')
  dat = read.table(filepath, sep = '\t')
  names(dat) = c("listener", "step.spec", "step.dur", "num.ih", "num.iy")
  dat$listener = as.factor(dat$listener)
  return(dat)
}
```

and process each data set this way:

```{r}

dat.eng <- load.label.dat("data_English.txt")
dat.rus <- load.label.dat("data_Russian.txt")
dat.span <- load.label.dat("data_Spanish.txt")

str(dat.eng)
```

# Exploratory plots

We can make plots by listener for proportion of /ih/ responses depending on one kind of step, aggregated over the other kind of step, like in Figure 1.

Here we aggregate over spectral steps to get by-listener responses for each duration step.

```{r dat-id-curves-by-spec}

dat.span %>%
  group_by(listener,step.dur) %>%
  summarize(num.ih = sum(num.ih),
            num.iy = sum(num.iy)) %>%
  mutate(prop.ih = num.ih/(num.ih + num.iy)) -> dat.span.by.dur


```





```{r plot-id-curves-dur}

ggplot(data = dat.span.by.dur, aes(x = step.dur, y = prop.ih, group = listener, color = listener)) + geom_line() + ylim(0,1.0) + theme(aspect.ratio=1)

```

We can also plot a subplot for each individual listener:

```{r plot-id-curves-dur-subj}

ggplot(data = dat.span.by.dur, aes(x = step.dur, y = prop.ih)) + geom_line() + ylim(0,1.0) + theme(aspect.ratio=1) + facet_wrap(~listener)

```


# Endpoint calculations

> In Ref. 3, in Experiment 1, the duration endpoint-difference scores were calculated as the proportion of /ɪ/ responses for all stimuli with duration value of 9 minus the proportion of /ɪ/ responses for all stimuli with duration value of 1. Likewise the spectral endpoint-difference scores were calculated as the difference in the proportion of /ɪ/ responses at the two spectral extremes of the stimulus set.

So we could compute endpoint difference scores with a function like this. Note that we must hold the step continuum over a single variable---either spectral quality or duration, but not both---basically we just drop the information about the other dimension. Or we could set one of the variables to some constant, e.g., let's compute endpoint difference scores for spectral quality, with duration held constant at step 5.

```{r example-compute-endpoint-diff}

dat.span.by.dur %>%
  dplyr::select(listener,step.dur,prop.ih) %>%
  filter(step.dur %in% c(0, 8)) %>%
  pivot_wider(names_from = step.dur, names_prefix = "prop.ih.", values_from = prop.ih) %>%
  mutate(endpoint_diff = prop.ih.8 - prop.ih.0)


```

You can write a function to generalize the computations, as I started to do below.

```{r fxn-compute-endpoint-diff}

# compute.endpoint.diff <- function(data = dat.span, step = "step.spec") {
#   require(dplyr)
#   
#   dat.by.step <- dat %>%
#     group_by(listener,.data[[step]]) %>%
#     summarize(num.ih = sum(num.ih),
#             num.iy = sum(num.iy)) %>%
#     mutate(prop.ih = num.ih/(num.ih + num.iy)) 
#   
#   dat.by.step %>%
    


```


# Linear discriminant analysis

The authors point out that data from an identification task is not normally distributed.

Here's what two normal distributions over steps might look like:

```{r}

norm_lower = rnorm(n = 1000, mean = 2, sd = 2)
norm_higher = rnorm(n = 1000, mean = 5, sd = 2)

plot(density(norm_lower))
lines(density(norm_higher))


````

Compare to the sigmoidal identification curves:

```{r plot-sigmoids}

sigmoid = function(x) {
   1 / (1 + exp(-x))
}

x1 <- seq(-5, 5, 0.1)
plot(x1, sigmoid(x1),pch = '.')
points(x1+2, sigmoid(x1), pch = '.')
```

Why do the authors say that LDA assumes a normal distribution?
This comes from thinking about LDA as a special case of Bayes classification....















