Evidence of a frequency effect in Spanish L1 word duration

# Evidence of a frequency effect in Spanish L1 word duration
## Brown Bag Lunch: Spring 2022
### Kyle Parrish and Juan José Garrido Pozu
### Rutgers University

---

# The English words "time" and "thyme" are not homophones

---

# Gahl (2008, 2009)

.pull-left[
**The Study**
- Compared the duration of **80,179 tokens** of **223 homophone pairs** from the Switchboard Corpus

- Switchboard - telephone conversations between American English speaking strangers.

- Testing the claim that frequency effects in duration is part of phonological form.

- Homophones (e.g., *time* and *thyme*) should shorten equally, given that they were assumed to have the same phonological form.
]

.pull-right[
**The Results**
- Overall, the **higher frequency** member of the homophone pair exhibited **shorter duration** than the low frequency word.

- This result held when considering several other predictors constant
]

---

# Background

.large[
**Big picture**: This is important because language-external factors may impact language representations
]

.large[
**Justification**: There is a lack of research investigating whether frequency driven duration effects occur accross language classes (e.g., stress-timed versus syllable-timed)
]

???
This is important for language acquisition, broadly, since it provides evidence that frequency effects in L2 Spanish might be related to the baseline (and perhaps input by extension), rather than some underlying L1 effect.
]

---

# RQs

.pull-left[
.big[
**RQ1:** Is there an association in Spanish whole-word duration and lexical frequency?

**H1:** Yes, higher lexical frequency will be associated with lower word duration. 
]]

.pull-right[
.big[
**RQ2:** Will the frequency effect found in previous studies in English be replicated?

**H2:** Yes, more frequent words in English will also shorten. 
]]

---
# Method

.large[
We examined whether **increased lexical frequency** was associated with **shorter duration** in Spanish monolingual corpus data.
]

.large[
Additionally, we aimed to **replicate** the previous frequency effects found in **English**.
]

---

# The Corpora

Free ST American English Corpus (https://openslr.org/45/).

Cell-phone recorded utterances from 10 total speakers (5 female) speaker

350 utterances per speaker.

A total of 2806 recordings were included.

1162 unique words for analysis. 
]
]

Crowdsourced high-quality Argentinian Spanish speech data set (https://openslr.org/61/)

Recorded by volunteers in Buenos Aires, Argentina.

The total number of speakers in unclear,

1928 recordings containing 858 unique words 
]
]

---

# Lexical Frequency

# Procedure

- The corpora contained a series of `.wav` files, and a single transcript of each utterance

- We used R to create individual `.txt` files and matched them to each audio file.

- Each `.wav` and `.txt` file pair were used to automatically segment the data in WebMaus basic.

- Checks for quality have been promising, though a large scale hand-correction is needed.
]

.pull-right[
**Data tidying**
- Only 4-8 grapheme words in both languages were included in the analysis to largely avoid function words.

- Cleaned words in `.txt` files by removing capital letters, spaces, and punctuation.

- Loaded and joined the lexical frequency corpus data with the word duration data. 
]

---

---

---

# Statistical Analysis

**Method**: Bayesian multilevel Regressions (one per language)

**Outcome variables**: Duration z-score

**Predictor**: *Log-transformed Lexical Frequency Z* - taken directly from the corpora and converted to a z-score.

**Predictor**: *Speech Rate*: measured as segments per second and standardized as a z-score

**Predictor**: *Length*: measured as the number of graphemes per word and convered to a z-score

`duration_z ~ log_freq_z + rate_z + length_z + (1 | word) + (1 | participant)`

---

# Results

---

---

---

---

---
# Results

##### Table 1: Spanish model 
<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> Estimate </th>
   <th style="text-align:right;"> Est.Error </th>
   <th style="text-align:right;"> Q2.5 </th>
   <th style="text-align:right;"> Q97.5 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Intercept </td>
   <td style="text-align:right;"> 0.050 </td>
   <td style="text-align:right;"> 0.068 </td>
   <td style="text-align:right;"> -0.080 </td>
   <td style="text-align:right;"> 0.185 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> log_freq_z </td>
   <td style="text-align:right;"> -0.182 </td>
   <td style="text-align:right;"> 0.026 </td>
   <td style="text-align:right;"> -0.231 </td>
   <td style="text-align:right;"> -0.133 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> rate_z </td>
   <td style="text-align:right;"> -0.153 </td>
   <td style="text-align:right;"> 0.018 </td>
   <td style="text-align:right;"> -0.189 </td>
   <td style="text-align:right;"> -0.118 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> length_z </td>
   <td style="text-align:right;"> 0.564 </td>
   <td style="text-align:right;"> 0.025 </td>
   <td style="text-align:right;"> 0.516 </td>
   <td style="text-align:right;"> 0.612 </td>
  </tr>
</tbody>
</table>

##### Table 2: English model 
<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> Estimate </th>
   <th style="text-align:right;"> Est.Error </th>
   <th style="text-align:right;"> Q2.5 </th>
   <th style="text-align:right;"> Q97.5 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Intercept </td>
   <td style="text-align:right;"> 0.104 </td>
   <td style="text-align:right;"> 0.025 </td>
   <td style="text-align:right;"> 0.053 </td>
   <td style="text-align:right;"> 0.153 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> log_freq_z </td>
   <td style="text-align:right;"> -0.199 </td>
   <td style="text-align:right;"> 0.025 </td>
   <td style="text-align:right;"> -0.248 </td>
   <td style="text-align:right;"> -0.151 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> length_z </td>
   <td style="text-align:right;"> 0.417 </td>
   <td style="text-align:right;"> 0.022 </td>
   <td style="text-align:right;"> 0.374 </td>
   <td style="text-align:right;"> 0.461 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> rate_z </td>
   <td style="text-align:right;"> -0.256 </td>
   <td style="text-align:right;"> 0.012 </td>
   <td style="text-align:right;"> -0.278 </td>
   <td style="text-align:right;"> -0.232 </td>
  </tr>
</tbody>
</table>

---

# Summary

- Higher lexical frequency was associated with shorter whole word duration in English and Spanish

- Overall, we replicated the frequency effect in English reported by Gahl (2008, 2009)

- We expanded this to Spanish, in which the effect was similar overall. 
]

- Future work could examine whether L2 learners acquire native-like frequency driven duration effects in perception and production. 
]

---

# Thank you! Questions?