James Magnuson (https://magnuson.psy.uconn.edu/) will present a talk sponsored by the Five College Cognitive Science Speaker Series in ILC N400 from at noon Wednesday 27th. Pizza will be served. The title and abstract are below. All are welcome!
EARSHOT: A minimal neural network model of human speech recognition that learns to map real speech to semantic patterns
James S. Magnuson, Heejo You, Hosung Nam, Paul Allopenna, Kevin Brown, Monty Escabi, Rachel Theodore, Sahil Luthra, Monica Li, & Jay Rueckl
One of the great unsolved challenges in the cognitive and neural sciences is understanding how human listeners achieve phonetic constancy (seemingly effortless perception of a speaker’s intended consonants and vowels under typical conditions) despite a lack of invariant cues to speech sounds. Models (mathematical, neural network, or Bayesian) of human speech recognition have been essential tools in the development of theories over the last forty years. However, they have been little help in understanding phonetic constancy because most do not operate on real speech (they instead focus on mapping from a sequence of consonants and vowels to words in memory), and most do not learn. The few models that work on real speech borrow elements from automatic speech recognition (ASR), but do not achieve high accuracy and are arguably too complex to provide much theoretical insight. Over the last two decades, however, advances in deep learning have revolutionized ASR, with neural network approaches that emerged from the same framework as those used in cognitive models. These models do not offer much guidance for human speech recognition because of their complexity. Our team asked whether we could borrow minimal elements from ASR to construct a simple cognitive model that would work on real speech. The result is EARSHOT (Emulation of Auditory Recognition of Speech by Humans Over Time), a neural network trained on 1000 words produced by 10 talkers. It learns to map spectral slice inputs to sparse “pseudo-semantic” vectors via recurrent hidden units. The element we have borrowed from ASR is to use “long short-term memory” (LSTM) nodes. LSTM nodes have a memory cell and internal “gates” that allow nodes to become differentially sensitive to variable time scales. EARSHOT achieves high accuracy and moderate generalization, and exhibits human-like over-time phonological competition. Analyses of hidden units – based on approaches used in human electrocorticography – reveal that the model learns a distributed phonological code to map speech to semantics that resembles responses to speech observed in human superior temporal gyrus. I will discuss the implications for cognitive and neural theories of human speech learning and processing.