My lips are moving and the sound's coming out
The words are audible but I have my doubts
-Missing Persons, ‘Words’
Uncanny Valley
Voice activation has become nearly ubiquitous in recent years, with a growing number of households including devices focused on this interaction. It is a remarkable market growth for units that do not work as well as intended.
Not so long ago, there was once a time when talking to a machine was considered an oddity – usually interacted with by folks with a limited grasp of reality.
The leap from rarity to ubiquity came on suddenly, stunning even the most evangelical advocates. Today, the proliferation of 'smart speakers' (a double meaning if one was ever to be had) brought even six-pack Joe into the heady world of voice control.
Just how did we get here? Voice interaction with our computers has been a dream for the wunderkind of MIT and a core character in much of technical science fiction. The computers of the then futurists had personalities built to fit the 'human' interaction - from HAL, 2001's semi-psychotic ulterior motive personality disorder to the overtly chirpy interface central to the Infinite Improbability-driven Heart of Gold.
The current generation of voice-driven smart devices is a leap from the first real-world voice-capable computers. Audrey (Bell Labs, 1952), developed to minimize voice bandwidth before automated switching, could understand spoken numbers only for specific operator voices. Shoebox (IBM, 1962) could understand the numbers 0 through 9 and up to 16 spoken English words. Unlike Audrey, the system did not rely on specific voices; it identified three parts of the word via an analog filter circuitry.
Canned Reality
The reality of voice interaction is far stranger than our expectations. True, there is still the over-reaching effort to have these machines respond to us in a way that approaches the uncanny valley. We want our machines to be 'part of the family,' a natural call and response between intimates. This feeling of closeness belies a subtle manipulation of our day-to-day interactions by the process itself.
Anyone who has interacted with Alexa, Google Home, Homepod, or the mobile phone speech-to-text tool knows how little these devices understand natural cadence speech. For the most part, a person cannot simply utter a complex command/request without at least a few rounds of hearing 'sorry, I don't understand' or having the device play Darling Nikki when asked to 'set to do not disturb.'
As a native New Yorker, having my Alexa units' inability to keep up with my fast-paced speech is damn frustrating, resulting in me repeating a request repeatedly. The same is true for non-native speakers of English. Ask the numerous European friends who have had extended stays in my home. These folks often speak English better than many of my native-born associates. Still, the frustrations they encountered in just attempting to have Alexa set a timer while cooking were off-putting. Ultimately, we could reliably interact with the devices; it only took a complete change in how we spoke.
Me Talk Pretty?
The technology may affect our speech patterns, constructing a more banal and common form of pronunciation. Until the technology can catch up, we are forced to perform a bit of code-switching, speaking in our regular cadence and pronunciation to each other while addressing the technology with something else. Commands to these devices require a slower, more sharply articulated speech- demanding accentuated Ps, Ds, and Bs. The process can feel like being forced to speak a staccato version of 'The Queen's English' (or the now-defunct Mid-Atlantic speech).
This is not the first time technology has influenced the way humans talk. Each new leap in voice communication has forced an alternate voice from its users to ensure efficient intelligibility.
Modern music has a very intimate characteristic that did not and could not exist before the first decade of the 1900s. Singers and performers of the day needed to 'reach the back of the room' by sheer skill. They also needed to have the vocals cut through the instruments and often the sound of dancing feet. Opera singers could do this with sheer power, albeit with more quietly considerate audiences.
Pre-War Dance Hall singers (not to be mistaken for the Jamaican blending of reggae, hip-hop, and R&B) needed a specific range and technique not to be washed out and heard clearly across the room. The falsetto (or, more rightly, a Countertenor ) voice and a passive megaphone provided just the right sound to make the vocals an explicit part of the song. You can hear a bit of this style in early World War I and II movies showing soldiers dancing while on R&R.
That Voice
Some remnants of the style can also be heard as Swing Big Bands added early vocalists- soon replaced by mellow, more intimate whiskey voices of Bing Crosby, Frank Sinatra, Helen Forrest, and Billie Holiday. This new sultry style was made possible only by the addition of microphone amplification. All of these singers were capable of belting it to the back room, but the expressiveness and seduction required a more subtle delivery while still being front and center of the composition. The cultural switch did not go easily for some, as these crooners were seen as too mushy and antithetical to the music by many. Of course, the young kids loved how it gave them a new sensuality and how it had them dancing close - it's no wonder why it was the sound of a generation at war.
Newscasters / Newsreaders are still influenced by the significantly affected vocal delivery of early presenters on the radio. If you have ever listened to early newsreels (like they played in movie theaters after the talkies took over) or recordings of presenters like Walter Winchell, you hear that voice. It is a voice that relies on a sharp but deliberate delivery, a higher, almost nasal register, and a pronunciation of words that sound like a mix of public school British and proper Boston. This is partly due to the era's social ideas on what an authoritative voice should sound like and the limited capability of the early condenser microphones used for radio broadcasting.
The 'voice' itself carried on far past its technical reasons - so an announcer could be understood through the noise of typical radio transmission, especially at the receiver end. The sound became a hallmark of a radio/TV news person, with many taking the style on to show that they were professional broadcasters. You can hear it in how Edward R. Murrow or Walter Cronkite spoke and delivered the news; the affectation is smoother, but the deliberate punchiness remains. The infamous Roger Grimsby of NYC's WABC in the '70s and '80s is a direct descendant and one of the last I can recall that overtly presented in style. It is worth noting that many of the top-flight national news hosts also employ a modern take on the style, but in keeping with the contemporary casual feel, it is an understated method.
Subtle Singularity
Is the technology we have become so enamored with changing how we speak? Several evolutionary biologists have shown evidence that our dependence on digital communication is changing how we think, store, and retain memories. Some discussion and newer studies are looking at whether young users of voice-controlled smart devices are doing a version of code-switching or defaulting to the more pronounced pronunciation used to tell Alexa what they want. The research is looking, in particular, at how the kids talk to each other in the noisy, messy playtime or when frustrated in getting the point across.
Is this the step that brings the devotees of John von Neumann's Singularity to mass acceptance?