The end of the calendar year always seems like a good time to pause for breath and reflect on what’s been happening over the last 12 months, and that’s as true in the world of commercial NLP as it is in any other domain. In particular, 2019 has been a busy year for voice assistance, thanks to the focus placed on this area by all the major technology players. So, we take this opportunity to review a number of key themes that have defined recent developments in the commercialisation of voice technology.
A version of this article also appears in the Journal of Natural Language Engineering.
For just over a year, I’ve been curating This Week in NLP, a weekly newsletter which highlights the key events and happenings in the world of commercial NLP. On reviewing a wide range of sources over the last year, what I’ve found most striking is the significant proportion of news coverage that focuses on voice assistance. Striking, but not really surprising: the ubiquity of voice assistants means they are the most directly accessible NLP technology for the majority of end users.
In this post, I reflect on what’s made the news over the last year, and draw out what I see as the major themes that have defined developments in voice in 2019.
1. The Battle of the Giants: Amazon Alexa vs Google Assistant
If you’re looking to buy a smart speaker, it’s very likely you’ll be making a choice between an Amazon Echo and a Google Home. Although there are a number of other players in the market — and we’ll get to those further below — at this point in time the main contenders are devices powered by Amazon’s Alexa, which celebrated its fifth birthday in November 2019, and Google’s Assistant, which turned three in October.
Between these two, who’s winning the war for your voice bandwidth and all it might reveal depends on what you count. Apparently more than 100 million Alexa-compatible devices have been sold; Alexa has been integrated with 85,000 smart home products; and the Alexa app store hosts over 100,000 skills. Google Assistant, on the other hand, is said to be on over 1 billion individual devices; it works on over 10,000 smart home products; and, at the beginning of the year, it supported just over 4000 actions.
Of course, device count isn’t the same as active user count: just about every Android device, and those are mostly phones, has Google Assistant pre-installed, but it’s hard to find numbers for how many people actually use it. Product count isn’t overly helpful either: a lot of those smart home products might be differently branded but otherwise identical light bulbs. And Amazon has made it so easy to develop for Alexa (more on that below) that a fair proportion of those 100k skills are likely to have (at most) one user, being the result of a developer kicking the tyres.
Smart speaker sales numbers might be a better indicator of which virtual assistant is gaining more traction, since the only reason you’d buy a smart speaker is because you actually want to talk to it. Here, for 2019Q3 at least, Amazon is way ahead, with 10.4m sales (that’s a stunning 36.6% of smart speaker sales worldwide) — three times more than Google at 3.5m. And those numbers for Amazon are up 5% on the same quarter in the previous year, so its lead appears to be increasing.
At the end of the day, though, you just want these things to answer your questions. So where do the various voice assistants stand in terms of actual performance? Twice a year, Loup Ventures runs what you might think of as an IQ test for virtual assistants, asking Google, Siri and Alexa 800 questions each on a variety of topics. In the most recent round, all three have improved (substantially, in Alexa’s case), but Google is still on top, answering only 57 questions (7%) incorrectly, whereas Alexa got 162 (20%) wrong.
Perhaps in response to this deficit, Amazon has introduced Alexa Answers, a kind of ‘Audible Quora’ where users can provide answers for questions that leave Alexa stumped; Alexa then delivers these answers preceded by the message that the answer is ‘according to an Amazon customer’. The feature has generated some criticism, particularly in regard to Amazon’s apparent lack of quality control.
2. Other Voice Assistants
Of course, there are a number of other players in the voice assistant space. In fact, for 2019Q3, Chinese manufacturers, about whom relatively little is heard in the West, had a good showing sales-wise: both Alibaba and Baidu narrowly outsold Google in smart speakers, shipping 3.9m and 3.7m units respectively, and Xiaomi just trailed Google at 3.4m units.
Meanwhile, other voice assistants have been struggling. Microsoft’s Cortana has had an uncertain year: introduced in 2014 to compete with Siri as a digital assistant for the now-dead Windows Phone, by the beginning of 2019 it had become a key feature of the Windows 10 interface; then around mid-year it became a separate app in the Windows store; and by year end it was being folded into Outlook and Office 365 as a productivity aid. Microsoft is one of around 30 companies that have signed up to Amazon’s Voice Interoperability Initiative, the aim of which is to allow multiple voice assistants to comfortably exist on the same device. This looks like a clear recognition that Cortana will exist alongside other voice assistants rather than compete with them.
Samsung’s Bixby, first introduced in 2017, has also had a hard time. Despite the company selling over 500 million ‘Bixby-compatible’ devices each year, it struggles to be heard above the chatter created by all the others. This hasn’t stopped Samsung rolling out a number of initiatives in an attempt to gain greater acceptance in the market. Recognising the success of Amazon’s strategy of providing lots of support for third party developers, this year Samsung has announced Bixby Developer Studio, which lets third party developers create skills (known as ‘capsules’ in Bixby-speak); a third-party app marketplace where developers can sell those apps; Bixby Views, which lets you build voice apps for visual devices from TVs to watches; and Bixby DevJam, a developer contest for new Bixby capsules, with prizes totalling US$125,000. The company aims to put AI in every device and appliance it makes by 2020.
Siri, the voice assistant that started it all back in 2011, has generated relatively little news in 2019. And where’s Facebook in all of this, you might ask? Around mid-year, Mark Zuckerberg announced that Facebook would soon be launching a number of voice-controlled products, but nothing has appeared yet. The widely-noted product placement of a Facebook Portal in September’s season 11 premiere of Modern Family — ‘Hey Portal, call Dad’ was the second line in the episode — doesn’t really count, since the Portal uses Alexa.
3. Conversational Ability
Inevitably, all this competition has contributed to a steady stream of advances in the technology underlying voice assistants. The big news in 2018 was Google Duplex, a hyper-realistic appointment-making voice dialog app that the company demo’d at Google I/O, and piloted in New York, Atlanta, Phoenix, and San Francisco towards the end of that year.
This year saw the progressive rolling out of that technology: by mid-year, Duplex was available for restaurant bookings in 43 US states, and a New Zealand trial was mooted by the end of the year. In May, The New York Times reported that Duplex calls were often still made by human operators at call centers: around a quarter of calls start with a live human voice, and of the calls that start with machines, 15% required a human to intervene. Some degree of human fallback is a sensible strategy when rolling out a service as ground-breaking as this, but it’s unclear to what extent Duplex has become more self-confident over time.
Duplex’s scary realism provoked concerns that people wouldn’t know whether they were talking to a human or a machine. By mid-2019, California had passed a law requiring chatbots to disclose that they’re not human; Duplex now begins a call by identifying itself as being from Google.
This technology seems to have given Google a real lead, but it looks like the other voice assistants are not far behind. Of all the goodies announced at Amazon’s annual product launch in September, the Doorbell Concierge was the most interesting. A functionality to be provided by the soon-to-be-released Ring Video Doorbell Elite, this uses Alexa to gather information from visitors and then utilises its knowledge base and other services to complete a task on their behalf, in a manner that is very reminiscent of Duplex’s apparent autonomy.
Meanwhile, it’s been suggested that Alibaba’s voice assistant, currently used to negotiate package delivery, is already ahead of Google’s Duplex. And Microsoft aims to use its acquisition of conversational AI technology from Semantic Machines to achieve something similar.
4. Tools for Building Voice Apps
The sophistication of Duplex’s performance makes the average Alexa skill or Google action seem trivial: the complexity of the conversation in the Google Duplex demo is in an entirely different league from what happens when you ask your Google Home what the weather is like. As a small step towards narrowing the gap, all the major vendors have introduced features in their developer platforms that enable extended conversations, making it possible to go beyond simple one-shot question-and-answer dialogs. In 2018, Amazon introduced Follow-up mode, and Google responded with Continued Conversation, features that cause the platforms to listen for additional queries or follow-up questions after an initial exchange, so that you don’t have to keep saying the wake word. Baidu’s DuerOS acquired the same capability this year, and Xiaomi’s version is on the way too.
Taking the next step, Amazon this year introduced Alexa Conversations, a set of tools that lets you build multi-turn dialogues that can interconnect with other Alexa skills. The feature is based on machine learning of what skills are used in close proximity to each other (for example, booking cinema tickets then organising an Uber to get to the cinema), and then automatically invoking the appropriate skill for the next step in the dialog, providing a kind of data-driven mixed-initiative where the application is increasingly able to predict the user’s next requirement.
On another front, both Google and Amazon announced features that support personalisation in applications: Google’s Personal References remembers data specific to you and your family, and Alexa skill personalization leverages voice profiles so that developers can provide personalized experiences, greetings, and prompts for recognised users.
The year also saw developments oriented towards making it easier to build these increasingly complex applications. Amazon has always seemed to care more about this than Google; from quite early on in Alexa’s development, Amazon realised the importance of making it easy for outside developers to plug their apps into Alexa’s code, and to plug Alexa’s code into all kinds of third-party devices. Taking this approach a step further, this year Amazon introduced new business-focused no-code Alexa Skills Blueprints, which facilitate the development of a variety of common application types. In March, Voicebot reported that, since the launch of the feature a few months earlier, over 25% of the 4000+ new skills published in the Alexa skills store were Blueprint-based; but it turns out that more than two million Blueprint-based skills had been developed just for private family use.
In related developments, Samsung launched Bixby Templates, and Microsoft released Power Virtual Agents, both no-code tools for creating conversational bots; and Nuance, a long-time player in the voice space, released PathFinder, a tool that builds dialog models using existing conversations as input data. In a virtuous cycle, as more applications are developed on these platforms, it becomes obvious which higher-level abstractions are most useful to support.
5. Speech Synthesis
A number of incremental improvements in speech recognition performance were announced by various vendors throughout the year, but these are not necessarily obvious to the end user. Much more visible (audible?) are improvements in speech synthesis.
During the year, Amazon announced the general availability of its newscaster style in the Amazon Polly text-to-speech service, which provides pretty impressive intonation: it’s worth checking out the sample in this piece at VentureBeat. And in the US at least, you can now ask Alexa to speak slower or faster.
Google’s Cloud Text-to-Speech has also seen improvements: it now supports 187 voices, 95 of which are neural-network-based WaveNet voices, covering 33 languages and dialects.
If you’ve ever been just a bit uncomfortable that the current crop of digital assistants all have female voices, you may be interested to know that a Danish team has created a voice called Q that’s designed to be perceived as neither male or female; a demo of the genderless voice is available online. Or you could switch to using Cortana, which is acquiring a masculine voice as one of its newest set of features.
Other use cases for speech synthesis have appeared during the year. Amazon has developed technology that mimics shifts in tempo, pitch, and volume from a given source voice. Facebook’s MelNet closely mimics the voices of famous people; you can check out its rendering of Bill Gates at VentureBeat. In fact, Amazon has decided that celebrity voices are going to be a thing. First cab off the ranks is Samuel L. Jackson: embodying his voice in your Echo device will cost you $4.99, complete with profanities.
VocaliD and Northeastern University in Boston have teamed up to preserve and re-create the voices of people who face losing their ability to speak: you can have a voice built for US$1,499. Thanks to Descript, now you can deep-fake your own voice on your desktop: this is intended as an aid to podcast editing, but no doubt we’ll see some other interesting uses of this technology. Indeed, 2019 brought us the first noted instance of a voice deep-fake being used in a scam, with the CEO of a UK company being persuaded to transfer approximately US$243,000 to the bank account of a fraudster who sounded just like his boss.
For AI applications in general, 2019 was the year in which a number of ethics issues came to the forefront. For voice, the big issue of the year was data privacy.
There’s been a concern about listening devices ever since we first invited them into our homes, but our early fears were allayed by vendor insistence that our smart speakers only ever started listening once the appropriate wake word was uttered.
That there might be other concerns was first hinted at towards the end of 2016, when the Arkansas police asked Amazon to hand over Echo voice recordings that might provide evidence in a murder case. And then in 2018, Alexa recorded a couple’s conversation in their home and sent it to a random person from their contact list. How could this happen? According to Amazon:
Echo woke up due to a word in background conversation sounding like ‘Alexa.’ Then, the subsequent conversation was heard as a ‘send message’ request. At which point, Alexa said out loud ‘To whom?’ At which point, the background conversation was interpreted as a name in the customers contact list. Alexa then asked out loud, ‘[contact name], right?’ Alexa then interpreted background conversation as ‘right’.
Fast forward to the April 2019, when Bloomberg reported that Amazon was using third-party contractors to transcribe and annotate voice recordings. That humans transcribe voice data for training and evaluation purposes won’t come as a surprise to anyone who knows anything about the industry, but it caused quite a stir in the mainstream press.
Very quickly every voice platform was in the spotlight: Apple, Google, Microsoft, Samsung and, thanks to voice recording in Messenger, Facebook.
The general tone of the reporting suggested there were two regards in which the conduct was considered egregious: not only were humans listening to conversations that users thought they were having privately with their voice assistants, the human listeners in question were not even direct employees of the companies concerned, so who knows where your data might end up …
The platform vendors quickly went into damage limitation mode in response to these events, assuring users that they could opt out of data collection, and that they could delete data that had been collected.
Amazon even added a feature that allows users to delete their voice recordings by saying ‘Alexa, delete what I just said’ or ‘Alexa, delete everything I said today’.
Apple’s Tim Cook gave the commencement address at Stanford in June, in which he emphasised strongly the need for data privacy — a not-so-subtle reminder that, unlike Google and Amazon, Apple doesn’t have an independent rationale for collecting your data.
The lesson from all of this is clear: transparency is important, so that users have a clear understanding of how and when their data is being used.
One response to concerns about privacy violations arising from your utterances being spirited up to the cloud is on-device processing. There are other good reasons for edge computing when it comes to voice processing, but developers of small-footprint standalone voice processors like Sensory and Picovoice were quick to emphasise the data privacy benefits of keeping it local.
Google also announced a new faster version of Google Assistant, thanks in part to an on-device language model, although the announcement doesn’t acknowledge privacy as a benefit.
On a side note, the ‘wake word defence’ may have its limits: as it happens, Amazon have a patent that would allow your Echo to make use of what you say before the wake word, rather than just after it, although the company stresses that the tech is not currently in use. And in July, a Belgian news organisation reported listening to over a thousand leaked Google Assistant recordings, of which around 150 were not activated by the wake word.
7. Voice Ubiquity
So it’s been quite a year, and it seems like voice is almost everywhere. Just to reinforce that thought, here are 12 new things you can do using your voice that appeared during 2019.
(1) Talk to more machines in your kitchen: Gourmia introduced an air fryer, a crock pot, and a coffee maker that can be controlled by Google Assistant and Amazon’s Alexa; Instant Brands announced that its Instant Pot Smart WiFi pressure cooker supports Google Assistant; and of course there’s the the new Amazon Smart Oven.
(2) Or, if you can’t be bothered cooking, talk to the drive-thru: today to order breakfast at Good Times Burger & Frozen Custard in Denver, and also soon at McDonald’s, who have acquired Apprente, a Bay Area voice-recognition startup, with the aim of using it to take orders at drive-through windows.
(3) While you’re there, you can now apply for that job flipping burgers at McDonald’s via Alexa or Google Assistant.
(4) If you’re in China, order deliveries from Starbucks through Alibaba’s smart speaker.
(5) Once you’ve got your meal, settle down to talk to more inhabitants of your smart home: use Alexa to control your Roku devices, talk to your LG Smart TV, or talk to your ceiling fan or IKEA’s smart blinds.
(6) Rather listen to music? Talk to Pandora, which has added a voice assistant to its iOS app, and is testing interactive voice ads you can talk to between music tracks.
(7) Talk water: to your shower, your smart sprinklers, or your irrigation systems.
(8) Plan a holiday: access TripIt travel planning by talking to Google Assistant or Alexa; or get EasyJet flight information by voice.
(9) Get information when you’re on holiday, by talking to Angie Hospitality, the virtual assistant for hotel guest rooms, or MSC Cruises’ ‘personal cruise assistant’.
(10) Or just stay at home and play a game: Monopoly now offers a voice-controlled banker; HQ Trivia, the viral trivia app, is now available via Google Assistant; and Sony has released Alexa skills for Jeopardy! and Who Wants to Be a Millionaire.
(11) Bored with games? Experience the joy of hands-free betting with BetConstruct’s Hoory voice assistant.
(12) For higher stakes gambling, ask Alexa to make a donation to your favourite US presidential candidate.
Isn’t life just so much better than it used to be?
If you enjoyed reading this post, sign up for This Week in NLP for an update every Friday on what’s happening in the world of commercial natural language processing.