A rough guide to navigating the voice landscape

Click to download a printable handout below.



Adopting the parlance of the modern day tech writer, voice technology appears to have fully ‘emerged’ – as a nice to have, flawed piece of technology.

Blame the sudden convergence of smart speakers on a hungry consumer electronics market. Blame confusion over the differentiating factors that separate ‘artificial intelligence’ and machine learning models. Blame a long host of contributing factors (suggestions including, but not limited to: thought leaders, hardware producers, everyone who’s conducted a TedX talk on emerging tech in the 2010’s… and you’d be missing the point.

The marketing landscape has failed voice technology, and we have reacted by accusing it of failing us. The circumstances are eerily reminiscent of a modern day show business tragedy: young actor bursts onto the scene with inspired performance in independent film. Actor subsequently gets snapped up for a series of lucrative blockbusters but fails to impress audiences. Becomes blacklisted by directors and is relegated to the DVD bargain bin… you get the idea.

In dismissing the prospect of developing a viable approach to voice technology, we inaccurately appraise its shortcomings, and perpetuate its misuse. This whitepaper has been written to address some of these shortcomings, and provide a brand-agnostic guide on how best to engage with consumers through a screenless medium. Within this, we’ll dive into the history of voice and some statistical insights around voice adoption, before outlining the immovable pillars that can and should act as a framework for ideation and execution.

Our aim? To make voice great (or at the very least viable) again – courtesy of an honest take on our current relationship with digital assistants/smart speakers, and the factors that must be taken into consideration when brands look to develop a screenless marketing strategy.

Voice, for reasons that will become clear, has the potential to innovate the way brands communicate to their audiences both now and in the future.. Read on for a closer look at ‘how’ and ‘why’.




The foundations of voice were built a long time ago. The road towards voice activated software started with Audrey, a speech recognition program developed by Bell Labs, and continued with ELIZA – an early natural language processing program created between 1964 and 1966 by MIT’s Joseph Weizenbaum. ELIZA was capable of simulating conversation using pattern matching methodology, and frequently amazed test subjects (to Weizenbaum’s despair) with her ability to deliver nuanced, ‘intelligent’ responses to their questions.

If this sounds a little too good to be true for a computer program built in the 60’s, then you’d be correct. Whilst ELIZA was designed with incremental improvement in mind, additional programming had to be added manually as direct edits to her script. It would be a full 15 years before the public became aware of personal computers, and almost 30 years before natural language processing systems were introduced (with varying degrees of success) to the likes of Ask.com and Microsoft’s Clippit. Say what you like about her interface, but ELIZA was most definitely ahead of the curve.

Examining the origins of technology that drive voice synthesis and speech to text recognition reveals a legacy of innovation that goes back centuries. Blaise Pascal first sowed the seeds of what would become machine learning with a mechanical calculator he modestly called the Pascaline back in 1643. His countryman Joseph Marie Jacquard picked up where Pascal left off, using punched cards to program the looms that produced his trademark patterning in 1752.

The cycle of human ingenuity continued to push data innovation forward at a steady speed for centuries. Then, the third industrial revolution came along – exponentially increasing levels of data production. Even the most sweeping technological innovations have not been enough to match our current output and, as we edge closer to the advent of Industry 4.0, it can feel like some tech categories are dropping off the pace.

Google and Amazon have undoubtedly moved the needle for voice activated technology, improving speech recognition models to near-human levels of accuracy, but both software giants have flattered to deceive in the department of true innovation. Scripting a bot/skill using Dialogflow or Alexa Skill Builder feels simple. The results are straight-forward, but… fairly linear. In a bid to be first-to-market, marketers have failed to recognise what their consumers really need from screenless technology.




In order to understand the divergence of opinion that exists within the topic of voice, we must find a suitably diverse region to examine. A region where vertical screens rule supreme, and one where smartphone penetration and Internet speeds provide an ideal environment to test and adopt screenless technology.

We are of course talking about Asia Pacific (APAC). Home to over 2 billion mobile subscribers (with over a billion in China alone) and 3 of the top 10 ranked countries for Internet usage, APAC is both smartphone addicted (with a total of 2.8 bn mobile subscriptions across region) and mobile averse (like the Japanese, who still spend more time consuming traditional media than digital media). Residents of South Korea enjoy the fastest broadband speeds on the planet (111.00 mbps as of August 2019), whilst their northern neighbours are rarely allowed access to the Internet. Two thirds of Chinese smartphone users make regular payments through their mobile device, but full Internet penetration sits at just 58% in Southeast Asia.

Different stages of development make APAC an ideal test subject to put under the microscope, and the cultural nuances present make this a truly fascinating exercise. A 2018 study on key APAC markets stated that 62% of surveyed smartphone users had used voice technology in the last 6 months (and 54% in the last month). The research, conducted by idstats on behalf of global performance agency iProspect, went on to group countries as conservative and dynamic growth markets. India (82% adoption), China (77%) and, a little further back, Indonesia (62%) emerged as market leaders – whilst markets like Japan, Singapore and Australia elicited a far more sceptical response to questions around voice usage.

The correlation between conservative growth and economic maturity is hard to ignore. Early access to technology gives users the chance to cultivate technological habits, many of which become deeply embedded in the processes used to exchange and disseminate information. Disrupting a well established system built around screens, typing and other visual stimuli will take some time. On a global level, the smart speaker market is buoyant – with US$7.2 billion in sales across 2018 – but consumers seem unwilling to let screenless technology take a more prominent position within their routines.

A combination of social discomfort and concerns over privacy factor in heavily to this rationale for APAC technology users. Screens allow for confidential interactions in a public environment. Moving these exchanges into a hands-free scenario has the potential to cause embarrassment within conservative societies (58% of surveyed users in Japan cited ‘embarrassment’ as a factor behind their decision to not use voice technology) and raise questions around data security.

These misgivings are not shared by voice technology users in more dynamic growth markets. China, home to Baidu (aka the world’s second largest producer of smart speakers), ‘gets it’. 42% of Chinese survey respondents to iProspect’s report identified as daily users, and over 68% of users noted that their usage of voice technology has increased over the last 6 months.

The reason behind this uptake? China’s legacy of voice recognition software. Requiring thousands of character options when typing does not equate to fast typing or text-friendly abbreviations (as is the case for Romantic based languages), which meant that speech recognition software became an instant hit upon release. Now, the latest iterations from Baidu (who control over 75% of the search market in China) translate 3 times faster than text entry, and have shown signs of rapid improvement thanks to sustained use.

Things get even more positive for voice in India, where the technology inhabits a position between ‘this season’s must- have fashion accessory’ and ‘indispensable time saver’. 78% of surveyed users reported increased usage over the last 6 months, with convenience and multitasking cited as key criteria for adoption. ‘Being part of the tech revolution’ and ‘makes me cool’ also featured heavily within the details of responses. Add to this a booming e-commerce market (powered by local unicorn Flipkart and a slew of regional/global competitors), and India’s honeymoon phase with voice technology looks set to continue.

APAC’s complex relationship with voice technology uncovers several key points for marketers to take into consideration.

  1. Emerging markets have fewer barriers to entry for voice technology. A generation that migrated straight to mobile will intuitively understand screenless technology better. Test and learn with an audience that’s receptive (and willing to overlook small errors) in pursuit of a blueprint that can be adapted and brought to market in more mature regions.
  2. Mature tech economies have more exacting standards for new technology (and have deeper relationships with existing interfaces). Conservative societies are less likely to endure public embarrassment to change up an established tech routine. Building trust within these economies will take time.
  3. Focusing in on a specific subset of a problem area (that could be improved by a screenless interface) in order to achieve the equity required for more extensive application is crucial. Choose your territories wisely and select your ideas carefully. Reducing friction within an already well-oiled process is not an impactful way to allocate and expend resources.

Salesforce’s research into attitudes around the applications of AI in Asia adds additional credence to the talking points brought up in iProspect’s research. In short, the outlook for new technologies looks positive. 63% of respondents believed consumers believe that AI will improve the world and their lives in the future, though trust levels were still low when it came to traditionally intra-personal interactions/exchanges. For example, 90% of Singaporean consumers indicated they would trust a human over a chatbot and/or robo-advisor when it came to financial services and wealth management. That said, respondents in Indonesia and the Philippines from the same survey indicated they held both in equal regard – perhaps giving an indication as to the regions and verticals voice marketers should target in order to create positive impact.




Whilst it’s clear where the opportunity lies for brands geographically, the efforts of technology companies reveal contrasting opinions on how best to leverage voice. Google has doubled down on education with Bolo, a voice activated application designed to promote literacy within young children. Amazon has banked on continued success within English speaking markets, enlisting Samuel L. Jackson to voice Alexa and cement Echo’s position as the world’s best selling smart speaker. Baidu’s highly impressive Deep Voice AI system has so far been used to improve translation services/voice to machine interfaces, but only within its native China. All seem content to strengthen from positions of strength, without taking into account the white space that sits between their existing hardware and products.

Brands have similar challenges when it comes to positioning voice within the marketing mix and targeting the right audiences. To date, brands have used voice applications to help consumers:

The above use-cases are symbolic of the challenges faced by voice marketers in markets with the resources to invest in emerging technology. Uncertainty and unfamiliarity has lead to scepticism. Marketers have responded to this sentiment with use-cases that are designed to ‘cut through the static’ in a playful manner, but prioritising novelty factor over purpose feels too much like the tech equivalent of a single use coffee cup. Useful once full, useless when drained of their contents.




Paving the way for purposeful screenless interactions requires a different set of pillars to inform the process. Planning for a successful voice application launch should therefore take into account the following 3 key factors:

  1. Singular Focus. Marketers must hone in on their customer requirements to work out exactly where they can make the most impact. Doing one task really well (as opposed to creating a one-size-fits-all product) results in a use-case that builds a framework for success – and instills trust in what can become (with time) a highly receptive audience.
  2. Repetition of Use. The only way to build a true sense of advocacy is to engage your consumers frequently. Voice applications and skills should pertain to tasks that are embedded in our routines, or functions that enable them to do more with time spent between screens. A one trick pony, however amusing or headline grabbing, is not a sustainable investment for marketers looking to win market share.
  3. Make It Authentic. Voice marketing is more than the sum of its technical parts. The tone of the commands and the voice in which they are articulated need to be authentic to your brand personality. Building a shared understanding of how an application should communicate is critical before pushing ‘go’ on the prototyping process. Get the branding right, and the build will follow…




Brands unprepared to test and learn their voice strategy in today’s market conditions are failing to see the bigger picture. Whilst current applications may fail to penetrate the screen bias of some adults, young adults, teenagers and children are embracing screenless technology with gusto in developed markets.

The proliferation of smart speakers within homes is a key contributing factor to this surge. Nielsen’s research into smart speaker penetration indicates widespread usage – 4 out of 10 surveyed households had more than one smart speaker, with 45% planning to add to this number in the future. A survey conducted by Common Sense highlighted that 9 out of 10 American parents with children aged 2-8 said they (or someone in their family) regularly interact with smart speakers at the family home. 58% of respondents specifically indicated that their children interacted with them, and a further 38% indicated that these interactions happened a couple of times/day. 

Super Awesome’s survey of American smart speaker users dives a little deeper into usage stats. Their data indicates that 91% of children aged 2-11 have access to voice activated technology. 26% of these children engage with smart speakers 2-4 hours every week, and a further 20% spend 5 hours or more using voice technology during the same time period.

Hardware providers have picked up on this trend on a global level. Initially built for the Indian market, the Ojoy A1 smart watch is marketed across APAC as a child friendly alternative to smart devices, and comes equipped with a GPS tracker and always-on parental control mode. Like the Verizon GizmoWatch, it is designed to reduce children’s screen time and is fully voice activated. Though less portable, the iPal robot has become popular in its native China as a class assistant and senior citizen companion. Voice activated technology allows it to tell jokes, take roll call and answer questions in two different languages.

The statistical insights are clear. Children in the US engage with voice activated technology free from the embarrassment caused by social conditioning. They’re establishing technological habits that have already bypassed screens – building relationships with screenless technology that transcend utility, sub-consciously accepting smart speakers and digital assistants into their everyday lives. Whilst cultural nuances associated with voice adoption will undoubtedly play a part in dictating how and when children within APAC adopt screenless technology, hardware providers in the region are already preparing for the shift. 

Gen Z had access to smartphones and tablets before they started high school (the average age of smartphone acquisition according to Google is 12 years old). Millennials may have been cast as mobile pioneers, but today’s teenagers are true mobile natives – for them, content consumption and connectivity begins and ends on a vertical screen.

We’ve yet to assign a letter to the generation of kids growing up as screenless natives, but their story looks set to be even more interesting than that of their big spending older brothers and sisters (teenagers have a purchasing power of $44bn in America alone, they must be delighted at the prospect of adults finally taking them seriously). And, whilst hardware providers still have questions to answer around data collection and parental controls, it’s clear that whoever makes the most effort to understand their technological habits will have the best chance of capturing share of voice and future-proofing their commerce strategy.





We owe it to ourselves to explore the potential of voice activated outcomes with more focus and application. This starts by dropping the ‘emerging’ prefix when it comes to screenless interactions, and continues with a commitment to improving the ways in which voice activated technology are applied. To paraphrase a quote by Spoken Layer’s Will Mayo from a previous IAB US report, the KPI for success is simply being there to begin with. Targeting the very young (with products designed to educate and inform, the elderly (with applications to improve access to services or improve testing)and enterprises (with assistants to improve the efficiency of drug testing could be the most ethical way to begin the journey. The final destination for this adventure is largely down to the appetite of marketers to be future focused and boldly go where no brand has gone before.


IAB SEA+India New Media Working Group