Popularity of voice bot frameworks using Natural Language Processing (NLP) and Artificial Intelligence (AI) is on the rise today. Even blockchain can’t compete with Alexa and Google Home anymore, in terms of the attention they are getting. They are in the
press, on the radio and TV shows, basically everywhere. Techies and corporate executives, a while ago, quickly and rightfully recognized the potential that voice based user interaction can offer and have innovation teams in place, actively exploring and experimenting
with the underlying technologies.
For example, TD Bank was the one of the very early adopters and released its Alexa skill back in November 2017, offering set of information mainly voice capabilities, like
- locating a branch and/or ATM
- obtaining an FX rate and/or stock quote
- choosing the credit card and/or bank account type which best suits customer’s needs and preferences.
Beyond these informational brochure-ware type skills, next innovation frontier could be in the field of sophisticated call centar automation and voice enabled banking services using voice assistants from the comfort of customers’ homes.
High Level Architecture
The Amazon’s Alexa, Google’s Google Home, Microsoft’s Cortana and Apple’s Home Pod are the best-known NLP voice assistant devices today. They all follow fairly similar basic architectures, as shown on the picture below, with:
- ‘smart’ speaker device, which is just a ‘dummy’ hardware interface to the end user, with set of microphones and a speaker, for capturing audio commands and playing back the audio responses to the user
- Proprietary AI / NLP service layer, provided by the voice assistant device vendor (Amazon, Google, Microsoft or Apple), which: (a) receives the unstructured audio command files from the smart speaker (b) uses
built-in proprietary AI / NLP capability to interpret the audio and constructs the structured text utterance (c) automatically prompts the user for required missing info in the utterance, like pre-defined ‘slot’ (i.e. variables) values (d) attempts
to match the constructed and completed utterance to a pre-defined ‘intent’ or ‘action’ handler (e) engages the ‘skill’/’action’ handler with structured JSON commands and handles the JSON responses
- ‘Skill’ / ‘Actions’ / ‘Intents’ fulfillment service layer, either built-in or provided by the 3rd party developers, which is invoked by the NLP service layer with JSON command as a payload and which provides JSON responses back to the NLP
Figure 1 - Generic Smart Speaker Framework Architecture
The Challenges Slowing Down Serious Innovation
Although all the mainstream voice bot frameworks look similar at the architectural level (described above), they are also fairly different, in terms of their basic features, like:
- how ‘skill’ developers must organize and deploy their ‘intent’ or ‘action’ handler code
- what set of tools for voice conversation design, development and testing is available
- how utterances for voice command activation should be syntactically defined and organized
- what is the level of NLP sophistication and flexibility, i.e. ability to tolerate and handle slight variations from pre-defined utterance constructs
- what is the level of conversational automation in how they handle ‘slots’ (i.e. ‘variables’) in utterances
- what is the device’s ‘machine learning’ ability, i.e. ability of the framework to self-evolve its utterance handling sophistication level, after skill deployment, without need for developers’ constant and tedious manual utterance updates
- what are the specific JSON command/response message formats
These platform differences significantly impede and limit ability of corporate innovators to offer voice bot skills which could potentially run on any device. Today, if corporation wants to enable voice interaction to as many of their customers, which may
be using variety of smart speakers, ‘skill’ developers have to plan or separate development projects (or teams) to develop/port/test identical skills on specific devices that need to be supported. Alternative is to bet on one device and ignore the others.
That’s all potentially very risky, limiting, inefficient and error prone.
There are 3rd party attempts like Jovo Framework that provide, as much as possible, platform independent development environment for Alexa and Google Home skills. Jovo seems very
interesting as a framework and is worth playing and experimenting with (yes we are currently evaluating it). It offers decent abstraction layer for consolidation of:
- portable ‘skill intent’ handler code segments, intent session context state, slots handling
- platform specific behaviour through specific intent handlers for each supported device
- portable intent invocation phrases, using generic Jovo Language Model, which can be later converted into platform-specific configuration files, deployable in the context of each specific device execution environment.
Jovo is not ideal and may not be the answer though. Not all of the devices are supported by it … currently only Alexa and Google Home are supported (although these are ones that probably matter the most at the moment). Questions also arise about Jovo’s ability
to efficiently keep up with latest developments of the supported underlying platforms, and its roadmap plans for including support for Microsoft and Apple devices. But with all of the existing fragmentation, something like Jovo could be your best shot at the
moment, especially if you are looking at as much platform independence as you can get.
Time For Standard Voice Browsers Maybe?
In my opinion, instead of trying to address the current lack of standardization in voice assistant development space through 3rd party abstraction layers like Jovo, the better approach could be for device vendors to work together, potentially under W3C umbrella,
in order to come up with the standard ‘voice conversation markup language’ (which could be a next generation of the already existing VoiceXML standard, with new additions, upgrades and
contributions from Google, Amazon, Microsoft and Apple). Such voice conversation markup language would further be supported by standardized ‘Voice Browser’ execution environment, with built-in voice conversation markup language parser, content interpreter
and standard conversation navigation manager ... implemented on top of each vendor's proprietary NLP services.
In a nutshell, it would be really great if developers developing voice skills, could describe:
- Skill’s intent definitions
- Utterances and their mapping to intent handlers,
- Utterance variables or slots
- Conversation flow and voice navigation
- Session handling
layer. This is not new pattern, but one that very much emulates the same approach that exists for modern web browsing.
Compliant Voice Browser (like web browsers today) would provide standard, out of the box, standard set of voice conversation navigation commands like START, BACK, REPEAT, STOP, CALL, etc. with developers able to provide extensions and set of value-added
Such standard voice browser environment and voice conversation markup language would likely enable significantly higher levels of adoption and penetration of voice enabled services by customers and ability for a lot more scalable and reusable code development
by corporate developers. Everybody will benefit.
Let’s hope Amazon, Google, Microsoft and Apple can come together and start working on next generation of VoiceXML standard and supporting it in their next generation devices. Voice assistant development would be significantly simplified and in much better
shape than what we have today, with existing fragmentation and 'voice assistant battles'.
I feel that even if one of the major voice assistants potentially takes this route, developer community would love it and embrace it. Others will likely have to follow then, as it happened in the world of web browser standardization and HTML