Voice Search Conference Summary
Bill Meisel, Publisher & Editor
Reprinted with permission from Speech Strategy News April 2008
The inaugural Voice Search Conference was held in San Diego, March 10-12. The coorganizers—the Applied Voice Input Output Society and your editor (Bill Meisel)—created the conference. This editorial summarizes the categories covered by the conference, and why they fit a broad definition of “voice search.” The conference was motivated by the following observations:
- The capabilities of speech recognition technology now support application designs that focus on satisfying the user, rather than designs that make major compromises to compensate for limitations of the technology;
- The core benefit of speech recognition when those technology constraints are reduced is shortening a user interaction, most simply put as: “Just say what you want and get it”; or “Use text to find what you want in a speech file.”
- The growing use of mobile devices with screens allows multimodal interaction in many cases: so that the result of speech searches to be delivered as text or graphics; and text input can be used as a fallback when speech cannot be used. Thus, some speech applications will be less “speechcentric.”
“Voice Search” seemed to us to summarize these observations. The idea received strong industry support, with primary sponsors Call Genie, IBM, Nuance Communications, Vlingo Corporation, and VoiceBox Technologies; and supporting sponsors BBN Technologies, Convergys, Genesys, Loquendo, Nexidia, Novauris, and West Interactive; and by broad participation by respected industry experts in delivering conference content.
Defining Voice Search
Not everyone agreed with the very broad interpretation of “Voice Search” used by the conference organizers, and some discussion on panels (plus comments during Q&A sessions) revolved around how the term should be defined. Because of the implied analogy to Web search, the most obvious interpretation fit mobility applications on wireless phones where a spoken utterance was used to search a long list, including directory assistance services and specifying a location. A common comment was that speech technology doesn’t stand alone; other than obvious cases where its use could improve safety (while driving, in particular). Applications should take advantage of other modalities where available and appropriate, for example, to deliver long replies as text.
Voice Search is intended as a unifying concept behind a number of application categories. The following sections of this article discuss major areas addressed at the conference.
Audio response
Text-to-speech technology has advanced along with speech recognition technology. It is now clearly intelligible, reaching its own “tipping point.” The latest advances go beyond intelligibility to “naturalness,” one of those concepts that is hard to define, but which we recognize it when it is there. Text-to-speech can deliver the results of a voice search or create dynamic responses to verify a voice search. These are critical capabilities that could not be addressed by recorded speech. In addition, text-to-speech often stands on its own, for example, by delivering email as a voice message.
Directory assistance and voice portals
In wireless phone applications, directory assistance seems to be the primary point of entry for major players, with the focus on business listings by name or category and the opportunities they offer for focused advertising. The success of some smaller pioneers such as Call Genie and Jingle Networks in this business has encouraged entry by larger companies such as Google, Microsoft, and AT&T (the larger companies with limited trials so far). Some of the services offer more than classical directory assistance, including search by category and other information, such as weather or news. Some can provide movie information and sell you a ticket. (We tend to talk about ads as if they were the objective of some of these services, but creating revenue by closing a sale is even better.)
In addition, there are specialized services, such as those that provide driving directions. A clear trend suggested by conference talks is the expansion of the directory assistance services into “voice portals” with multiple services accessed through a single, familiar number. (The term “voice portal” wasn’t popular because of past abuse of the term, but the concept is still useful.)
Long-list search
Directory assistance is one example of searching a long list—of business names in the typical case. But it exemplifies a more general case where speech recognition is particularly advantageous. In many cases, the user desires to find an item on a list too long to say in a menu or even display effectively on a screen. Other than business names, examples include addresses, contact lists, employee lists, music selections, ringtones, and retail items such as books. In these cases, the user knows the list (or at least their items of interest). Speech recognition can provide access to these hidden lists, and make access to them more flexible by including variations on the way items are spoken (for example, including nicknames in contact lists or “halo” areas when a city is specified to include nearby cities). Another advantage is that the user does not need to be able to spell the entry correctly, as they might with text entry.
Unifying the mobile user experience
Several companies at the conference are developing or offering a solution to using more services and features on mobile phones or in automobiles. Wireless service providers have been trying to increase revenues by encouraging increased use of the phone as a standard text-based Web browser, with mixed (mostly poor) results. Part of the problem is the device itself, with a small screen and difficulty in entering text, making the usual visually oriented web experience less satisfying. In addition, with alarming statistics on the risks of using mobile phones—and particularly the risk of entering text—while driving, an alternative to visual interfaces is almost mandatory.
Beyond the obvious limitations of the devices, vendors of unifying voice interfaces for mobile phones or automobile systems emphasize “feature creep,” the proliferation of what can be done on the devices themselves and with network-based services. Although approaches to achieving a unifying experience differ among vendors, they all endeavor to make it possible for the user to get what they want without knowing specific commands or navigating long menus—the core power of a well-executed speech interface. Part of the solution to the problem in some cases is learning preferences of each user to shorten the interaction.
A typical point made by vendors is that speech can provide what appears to the user as a single interface to diverse applications. The appropriate use of multimodality is also a typical theme.
Audio search and speech analytics
Accessing the content of recorded phone calls in contact centers, voice mail, and audio/video content on the web are not easy tasks for speech technology because of unconstrained and often lessthan-fluent speech, sometimes with significant background noise. The saving grace is that one can determine the gist of such speech without necessarily requiring a high level of word accuracy. If, for example, a caller to a company contact center is angry, they are likely to signal that anger in more than one word. One can then detect a problem call accurately despite perhaps misrecognizing an expletive or two.
Speech analytics in call centers can be used for conventional monitoring that goes beyond the very limited sampling often done by call center managers. It can give a more accurate statistical picture of what callers are doing and how well calls are being handled; and can allow finding specific calls that reveal issues in design of an IVR system or in agent training. A number of vendors offer increasingly sophisticated analytic systems.
As audio/visual sources on the web proliferate (e.g., youTube), there is increased need to search the content and not just the metadata. Think, for example, of a company that wants to know what is being said about it in blogs, newscasts, or consumer-deployed videos. This is a difficult task, but several companies at the conference were attempting to address the need.
Contact center automation
What does voice search have to do with call centers? One could ask a related question by analogy: What does Web search have to do with individual web sites? The answer for Web sites is that users now expect a complex web site to have the same search box as they used to find the site. The answer for call centers may be that users will come to expect to be able to “just say what they want and get it” when they find this works for directory assistance and other mobile services. Beyond just satisfying expectations, when a caller finds a voice search interface at a call center, both they and the company benefit from the quick interaction. The overly structured interaction usually found is in part a historical anomaly created by touch-tone interfaces and past limitations of speech technology.
Customers will be exposed to voice search interfaces in mobile services such as directory assistance and driving directions. They will view these services fondly, as time-and money-saving alternatives. Finding an equivalent experience at a call center may avoid the typical response of fighting the system to be connected to an agent.
In addition, ad-supported telephone services will almost always include an option to be connected to the advertiser. This will create a high-volume application that call centers must address, and which may require speech automation to be economically feasible. Such calls must be answered promptly, lest the caller’s impulse to buy evaporates.
Many existing call-center technologies support the voice search paradigm. One is “naturallanguage” call steering. After a general prompt with examples of what can be said, the caller just states his or her problem or objective. The semantic model within the call steering software has learned from examples what should be done with the call, and handles it appropriately with another level of automation or by transfer to an agent with the proper skills.
There are other techniques that reduce navigation that are offered by vendors today and which are in use by some companies. Some of those mentioned at the conference include:
- Good dialog design practices: Just do it better and avoid designing the same call flow one used with touch-tone navigation. Make full use of the capabilities of today’s speech technology.
- Robust parsing: A technology variant that is a simpler alternative to natural-language technology in some applications, and is available from major technology vendors. Robust parsing can ignore parts of utterances that aren’t relevant without resorting to “please repeat.”
- Personalization and other data-driven dynamic interaction: An interaction can be shortened by taking advantage of knowledge of the caller or external circumstances. For example, American Airlines’ system automatically indicates the flight status when a caller calls within 23 hours of a reserved flight and allows users to jump among tasks without waiting for a prompt.
- Shortcuts: Don’t force an overly structured “main-menu” interaction. Build in the ability for a caller to interrupt a prompt and jump to a particular service.
- Specialized, but common, cases: Allow typical ways that callers try to shorten an interaction. For example, “one-step correction” allows a caller to correct an erroneous confirmation by saying “no” and the correct instruction in the same utterance.
Unified Communications
“Unified Communications” (UC) seems to be a sponge with ambitions to absorb all enterprise communications under one “brand.” It includes telephone communications, voice mail, email, instant messaging, video, and more. It even aims, in some cases, to include IVR systems. [See, for example, the Aspect/Microsoft partnership, p. 1.] The IDC estimate of revenues of $17 billion in 2011 for the segment illustrates this all-inclusive philosophy, perhaps reflecting the sensible point of view that telephone applications shouldn’t be put into silos.
The downside of this unifying approach is that, fully accepted, it implies the daunting challenge of replacing all the communications in the enterprise—even those that seem to work just fine, thank you—with new platforms. And the concepts of UC go beyond the enterprise, extending to the enhancement of subscription-based network services.
Where do UC and Voice Search overlap? Voice interfaces in auto attendants are an obvious case—just say who you want and be connected. Dialing or setting up conference calls by name is another long-list application. Converting voicemail to text for search and storage is an increasingly popular application, although not currently on the feature list of the major UC vendors.
A more fundamental view is the use of a voice interface to manage the many features of UC. The more functions one bundles into a single system, the more challenging the user interface. Speech has the potential to become a “communications assistant” that lets users just say what they want (“If John Doe is available, add him to this call.”) No vendor has fully embraced this model, perhaps because of past technology limitations, but its time is due. Microsoft includes its Speech Server as a built-in part of the Office Communications Server, and Avaya discussed its Speech Access option at the Voice Search Conference. The latter allows commands such as “Read my messages,” “Call the sender,” “Find free time tomorrow,” “Give me a wake up call,” “Read my urgent messages from my boss,” and “Connect all calls,” according to the presentation. Certainly this fits the paradigm of “just say what you want and get it.”
Unifying communications versus Unified Communications
But unifying communications doesn’t have to be fully comprehensive. In particular, many network-based services just attempt to solve particular communications problems. One important example is the delivery of voice mail as text, allowing it to be reviewed quickly and allowing archives of voice mail to be searched as text. Today’s systems use a combination of speech technology and human transcription, reflecting the difficulty of transcribing unconstrained telephone speech. This increasingly popular service was addressed at the conference and is exemplified by a number of articles in this issue.
If communicating with oneself is communication, then services which allow the equivalent of voice notes (transcribed to text and often with categorization of the note) are another application in this category. Some services try to rise to the level of personal assistant, allowing commands such as “Create an appointment on Friday at 2 PM with Dan Smith” to create an entry in a calendar application. The utility of such applications, some of which use human agents to do the speech recognition, is clear if well designed.
So…
To return to the introductory theme, Voice Search simply summarizes a new era in the use of speech technology. Speech technology has passed a “tipping point.” It’s not the computer in StarTrek yet (and may never be), but it can solve user interface problems that might otherwise require many confusing steps to achieve a result.
TMA Associates (www.tmaa.com) |