mypie is the industry's first multimodal mobile interface implemented on Android devices, allowing users to interact with their email, contact and calendar with simultaneously active voice and screen interfaces. Users can interact freely using voice alone, screen alone, or dynamic combinations of both. By offering these multimodal interfaces, mypie greatly improves the usability and the accessibility of user information. On top of the multimodal interface, mypie also supports value-added services such as semantic queries and updates, to further enhance the usability of user information at minimal efforts. This article examines the advantage and the requirement of multimodal technology, focusing on mobile environment, and describes how mypie resolved the technical challenges of multimodality with its own proprietary technology developed for Android devices.
A multimodal interface allows users to utilize two or more forms of input and output interchangeably in a same interaction with a computing device. For instance, simultaneously activated voice and screen interfaces accept user commands in either of the modalities and then present the result in both modalities in synchronization. With a multimodal interface support, users choose a modality to use either by issuing a modality select command explicitly or just by using the desired modality selected implicitly. Users may mix the modalities in any way wanted to complete a given task of an application. For mobile applications, the voice modality is particularly useful as mobile devices are typically confined to physical limitation of small screen and tiny keypad. The voice modality alone, however, has its own limitation as voice interaction may not be proper in some situation. The visual modality can interchangably be used in those situation. In summary, there are two distinct advantages in multimodal interface.
When an application is driven by multimodal interface, the user finds that the usability increased, as the weaknesses of one modality are offset by the strength of another[2][3]. On a mobile handset, with small visual interface and keypad, a word may be quite difficult to type in but very easy to say. Patient information in an operating room may better be accessed verbally to maintain an antiseptic environment, and presented visually to maximize the comprehension.
Multimodal interface helps situationally impaired users to access their applications. For instance, workers wearing gloves can utilize voice modality. Drivers may use voice modality to enter and to obtain desired information. Multimodal interface can also increase the accessibility of an application as it can be accessed by people with a wide variery of impairment. Visually impaired users reply on the voice modality while hearing-impaired users reply on the visual modality.
Mobile applications have to deal with the physical limitation of mobile devices in environment exposed to noise or privacy implication. The multimodal functionality is particularily useful as it allows users to switch input modality to cope with those situation. To overcome the limited capabilities and resources of physically reduced keyboards and display, users can use the voice modality to enter long and hard-spelling words.
When the user has to interact with an application in an environment which is either noisy or requires silence with privacy, then the visual modality is better choice[1][3]. In other words, while voice interface may greatly enhance the usability/accessibility, voice-alone mobile interaction may limit the overall usability of an application. In order to make it useful for mobile applications, a voice interface must be augmented with visual modality interaction[3].
In order for mypie to support multimodal interaction in mobile handsets for email, contact and calendar, we have to develop a set of new technologies to accept commands from both modalities freely and to recognize voice commands and data accurately. Ranging from realtime audio streaming to dynamic grammar compilation, these patent-pending, new technologies were developed and tuned for generic mobile environment, allowing them to be utilized for wide variety of mobile handsets.
mypie accepts input either from voice modality or from visual modality dynamically. The intrinsic requirement is that an input of one modality must be propagated to the other modality processing agents to produce the reaction in time. This synchronization requires realtime behavior as delayed propagation of an input event may result in disharmonized reaction from other modalities. In order to guarantee a lively, unison reaction from each and every modlaities, input events must be propagated to all of the modalities in realtime.
mypie ensures this modality synchronization by converting commands from different modalities into a set of standardized internal commands, independent from the modality originated. These standardized commands are then processed by the mypie backend server, to produce results in modality independent form. The result is forwarded to both voice and visual modality agents to be presented in both modalities simulataneously.
mypie voice modality agent is always active during a session, ready to accept voice commands at any time. This always-active voice agent is to allow lively interaction transparently without any modality selection command. Users just need to speak up to enter a voice command (or to select a screen menu to utilize the visual modality).
In order to keep the voice modality always active, mypie samples audio input from the microphone all the time to transport the sampled audio to the voice modality agent. The channel from the microphone to the voice agent has to go through a wireless network connection which may introduce jitter and packet loss unexpectedly. In order for mypie to ensure the usability of voice modality, it is essential to remove the jitter and to avoid packet loss from the voice sample delivery. mypie utilizes an RTP proxy to remove jitters and relies on wireless TCP connection to mobile handsets to deal with packet loss.
The requirement of continuous audio sampling to the voice modality agent has side effects of injecting the echo of the voice agent's own output. Since mypie always plays voice prompts while it presents information on screen, the echo may get back into the voice modality agent as valid input. This echo effect must be cancelled out to avoid the misrecognition by the voice agent. mypie utilizes the Geigel DTD algorithm to cancel the echo effects.
With the speech recognition capability equipped in the latest smart phones, many mobile users utilize the voice search function implemented on top of the speech recognition services. Using the voice search, users can command search actions with a single utterance of command. Voice searching, however, may not produce the desired result 100% of the time due to the equipped voice recognizer's error. There are many reasons behind this recognizer error - back ground noise or un-common keyword(s), etc.
Voice searching personal information may result more errors as the search process may be the subject of personal information related to local or foreign keywords, uncommon to the general web search keywords. In order to improve the accuracy, mypie builds personalized recognition grammars by extracting keywords from the user's email, contacts and calendar contents. With the personalized grammar dynamically built for each individual users, mypie voice recognizer can catch the user utterance at higher accuracy to result in improved user experience.
To improve the speech recognizer accuracy, mypie creates personalized grammars for each users when they register for the mypie services. There are 3 types of grammars: email grammar, calendar grammar and contact grammar. Each grammar is to support the mypie voice searching process targeted for a certain item. Once created, the grammars are securely stored to be utilized and be updated dynamically for new email, contact and calendar information.
mypie updates the personalized grammars for the new email, calendar and contact information as they become available. Depends on the number of entries, grammar update may take significant overhead as the entire grammar file needs to be re-produced. To reduce the overheads, mypie takes different update frequency for each of the email, calendar and contact grammars. All of the grammars, however, are updated at least once per each day.
mypie creates and updates two seperate email grammars: one for the sender search and the other for the subject search. The sender and subject keywords are extracted from the email header and kept in hash tables, supporting the grammar updates. When a new email arrives, mypie immediately updates the hash table(s) if the latest email brings in new sender or subject keyword. mypie, however, doesn't reproduce the grammar after updating the hash table. It instead reproduce the grammar when the user loggs into the system.
The calendar and contact search grammar files are created when a new mypie user completes the registration for the service, from all of the user's calendar and the contact entries at the time of registration. Like the email grammars, the searchable keywords are extracted and stored in separate hash tables for creating the grammar file to support the voice search.
After created, the calendar and contact grammar files will get reproduced at regular time interval as well as at the time of user sign-on, if the associated keyword table has been updated.
Experience mypie on Android
Download mypie from the Android Market to play with the pre-built mypie demo account. Try the voice commands shown in the demo slides - find the demo account login id and the PIN from the demo slide's voice record.
Register for mypie service from Android for your own Gmail, Gcontact and Gcalendar.
NOTE: mypie is currently in private Beta. Please contact us at beta@mypie.net to get private Beta permission.
Our first product, mypie, is the industry's first multimodal mobile interface to enable users to manage their email, contacts and calendar using simultaneously active voice and visual interactions.
In the future, we plan to apply out multimodal technology to develop mobile entertainment and travel applications to help users to access desired information with minimal effort.