Sara Morgan Rea
specializes in designing Enterprise wide solutions utilizing Microsoft Technologies. Over the years she has written several articles for development journals such as MSDN, Enterprise Development, and Visual Studio Magazine. In March 2005, her first book titled, “Building Intelligent .NET Applications: Agents, Data Mining, Rule-based Systems, and Speech Processing” (ISBN: 0321246268) was released. You can learn more about Sara from her web site at
http://www.custsolutions.net
. You can also reach her by email at SaraRea1@custsolutions.net.
What you need:
Visual Studio .NET 2003
Microsoft Speech Application SDK
Related Resources:
Microsoft Speech Server web site
Building Intelligent .NET Applications: Agents, Data Mining, Rule-based Systems, and Speech Processing, ISBN: 0321246268
Top Ten Tips for Designing Telephony Applications
Microsoft Speech Server allows .NET developers to build Telephony or voice-only applications quickly and easily. This article lists ten tips you may want to consider before designing this type of application.
Chances are strong that you have used a speech-based application at least once. It may have been to access your bank information, check the status of an airline flight, or to get the latest weather. The number and breadth of speech-based applications is growing every day. Part of the reason for this is the increasing use of cell phones. Another reason is the idea that information must be available quickly and easily twenty-four hours a day, seven days a week. Most companies cannot afford to keep a staff on hand to answer telephones at all times of the day and night. So, telephony applications, in which data can be accessed over a telephone have emerged as a logical choice.
For telephony applications, the user interface is the telephone. Input is received through spoken commands or DTMF (Dual Tone Multi-frequency), which involves the user pressing keys on the telephone keypad. These types of applications are not new. Also known as Interactive Voice Response (IVR) systems, large organizations and call centers have been using them for years. Only recently though has the technology behind speech recognition and speech synthesis advanced significantly.
Introduced in 2004, Microsoft Speech Server includes an SDK (Software Development Toolkit) that allows .NET developers to build telephony applications using Visual Studio .NET. This article is not an introduction to using the Microsoft Speech Application SDK (SASDK). For that, you can refer to the online tutorial that accompanies the SDK or chapters two through four of my book, “Building Intelligent .NET Applications: Agents, Data Mining, Rule-based Systems, and Speech Processing”. Instead, this article will present what I consider to be ten things you should consider before writing a telephony application. The items are based on my own experience using the Microsoft Speech Server SDK as well as my opinions as a user of telephony applications.
# 1 – Allow the user to barge-in for all prompts
Recently I called my bank because I noticed some unusual activity with my bank account. I was told to contact their Fraud hotline. When calling the hotline, a female voice recites a welcome message and then asks the user to press one if they want options in English. After pressing one, the caller is then forced to listen to what all the available menu choices are. For some reason the designers of this telephony application did not or cannot allow the user to barge-in and enter a menu choice. Instead, they must wait and listen to the prompts for all four menu choices before being allowed to select one. Now, I suppose some might argue that by doing this you force the caller to listen to all the choices and not just select the first option. I would instead argue that all you are doing is aggravating the user and that barge-in should always be allowed.
A prompt is a message that the system sends to the user asking them for input. The SASDK allows you to set a BargeIn property for all prompts. The property accepts a Boolean value which indicates whether the prompt control will stop playing and accept input when detecting either speech or numerical input. The default value for the prompt control is True, so it is not necessary to specifically set this value. But, before setting this value to False, consider the implications this might have on user acceptance.
# 2 – Use Dynamic Grammars when the content is not known ahead of time or is expected to change.
Just like traditional applications, telephony applications will typically interact with a database. Because the data within a database is dynamic, the information you collect from the user may need to be dynamic as well.
Grammars represent what the user says to the application. Telephony applications built with the SASDK utilize a grammar editor that allows developers to graphically design the Extensible Markup Language (XML) based files that comprise the grammar. The resulting grammar files have a .grxml file extension and are used by the speech recognition engine to understand what the user is saying. Each interaction with the user is associated with one or more of these grammar files.
Dynamic grammars are built each time a web page is executed. The following code shows you how a dynamic grammar can be built using the results of a stored procedure call.
StringBuilder s = new StringBuilder();
s.Append("<grammar xml:lang=""en-US"" ");
s.Append("tag-format=""semantics-ms/1.0"" ");
s.Append("version=""1.0"" root=""" + Name);
s.Append("Rule"" mode=""voice"" ");
s.Append("xmlns=""http://www.w3.org/2001/06/grammar""_>" );
s.Append("<rule id=""" + Name + @"Rule""_ scope=""public"">" );
s.Append("<one-of>");
SqlDataReader dr = SqlHelper.ExecuteReader(connstr,_ CommandType.StoredProcedure, sp );
while
( dr.Read() ) {
s.Append(ConvertGrammarItem(Name,dr.GetString(0)));
}
s.Append("</one-of>");
s.Append("</rule>");
s.Append("</grammar>");
Microsoft.Speech.Web.UI.Grammar gram = new Grammar();
Gram.InlineGrammar = s.ToString();
QAControl.Reco.Grammars.Add(gram);
QAControl.Visible = true;
The resulting grammar is then associated with the question and answer control named QAControl. The individual grammar items are built in the ConvertGrammarItem method which is listed below.
private
string ConvertGrammarItem(string Name, string_ GrammarItem ) {
StringBuilder s = new StringBuilder();
s.Append("<item>");
s.Append("<item>");
s.Append(GrammarItem);
s.Append("</item>");
s.Append("<tag>$." + Name + " = """ + GrammarItem_
"</tag>");
s.Append("</item>");
return
s.ToString();
}
# 3 – Allow the user the option of entering more than one piece of information in a single command.
In many cases it is faster for a user to speak a complex command than it is to type in a query or click a series of controls. Take for example an airline reservation system. To place a reservation, you need to provide several key pieces of information such as departure and destination locations, arrival dates and times, and travel preferences. For online systems, this information is typically collected through a series of text and combo boxes. For a telephony system, this information is collected entirely through spoken commands.
One way to do this is to prompt the user for each piece of information required. A telephony system like this is the most straightforward and involves a relatively simple grammar. However, it is faster to allow users the option of speaking more than one piece of information at a time. For instance, they might be allowed to say “Find me all Delta flights out of Louisiana for October 18th.” The downside to allowing this type of query is that the potential for error or misunderstanding increases. Also, more development time must be spent building the grammar. This is a tradeoff that must be considered when developing telephony applications.
# 4 – Don’t give the user more than three menu choices and limit nested menus to no more than two levels.
The overall goal of a telephony application should be to make information available to the caller as quickly as possible. For the same reasons users of traditional web-based applications do not want to waste time clicking through nested menus, telephony callers do not want to waste time navigating complex voice-only menus. This means as the designer of a telephony application, you will need to limit the number of menu choices and nested submenus. I recommend limiting each menu to no more than three choices and the number of nested menus (excluding the main menu) to no more than two levels.
In addition to limiting the number of menu choice, you also need to limit the number of words in each menu choice. If the user is expected to speak a menu choice such as, “get bank account balance”, make sure that the grammar allows the user to alternatively say “get bank balance”, or “account balance”. If the user is forced to speak a four word menu choice in it’s entirety, it is likely a mistake will be made and the user will become frustrated.
# 5 – Limit the length and complexity of prompt messages.
Be careful to limit the length of prompt messages so you do not overwhelm users with too much information. Typically, users of telephony applications know exactly what piece of information they are trying to retrieve. A long prompt that goes into great detail about all the options available to the user will aggravate anyone but a first time caller. Even for first time callers, too much information may overwhelm and confuse them.
An alternative to lengthy instructional prompts is the inclusion of help commands. You can design the application so that help can be accessed any time the user simply says “help” or “I need help”. The messages associated with this help can change depending on where in the application the user is. Remember, you might have to periodically remind the user that a help command is available to them.
Limiting the length of the prompt message is especially important for the welcome message. A welcome message is played when a telephony application is first initiated. If the message is too long or complex, a first time caller will be discouraged from using the system. Return callers will be frustrated and will generally tune out the message, even if it changes and contains some piece of valuable information.
# 6 – Make sure prompts are spoken slowly and clearly and take advantage of tuning abilities.
A telephony application built with the Microsoft SASDK utilizes a prompt database to store all potential prompts. Prompts can be delivered using text-to-speech or associated with pre-recorded messages. The pre-recorded messages can be recorded using the developer’s voice or a professional voice talent. Care must be taken to have recorded messages spoken slowly and clearly so that all users can understand them. In addition, if a telephony application will be accessed by callers from different areas of the country or outside the United States, regional dialects must be considered.
The Microsoft SASDK allows developers to tune both pre-recorded and text-to-speech prompts. For text-to-speech prompts, SSML (Speech Synthesis Markup Language) tags can be used to specify volume and control the speaking rate. The following code uses the ssml:prosody element to adjust the pitch and rate of the spoken text. It is wrapped inside the required ssml:speak tag which is a required root element for ssml.
<ssml:speak version="1.0"
xmlns:ssml="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
<ssml:sentence>
No flights are available on October <ssml:prosody pitch="+0.5st" rate="-10%">18 th</ssml:prosody>
</ssml:sentence>
</ssml:speak>
Pre-recorded messages can be tuned using the spectrogram view which is part of the Microsoft SASDK’s built-in Prompt Editor. When a recording is imported into the prompt database, the speech recognizer does a good job of aligning the recording with the typed text. However, sometimes adjustments are needed to make the prompt sound clearer. The spectrogram view allows the developer to adjust the word boundaries for imported recordings.
# 7 – Allow the user to ask for help, go back to a main menu, or repeat a prompt.
At any time during a call, telephony users should have the option of asking for help, going back to the main menu or repeating a prompt. In addition, callers will need to be told and then reminded that these options are available to them. It is not enough to tell them about these commands at the beginning of the call. Periodically, or at the end of each question you need to remind them of these commands with a prompt such as, “At any time, you can ask for help, go back to the main menu or repeat a prompt.”
To make the implementation of these commands easier to manage, user controls can be used. Just like any ASP.NET application, user controls can be used to group similar functionality that is used throughout multiple web pages. For example, the following HTML tags can be placed inside a user control that is included with each web page in the application.
<speech:command id="HelpCmd" runat="server"
Settings="GlobalQASettings" type="Help"
width="129px"xpathtrigger="/SML/HelpCmd">
<GRAMMAR src="Grammars/Common/Help.grxml">
</GRAMMAR>
</speech:command>
<speech:command id="RepeatCmd" runat="server"
Settings="GlobalQASettings" type="Repeat"
xpathtrigger="/SML/RepeatCmd">
<GRAMMAR src="Grammars/Common/Repeat.grxml">
</GRAMMAR>
</speech:command>
<speech:Command id="MainMenuCmd" runat="server"
Settings="GlobalQASettings"
xpathtrigger="/SML/MainMenuCmd"
AutoPostBack="True" type="MainMenu">
<Prompt ID="MainMenuCmd_Prompt"></Prompt>
<Grammar Src="Grammars/Common/MainMenu.grxml"
ID="MainMenuCmd_Grammar">
</Grammar>
<DtmfGrammar ID="MainMenuCmd_DtmfGrammar">
</DtmfGrammar>
</speech:Command>
# 8 – Log exceptions and have a periodic review of those logs.
Telephony developers may need to debug their applications long after the initial development process. Logging the results of interactions with actual users will be critical in the success of these applications. By default, Microsoft Speech Server logs each call and stores the results in a Windows event trace log file with an .etl (event log tracing) extension. Developers can then import these files into a SQL database and use tools provided with the SASDK to extract data from the interactions.
Obviously, developers will want to monitor these logs closely after an application is first deployed. However, once a telephony application is considered stable, this exercise should not stop. Periodic reviews of these logs should be established to monitor the success of user interactions.
# 9 – Allow plenty of time for application design and even more time for testing and revisions.
In the world of speech application development, the GUI (Graphical User Interface) has been replaced with the VUI (Voice User Interface). This means there are different challenges and considerations to be made when designing these types of applications. To identify and face these new obstacles, telephony developers should allow plenty of time for application design. The call flow should be carefully designed to guide the user through the application and not abandon them. It would be helpful to use a tool like Visio to diagram the call flow so it will be easy to visually identify all the possible paths a use can take.
# 10 – Try to design the prompts and navigation to be as natural and intuitive as possible and get input from actual users of the system before proceeding too far.
Speech-based applications will continue to grow in use as companies like Microsoft and others remain committed to making speech mainstream. Just like there were mistakes made when the number of web applications seemed to double everyday, the maturity of speech-based applications will undergo similar challenges. Developers will be quick to implement telephony applications using the newly available tools like Microsoft’s SASDK without first considering the complexity of speech-based applications.
What every new developer using the Microsoft SASDK must realize is that just because telephony applications can be built in the same way that traditional web applications are, they are not the same. The way that callers use telephony applications and the expectations they have for them are very different than that of traditional applications. Developers should strive to design prompts and navigation so that the flow is natural and intuitive. More importantly, user feedback is critical. Before proceeding too far with the development of a telephony application, perform some type of field test using a sample of actual callers. This will help to ensure positive user acceptance and the least amount of rewrites necessary.
The Microsoft Speech Server and SASDK are wonderful new tools every developer should be familiar with. If you have not had a chance to play with this product, I encourage you to visit the Microsoft Speech Server web site at http://www.microsoft.com/speech and download a copy of the SASDK.