Welcome to GotSpeech.NET Sign in | Join | Help

Design and Conquer

Designing the right speech application is the first step. Designing the code that implements the application is the next step. Conquering the development tasks are the third step. There are several later steps, but in this blog I intend to focus on the design and development.
Another new gig

I have decided to take a development job with a different focus. I'll still be doing development, but not in the realm of speech recognition applications. My last day with Gold Systems will be February 15th.

I'm bummed that I won't get a chance to work with Marshall! I think he's a great catch for Gold Systems, and I hope he enjoys his new position!

I plan to check in on the forums once in a while to see if I can lend a hand with any questions. I'm also interested to see how gotspeech.net evolves in the next several months with OCS 2007 coming out. It should be an exciting year!

I don't plan on having any future blog posts, since I won't be coding anything in MSS 2004 or OCS 2007, unless it's just for fun. :-)

I've really enjoyed being a part of this community. You all have been a great resource for advice and questions on how to best develop applications in MSS 2004. I hope this site only grows in its value to Speech Server and OCS developers!

 

Posted Tuesday, February 13, 2007 11:32 AM by Chris Jansen | 3 Comments

Filed under:

Quick post about n-best testing with TASim

I originally learned about the specifics of testing n-best lists with the TASim on the MS Speech Server Tips & Tricks webpage (http://www.microsoft.com/speech/community/tips.mspx). The tip indicates that you have to modify the TASim configuration file to use an actual Speech Server in order to test QAs that have n-best enabled. You couldn't test n-best with just the SASDK.

This has changed. I haven't seen any documentation about this (hence my post), but MSS 2004 R2 supports testing n-best natively. I happened to test a QA with n-best enabled, and could see in the SML tab of the Speech Debugging Console that multiple alternates are returned. This is a nice change because you can test n-best recognition without having to have a speech server.

 

 

Posted Friday, September 22, 2006 11:34 AM by Chris Jansen | 2 Comments

Playing prompts, Part 2

As promised here is the next chapter in a series of posts about playing prompts.

First, a brief digression to describe my target audience. When I think of you, I think:

  • You're familiar with the Speech Application SASDK
  • You have developed at least one basic application, such as the pizza demo.
  • You have a working knowledge of all the basic concepts of the SASDK, such as QAs, semantic items, client side vs. server side (a.k.a. code-behind), etc.
  • You find yourself looking for more information than what's directly covered in the documentation
  • (Optional) You know a lot more about speech server development than I do and are prepared to enlighten me. :-)

Now, on to the prompt database.

Prompt Database

The prompt database is the primary method for playing prompts in an MSS application. At a high level, the prompt database is similar to a resource file – the key is the full text of the prompt, and the value is recorded audio that corresponds to the full text of the prompt.

There are advantages and disadvantages to using the prompt database to integrate recorded prompts into your application.

Advantages

  1. If you haven't recorded your prompt audio yet, the speech server (and the TASim) automatically falls back to using TTS. This allows for easy develop - test iterations, because you don't need to have any of your prompts recorded to test application functionality.
  2. In your code, it’s easy to know what the application will say at any given prompt because you have the full prompt text defined in the code.
  3. Extractions. Once you have a prompt defined in the prompt database, you can then define extractions of that prompt. I won't go into great detail on what extractions are (they are well covered in the documentation), but essentially you can re-use smaller chunks of any full prompt by defining an extraction. This is both an advantage and a disadvantage.

Disadvantages

  1. The key into the prompt database is the full text of the prompt. If the text of your prompts starts to get lengthy, you start to spend a lot of time entering prompt text, both in your code and in the prompt database.
  2. Once you import the recorded prompt into the database, the prompt database will align your prompt. This means that it will attempt to match the text of the prompt with the words in the prompt. If the prompt you recorded doesn’t use the same verbiage as the prompt key, the prompt will not align. You can then either manually align the prompt, modify the prompt text to match the new verbiage, or re-record the prompt to match the prompt text.
  3. If the application prompt text isn't finalized before you begin development (and in my experience this is quite often the case), you are left with a prompt administration challenge. Depending on how and where you define your prompt text, updating this text when your prompts change can be a difficult task. In addition you have to update it in two places – your code and the prompt database.
  4. It's difficult to use variations of the same prompt. For example, let's say you have an application that says Stephen Potter's favorite phrase, "Got it." Even though it's a small phrase, it's not a bad idea to have more than one version of this prompt recorded so the caller can hear some variability in the prompts the application plays. This is similar to a real conversation – a person doesn't say "Got it" exactly the same way every time. However, this is difficult to do with the prompt database – it would require use of PEML tags.
  5. Extractions, although beneficial, can produce less-than-optimal sounding prompts. In looking at just the text of a prompt, it may seem that you could re-use portions of a prompt somewhere else. However, there may be minor inflection differences between two uses of prompt audio that aren’t apparent until you hear how the extraction sounds in a different context.

One way to mitigate some of the disadvantages of the prompt database is to use a resource file. I tried this with the last two applications that I developed, and I was happy with the results. I’ll discuss the specifics in my next post.

 

Posted Wednesday, August 30, 2006 12:52 PM by Chris Jansen | 0 Comments

Playing prompts, Part 1
Yikes! How did a month and a half go by since my last post? I've had half my head immersed in an MSS 2004 project, the other half in messing around with MSS 2007, and the third half trying to figure out where my son's going to go to kindergarten!

On to the topic at hand. If you've had a little experience with MSS 2004, you probably know that there is more than one way to play recorded prompts. I'm going to try to cover the different options and identify pros and cons of each approach.

First, my list of the various ways you can play a recorded prompt in MSS 2004.
  1. You specify the text of the prompt directly in your code, and you add matching text in one or more entries in a prompt database. The prompt audio is retrieved from the prompt database and played.
  2. You use SSML markup to reference an audio file. The file is retrieved and played.
  3. You use a prompt database for your recorded prompts. Each prompt is labeled with a tag. You reference the prompt in your code with a peml:withtag element to reference the specific prompt you want to play.
I'll start with #1 in my next post. If you know of other useful ways to play recorded prompts, leave a comment and let me know.

Posted Thursday, July 27, 2006 9:58 PM by Chris Jansen | 1 Comments

The bright side of a bevy of bugs

I've been working on an MSS 2004 speech rec application for several weeks now. For the past week and a half it has been in QA testing. I've had to track down and fix a few bugs, which has made me conscious of some of the beneficial tools available while developing in this environment. Here are some of the things I have found invaluable.

  1. The TASim and the Speech Debugging Console. These are pretty obvious ones. They're essential for doing any sort of testing of an app. I really like the fact that I can run my application on my own development machine and play around with the code before I deploy it on the server.

    The Speech Debugging Console is quite comprehensive. I can see the prompt text, the SML result, the active grammar, and a whole host of other real-time information. I especially find the Output tab useful since it provides a useful running trace of a call. You can also embed logging statements in your JScript code and see this output in the Output tab (if you're not familiar with how to do this, leave a comment and I can post about it).

  2. Debugging and adding break statements through Visual Studio. Another obvious one, but based on my past experiences with other IVR development platforms, it's actually a novel thing to be able to break and step through a running application. It's not flawless, what with having to deal with both client and server sides, but it's still very helpful.
  3. The Speech Debugging Console Log Player. This tool allows you to "replay" previous TASim calls. You can, for example, use TASim make a call to your application, go through the first five QAs, and then hang up. TASim generates an XML file with details about the call. You can then edit the XML file and remove the disconnect at the end. This allows you to use the console log player to replay the call and then continue using the TASim once the console log player finishes "playing". I have found this especially useful when I want to test something that's five QAs into an application and I don't want to provide input for those QAs over and over again. The console log player can get to a certain point in the app more quickly.
  4. log4net. This logging framework has been helpful for me to understand what my back-end data integration is doing while I'm making a test call. I usually use this to understand higher level server-side state without using the debugger to step through the code line by line. Of course I have log statements integrated throughout my code-behind to make log4net useful.

If you have any of your own tips for debugging, let me know!

 

 

Posted Monday, June 12, 2006 4:12 PM by Chris Jansen | 1 Comments

More than one way to skin a cat

I'm in the process of finalizing development of a custom speech recognition application. It's due to go to our QA "group" (aka Jim) next Wednesday.

I've done a handful of applications with the MS SASDK now, which is just enough to be really dangerous. Like the subject implies, I've learned that there are many many ways to do the same thing in this environment. Some are good, some are not so good, and some are a downright bad idea.

Last week, Marshall pointed me to a section of the SASDK docs that discuss dynamic content (You can find it here: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/SASDK_UserManual/html/UM_deploy_Dynamic.asp). I learned that some of the things I was doing could cause performance issues in a production environment. The main thing that I was doing was changing a QA's Enabled property in the code-behind. This can cause the page script to be re-compiled. I discussed this issue with another developer here, and it seems that essentially you're causing the TAS to cache a copy of each distinct state of your page. So, for example, if the page has one QA that has its Enabled property set in the code-behind, the page would have two versions cached -- one where the QA is enabled, one where it's disabled.

This isn't nearly as big of a performance issue as if you set your inline prompt text dynamically. If the prompt can be truly dynamic, then the TAS will cache a copy every time a QA's prompt is different. To misquote Martha Stewart, "it's a bad thing".

(For anyone interested in the lexicography of "more than one way to skin a cat", look here: http://www.worldwidewords.org/qa/qa-mor1.htm)

 

 

 

 

Posted Friday, May 26, 2006 10:56 AM by Chris Jansen | 0 Comments

Introduction

Thanks to Brandon Tyler for getting me set up with a blog on gotspeech.net! Since its inception, I've been able to get some very helpful answers from this site. I've been working on IVR applications for many years, and this is the most organized community (okay, I'll be honest, other than the netspeechsdk newsgroup, it's the only community) I've ever discovered for discussing a specific speech recognition development platform.

A quick bio -- I've been with Gold Systems for 12 years. Since the company has only been around for 15 years, I've been termed a "lifer". I started working here before I graduated from college, and have worked with a variety of IVR application development platforms -- Script Builder, Graphical Designer (aka Graphical Disaster), IVR Designer, CRA Editor, and most recently the MS SASDK.

Learning to develop for Microsoft Speech Server was a big challenge, because it's paradigmatically different than other speech reco development environments. Although it has its challenges, I like it because it allows me to program in a more "real" programming language (C#), which for the most part I haven't been able to do on other platforms.

 

Posted Friday, May 26, 2006 8:58 AM by Chris Jansen | 2 Comments