Welcome to GotSpeech.NET Sign in | Join | Help

Using Windows 7 for Speech Server Development

For those who may think that Windows 7 can't be used to develop Speech Server 2007 applications, I have found it can work.  Below are the steps I took to install Speech Server on a fresh install of Windows 7.

1.  Follow all the instructions explicitly for Windows Vista in the ReleaseNotes.htm file that is in the OCS2007SpeechServer folder.

2.  One of the steps has you install the Visual Studio 2005 Update for Windows Vista. It does apply to Windows 7

3.  Don't do like I tried to do and install the Hotfix for Microsoft .NET Framework 2.0  It is only for Windows Server 2003 and XP.

4.  Be sure to follow very carefully the install for IIS and MSMQ on WIndows Vista.  This also takes the place of the step Entitled "To Enable Integrated Windows Authentication."

5.  All other steps are self-explanitory.

6.  The last thing you will do after you have installed at least one Language is to install the Speech Server Cumulative Update. This gets rid of the Red X problem and a few other things.  This update cannot be update directly by running .MSP file.  Because of the security involved, you must run this install as administrator.  The problem is that if you right-click on the file to choose run as administrator, that option is not found.  So what I did was create a simple .BAT file in notepad, added the path to the file, and then ran the batch file with the Run As Administrator option selected.

That's it.  There is nothing more to do.  I have test speech server and ran and edited apps already created in Vista and XP.

So now the only remaining question is, does Microsoft officially support the use of Speech Server development in Windows 7?  To date, I have not heard, but at least for now I suspect not.

Posted by kstep | 1 Comments

VPCs and a Softphone to Test Transfers

I don't know if anyone has tried this, but the other day when I was playing around with transfers using only soft phones and virtual PCs, I was able to do something I wasn't sure you could do. I was able to initiate a call to one speech server, which in turn transferred the call to a second speech server and started another application. The scenario is like this…

Create application A that does a transfer after it completes. Load application A onto Speech Server A running inside a virtual (or physical) PC (I'll call it PC-A).

Create application B that runs an application that once transferred to, from application A, starts up and uses the ANI for identification. Application B runs on a second virtual (or physical) PC, PC-B.

On my laptop I had a soft phone that called PC A's application. PC A's application successfully transfers the call to PC B's application. Except for the fact that I produced prompts that indicated that I was transferring from one app to another, and that the app on PC B had a prompt saying I just came from another app, the user's perception of the transfer is non-existent. That is, the user would not even know that the transfer took place except possibly for some latency.

A couple of things to note in this endeavor…

  1. The port numbers for each app need to be different. i.e. Application A was put on port 8090, and application B I put on port 8092.
  2. The softphone must use tcp by default. For example if you try using X-Lite, the only way that X-lite can work is if you use a URI like this… sip:1234@192.168.8.5:8080;transport=tcp. Since Speech server cannot append the ";transport-tcp" to the end of the URI (at least I don't know of a way) for the transfer URI, you need to use a phone where tcp is the default , such as sipXPhone.

     

This kind of gives one some hope that without any telephony backend, you can test real transfers.

Posted by kstep | 6 Comments

Debugging Speech Server: Lesson Learned

If you've worked with computers long enough you know that there are times when you run across a problem that seems so elusive that you just want to attribute it to, "that's just the way it is", and move on.

Such was the case not too long ago when I ran across a problem using Speech Server. It got to the point where all I wanted to do was backup our databases, wipe clean the Speech Server, put on a fresh operating system, re-install Visual Studio 2005, SQL Server, prerequisites for Speech Server and Speech Server itself… not a very appealing option, but one I was really close to accepting. It was an extremely frustrating experience.

So the scenario is I have a laptop with a soft phone, calling across the local network to a Windows 2003 Server with Speech Server installed. I was able call the speech server and everything about the call was fine except for one important part. I could not hear anything. I knew the soft phone was being answered, because of the message on the soft phone that said I was connected. I could verify the connection was being made on the Speech Server because I could see the additional call show up through the Performance Monitor. I even used Wireshark to do a trace to show that the SIP traffic was going back and forth, and to see if there was some anomaly with the audio. But everything looked fine. There were no errors in any system or speech logs either on the Speech Server or my laptop. My natural inclination was to look into a firewall problem. After all, firewall problems always seem to be the culprit when it comes to SIP calls. After shutting down the firewall, however, I still had the problem.

Just for grins, I took a virtual machine that had Windows 2003 Server and Speech Server loaded, and ran the virtual machine on my laptop and attempted to make a call using the same scenario. I had a soft phone on my host operating system calling a Speech Server on a Virtual machine, all on my laptop. Everything worked fine, including speech.

My conclusion… my laptop was configured properly, my soft phone was configured properly and the problem with getting no audio had nothing to do with the laptop, and strangely was not working from the Speech Server.

So what had changed? The Speech Server I was using for development had been working fine before I had made changes to my application. With this new version of the app, it simply would not produce audio. Had I done something with the application to cause that? After testing the old version of the application, I still was not producing audio to my soft phone. I even tested the built in "Welcome to Microsoft Speech Server" app that comes with Speech Server and that didn't produce audio either.

Another thing that had changed was that we had installed .NET 3.5 on the Speech Server box to accommodate the latest WCF (Windows Communication Foundation). Could that have been the problem? To eliminate that possibility I also installed .NET 3.5 on the virtual machine I used earlier. After testing, it worked just fine. Audio passed to the soft phone without a problem.

So I still was not getting audio from our speech server to my laptop and I still haven't a clue why. Could it be that the problem was not the Speech Server but was my laptop? Seems unlikely since everything works fine when I used a virtual machine, but just in case I asked one of my colleagues to set their laptop up with a soft phone configured the way mine was. I then went to Speech Server to configure a trusted SIP peer. My colleague made the call and lo and behold, he received audio to his Soft Phone!

So now things pointed back to my laptop. But what could be the problem? What was different now as compared to two weeks ago when things were working fine?

It turns out the week prior to my discovering the problem I am describing here, we were at Microsoft going through a tech review of our application and one of the things that was needed was a demonstration of our application. To do that I used my laptop running a virtual machine with Speech Server that was all on my laptop. Because I was not going to be connected to a standard network, I had set up my laptop with a virtual network using a loopback adapter. In case you are not familiar with a loopback adapter, this is a piece of software that ships with Windows that allows you to configure a virtual network card that can be used to test network connections. It can also be used to setup a virtual machine that connects back to the host operating system through a virtual network without having a physical one.

Just like a real network card, you can have the loopback adapter connected or disabled. So for grins, I decided to look into the possibility that the loopback adapter was causing the problem. It turns out the loopback adapter was still connected as it was while I was at Microsoft. So I disabled it and tried my test again. IT WORKED! With the loopback adapter disabled, I was now getting audio from the Speech Server.

So there are three morals of this story. One, if you use a loopback adapter, make sure you shut it off before doing testing over a physical network. Two, don't just arbitrarily wipe a box clean and spend many hours getting that box up and running without exhausting other possibilities. If I had done that, I would have found out that the problem still existed, and that would have really been frustrating. Finally and probably most important, when debugging, always think about what changed since the last time it worked. Nine times out of ten, it is a change that seems to be benign that causes your problem.

To this day I am not really sure why the loopback adapter being connected causes a problem with audio being passed from Speech Server to a soft phone over a physical network, but it does. And maybe someone can help me understand that. This is just another one of those lessons learned that I wanted to share with you.

Using VS 2008 and VS 2005 on the Same Computer

There has been some discussion about using Visual Studio 2008 and Visual Studio 2005 with speech server on the same box. I thought I would write about my experiences and provide some warnings and alternatives.

I personally have loaded both VS 2008 and VS 2005 on an XP Pro box and things seem to be OK with one exception...  It seems there was a slight change in the way the workflow activities work.  Now, whenever a required property is needed on an activity, and in some cases even when things should not indicate requirements, instead of displaying a red exclamation mark, there is a red X through the entire activity icon box.

As long as there is a red X, it appears as though you cannot select the activity. However when you do click or right-click on the activity everything seems to work as it should.  The only thing you can't do is delete the activty unless you minimize it by clicking the double chevron button. This is more of an annoyance than anything else.  I can still write and debug speech apps.

Using Vista is a whole other issue. A colleague of mine installed VS 2008 on the same box as VS 2005 and speech server. When he tried running an app within Visual Studios, there was an error indicating that the debugger could not attach to the workflow process. Even when he tried using the CTRL-F5 (non debug mode) he got the same error.

So the bottom line is  I would not recommend doing loading VS 2008 with VS 2005 if you are working on speech apps, especially on Vista. With XP Pro, things seem to work OK but with the caveat I mentioned above. There may be some other issues I have not uncovered yet. 

My suggestion would be to create a virtual machine and load VS 2008 on the virtual machine if you need to work on VS 2008 and VS 2005 on the same box.

Test Blog From Word 2007

This is a test blog.  For those that blog to this site and would like to know how to setup Word 2007 to post a blog, let me know.

Posted by kstep | 2 Comments

The nBest Choice

There are times when you receive ambiguous responses to very direct questions. For example, what if Speech Server asks a user what their last name is and the response is "Smith."   If it matters, the spelling could be s-m-i-t-h or s-m-y-t-h. Or what if the question is,  "Which city are you traveling to?", and the answer is "Austin."  The problem here is it could be interpreted as Boston.

There are times, when we need to get a list of possible responses that the recognition engine derived from the recognition based on the grammar it used.  Armed with the list, we can then ask the user, out of the possible answers, which one is correct, instead of asking the same question over and over.  For example, consider this scenario...

IVR:  Which city are you traveling to?

USER: Austin

IVR: I think you said Boston. Is that correct?

USER: No

IVR: I'm sorry which city are you traveling to?

USER Austin

IVR: I think you said Boston. Is that correct? ....

(By the way, it is possible to have the system allow the user say, "No I said Austin.", but that's a whole other discussion and it could still result in the system thinking the user said Boston.)

As you can see, this could get quite tedious for the user and the programmer.  Instead, we could present the user with other possible answers like this...

IVR:  Which city are you traveling to?

USER: Austin

IVR: I think you said Boston. Is that correct?

USER: No

IVR: I'm sorry, did you say Austin?

USER: Yes

Michael Dunn had a wonderful blog post about how to obtain an nBest within a Workflow application called N-Best Type Functionality. In that blog Michael pointed out that there is a property associated with a Speech Activity called RecognitionResult.AlternatesAlternates is an object of type List<RecognitionPhrase>, and RecognitionPhrase is an object that contains information about the possible answer returned in the Alternates list. Properties associated with RecognitionPhrase include but are not limited to:

Confidence - This is score with a value of 0.0 - 1.0 where 1.0 is 100% confidence that the answer is correct.  (In the real world a confidence score of 1.0 is nearly impossible, and besides a recognized phrase in the alternates list would not be in the alternates list if it's confidence score was 1.0)

Homophones - A homophone is a word that sounds the same but is spelt differently. This property is a list of homophones for the recognized phrase.  This could be used as a way to present the user alternative spellings of a response. using something like this:

   qaLastName.MainPrompt.AppendTextWithHint("Boston", Microsoft.SpeechServer.Synthesis.SayAs.SpellOut);

Text - The word or phrase that was recognized.  In the example dialog above, this is where I would obtain Austin as the possible alternative to what the speech engine thought was said, namely Boston.

BIG NOTE:

Unfortunately (and I learned this the hard way) you cannot test alternatives using the built-in debugger and soft phone by pressing F5.  Instead, use Ctrl-F5 to put the soft phone into a non-debug mode.  Doing this prevents you from using the User Input and Call History tabs. This does not shut of debugging in Visual Studios it simply prevents you from inputting your response by typing it in, so you will have to use a microphone and speak your response or use the DTMF buttons. It turns out when you use the debugger's soft phone in debug mode, even if you record your response, the soft phone puts the result into the Text Input: field.  You are then required to click on the Submit button.  Doing this is equivalent to typing in the text and yields effectively a 100% confidence score from the recognition engine which eliminates any possibilities of alternative phrases showing up in the Alternates list.

Time Wasted Learning?

Well maybe I should not call it time wasted, just learning experiences. I have spent a lot of time on things that, well if I had done them a little differently, could have saved me some heart ache.  Below is a list of items that I have learned as early as yesterday. If you are interested in more details on any of these, I will ellaborate on later blogs.

 

  1. Don’t give your speech server computer a name that starts with numbers. The Speech Server Process does not like that.  Maybe a bug?
  2. Firewalls are a necessary evil but you can set your applications to run on a specified port number... including a port that is already open. This works well for machines where you have no control over the port allocation.
  3. I can’t call Microsoft Speech Server using a soft phone on the same box as Microsoft Speech Server… I have wasted and inordinate amount of time trying.
  4. Using Virtual PC to set up a virtual Speech Server is a great learning and debugging tool, all you need is about 2 gig of memory on the box to make it worth your while.
  5. Set the UseDefaultGrammars property to false if you are using custom grammars for controls like MenuActivities and NavigableListActivities.  You can get some strange results when your activities use data sources that have items that are in your grammars.  The activities use the items in the data sources to create the default grammars.
  6. There is nothing built into the prompt engine that converts numbers and dates to their word equivalents. You’ve got to create your own code to do that if you don’t want text to speech (TTS) as an output for that type of data.
  7. Don’t expect all the features of WorkFlow to work in a SpeechWorkflow. For example, in a regular Workflow application you can add and remove activities dynamically. That does not work so well with Speech Workflows.

So hopefully some of this will be news to you and it will have been worth the time to read this blog. 

Posted by kstep | 4 Comments

Dynamic Workflows - Session Data

I was tasked to develop an application that has to be very dynamic. What I mean by this is that I needed to be able to have call flow generated at run time based on some parameters that are set in a database. To this end, one of the issues I ran into was the ability to create and maintain session data and have that data available to custom activities.  The idea is that depending on parameters that are obtained from a data source, some or all of these custom activities would be added to the call flow at the instant that call was initiated and each of these activities would need access to some global data dictionary in order to store and retrieve data that was collected by other activities.

 In OCS Speech Server, when using work flow to develop speech apps, there is a work flow engine that processes each activity in a very serial way unless some event is triggered to cause the work flow to suddenly jump from the main call flow to some other call flow. There is a public object that is available called UserData.

 UserData is a Dictionary that stores Key/Value pairs. At first I thought this would be a perfect variable to store any data that I collected from the user while their session was active. After some testing and a little research I quickly came to the realization that the UserData dictionary did a real good job at storing user data but that data was maintained for all users, which is to say, UserData was an instance variable of the WorkFlow and was not a session variable provided anew for each caller. So I needed something similar that would be available for each caller.

 The solution appears to be something I found in a BLOG by Chad Cambell.  In his BLOG he talks about creating your own session variable and how to get to that session variable from the main call flow or from any custom activity.

Although this solution seems to be sound, I would like to hear from anyone that may have a different solution.

Posted by kstep | 1 Comments
Filed under:

VXML vs. WorkFlow for Speech Server 2007.

Before I get to the core of what this blog is all about, some explanation is in order as to where the heck I have been.

About a year ago I started and misserably failed at writing tutorial blogs on how to use Speech Server 2004. That was during a time when I was trying to get a business in writing speech software with my company, CRK Software.  Well, CRK Software was not going in the direction I had hoped it would, and unfortunately I had to put CRK Software on the back burnner and work at a real job with Countrywide Mortgage as an ASP.NET developer. For a long time now I have done little to nothing with Speech Server and felt like I was not going to be much help in this arena.

Well time went on and I found out about another contract that could potentially put CRK Software back on the front burnner, so now I am working a new contract doing, you guessed it, OCS Speech Server work. One of the first questions I was asked was whether to use VXML or WorkFlow to build their product. At first it seemed like a no brainer. Workflow all the way. Especially since other software development outside the speech world would be using Workflow.  Then I got to thinking... why not VXML?

There are at least several reasons to choose VXML over workflow and reasons not to. So below is a bit of a comparison.  Let me know if you can come up with other compelling reasons to use one standard over another?

VXML Pros

1.       A more universal and well established standard.

2.       Less proprietary so there are more abundant choices in VXML interpreters/speech servers.

3.       Can be written based on J2EE or .NET.

4.       Because it’s XML, tools abound for authoring XML and are quite good in Visual Studios.

VXML Cons

1.       Development tools although abundant for XML, are not necessarily well established to debug voice apps with grammars and the like.

2.       I believe there is a longer development cycle but I guess familiarity with VXML, the development tools and other factors can affect development life cycle.  Since I am woefully lacking in VXML development experience I could be persuaded otherwise.

3.       Both client side (javascript) and server side (Java, C# or VB.NET) code is necessary. Development is more of a web development paradigm. 

OCS Speech Workflow Pros

1.       Because speech applications tend to be rather linear in their execution, speech apps lend themselves quite well to workflow.

2.       Very good tools for writing and debugging within Visual Studios.

3.       A graphical development environment for both dragging and dropping call flow and seeing the call flow.

4.       Very little client if any client side code is necessary to develop speech apps.

5.       Much of the .Net Speech library applies equally well to Vista desktop speech development but I am not sure how big of a deal this is when you are only concerned about IVR development unless you are looking to have a skill set outside IVRs.

OCS Speech Workflow Cons

1.       Very proprietary. You must write on Microsoft tools for Microsoft Speech Server.

2.       Very young technology and to some extend not very well supported, even by Microsoft, although that is improving. Be prepared to plow new paths without much documentation.

3.       If you are not a .NET developer, you will have a steeper learning curve to get used to the development tools and .NET in general.  Even if you have written Speech Server 1.1 SDK apps, there is a large shift in the way you develop IVR apps.

4.       Its in .NET. Although for me I don't consider this a negative, others, especially those J2EE die hards would consider it a negative.

And what about SALT?

 

All the same pros and cons (although documentation is more plentiful.) for workflow except two additional CONs… 

 

1.       Microsoft seems to be moving away from the SALT as a standard.

2.       Salt can be tedious since much of the development requires client side, javascript, development.

Posted by kstep | 5 Comments

What is Necessary to Make Speech Server Run? - A System Requirements Discussion

After a good vacation it’s hard, number one, to get back into the swing of things, and number two, to catch up on all the things you need to in order to get back into the swing of things.  Did that make any sense?

 

At any rate I am back at it and in this installment I want to continue to talk about some of the hardware and software pieces that make up the Microsoft Speech Server environment. In the last installment I spent some time talking about the TIM and telephony board. Those are the pieces that allow a telephone network to connect up to the Microsoft Speech Server.  But what about the computers necessary to make Speech Server work. What kind of hardware is necessary there? Also in my last installment I said would talk about configuring the MSS server, but I think instead it makes more sense to talk about what is required for a speech system to run first, then develop an application, and then talk about configuring MSS.

 

Before I go much further I want to take a minute to address those of you who are new to Microsoft Speech and simply want to experiment with this technology. DON’T LET THIS ARTICLE SCARE YOU. I say this because as you read through, it may seem like you will be getting in over your head.  Rest assured you can actually develop a speech app using the Microsoft Speech SDK without any additional hardware cost or expense. With an MSDN license or a trial version of Microsoft Speech Server, and a developers license of a SIP TIM (like Vail Systems), you can develop an IVR without any additional hardware and it will even run in a Virtual PC or Virtual Server but you will still need to obtain a license of Windows Server 2003.  So read on but don’t get discouraged.

 

If you’ve been around the IVR industry long, you know that one of the arts within the art of providing a good speech application is being able to determine how much hardware needs to be put into place to make a system with good scalability, as well as good response time.  To tell you the truth I am no expert in this realm, and yet it’s one of the most asked questions.  For example, how many computers should I have? What speed processor? How much memory and hard disk space?  How many ports (telephone lines) do I need? Should the Speech Server be on the same box as the Web Server? (Also known as the application server.)  Should I put the database on the same box? And on it goes.

 

First, to answer these many questions, lets look at all the pieces that are necessary to make a Speech Server work.  First it should be noted that when you install the Microsoft Speech Server, you have the option to install two primary pieces. Those pieces are the Speech Engine Services (SES) and the Telephony Application Services (TAS). Let’s talk about what each of these do.

 

SES is broken down into two parts. The Speech Recognizer (SR) engine and the Text-To-Speech (TTS) engine. As the names imply the SR engine is used to recognize speech by comparing what is spoken to a pre-determined grammar. The recognition output of the SR engine is an XML document called SML. It contains the text of what was recognized along with confidence scores and semantic items. We will talk more about SML in the next installment. The TTS engine will generate a digitized stream of audio data from text that is part of another XML document called SSML, and again we will talk more about SSML later.  So in reality SES is the heart and soul of a speech system. It recognizes spoken word and plays back text as audio. As you may have already guessed, this part of the Speech Server system requires a lot of processing power, i.e. processor speed and memory.

 

The other part of the Microsoft Speech Server is the TAS. As I stated in the last installment, you can think of the TAS as a web browser for speech. Since telephones do not have the capability to interpret web pages in the form of HTML, SALT, and JScript, it becomes incumbent on the TAS to do that on behalf of the telephony hardware. So TAS is the speech browser and interpreter for speech applications. It understands HTML, JScript and those special HTML tags for speech called SALT.  TAS too can require a lot processing power. Especially if TAS is required to keep track of a lot of telephone calls at one time.  It would be like having your desktop run multiple Internet Explorer browsers with each one running web pages with sound and video.  Clearly if we try to run too many of these web pages at once, we start to run out of computer resources and latency is the result.

 

Because speech IVR applications are fundamentally web applications, we also need a web server. For Microsoft Speech, this requires Internet Information Services (IIS).  This software is used to send to the client (TAS in the case of a Microsoft Speech Server IVR) the necessary HTML, JScript and SALT web pages. IIS does a lot of work. Mainly this work comes in the form of taking C# or Visual Basic code and turning it into the appropriate HTML, JScript and SALT so it can send this information to the TAS.  Of course other functions include dealing with security and keeping track of application and session state. (If you are not familiar with the terms ‘application state’ and ‘session state’ take a look at this article from Microsoft.) The point here is that IIS does more than just find a web page file and send it out to the system that requests it.

 

One other major part of a typical Speech Server system is a database of some sort. The database engine that you use will of course depend on your company’s preference.  But when we talk Microsoft we are talking Microsoft SQL Server. Others come to mind as well such as Oracle, Sybase, and the beloved and free MySQL.  Regardless of the database engine you use, consideration must be taken on how big the database will be or even if it already exists. Many companies tie their Speech Server to existing database systems.  Typically a speech system is added as an additional interface to existing applications. Now users have the added flexibility to get to a companies data either through a web site or through a speech portal via phone.

 

Now that we understand the pieces that need to be put into place for a speech system, we must now do the real work. We must somehow come up with a way to determine if at one extreme, all these pieces will be working on separate boxes tied by a network, of if our system will be handling a sufficiently small amount of users where the entire set of components can be hosted on a single box. Usually it’s somewhere in-between.

 

There are many factors to consider. The number of concurrent users tops these considerations. Sometimes just knowing how many concurrent users isn’t enough however.  Sometimes we need to know when our peak load will be and is it OK if during those peaks we simply put the user into a queue or give them a busy signal.  We should also know how long we expect the user to be on the line and how much of the application requires intense SR activity or backend processing.  Then there is the issue of the network. Usually, the bottleneck in the whole system is the network. Should we consider putting all the components on a local area network, or is it impossible because our company’s topography requires databases and web servers to be geographically separated.  Microsoft Speech Server even allows SES and TAS to be on separate boxes with load balancing. (My understanding is that in the upcoming version of Speech Server included in the Office Communication Server separating TAS and SES will not be possible.)  The big IVR companies typically have some sort of a process in place to calculate how many computer systems will be required to handle a given deployment. Even at that, many times once the system is installed tweaking and tuning cause that estimate to be revised.

 

This I can tell you. You can you can find articles on this subject that go into details on what you can expect like this one from Intel.  Many times it takes experience, knowledge of your system and trial and error to come up with the right answer.  The other thing I can tell you is I have developed a “works-in-a- box” system (SES, TAS, IIS and a database on the same box) with 4 ports that has no problem running with four users concurrently. That box is a Pentium D with 2 gig of memory running at 2.4 Ghz.  So while this article may make it seem like developing a relatively cheap system for development or small user loads is elusive, it’s not. In fact to develop a speech system that runs with a standard sound card and no telephony is essentially free, once you have the PC.

 

In the next installment I plan on a discussion of a very simple speech application along with a Hello World tutorial. So check back in a few weeks.

Posted by kstep | 0 Comments

What does TIM and Brooktrout have to do with Speech? - A Telephony Interface Primer.

I remember as a kid taking two cans, punching holes in the bottoms, threading string into the holes and tying knots to hold the string. Then my friend and I would hold the cans so that the string was taught and talk into the can while the other had their can up to their ear. We were never quite sure if we were actually hearing the voices through the cans or just hearing each other because we were so close. But it was a fun little experiment that entertained us for a while. The physics behind this assumes the bottom of the can you talk into vibrates with your voice, which in turn vibrates the string, which finally vibrates the bottom of the other can. So in theory we had built ourselves a voice user interface (the cans) along with network (the string) that allowed us to talk to each other over a distance. It was a pretty crude interface, but none-the-less an interface.

 

With Microsoft Speech Server (MSS), there are a multitude of interfaces, all of which need the “cans” and “string” to tie it all together. In previous installments I have discussed how a speech application is very similar to a web application. On one end you have the web server, which services up the web pages, and on the other end you have a user sitting in front of a computer using a web browser, or in the case of a speech application, talking on a phone. The user types in an address such as www.microsoft.com in their browser and at the other end, the web server gets that request and sends back the start or default page of the web application. As the user interacts with the web application, more requests are made to the web server and the web server responds with new or updated pages back to the user’s web browser.

 

Think of MSS as a computerized version of a person with a web browser. MSS requests a speech application from the web server, the web server in turn sends back a web page, but instead of pictures and text, the speech server gets some special instructions that tell it to play prompts, and accept speech input for recognition. So fundamentally, the interface between the web server and the Speech Server is a standard web connection, but there is another interface between MSS and the telephony network. In this installment, I would like to talk in some detail about that interface.

 

If a user wanting to use the internet had no keyboard to “talk” to the internet with, and no screen to see the results of a request to the internet, that would make for a pretty boring user experience. So to allow a user to “talk” to the internet, we have a piece of hardware called a keyboard, and a piece of software that translates the electromechanical input from the keyboard to the text that is required to make a request. By the same token, to see results, we have some software that translates the ones and zeros to an electrical signal by a special circuit that sends a signal to the computer monitor to display the results of a web request.

 

Setting aside the use of a telephone for a moment, it is possible to plug a microphone and speaker into the MSS computer to speak to MSS, and then hear spoken words from MSS through a speaker. But having hundreds of users with microphones and headsets plugged into one machine is quite obviously not very practical. So to accommodate lots of users from anywhere in the world, some how MSS needs to connect to an entire telephone network."Out-of-the-box,” that is not something the MSS can do without some special hardware and software.  

 

MSS understands computer networks and connects easily to a web server, but it "knows" very little about connecting to a telephone network. So Microsoft relies on third party vendors to provide that telephone interface. As with a keyboard and monitor (the hardware) there is also accompanying software (drivers) that allow any company who makes those devices to connect to a computer running Microsoft Windows. So for MSS to connect to a telephone network, other companies make special boards that fit into an ISA slot in your computer that make the physical connection to standard telephony systems. These companies also provide software drivers for the board and a software interface between MSS and the board called the Telephony Interface Manager (TIM).

 

These boards and software can be purchased from several sources; however as a way of detailing how these boards and software work, I would like to tell you about the boards and software from Cantata Technology (formerly known as Brooktrout). Cantata offers several versions of their Brooktrout TR1000 boards. For development, testing and small deployments they offer a 4 port (4 telephone line) analog board. Other configurations use ISDN digital cards up to 96 ports. For an inexpensive test system, you can purchase a Starter Kit from Cantata that includes a 4 port analog board, the TIM software, and a 90 day license for Microsoft Speech Server.  I have used this configuration to set up a small office system that connects right to my standard telephone line.

 

First let’s talk about the TIM. The TIM software, as I mentioned is something that is shipped with the board. Without the TIM, the board would not be able to interface with MSS. The purpose of the TIM software is two-fold. First it translates the digitized audio from the board into a standard that is acceptable to MSS as well as takes the audio from MSS and translates it into a standard acceptable by the TR-1000 boards. Second, the TIM takes instructions from the MSS to do things like answer the incoming telephone calls, as well as hang-up, transfer calls and perform outbound calls.

 

The installation of the Brooktrout TR1000 4 port analog telephony card and TIM was a rather simple process. The board does require a full length board slot, so don’t try to install it in a compact desktop computer. The instructions in the user manual were well documented and provide excellent step-by-step installation instructions as well troubleshooting techniques. It is important that you follow the instructions carefully. For example they tell you to install the TR1000 software first, before installing the board. In this way, when you do install the board, the Plug-and-play features kick in and automatically detect and install the appropriate drivers. (The TR1000 software has recently been updated with a number of enhancements and fixes. The software is going through the final Microsoft certification for MSS 2004 R2, but has not been release as of this writing.)  After you install the board, you are asked to run the configuration software to determine how you want the ports set up. For example you can determine how many rings before the line gets picked up, as well as determining if the ports will be used for in-bound or out-bound calls. After the configuration, it’s time to install the TIM software. Since I was familiar with Intervoice’s TIM installation, I quickly recognized the process.  The Brooktrout TIM software is really a re-branding of the Intervoice TIM.

 

I have installed three different versions of TIM and two different boards (the other being the Intel Dialogic board). None of them went as smoothly as this install did. I do think it’s fair to say, however, that nothing is problem free, and so I do have a couple of minor issues.

 

First, for some reason when installing the MMC interface for TIM, the system seemed to lock up during the install. There was no indication that the MCC console had completed its install, but upon inspection of the system, it did get installed.  I decided to re-install the entire system to see if the problem repeated. The second installation went without a hitch.

 

Second, near the completion of the installation of the TIM a pop-up box appeared saying that the IviGrp group had been installed and it asked me if I wanted to install a user for that group. At this point I could have chosen no, but I was not really sure. I looked for documentation to help explain the question, but the documentation had not yet been installed. So my guess was to choose yes and I was immediately taken to the Computer Management console. My next thought was what user to install and decided not to do anything.  Later after successfully installing the TIM software (even without a new user) I was able to read and understand what it was the installation software was having me do. It turns out to run some of the command line utilities for the TIM software, your system user requires that it a member of the IviGrp group. So I was later able to assign my login to that group without a problem.

 

Following the install of the TR1000 and the TIM, I installed Microsoft Speech Server. To test the system from end-to-end, I plugged Port 0 of the Brooktrout board into the telephone jack in the wall and dialed the number. The first time I called, the system took three rings before it answered, even though the board was set up to answer on the second ring. After it answered, it took about 10 seconds before MSS sent me the “Welcome to Microsoft Speech Server” message. The delays were likely due to the server caching all the necessary software, because on the second call, everything worked without delay.

 

So that’s it. When compared to developing a speech application and then configuring and optimizing the application, installing the Brooktrout TR1000 and TIM software was a breeze. In the next installment, I will talk a little about configuring MSS and how MSS knows what application to run when a caller calls the system.

 

Posted by kstep | 3 Comments

Technologies You Should Know For Microsoft Speech Server

OK, I’m going to date myself here, but I’m sure some of you young puppies out there have seen this as well. There was an “I Love Lucy” show, where Lucy somehow ended up at a candy factory on the production line. Her job was to take the candy off the moving conveyor belt and place it into packaging as it slid by. At first, she had no problem. As time went on, however, the conveyor belt started to increase in speed, so much so she started eating some of the candy instead of packing it. Eventually all the candy began to fall on the floor around her.

 

Sometimes, when it comes to keeping up with technology, especially software development technology, that’s exactly how I feel. Just when I think I have caught up, suddenly there is a flood of new information to learn about. I know others have touched on this before, but I thought it might be interesting to sort of list the technologies that a developer using Microsoft Speech Server would need to know in order to effectively develop a speech application. And if you think this is all, wait until Speech Server 2007 comes out. While much of what you learn for Speech Server 2004 can be applied, there will suddenly be a new Visual Studio, with a new version of C#, along with a new Workflow paradigm for programming… and well you get what I’m saying.

 

As I stated in my last installment, writing speech applications for Speech Server 2004 is much like writing a web site except you include speech stuff. That speech stuff is SALT (Speech Application Language Tags).  Just knowing SALT won’t help much in the Visual Studios environment. In fact, the way the Speech SDK works, you really don’t write SALT at all. You write an ASP.NET application and use Microsoft Speech controls that are included in the tool box. By dragging and dropping these speech controls, you are in essence writing some HTML-like/SALT-like code that is ultimately interpreted by the web server into true HTML and true SALT. So that brings up the first technology you need to understand… ASP.NET. If you are already an ASP.NET programmer, you already have many of the skills necessary to build speech apps.

 

For those unfamiliar with ASP.NET, or for that matter any of the Microsoft .NET technologies, I recommend that you understand what .NET is and what can be done with it. Essentially it’s Microsoft’s latest technology for developing applications. Applications developed in .NET can be console applications, Windows desktop applications, web sites, web services, and just about any other kind of application you can imagine. For those of you who program using JAVA technologies, you will be very comfortable with the similarities between .NET and JAVA.

 

When you use Visual Studios (or any other development environment) to write ASP.NET web applications (to include speech), you are really writing code for two computers. One computer is the server that services up the application, called a web server (For Microsoft that is IIS) and the other is the client computer that gets the information from the web server. The client is the computer requesting the web pages from the web server (for speech that can be Microsoft Speech Server). The thing to keep in mind is that when you write an application using ASP.NET, you are writing code that tells the web server what to send to the client. It is up to the web server to take that code and convert it to code that is run on the client’s browser. There are cases however, when you will write code directly for the browser and the server just passes that code to the client without doing any conversion.

 

So why would you write code that runs on the web server in some cases and write some code that runs on the client’s browser in other cases? Well this is the dilemma that plagues web development; and this is one reason why Microsoft developed the .NET environment. The idea is that if all I had to do was write code for the server then theoretically, I should be able to write all of my code in one language (for example, C# or Visual Basic.NET) so that it runs exclusively on the web server, and rely on the fact that the web server will convert that code to HTML and JavaScript (as well as SALT for speech) which runs on any client’s browser. For standard web site applications, this works fairly well.  However in practice, there are times when writing code directly for the browser (using HTML and JavaScript) makes the system more efficient. So from a design perspective, it makes sense to write code for the browser so the system as a whole is more efficient. So why not just write HTML and JavaScript? Well, there are also times that there is code that MUST run on the web server. For example, since I can’t have every client connect to a database server to get data, I have to rely on the web server to make those connections on behalf of the client. Therefore, I need to write code for the server that queries the database.

 

So already we see at least three programming languages that we must learn to be an efficient web/speech developer in the Microsoft world. Those being C# or Visual Basic.NET, HTML and JavaScript (also know as ECMA Script and a Microsoft version called JScript).

 

But that just gets us part way. If we need to connect to a database we also need to understand the database server and its languages, usually SQL (Sequential Query Language). Although SQL is pretty universal for database systems from different companies, there are usually some proprietary nuances that need to be learned. For example if you use Microsoft SQL Server, you will need to learn T-SQL which is very SQL-like, but has unique syntax for running scripts (called stored procedures).

 

If that weren’t enough, many times we need to transfer data between computer systems that are not physically connected. For various reasons, these computers may not be able to have direct connections. Since nearly all computers these days are connected to the Internet, it stands to reason that we need a language that allows us to transfer data in such a way that computers connected to the internet can understand. That method requires another technology called XML. Without going into details on XML, let me just say that speech technology in the telephony world relies heavily on XML. As a matter of fact the whole .NET environment relies heavily on XML. Fortunately the .NET framework obfuscates most of the XML stuff for us.

 

In Speech there are three versions of XML that you must be familiar with. They are GRXML (Grammar XML), SML (Speech Markup Language) and SSML (Speech Synthesis Markup Language). Let’s talk about each of these briefly.

 

In Microsoft Speech Server, recognition can only occur if what is spoken by the user matches a pre-determined “list” of things that we expect the user to say. That list of potential statements is called a grammar. Part of the challenge of writing speech apps is knowing ahead of time what we expect a response to a question to be. For example, if I ask, “What color do you like?” as the programmer, I have to provide to the recognition engine a grammar that has a list of the colors I expect the user to say. That list must be in a special text file that is an XML document that uses the GRXML standard.

 

Whenever we tell the speech engine we want it to recognize something, two things have to be present for the recognition engine to do its work. First we must tell it which GRXML file, or grammar, to use, then we must provide it with the digitized audio data spoken by the user. If the recognition engine is able to match the digitized audio with something in the GRXML file, the recognition engine then produces another XML document called SML. In the SML document there will be the text of what the engine “thought” was spoken along with a confidence score. The confidence score is a number from 0.0 to 1.0 where 1.0 represents 100% confidence and 0.0 is no confidence. We will talk more about this in later installments.

 

To convert written text to spoken words, a text-to-speech (TTS) engine is used. In order for the TTS to do its work, it requires an XML Document in an SSML standard. This document not only has the text of what is converted to digitized audio, but also some special markup that indicates how the text is spoken. For example, when most people speak (unlike one of my college professors) there is inflection and volume change (prosodies) as well as some other information. The point is, this SSML document will tell the TTS engine what and how to convert written text to a digitized representation of audible speech.

 

So let’s recap. Some of the technologies you need to know to program speech in the Microsoft world are:

 

.NET – Microsoft’s application development framework.

 

ASP.NET – Web application development.

 

C# or Visual Basic .NET – One of the languages that the .NET framework understands and is used to program server side code for ASP.NET.

 

HTML – Hypertext Markup language for telling a client’s browser how to display things on the screen.

 

JavaScript – A scripting language that tells the client browser what to do.

 

SQL – A query language used to return results of a query from a database as well a proprietary form of SQL for each brand of database.

 

XML – A markup language used to transfer data across the internet.

 

GRXML – An XML document containing a grammar.

 

SML – An XML document that contains the response from the Speech Server recognition.

 

SSML – An XML document that contains text and prosodies for a TTS engine.

 

SALT – The Speech HTML tags and markup used by the browser to run a speech application. Understand, however, that you are using the Speech SDK for Microsoft Speech Server you will not directly write SALT in your application.

 

While this all seems like a lot, I certainly don’t want to dissuade you from learning how to program speech applications. Much of what you need to learn for speech is very useful outside speech application development. For those of you so inclined, be assured that by learning speech now, you will have a skill that few people have. Speech as a technology will only become more popular over time, so you will be ahead of those who know nothing about it. Plus… it’s pretty cool stuff.

 

Posted by kstep | 3 Comments

The Begining - What is Microsoft Speech Server?

When I talk to people about what I do, they think it’s really cool that I am involved in speech recognition software…that is until they find out the types of systems I program. You know what I am talking about if you have been in this business a while. The response is just about always the same. “Oh you’re the one!”  I tell you this because unlike the kind of speech recognition that people have seen in the movie 2001 a Space Odyssey, the types of systems we are talking about on the GotSpeech.Net web site, are not about dictation, or free form speech. It is not about controlling our environment by telling the computer to turn on and off lights and adjust the music volume. The systems we are talking about here deal with the ability of a user to provide and get information in much the same way we would provide and get information if we were on a web site.

 

This brings me to the point of this post. Although one could argue that the systems we develop using Microsoft Speech Server (MSS) could ultimately do the kinds of things described above, what we are really talking about is the ability of an application to allow a user to talk over the phone to get at information, or provide information, to a computer in circumstances when a computer is unavailable or impractical.

 

Fundamentally MSS is really nothing more than a web browser, kind of like Internet Explorer (IE). But wait, you say. In Internet Explorer I can type things in, read things, do searches, and see pretty pictures. So how is Microsoft Speech Server like IE? Especially when you are using a telephone and speech? Well if you think about it, your keyboard does not have a browser built in, and nor does your computer screen. When you use your keyboard you are translating the typed characters through some software and hardware in to the letters and words you see in the browser software. So too with MSS (the browser), where when you speak through the telephone, your words are somehow translated through some hardware and software into words that are entered into the browser (MSS). And what about the words and pictures you see in IE? Well I will concede that MSS can’t really “say” a picture, unless of course there is an associated description of that picture that can be spoken. But MSS can certainly take words and convert them to spoken sounds that you can hear over the phone. So is it possible that MSS is like and internet browser? Well, yes. In fact in some cases that is what it is called. In the case of Microsoft it is called a SALT or Voice browser. (In the future releases of MSS it can also be a VXML browser and handle managed code.) I will the discussion of VXML and managed code stuff until after MSS 2007 is released.

 

SALT stands for Speech Application Language Tags. VXML stands for Voice XML. The two are surprisingly similar, and both do essentially the same thing. They provide a special markup language to a speech browser in much the same way that HTML provides a special markup language to IE (or Netscape, Firefox, and Mozilla). With the markup language we can tell the browser how to present and accept information from the user. MSS 2004 understands SALT. In fact it also understands HTML and JScript, just like IE. So what is SALT used for? Look at SALT as special HTML, because that is what it is. It is special tags used exclusively for speech recognition systems like MSS. But before we go much further, one thing I don’t want to do is leave our non-web developer brothers and sisters behind. So let me talk a little about the infrastructure that is required for MSS to work.

 

For non-web developers, the knowledge curve necessary to develop speech applications for MSS is going to be a bit steeper. This is because if you don’t already know ASP.NET (Microsoft’s web development framework) it will be something you must become familiar with in order to effectively write MSS applications. (Yes you can write speech apps without ASP.NET, but not nearly as easily.) If you are already a web programmer (and JAVA/JSP programmers are welcome by the way) then you already have the basis for quickly understanding how MSS works.

 

To make a speech application work on MSS you must first you write a web application, with speech stuff included. Then you deploy your application on a web server (In the case of Microsoft Speech Server, we use IIS). Then instead of using a computer with IE to browse the web site, you use a computer with a speech browser, i.e. MSS; and instead of using a keyboard and mouse to get input and output, you use a telephone attached to a special telephony card that is plugged into the computer that has MSS installed (GASP). The application is written just like a web application written for the Internet. The difference of course is that instead of having just HTML (and a scripting language like JavaScript), we add those special HTML tags called SALT to tell MSS when to recognize speech ( <LISTEN> ) and when to say something back to the caller ( <PROMPT> ) when to record sound to a .WAV file ( <RECORD> ) and when to do telephone stuff like answering the call or placing an outbound call ( <SMEX> ). We will talk more about this in a later installment.

 

I don’t want to complicate matters too much but I know there are a lot of you out there screaming at your screen, “But Ken, what about the multimodal capabilities?”  Yes there are multimodal capabilities. Multimodal is a way for speech applications to be incorporated into standard web applications. This allows not only for speech recognition on standard web pages, but also it allows the user to see the results of their speech. So with multimodal you have the benefits of speech along with the capabilities of seeing pictures as well as using other input devices such as keyboards, mice and styluses. Microsoft’s vision of multimodal is best realized using “smart” devices like Pocket PC, where you can surf web pages with speech, as well as see text and pictures. Imagine for example checking into your flight at an airport and seeing on your screen the possible seats that are available.

 

That’s a lot to absorb for you beginners. And there is sooooo much more I could talk about. I hope to elaborate more on many of the subjects we talked about here, in future posts. So stay tuned. As for the rest of you, please feel free to elaborate more on anything discussed here.

Posted by kstep | 5 Comments

Speech Teach - The First Installment

Hello and thanks for taking the time to read this first installment of Speech Teach. My name is Ken and for those who have been around the Microsoft Speech arena over the past 3 years, you may very well have met me before. I was one of the few instructors teaching the Microsoft Speech SDK with Intervoice (I truly was in the right place at the right time.) In addition, you may have seen me when I helped Microsoft introduce the Speech SDK at a couple of introductory demonstrations at SpeechTek in New York and San Francisco. So what am I doing now? Well I nervously left Intervoice to start my own business in; you guessed it, speech software development. I am now the proud owner of a sole proprietorship with dreams of making it big. My company is called CRK Software and I specialize in .NET  development with the goal of writing the next killer app… hopefully in speech recognition.

 

After talking with Brandon Taylor, about GotSpeech.Net and watching it over the past several months, I’m beginning to believe that this site will soon (if not already) become one of the premier sites for information on Microsoft Speech. So I thought it might be appropriate to use my experience to, hopefully, help improve the value of this site. I want to thank Brandon for giving me this opportunity and I want to thank Marshall Harrison, recent recipient of the Microsoft MVP award, for making this site a success along with the other contributors.

 

Now that I’ve been afforded the opportunity to write a blog, my next task was deciding on what to focus. I decided that there are many folks out there that truly need a way to begin. They need some source for asking the basics and learning from someone who frankly started with very little knowledge in speech recognition and continues to learn a great deal about speech and in particular Microsoft Speech. So this blog is dedicated to those who are both literally as well as figuratively beginners in the speech recognition world as it pertains to telephony. In a later installment, I will talk more about the version of Speech Recognition that this site is dedicated to. As you may well know, Microsoft is involved in desktop as well as telephony style speech recognition. My knowledge of speech is in the later.

 

So there you have it. I hope to post once or twice a month. If there are any subjects that you would like to read about in future installments, please let me know. Otherwise, I will start from the beginning and slowly, with your help, take you through some fundamentals. Hopefully I will provide the new arrivals to speech a much need foundation for understanding this technology, and maybe provide a different perspective to those who have been around this technology a while. More than anything I want this to be a basis for discussion of the basics.

Posted by kstep | 3 Comments