Welcome to GotSpeech.NET Sign in | Help

A Free UCMA VXML IVR Server

While many of you who use this site are most interested in Speech Server 2007 and UCMA Workflow, unfortunately Microsoft’s support for these products is waning. At the same time UCMA does seem to be moving on down the road but I don’t think Microsoft is motivated to provide any real support for stand-alone IVR sever development outside of Lync. Unfortunately it seems that development of stand-alone IVRs using .NET is getting more difficult instead of easier.

 

In response to this I am a partner of a new company called FoneRage. Its purpose is not so much to develop supported platforms for UCMA or development tools for IVRs in the .NET environment, but FoneRage does require a platform to run its products. After much consternation, FoneRage made a strategic decision to move development to VXML mainly because any IVR application that we develop can more easily move to platforms outside of UCMA, if for some reason UCMA goes away.  VXML is a de facto standard in the IVR world, so there are many supported platforms that run on Microsoft servers as well as Linux and Unix.  At FoneRage we really love the Microsoft development environment for many reasons. While cost is certainly a factor, the reality is, we are most comfortable with the .NET environment and certainly love the Visual Studio development tools. Using UCMA and developing our own IVR platform allows us to service up IVR applications that don't rely on systems that in this industry can be quite expensive and almost always are setup to charge by the port.  It is probably why so many of you use Speech Server and UCMA.

 

In order for us to continue development on our IVR products using a Microsoft Server environment, we needed two things. First we needed an IVR server platform. So using UCMA and its built-in VXML Browser we were able, over the last 8 months or so, to come up with a preliminary version of a working VXML IVR server platform. At present the FoneRage IVR server allows for simultaneous inbound and outbound calls at a load that is theoretically limited only by the hardware resources. It is currently being developed and tested on an X64 Windows 2008 R2 server. The FoneRage IVR server has the ability to do outbound calls using MSMQ, much like Speech Server 2007, and as well, we are working to expose a method call that runs an outbound call without MSMQ.

 

At present there is no logging built in but we are working on that and hope to have logging available soon. As part of the install there is a console version of the service that allows the IVR to run in console mode which does output log messages to the console window so you can debug issues.  There is also some tie-ins to the Windows Event Viewer.

 

In addition to an IVR platform we also did not want to write VXML for every application we write. So we also developed an API that allows us to run .ASPX pages that output VXML written in C#. Which is to say, when we are done with the API, we will not need to know much about VXML in order to develop a VXML application. The API will do that for us.

 

I write this blog to let you know that our intent was never to develop the IVR Server and API only for our benefit. We also wanted this project to benefit anyone who does not want to go through the same trials and errors we have with UCMA and VXML. We have already made a version of the VXML server available on our web site for free and we intend to release the API soon, also for free. If you would like to try out either, the only thing we ask is that you provide feedback and that you register on our web site at www.fonerage.com.

 

At present both of these products are in a “development” state, maybe not even beta level code yet, but as time goes on we anticipate a great deal of improvement. We expect to be producing our own enterprise level systems so we have as much interest in making these products as full proof as you do.

 

We are releasing these products for free because we believe in two things that are important to the IVR community at large. First, we believe that IVRs have a bad wrap. They are frustrating to users and expensive to build. FoneRage wants to make IVRs friendly and cheap. Second, we believe in giving back to the development community since the development community is so generous in helping each other. Much of what we know about VXML, IVRs, UCMA and the like come from the generous information sent out by developers like you. This is our way of continuing that tradition.

 

Even if it’s just curiosity and not necessarily your intent of building a system with our IVR, I encourage you to give it a try.  In the process give us some feedback and help us improve and maybe even ultimately have a viable alternative to the Speech Server and UCMA workflow environment. Not only does it obviously benefit FoneRage, but it will clearly benefit you and the IVR community.

Posted by kstep | 0 Comments

A UCMA 3.0 "Hello World" App.

Writing a “Hello World” app using UCMA is rather simple.  There are few loopholes, but I will walk you through that.

Obviously the first thing we need to do is make sure you have your development environment setup properly.  Minimally you will need either Vista, Windows 7 or Server 2008 64 bit operating system.  Additionally you will need Visual Studio 2008 or 2010.  I have not tried to use this on the free version of Visual Studio Express. You will also need to install the UCMA SDK.  Finally you will need some sort of soft phone like x-lite.

For those of you who are using Speech Server 2007 and are wondering if you can have both UCMA and Speech Server 2007 installed on the same machine, it is possible.  You just need to make sure Speech Server is not running when using a UCMA app and vice-versa.

For those who have UCMA 2.0 installed, you will need to uninstall it first before installing UCMA 3.0.  I should also note that installing and uninstalling UCMA 2.0, then installing UCMA 3.0 did cause a problem with my Speech Server install, or at least I think it did.  Ultimately I had to uninstall UCMA 3.0 then I had to uninstall .NET 4.0 framework (Speech Server will not reinstall with .NET 4.0), then I had to reinstall Speech Server, reinstall .NET 4.0 through the Visual Studio install and then reinstall UCMA 3.0. After all that, Speech Server 2007 functioned properly.

UCMA 3.0 installes quite a few components to include a speech and TTS engine, the Lync redistributables, the speech Workflow API and the other UCMA APIs

After installation of all the components you are ready to crank up Visual Studio.  In Visual Studio:

1.       Click File/New/Project.

2.       Select  .NET Framework 3.5 from the dropdown that lists the different versions of the frameworks installed. You must be in .NET 3.5 mode to see the Communications Workflow template.

3.       In the Installed Templates expand out Visual C# (or if you prefer Visual Basic) and click on Communications Workflow.

4.       For this app we will choose Inbound Sequential Workflow Console Application.

5.       In the Name: field name the app UCMA Hello World.

6.       Choose your solution location in the Location: field.

7.       Click OK.

8.       In the Select Language dialog box click OK. You should now be presented with what may look rather familiar to Speech Server developers.

In the toolbox you should now see the familiar Workflow activities as well as the UCMA Speech activities that were loaded when you installed the UCMA SDK.

1.       Drag and drop a SpeechStatementActivity into the CommunicationsSequenceActivity just above the DisconnectCallActivity.

2.       Right Click the speechStatementActivity1 activity and choose properties.

3.       In the MainPrompt property, type Hello World.

At this point we have essentially completed the application and it will run. The only problem is we need to tell the code that we are not connecting to the default endpoint which is a Lync server.  Instead we want it to use a standard SIP connection.  To do that we need to make one little code change.

1.       In Solution Explorer double-click to open the Promgram.cs file.  It is this file that starts up the workflow runtime and in so doing starts up the CollaborationPlatform.  The CollaborationPlatform is what handles the incoming calls.

 

2.       If you scroll down to the Initialize() method you should find a line of code that looks like this:

_endpoint = new ApplicationEndpoint(_collabPlatform, settings);

 

3.       Just above the line of code indicated in step 2 above, enter this code:

settings.IsDefaultRoutingEndpoint = true;

 

4.       Now start the application in Debug mode.

 

5.       After compiling and running you should see a console window that should displays the following:

 

Step 1 of 3: Initializing the Workflow Runtime... Complete.

Step 2 of 3: Starting CollaborationPlatform... Complete.

Step 3 of 3: Establishing Endpoint... Complete.

Press Enter to stop. 

 

NOTE: If you do not see “Establishing Endpoints… Complete”, there is a problem and usually the problem is some other application is listening on port 5060 (the standard SIP port).  For example Speech Server may still be running on your box.  If so, shut down the process and try again.

 

At this point you are ready to dial into the app.  Be sure your softphone is able to connect to a transport of TCP.  x-lite is capable of doing this by specifying it in the SIP URI.  For example if I use x-lite on my local box I need to use something like this:

 

sip:1234@localhost:5060;transport=tcp

 

As you can see, this is a very simple application. From here, however Speech Server developers can probably get going with some decent development. As you can see, the “server” for this application is a console application.  Clearly, putting a console application into production is probably not a good idea.  Here is Michael Dun’s solution from a UCMA 2.0 blog. I wrote my own Windows Service app that calls the necessary methods from the Program class that seems to work fine for now.

 

Later I will blog on some pitfalls you are likely to run across and some suggested solutions.

Posted by kstep | 6 Comments

UCMA 3.0 RC is available for download.

The release candidate for UCMA Managed API 3.0 SDK is now available for download. 

To use this SDK you must be running a 64 bit client operating system or Windows Server 2008.  Owners of XP, 64 bit or otherwise, are not invited to this party.  You will also need Visual Studio 2008 or 2010 set for .NET 3.5.  After installation, you should see a starter template under C# or Visual Basic that creates a starter workflow page.

I will blog later about how to get a "Hello World" app up and running in your development environment.

Posted by kstep | 1 Comments
Filed under: ,

Using UCMA to write IVR Applications

It looks like SpeechServer 2007 is in its sunset of life and many are wondering how to move forward.  While I am disappointed in the direction Microsoft is going with regard to IVR development I continue to hold out hope that Microsoft will provide a means to develop speech and DTMF IVR platforms using workflow.  UCMA 2.0 and soon to be released UCMA 3.0 appear to be a possibility if you are insistent on using managed code for your development.

At the outset, I should point out that UCMA 2.0 is not a one-for-one replacement of Speech Server.  It is not intended to be.  UCMA is intended as an SDK that gives you tie-ins to OCS (soon to be Lync).  As part of its SDK you have the capability to develop IVR applications that can use either DTMF and/or Speech Recognition in a workflow environment, or if you prefer, without workflow, using the raw Speech API. For those who have built Speech Server 2007 apps, there is a lot that is familiar.  There is also alot that is no longer available and quite a few pitfalls which I will try to blog about later.

For example you still have QA's, Statement Activities, telephony control, goto activities, commands, no-reco and silence handling and the ability to create your own custom activities.  Things that are missing are the built-in softphone debugger, grammar editors, prompt database and some of the built-in activities such as the Record activities, navigable list, form filling dialog, get and confirm, detect answering machine, menu, validators, SALT and VXML interpreters, set task status and some other features. The net result is, if you are using any of the features not included in UCMA, you will have to roll your own, which of course increases your development time.

I have been working with UCMA for about 4 months now.  All in all I have been able to overcome the shortfalls in UCMA by building my own custom activities.  While I have not tried to duplicate all the activities that Speech Server provided, I am fairly convinced I could.  Building grammars is more difficult without the use of a grammar builder/tester.  Fortunately the GRXML grammars built using the grammar builder from SpeechServer work very well in UCMA.  So if you really need a grammar builder you can always use SpeechServer's.  Hopefully Microsoft sees it fit to add a grammar builder plugin for VS 2008 and VS 2010 as part of the UCMA SDK. For me the prompt database that came with SpeechServer was pretty much useless except for small IVR apps, so I don't miss that.

Debugging has not been much of a problem without a built-in softphone.  I simply point my x-lite soft phone to the box running the UCMA IVR app and I can add breakpoints in the code and test much like I did in SpeechServer.  Logging is non-existent.  So if you need to do any kind of logging, you will need to create your own.  Creating logging capabilities is not a big deal unless you rely on the mountains of details that the ETL logs provided.  If so, you will need to create log output for those parameters you most want to see.

One of the big things I do miss are the performance counters that SpeechServer provides.  Again, those can be built but as always, it adds additional development time. The other thing that seems to be a problem is supervised transfers.  To date, I am not sure how to get that working. Also, if you think documentation on SpeechServer is sparse, wait until you get into UCMA. While there is some help and some samples apps, much of it does not answer many of the issues I ran into and some of it is just plain wrong. 

So what are the advantages of UCMA.  For one, the SDK is free to use and there is no requirement to have a licensed copy of OCS.  Unless you need the ability to do things like monitor presence, or build in instant messaging (Instant Messaging activities are included in the UCMA SDK), the UCMA SDK will allow you to build IVR apps that can have all the functionality of a SpeechServer app.  Another advantage, although my statement here is more empirical than fully tested, is that it appears that UCMA  IVR's use less system resources than does SpeechServer.  From that perspective you should be able to handle more call volume on a single box.  Again, I have not tested this yet, but based on the kind of CPU utilization and memory utilization from an equivalent app, there does seem to be quite a difference in scalability.

Should large enterprises migrate to UCMA?  Well I guess the answer to that question is, as always, it depends.  This is not a tried and true technology.  The time and effort to build an equivelant app in SpeechServer is much less than is required for UCMA, but as I noted in a blog several years ago, the tools for building VXML apps won't provide much relief either. If you have a SpeechServer app that you want to migrate to UCMA, there will be quite a bit of work to do, it is by no means a drag and drop operation. On the other hand, much of the code could be reusable if you haven't used the built-in activities from SpeechServer that I mentioned were not available in UCMA.

My over-all assesment of UCMA development for IVR's is... I am dissapointed but at the same time encouraged. To be honest, I continue to consider other avenues to IVR development.  I am dissapointed because I had to jump through a lot of hoops to make a working UCMA IVR app.  I am dissapointed that Microsoft is backing away from further development of SpeechServer.  Now that I understand the pitfalls of UCMA development, I am convinced it is a viable, if not fully proven technology. I am encouraged because there is at least a path to the future of IVR development in a .NET managed workflow environment. I am also encouraged because I can now do IVR development without the need to purchase licenses for OCS features that I will never use. 

We .NET IVR developers seem to have taken a huge step backwards with UCMA, my only hope is that from here things improve.

Posted by kstep | 4 Comments
Filed under: , ,

Using Windows 7 for Speech Server Development

For those who may think that Windows 7 can't be used to develop Speech Server 2007 applications, I have found it can work.  Below are the steps I took to install Speech Server on a fresh install of Windows 7.

1.  Follow all the instructions explicitly for Windows Vista in the ReleaseNotes.htm file that is in the OCS2007SpeechServer folder.

2.  One of the steps has you install the Visual Studio 2005 Update for Windows Vista. It does apply to Windows 7

3.  Don't do like I tried to do and install the Hotfix for Microsoft .NET Framework 2.0  It is only for Windows Server 2003 and XP.

4.  Be sure to follow very carefully the install for IIS and MSMQ on WIndows Vista.  This also takes the place of the step Entitled "To Enable Integrated Windows Authentication."

5.  All other steps are self-explanitory.

6.  The last thing you will do after you have installed at least one Language is to install the Speech Server Cumulative Update. This gets rid of the Red X problem and a few other things.  This update cannot be update directly by running .MSP file.  Because of the security involved, you must run this install as administrator.  The problem is that if you right-click on the file to choose run as administrator, that option is not found.  So what I did was create a simple .BAT file in notepad, added the path to the file, and then ran the batch file with the Run As Administrator option selected.

That's it.  There is nothing more to do.  I have test speech server and ran and edited apps already created in Vista and XP.

So now the only remaining question is, does Microsoft officially support the use of Speech Server development in Windows 7?  To date, I have not heard, but at least for now I suspect not.

Posted by kstep | 1 Comments

VPCs and a Softphone to Test Transfers

I don't know if anyone has tried this, but the other day when I was playing around with transfers using only soft phones and virtual PCs, I was able to do something I wasn't sure you could do. I was able to initiate a call to one speech server, which in turn transferred the call to a second speech server and started another application. The scenario is like this…

Create application A that does a transfer after it completes. Load application A onto Speech Server A running inside a virtual (or physical) PC (I'll call it PC-A).

Create application B that runs an application that once transferred to, from application A, starts up and uses the ANI for identification. Application B runs on a second virtual (or physical) PC, PC-B.

On my laptop I had a soft phone that called PC A's application. PC A's application successfully transfers the call to PC B's application. Except for the fact that I produced prompts that indicated that I was transferring from one app to another, and that the app on PC B had a prompt saying I just came from another app, the user's perception of the transfer is non-existent. That is, the user would not even know that the transfer took place except possibly for some latency.

A couple of things to note in this endeavor…

  1. The port numbers for each app need to be different. i.e. Application A was put on port 8090, and application B I put on port 8092.
  2. The softphone must use tcp by default. For example if you try using X-Lite, the only way that X-lite can work is if you use a URI like this… sip:1234@192.168.8.5:8080;transport=tcp. Since Speech server cannot append the ";transport-tcp" to the end of the URI (at least I don't know of a way) for the transfer URI, you need to use a phone where tcp is the default , such as sipXPhone.

     

This kind of gives one some hope that without any telephony backend, you can test real transfers.

Posted by kstep | 6 Comments

Debugging Speech Server: Lesson Learned

If you've worked with computers long enough you know that there are times when you run across a problem that seems so elusive that you just want to attribute it to, "that's just the way it is", and move on.

Such was the case not too long ago when I ran across a problem using Speech Server. It got to the point where all I wanted to do was backup our databases, wipe clean the Speech Server, put on a fresh operating system, re-install Visual Studio 2005, SQL Server, prerequisites for Speech Server and Speech Server itself… not a very appealing option, but one I was really close to accepting. It was an extremely frustrating experience.

So the scenario is I have a laptop with a soft phone, calling across the local network to a Windows 2003 Server with Speech Server installed. I was able call the speech server and everything about the call was fine except for one important part. I could not hear anything. I knew the soft phone was being answered, because of the message on the soft phone that said I was connected. I could verify the connection was being made on the Speech Server because I could see the additional call show up through the Performance Monitor. I even used Wireshark to do a trace to show that the SIP traffic was going back and forth, and to see if there was some anomaly with the audio. But everything looked fine. There were no errors in any system or speech logs either on the Speech Server or my laptop. My natural inclination was to look into a firewall problem. After all, firewall problems always seem to be the culprit when it comes to SIP calls. After shutting down the firewall, however, I still had the problem.

Just for grins, I took a virtual machine that had Windows 2003 Server and Speech Server loaded, and ran the virtual machine on my laptop and attempted to make a call using the same scenario. I had a soft phone on my host operating system calling a Speech Server on a Virtual machine, all on my laptop. Everything worked fine, including speech.

My conclusion… my laptop was configured properly, my soft phone was configured properly and the problem with getting no audio had nothing to do with the laptop, and strangely was not working from the Speech Server.

So what had changed? The Speech Server I was using for development had been working fine before I had made changes to my application. With this new version of the app, it simply would not produce audio. Had I done something with the application to cause that? After testing the old version of the application, I still was not producing audio to my soft phone. I even tested the built in "Welcome to Microsoft Speech Server" app that comes with Speech Server and that didn't produce audio either.

Another thing that had changed was that we had installed .NET 3.5 on the Speech Server box to accommodate the latest WCF (Windows Communication Foundation). Could that have been the problem? To eliminate that possibility I also installed .NET 3.5 on the virtual machine I used earlier. After testing, it worked just fine. Audio passed to the soft phone without a problem.

So I still was not getting audio from our speech server to my laptop and I still haven't a clue why. Could it be that the problem was not the Speech Server but was my laptop? Seems unlikely since everything works fine when I used a virtual machine, but just in case I asked one of my colleagues to set their laptop up with a soft phone configured the way mine was. I then went to Speech Server to configure a trusted SIP peer. My colleague made the call and lo and behold, he received audio to his Soft Phone!

So now things pointed back to my laptop. But what could be the problem? What was different now as compared to two weeks ago when things were working fine?

It turns out the week prior to my discovering the problem I am describing here, we were at Microsoft going through a tech review of our application and one of the things that was needed was a demonstration of our application. To do that I used my laptop running a virtual machine with Speech Server that was all on my laptop. Because I was not going to be connected to a standard network, I had set up my laptop with a virtual network using a loopback adapter. In case you are not familiar with a loopback adapter, this is a piece of software that ships with Windows that allows you to configure a virtual network card that can be used to test network connections. It can also be used to setup a virtual machine that connects back to the host operating system through a virtual network without having a physical one.

Just like a real network card, you can have the loopback adapter connected or disabled. So for grins, I decided to look into the possibility that the loopback adapter was causing the problem. It turns out the loopback adapter was still connected as it was while I was at Microsoft. So I disabled it and tried my test again. IT WORKED! With the loopback adapter disabled, I was now getting audio from the Speech Server.

So there are three morals of this story. One, if you use a loopback adapter, make sure you shut it off before doing testing over a physical network. Two, don't just arbitrarily wipe a box clean and spend many hours getting that box up and running without exhausting other possibilities. If I had done that, I would have found out that the problem still existed, and that would have really been frustrating. Finally and probably most important, when debugging, always think about what changed since the last time it worked. Nine times out of ten, it is a change that seems to be benign that causes your problem.

To this day I am not really sure why the loopback adapter being connected causes a problem with audio being passed from Speech Server to a soft phone over a physical network, but it does. And maybe someone can help me understand that. This is just another one of those lessons learned that I wanted to share with you.

Using VS 2008 and VS 2005 on the Same Computer

There has been some discussion about using Visual Studio 2008 and Visual Studio 2005 with speech server on the same box. I thought I would write about my experiences and provide some warnings and alternatives.

I personally have loaded both VS 2008 and VS 2005 on an XP Pro box and things seem to be OK with one exception...  It seems there was a slight change in the way the workflow activities work.  Now, whenever a required property is needed on an activity, and in some cases even when things should not indicate requirements, instead of displaying a red exclamation mark, there is a red X through the entire activity icon box.

As long as there is a red X, it appears as though you cannot select the activity. However when you do click or right-click on the activity everything seems to work as it should.  The only thing you can't do is delete the activty unless you minimize it by clicking the double chevron button. This is more of an annoyance than anything else.  I can still write and debug speech apps.

Using Vista is a whole other issue. A colleague of mine installed VS 2008 on the same box as VS 2005 and speech server. When he tried running an app within Visual Studios, there was an error indicating that the debugger could not attach to the workflow process. Even when he tried using the CTRL-F5 (non debug mode) he got the same error.

So the bottom line is  I would not recommend doing loading VS 2008 with VS 2005 if you are working on speech apps, especially on Vista. With XP Pro, things seem to work OK but with the caveat I mentioned above. There may be some other issues I have not uncovered yet. 

My suggestion would be to create a virtual machine and load VS 2008 on the virtual machine if you need to work on VS 2008 and VS 2005 on the same box.

Test Blog From Word 2007

This is a test blog.  For those that blog to this site and would like to know how to setup Word 2007 to post a blog, let me know.

Posted by kstep | 2 Comments

The nBest Choice

There are times when you receive ambiguous responses to very direct questions. For example, what if Speech Server asks a user what their last name is and the response is "Smith."   If it matters, the spelling could be s-m-i-t-h or s-m-y-t-h. Or what if the question is,  "Which city are you traveling to?", and the answer is "Austin."  The problem here is it could be interpreted as Boston.

There are times, when we need to get a list of possible responses that the recognition engine derived from the recognition based on the grammar it used.  Armed with the list, we can then ask the user, out of the possible answers, which one is correct, instead of asking the same question over and over.  For example, consider this scenario...

IVR:  Which city are you traveling to?

USER: Austin

IVR: I think you said Boston. Is that correct?

USER: No

IVR: I'm sorry which city are you traveling to?

USER Austin

IVR: I think you said Boston. Is that correct? ....

(By the way, it is possible to have the system allow the user say, "No I said Austin.", but that's a whole other discussion and it could still result in the system thinking the user said Boston.)

As you can see, this could get quite tedious for the user and the programmer.  Instead, we could present the user with other possible answers like this...

IVR:  Which city are you traveling to?

USER: Austin

IVR: I think you said Boston. Is that correct?

USER: No

IVR: I'm sorry, did you say Austin?

USER: Yes

Michael Dunn had a wonderful blog post about how to obtain an nBest within a Workflow application called N-Best Type Functionality. In that blog Michael pointed out that there is a property associated with a Speech Activity called RecognitionResult.AlternatesAlternates is an object of type List<RecognitionPhrase>, and RecognitionPhrase is an object that contains information about the possible answer returned in the Alternates list. Properties associated with RecognitionPhrase include but are not limited to:

Confidence - This is score with a value of 0.0 - 1.0 where 1.0 is 100% confidence that the answer is correct.  (In the real world a confidence score of 1.0 is nearly impossible, and besides a recognized phrase in the alternates list would not be in the alternates list if it's confidence score was 1.0)

Homophones - A homophone is a word that sounds the same but is spelt differently. This property is a list of homophones for the recognized phrase.  This could be used as a way to present the user alternative spellings of a response. using something like this:

   qaLastName.MainPrompt.AppendTextWithHint("Boston", Microsoft.SpeechServer.Synthesis.SayAs.SpellOut);

Text - The word or phrase that was recognized.  In the example dialog above, this is where I would obtain Austin as the possible alternative to what the speech engine thought was said, namely Boston.

BIG NOTE:

Unfortunately (and I learned this the hard way) you cannot test alternatives using the built-in debugger and soft phone by pressing F5.  Instead, use Ctrl-F5 to put the soft phone into a non-debug mode.  Doing this prevents you from using the User Input and Call History tabs. This does not shut of debugging in Visual Studios it simply prevents you from inputting your response by typing it in, so you will have to use a microphone and speak your response or use the DTMF buttons. It turns out when you use the debugger's soft phone in debug mode, even if you record your response, the soft phone puts the result into the Text Input: field.  You are then required to click on the Submit button.  Doing this is equivalent to typing in the text and yields effectively a 100% confidence score from the recognition engine which eliminates any possibilities of alternative phrases showing up in the Alternates list.

Time Wasted Learning?

Well maybe I should not call it time wasted, just learning experiences. I have spent a lot of time on things that, well if I had done them a little differently, could have saved me some heart ache.  Below is a list of items that I have learned as early as yesterday. If you are interested in more details on any of these, I will ellaborate on later blogs.

 

  1. Don’t give your speech server computer a name that starts with numbers. The Speech Server Process does not like that.  Maybe a bug?
  2. Firewalls are a necessary evil but you can set your applications to run on a specified port number... including a port that is already open. This works well for machines where you have no control over the port allocation.
  3. I can’t call Microsoft Speech Server using a soft phone on the same box as Microsoft Speech Server… I have wasted and inordinate amount of time trying.
  4. Using Virtual PC to set up a virtual Speech Server is a great learning and debugging tool, all you need is about 2 gig of memory on the box to make it worth your while.
  5. Set the UseDefaultGrammars property to false if you are using custom grammars for controls like MenuActivities and NavigableListActivities.  You can get some strange results when your activities use data sources that have items that are in your grammars.  The activities use the items in the data sources to create the default grammars.
  6. There is nothing built into the prompt engine that converts numbers and dates to their word equivalents. You’ve got to create your own code to do that if you don’t want text to speech (TTS) as an output for that type of data.
  7. Don’t expect all the features of WorkFlow to work in a SpeechWorkflow. For example, in a regular Workflow application you can add and remove activities dynamically. That does not work so well with Speech Workflows.

So hopefully some of this will be news to you and it will have been worth the time to read this blog. 

Posted by kstep | 4 Comments

Dynamic Workflows - Session Data

I was tasked to develop an application that has to be very dynamic. What I mean by this is that I needed to be able to have call flow generated at run time based on some parameters that are set in a database. To this end, one of the issues I ran into was the ability to create and maintain session data and have that data available to custom activities.  The idea is that depending on parameters that are obtained from a data source, some or all of these custom activities would be added to the call flow at the instant that call was initiated and each of these activities would need access to some global data dictionary in order to store and retrieve data that was collected by other activities.

 In OCS Speech Server, when using work flow to develop speech apps, there is a work flow engine that processes each activity in a very serial way unless some event is triggered to cause the work flow to suddenly jump from the main call flow to some other call flow. There is a public object that is available called UserData.

 UserData is a Dictionary that stores Key/Value pairs. At first I thought this would be a perfect variable to store any data that I collected from the user while their session was active. After some testing and a little research I quickly came to the realization that the UserData dictionary did a real good job at storing user data but that data was maintained for all users, which is to say, UserData was an instance variable of the WorkFlow and was not a session variable provided anew for each caller. So I needed something similar that would be available for each caller.

 The solution appears to be something I found in a BLOG by Chad Cambell.  In his BLOG he talks about creating your own session variable and how to get to that session variable from the main call flow or from any custom activity.

Although this solution seems to be sound, I would like to hear from anyone that may have a different solution.

Posted by kstep | 1 Comments
Filed under:

VXML vs. WorkFlow for Speech Server 2007.

Before I get to the core of what this blog is all about, some explanation is in order as to where the heck I have been.

About a year ago I started and misserably failed at writing tutorial blogs on how to use Speech Server 2004. That was during a time when I was trying to get a business in writing speech software with my company, CRK Software.  Well, CRK Software was not going in the direction I had hoped it would, and unfortunately I had to put CRK Software on the back burnner and work at a real job with Countrywide Mortgage as an ASP.NET developer. For a long time now I have done little to nothing with Speech Server and felt like I was not going to be much help in this arena.

Well time went on and I found out about another contract that could potentially put CRK Software back on the front burnner, so now I am working a new contract doing, you guessed it, OCS Speech Server work. One of the first questions I was asked was whether to use VXML or WorkFlow to build their product. At first it seemed like a no brainer. Workflow all the way. Especially since other software development outside the speech world would be using Workflow.  Then I got to thinking... why not VXML?

There are at least several reasons to choose VXML over workflow and reasons not to. So below is a bit of a comparison.  Let me know if you can come up with other compelling reasons to use one standard over another?

VXML Pros

1.       A more universal and well established standard.

2.       Less proprietary so there are more abundant choices in VXML interpreters/speech servers.

3.       Can be written based on J2EE or .NET.

4.       Because it’s XML, tools abound for authoring XML and are quite good in Visual Studios.

VXML Cons

1.       Development tools although abundant for XML, are not necessarily well established to debug voice apps with grammars and the like.

2.       I believe there is a longer development cycle but I guess familiarity with VXML, the development tools and other factors can affect development life cycle.  Since I am woefully lacking in VXML development experience I could be persuaded otherwise.

3.       Both client side (javascript) and server side (Java, C# or VB.NET) code is necessary. Development is more of a web development paradigm. 

OCS Speech Workflow Pros

1.       Because speech applications tend to be rather linear in their execution, speech apps lend themselves quite well to workflow.

2.       Very good tools for writing and debugging within Visual Studios.

3.       A graphical development environment for both dragging and dropping call flow and seeing the call flow.

4.       Very little client if any client side code is necessary to develop speech apps.

5.       Much of the .Net Speech library applies equally well to Vista desktop speech development but I am not sure how big of a deal this is when you are only concerned about IVR development unless you are looking to have a skill set outside IVRs.

OCS Speech Workflow Cons

1.       Very proprietary. You must write on Microsoft tools for Microsoft Speech Server.

2.       Very young technology and to some extend not very well supported, even by Microsoft, although that is improving. Be prepared to plow new paths without much documentation.

3.       If you are not a .NET developer, you will have a steeper learning curve to get used to the development tools and .NET in general.  Even if you have written Speech Server 1.1 SDK apps, there is a large shift in the way you develop IVR apps.

4.       Its in .NET. Although for me I don't consider this a negative, others, especially those J2EE die hards would consider it a negative.

And what about SALT?

 

All the same pros and cons (although documentation is more plentiful.) for workflow except two additional CONs… 

 

1.       Microsoft seems to be moving away from the SALT as a standard.

2.       Salt can be tedious since much of the development requires client side, javascript, development.

Posted by kstep | 6 Comments

What is Necessary to Make Speech Server Run? - A System Requirements Discussion

After a good vacation it’s hard, number one, to get back into the swing of things, and number two, to catch up on all the things you need to in order to get back into the swing of things.  Did that make any sense?

 

At any rate I am back at it and in this installment I want to continue to talk about some of the hardware and software pieces that make up the Microsoft Speech Server environment. In the last installment I spent some time talking about the TIM and telephony board. Those are the pieces that allow a telephone network to connect up to the Microsoft Speech Server.  But what about the computers necessary to make Speech Server work. What kind of hardware is necessary there? Also in my last installment I said would talk about configuring the MSS server, but I think instead it makes more sense to talk about what is required for a speech system to run first, then develop an application, and then talk about configuring MSS.

 

Before I go much further I want to take a minute to address those of you who are new to Microsoft Speech and simply want to experiment with this technology. DON’T LET THIS ARTICLE SCARE YOU. I say this because as you read through, it may seem like you will be getting in over your head.  Rest assured you can actually develop a speech app using the Microsoft Speech SDK without any additional hardware cost or expense. With an MSDN license or a trial version of Microsoft Speech Server, and a developers license of a SIP TIM (like Vail Systems), you can develop an IVR without any additional hardware and it will even run in a Virtual PC or Virtual Server but you will still need to obtain a license of Windows Server 2003.  So read on but don’t get discouraged.

 

If you’ve been around the IVR industry long, you know that one of the arts within the art of providing a good speech application is being able to determine how much hardware needs to be put into place to make a system with good scalability, as well as good response time.  To tell you the truth I am no expert in this realm, and yet it’s one of the most asked questions.  For example, how many computers should I have? What speed processor? How much memory and hard disk space?  How many ports (telephone lines) do I need? Should the Speech Server be on the same box as the Web Server? (Also known as the application server.)  Should I put the database on the same box? And on it goes.

 

First, to answer these many questions, lets look at all the pieces that are necessary to make a Speech Server work.  First it should be noted that when you install the Microsoft Speech Server, you have the option to install two primary pieces. Those pieces are the Speech Engine Services (SES) and the Telephony Application Services (TAS). Let’s talk about what each of these do.

 

SES is broken down into two parts. The Speech Recognizer (SR) engine and the Text-To-Speech (TTS) engine. As the names imply the SR engine is used to recognize speech by comparing what is spoken to a pre-determined grammar. The recognition output of the SR engine is an XML document called SML. It contains the text of what was recognized along with confidence scores and semantic items. We will talk more about SML in the next installment. The TTS engine will generate a digitized stream of audio data from text that is part of another XML document called SSML, and again we will talk more about SSML later.  So in reality SES is the heart and soul of a speech system. It recognizes spoken word and plays back text as audio. As you may have already guessed, this part of the Speech Server system requires a lot of processing power, i.e. processor speed and memory.

 

The other part of the Microsoft Speech Server is the TAS. As I stated in the last installment, you can think of the TAS as a web browser for speech. Since telephones do not have the capability to interpret web pages in the form of HTML, SALT, and JScript, it becomes incumbent on the TAS to do that on behalf of the telephony hardware. So TAS is the speech browser and interpreter for speech applications. It understands HTML, JScript and those special HTML tags for speech called SALT.  TAS too can require a lot processing power. Especially if TAS is required to keep track of a lot of telephone calls at one time.  It would be like having your desktop run multiple Internet Explorer browsers with each one running web pages with sound and video.  Clearly if we try to run too many of these web pages at once, we start to run out of computer resources and latency is the result.

 

Because speech IVR applications are fundamentally web applications, we also need a web server. For Microsoft Speech, this requires Internet Information Services (IIS).  This software is used to send to the client (TAS in the case of a Microsoft Speech Server IVR) the necessary HTML, JScript and SALT web pages. IIS does a lot of work. Mainly this work comes in the form of taking C# or Visual Basic code and turning it into the appropriate HTML, JScript and SALT so it can send this information to the TAS.  Of course other functions include dealing with security and keeping track of application and session state. (If you are not familiar with the terms ‘application state’ and ‘session state’ take a look at this article from Microsoft.) The point here is that IIS does more than just find a web page file and send it out to the system that requests it.

 

One other major part of a typical Speech Server system is a database of some sort. The database engine that you use will of course depend on your company’s preference.  But when we talk Microsoft we are talking Microsoft SQL Server. Others come to mind as well such as Oracle, Sybase, and the beloved and free MySQL.  Regardless of the database engine you use, consideration must be taken on how big the database will be or even if it already exists. Many companies tie their Speech Server to existing database systems.  Typically a speech system is added as an additional interface to existing applications. Now users have the added flexibility to get to a companies data either through a web site or through a speech portal via phone.

 

Now that we understand the pieces that need to be put into place for a speech system, we must now do the real work. We must somehow come up with a way to determine if at one extreme, all these pieces will be working on separate boxes tied by a network, of if our system will be handling a sufficiently small amount of users where the entire set of components can be hosted on a single box. Usually it’s somewhere in-between.

 

There are many factors to consider. The number of concurrent users tops these considerations. Sometimes just knowing how many concurrent users isn’t enough however.  Sometimes we need to know when our peak load will be and is it OK if during those peaks we simply put the user into a queue or give them a busy signal.  We should also know how long we expect the user to be on the line and how much of the application requires intense SR activity or backend processing.  Then there is the issue of the network. Usually, the bottleneck in the whole system is the network. Should we consider putting all the components on a local area network, or is it impossible because our company’s topography requires databases and web servers to be geographically separated.  Microsoft Speech Server even allows SES and TAS to be on separate boxes with load balancing. (My understanding is that in the upcoming version of Speech Server included in the Office Communication Server separating TAS and SES will not be possible.)  The big IVR companies typically have some sort of a process in place to calculate how many computer systems will be required to handle a given deployment. Even at that, many times once the system is installed tweaking and tuning cause that estimate to be revised.

 

This I can tell you. You can you can find articles on this subject that go into details on what you can expect like this one from Intel.  Many times it takes experience, knowledge of your system and trial and error to come up with the right answer.  The other thing I can tell you is I have developed a “works-in-a- box” system (SES, TAS, IIS and a database on the same box) with 4 ports that has no problem running with four users concurrently. That box is a Pentium D with 2 gig of memory running at 2.4 Ghz.  So while this article may make it seem like developing a relatively cheap system for development or small user loads is elusive, it’s not. In fact to develop a speech system that runs with a standard sound card and no telephony is essentially free, once you have the PC.

 

In the next installment I plan on a discussion of a very simple speech application along with a Hello World tutorial. So check back in a few weeks.

Posted by kstep | 0 Comments

What does TIM and Brooktrout have to do with Speech? - A Telephony Interface Primer.

I remember as a kid taking two cans, punching holes in the bottoms, threading string into the holes and tying knots to hold the string. Then my friend and I would hold the cans so that the string was taught and talk into the can while the other had their can up to their ear. We were never quite sure if we were actually hearing the voices through the cans or just hearing each other because we were so close. But it was a fun little experiment that entertained us for a while. The physics behind this assumes the bottom of the can you talk into vibrates with your voice, which in turn vibrates the string, which finally vibrates the bottom of the other can. So in theory we had built ourselves a voice user interface (the cans) along with network (the string) that allowed us to talk to each other over a distance. It was a pretty crude interface, but none-the-less an interface.

 

With Microsoft Speech Server (MSS), there are a multitude of interfaces, all of which need the “cans” and “string” to tie it all together. In previous installments I have discussed how a speech application is very similar to a web application. On one end you have the web server, which services up the web pages, and on the other end you have a user sitting in front of a computer using a web browser, or in the case of a speech application, talking on a phone. The user types in an address such as www.microsoft.com in their browser and at the other end, the web server gets that request and sends back the start or default page of the web application. As the user interacts with the web application, more requests are made to the web server and the web server responds with new or updated pages back to the user’s web browser.

 

Think of MSS as a computerized version of a person with a web browser. MSS requests a speech application from the web server, the web server in turn sends back a web page, but instead of pictures and text, the speech server gets some special instructions that tell it to play prompts, and accept speech input for recognition. So fundamentally, the interface between the web server and the Speech Server is a standard web connection, but there is another interface between MSS and the telephony network. In this installment, I would like to talk in some detail about that interface.

 

If a user wanting to use the internet had no keyboard to “talk” to the internet with, and no screen to see the results of a request to the internet, that would make for a pretty boring user experience. So to allow a user to “talk” to the internet, we have a piece of hardware called a keyboard, and a piece of software that translates the electromechanical input from the keyboard to the text that is required to make a request. By the same token, to see results, we have some software that translates the ones and zeros to an electrical signal by a special circuit that sends a signal to the computer monitor to display the results of a web request.

 

Setting aside the use of a telephone for a moment, it is possible to plug a microphone and speaker into the MSS computer to speak to MSS, and then hear spoken words from MSS through a speaker. But having hundreds of users with microphones and headsets plugged into one machine is quite obviously not very practical. So to accommodate lots of users from anywhere in the world, some how MSS needs to connect to an entire telephone network."Out-of-the-box,” that is not something the MSS can do without some special hardware and software.  

 

MSS understands computer networks and connects easily to a web server, but it "knows" very little about connecting to a telephone network. So Microsoft relies on third party vendors to provide that telephone interface. As with a keyboard and monitor (the hardware) there is also accompanying software (drivers) that allow any company who makes those devices to connect to a computer running Microsoft Windows. So for MSS to connect to a telephone network, other companies make special boards that fit into an ISA slot in your computer that make the physical connection to standard telephony systems. These companies also provide software drivers for the board and a software interface between MSS and the board called the Telephony Interface Manager (TIM).

 

These boards and software can be purchased from several sources; however as a way of detailing how these boards and software work, I would like to tell you about the boards and software from Cantata Technology (formerly known as Brooktrout). Cantata offers several versions of their Brooktrout TR1000 boards. For development, testing and small deployments they offer a 4 port (4 telephone line) analog board. Other configurations use ISDN digital cards up to 96 ports. For an inexpensive test system, you can purchase a Starter Kit from Cantata that includes a 4 port analog board, the TIM software, and a 90 day license for Microsoft Speech Server.  I have used this configuration to set up a small office system that connects right to my standard telephone line.

 

First let’s talk about the TIM. The TIM software, as I mentioned is something that is shipped with the board. Without the TIM, the board would not be able to interface with MSS. The purpose of the TIM software is two-fold. First it translates the digitized audio from the board into a standard that is acceptable to MSS as well as takes the audio from MSS and translates it into a standard acceptable by the TR-1000 boards. Second, the TIM takes instructions from the MSS to do things like answer the incoming telephone calls, as well as hang-up, transfer calls and perform outbound calls.

 

The installation of the Brooktrout TR1000 4 port analog telephony card and TIM was a rather simple process. The board does require a full length board slot, so don’t try to install it in a compact desktop computer. The instructions in the user manual were well documented and provide excellent step-by-step installation instructions as well troubleshooting techniques. It is important that you follow the instructions carefully. For example they tell you to install the TR1000 software first, before installing the board. In this way, when you do install the board, the Plug-and-play features kick in and automatically detect and install the appropriate drivers. (The TR1000 software has recently been updated with a number of enhancements and fixes. The software is going through the final Microsoft certification for MSS 2004 R2, but has not been release as of this writing.)  After you install the board, you are asked to run the configuration software to determine how you want the ports set up. For example you can determine how many rings before the line gets picked up, as well as determining if the ports will be used for in-bound or out-bound calls. After the configuration, it’s time to install the TIM software. Since I was familiar with Intervoice’s TIM installation, I quickly recognized the process.  The Brooktrout TIM software is really a re-branding of the Intervoice TIM.

 

I have installed three different versions of TIM and two different boards (the other being the Intel Dialogic board). None of them went as smoothly as this install did. I do think it’s fair to say, however, that nothing is problem free, and so I do have a couple of minor issues.

 

First, for some reason when installing the MMC interface for TIM, the system seemed to lock up during the install. There was no indication that the MCC console had completed its install, but upon inspection of the system, it did get installed.  I decided to re-install the entire system to see if the problem repeated. The second installation went without a hitch.

 

Second, near the completion of the installation of the TIM a pop-up box appeared saying that the IviGrp group had been installed and it asked me if I wanted to install a user for that group. At this point I could have chosen no, but I was not really sure. I looked for documentation to help explain the question, but the documentation had not yet been installed. So my guess was to choose yes and I was immediately taken to the Computer Management console. My next thought was what user to install and decided not to do anything.  Later after successfully installing the TIM software (even without a new user) I was able to read and understand what it was the installation software was having me do. It turns out to run some of the command line utilities for the TIM software, your system user requires that it a member of the IviGrp group. So I was later able to assign my login to that group without a problem.

 

Following the install of the TR1000 and the TIM, I installed Microsoft Speech Server. To test the system from end-to-end, I plugged Port 0 of the Brooktrout board into the telephone jack in the wall and dialed the number. The first time I called, the system took three rings before it answered, even though the board was set up to answer on the second ring. After it answered, it took about 10 seconds before MSS sent me the “Welcome to Microsoft Speech Server” message. The delays were likely due to the server caching all the necessary software, because on the second call, everything worked without delay.

 

So that’s it. When compared to developing a speech application and then configuring and optimizing the application, installing the Brooktrout TR1000 and TIM software was a breeze. In the next installment, I will talk a little about configuring MSS and how MSS knows what application to run when a caller calls the system.

 

Posted by kstep | 3 Comments
More Posts Next page »