What is Necessary to Make Speech Server Run? - A System Requirements Discussion
After a good vacation it’s hard, number one, to get back into the swing of things, and number two, to catch up on all the things you need to in order to get back into the swing of things. Did that make any sense?
At any rate I am back at it and in this installment I want to continue to talk about some of the hardware and software pieces that make up the Microsoft Speech Server environment. In the last installment I spent some time talking about the TIM and telephony board. Those are the pieces that allow a telephone network to connect up to the Microsoft Speech Server. But what about the computers necessary to make Speech Server work. What kind of hardware is necessary there? Also in my last installment I said would talk about configuring the MSS server, but I think instead it makes more sense to talk about what is required for a speech system to run first, then develop an application, and then talk about configuring MSS.
Before I go much further I want to take a minute to address those of you who are new to Microsoft Speech and simply want to experiment with this technology. DON’T LET THIS ARTICLE SCARE YOU. I say this because as you read through, it may seem like you will be getting in over your head. Rest assured you can actually develop a speech app using the Microsoft Speech SDK without any additional hardware cost or expense. With an MSDN license or a trial version of Microsoft Speech Server, and a developers license of a SIP TIM (like Vail Systems), you can develop an IVR without any additional hardware and it will even run in a Virtual PC or Virtual Server but you will still need to obtain a license of Windows Server 2003. So read on but don’t get discouraged.
If you’ve been around the IVR industry long, you know that one of the arts within the art of providing a good speech application is being able to determine how much hardware needs to be put into place to make a system with good scalability, as well as good response time. To tell you the truth I am no expert in this realm, and yet it’s one of the most asked questions. For example, how many computers should I have? What speed processor? How much memory and hard disk space? How many ports (telephone lines) do I need? Should the Speech Server be on the same box as the Web Server? (Also known as the application server.) Should I put the database on the same box? And on it goes.
First, to answer these many questions, lets look at all the pieces that are necessary to make a Speech Server work. First it should be noted that when you install the Microsoft Speech Server, you have the option to install two primary pieces. Those pieces are the Speech Engine Services (SES) and the Telephony Application Services (TAS). Let’s talk about what each of these do.
SES is broken down into two parts. The Speech Recognizer (SR) engine and the Text-To-Speech (TTS) engine. As the names imply the SR engine is used to recognize speech by comparing what is spoken to a pre-determined grammar. The recognition output of the SR engine is an XML document called SML. It contains the text of what was recognized along with confidence scores and semantic items. We will talk more about SML in the next installment. The TTS engine will generate a digitized stream of audio data from text that is part of another XML document called SSML, and again we will talk more about SSML later. So in reality SES is the heart and soul of a speech system. It recognizes spoken word and plays back text as audio. As you may have already guessed, this part of the Speech Server system requires a lot of processing power, i.e. processor speed and memory.
The other part of the Microsoft Speech Server is the TAS. As I stated in the last installment, you can think of the TAS as a web browser for speech. Since telephones do not have the capability to interpret web pages in the form of HTML, SALT, and JScript, it becomes incumbent on the TAS to do that on behalf of the telephony hardware. So TAS is the speech browser and interpreter for speech applications. It understands HTML, JScript and those special HTML tags for speech called SALT. TAS too can require a lot processing power. Especially if TAS is required to keep track of a lot of telephone calls at one time. It would be like having your desktop run multiple Internet Explorer browsers with each one running web pages with sound and video. Clearly if we try to run too many of these web pages at once, we start to run out of computer resources and latency is the result.
Because speech IVR applications are fundamentally web applications, we also need a web server. For Microsoft Speech, this requires Internet Information Services (IIS). This software is used to send to the client (TAS in the case of a Microsoft Speech Server IVR) the necessary HTML, JScript and SALT web pages. IIS does a lot of work. Mainly this work comes in the form of taking C# or Visual Basic code and turning it into the appropriate HTML, JScript and SALT so it can send this information to the TAS. Of course other functions include dealing with security and keeping track of application and session state. (If you are not familiar with the terms ‘application state’ and ‘session state’ take a look at this article from Microsoft.) The point here is that IIS does more than just find a web page file and send it out to the system that requests it.
One other major part of a typical Speech Server system is a database of some sort. The database engine that you use will of course depend on your company’s preference. But when we talk Microsoft we are talking Microsoft SQL Server. Others come to mind as well such as Oracle, Sybase, and the beloved and free MySQL. Regardless of the database engine you use, consideration must be taken on how big the database will be or even if it already exists. Many companies tie their Speech Server to existing database systems. Typically a speech system is added as an additional interface to existing applications. Now users have the added flexibility to get to a companies data either through a web site or through a speech portal via phone.
Now that we understand the pieces that need to be put into place for a speech system, we must now do the real work. We must somehow come up with a way to determine if at one extreme, all these pieces will be working on separate boxes tied by a network, of if our system will be handling a sufficiently small amount of users where the entire set of components can be hosted on a single box. Usually it’s somewhere in-between.
There are many factors to consider. The number of concurrent users tops these considerations. Sometimes just knowing how many concurrent users isn’t enough however. Sometimes we need to know when our peak load will be and is it OK if during those peaks we simply put the user into a queue or give them a busy signal. We should also know how long we expect the user to be on the line and how much of the application requires intense SR activity or backend processing. Then there is the issue of the network. Usually, the bottleneck in the whole system is the network. Should we consider putting all the components on a local area network, or is it impossible because our company’s topography requires databases and web servers to be geographically separated. Microsoft Speech Server even allows SES and TAS to be on separate boxes with load balancing. (My understanding is that in the upcoming version of Speech Server included in the Office Communication Server separating TAS and SES will not be possible.) The big IVR companies typically have some sort of a process in place to calculate how many computer systems will be required to handle a given deployment. Even at that, many times once the system is installed tweaking and tuning cause that estimate to be revised.
This I can tell you. You can you can find articles on this subject that go into details on what you can expect like this one from Intel. Many times it takes experience, knowledge of your system and trial and error to come up with the right answer. The other thing I can tell you is I have developed a “works-in-a- box” system (SES, TAS, IIS and a database on the same box) with 4 ports that has no problem running with four users concurrently. That box is a Pentium D with 2 gig of memory running at 2.4 Ghz. So while this article may make it seem like developing a relatively cheap system for development or small user loads is elusive, it’s not. In fact to develop a speech system that runs with a standard sound card and no telephony is essentially free, once you have the PC.
In the next installment I plan on a discussion of a very simple speech application along with a Hello World tutorial. So check back in a few weeks.