OK, I’m going to date myself here, but I’m sure some of you young puppies out there have seen this as well. There was an “I Love Lucy” show, where Lucy somehow ended up at a candy factory on the production line. Her job was to take the candy off the moving conveyor belt and place it into packaging as it slid by. At first, she had no problem. As time went on, however, the conveyor belt started to increase in speed, so much so she started eating some of the candy instead of packing it. Eventually all the candy began to fall on the floor around her.
Sometimes, when it comes to keeping up with technology, especially software development technology, that’s exactly how I feel. Just when I think I have caught up, suddenly there is a flood of new information to learn about. I know others have touched on this before, but I thought it might be interesting to sort of list the technologies that a developer using Microsoft Speech Server would need to know in order to effectively develop a speech application. And if you think this is all, wait until Speech Server 2007 comes out. While much of what you learn for Speech Server 2004 can be applied, there will suddenly be a new Visual Studio, with a new version of C#, along with a new Workflow paradigm for programming… and well you get what I’m saying.
As I stated in my last installment, writing speech applications for Speech Server 2004 is much like writing a web site except you include speech stuff. That speech stuff is SALT (Speech Application Language Tags). Just knowing SALT won’t help much in the Visual Studios environment. In fact, the way the Speech SDK works, you really don’t write SALT at all. You write an ASP.NET application and use Microsoft Speech controls that are included in the tool box. By dragging and dropping these speech controls, you are in essence writing some HTML-like/SALT-like code that is ultimately interpreted by the web server into true HTML and true SALT. So that brings up the first technology you need to understand… ASP.NET. If you are already an ASP.NET programmer, you already have many of the skills necessary to build speech apps.
For those unfamiliar with ASP.NET, or for that matter any of the Microsoft .NET technologies, I recommend that you understand what .NET is and what can be done with it. Essentially it’s Microsoft’s latest technology for developing applications. Applications developed in .NET can be console applications, Windows desktop applications, web sites, web services, and just about any other kind of application you can imagine. For those of you who program using JAVA technologies, you will be very comfortable with the similarities between .NET and JAVA.
When you use Visual Studios (or any other development environment) to write ASP.NET web applications (to include speech), you are really writing code for two computers. One computer is the server that services up the application, called a web server (For Microsoft that is IIS) and the other is the client computer that gets the information from the web server. The client is the computer requesting the web pages from the web server (for speech that can be Microsoft Speech Server). The thing to keep in mind is that when you write an application using ASP.NET, you are writing code that tells the web server what to send to the client. It is up to the web server to take that code and convert it to code that is run on the client’s browser. There are cases however, when you will write code directly for the browser and the server just passes that code to the client without doing any conversion.
So why would you write code that runs on the web server in some cases and write some code that runs on the client’s browser in other cases? Well this is the dilemma that plagues web development; and this is one reason why Microsoft developed the .NET environment. The idea is that if all I had to do was write code for the server then theoretically, I should be able to write all of my code in one language (for example, C# or Visual Basic.NET) so that it runs exclusively on the web server, and rely on the fact that the web server will convert that code to HTML and JavaScript (as well as SALT for speech) which runs on any client’s browser. For standard web site applications, this works fairly well. However in practice, there are times when writing code directly for the browser (using HTML and JavaScript) makes the system more efficient. So from a design perspective, it makes sense to write code for the browser so the system as a whole is more efficient. So why not just write HTML and JavaScript? Well, there are also times that there is code that MUST run on the web server. For example, since I can’t have every client connect to a database server to get data, I have to rely on the web server to make those connections on behalf of the client. Therefore, I need to write code for the server that queries the database.
So already we see at least three programming languages that we must learn to be an efficient web/speech developer in the Microsoft world. Those being C# or Visual Basic.NET, HTML and JavaScript (also know as ECMA Script and a Microsoft version called JScript).
But that just gets us part way. If we need to connect to a database we also need to understand the database server and its languages, usually SQL (Sequential Query Language). Although SQL is pretty universal for database systems from different companies, there are usually some proprietary nuances that need to be learned. For example if you use Microsoft SQL Server, you will need to learn T-SQL which is very SQL-like, but has unique syntax for running scripts (called stored procedures).
If that weren’t enough, many times we need to transfer data between computer systems that are not physically connected. For various reasons, these computers may not be able to have direct connections. Since nearly all computers these days are connected to the Internet, it stands to reason that we need a language that allows us to transfer data in such a way that computers connected to the internet can understand. That method requires another technology called XML. Without going into details on XML, let me just say that speech technology in the telephony world relies heavily on XML. As a matter of fact the whole .NET environment relies heavily on XML. Fortunately the .NET framework obfuscates most of the XML stuff for us.
In Speech there are three versions of XML that you must be familiar with. They are GRXML (Grammar XML), SML (Speech Markup Language) and SSML (Speech Synthesis Markup Language). Let’s talk about each of these briefly.
In Microsoft Speech Server, recognition can only occur if what is spoken by the user matches a pre-determined “list” of things that we expect the user to say. That list of potential statements is called a grammar. Part of the challenge of writing speech apps is knowing ahead of time what we expect a response to a question to be. For example, if I ask, “What color do you like?” as the programmer, I have to provide to the recognition engine a grammar that has a list of the colors I expect the user to say. That list must be in a special text file that is an XML document that uses the GRXML standard.
Whenever we tell the speech engine we want it to recognize something, two things have to be present for the recognition engine to do its work. First we must tell it which GRXML file, or grammar, to use, then we must provide it with the digitized audio data spoken by the user. If the recognition engine is able to match the digitized audio with something in the GRXML file, the recognition engine then produces another XML document called SML. In the SML document there will be the text of what the engine “thought” was spoken along with a confidence score. The confidence score is a number from 0.0 to 1.0 where 1.0 represents 100% confidence and 0.0 is no confidence. We will talk more about this in later installments.
To convert written text to spoken words, a text-to-speech (TTS) engine is used. In order for the TTS to do its work, it requires an XML Document in an SSML standard. This document not only has the text of what is converted to digitized audio, but also some special markup that indicates how the text is spoken. For example, when most people speak (unlike one of my college professors) there is inflection and volume change (prosodies) as well as some other information. The point is, this SSML document will tell the TTS engine what and how to convert written text to a digitized representation of audible speech.
So let’s recap. Some of the technologies you need to know to program speech in the Microsoft world are:
.NET – Microsoft’s application development framework.
ASP.NET – Web application development.
C# or Visual Basic .NET – One of the languages that the .NET framework understands and is used to program server side code for ASP.NET.
HTML – Hypertext Markup language for telling a client’s browser how to display things on the screen.
JavaScript – A scripting language that tells the client browser what to do.
SQL – A query language used to return results of a query from a database as well a proprietary form of SQL for each brand of database.
XML – A markup language used to transfer data across the internet.
GRXML – An XML document containing a grammar.
SML – An XML document that contains the response from the Speech Server recognition.
SSML – An XML document that contains text and prosodies for a TTS engine.
SALT – The Speech HTML tags and markup used by the browser to run a speech application. Understand, however, that you are using the Speech SDK for Microsoft Speech Server you will not directly write SALT in your application.
While this all seems like a lot, I certainly don’t want to dissuade you from learning how to program speech applications. Much of what you need to learn for speech is very useful outside speech application development. For those of you so inclined, be assured that by learning speech now, you will have a skill that few people have. Speech as a technology will only become more popular over time, so you will be ahead of those who know nothing about it. Plus… it’s pretty cool stuff.