Speech synthesis and recognition were both introduced in .NET 3.0. They both live in System.Speech.dll. In the past, I already talked about speech synthesis in the context of ASP.NET Web Form applications, this time, I’m going to talk about speech recognition.
.NET has in fact two APIs for that:
System.Speech, which comes with the core framework;
Microsoft.Speech, offering a somewhat similar API, but requiring a separate download: Microsoft Speech Platform Software Development Kit.
I am going to demonstrate a technique that makes use of HTML5 features, namely, Data URIs and the getUserMedia API and also ASP.NET Client Callbacks, which, if you have been following my blog, should know that I am a big fan of.
First, because we have two APIs that we can use, let’s start by creating an abstract base provider class:
It basically features one method, RecognizeFromWav, which takes a physical path and returns a SpeechRecognitionResult (code coming next). For completeness, it also implements the Dispose Pattern, because some provider may require it.
In a moment we will be creating implementations for the built-in .NET provider as well as Microsoft Speech Platform.
The SpeechRecognition property refers to our Web Forms control, inheriting from HtmlGenericControl, which is the one that knows how to instantiate one provider or the other:
The control exposes some custom properties:
Culture: an optional culture name, such as “pt-PT” or “en-US”; if not specified, it defaults to the current culture in the server machine;
Mode: one of the two providers: Desktop (for System.Speech) or Server (for Microsoft.Speech, of Microsoft Speech Platform);
SampleRate: a sample rate, by default, it is 44100;
Grammars: an optional collection of additional grammar files, with extension .grxml (Speech Recognition Grammar Specification), to add to the engine;
Choices: an optional collection of choices to recognize, if we want to restrict the scope, such as “yes”/”no”, “red”/”green”, etc.
The mode enumeration looks like this:
The SpeechRecognition control also has an event, SpeechRecognized, which allows overriding the detected phrases. Its argument is this simple class that follows the regular .NET event pattern:
Which in turn holds a SpeechRecognitionResult:
This class receives the phrase that the speech recognition engine understood plus an array of additional alternatives, in descending order.
You need to add an assembly-level attribute, WebResourceAttribute, possibly in AssemblyInfo.cs, of course, replacing MyNamespace for your assembly’s default namespace:
This attribute registers a script file with some content-type so that it can be included in a page by the RegisterClientScriptResource method.
And here is it:
OK, let’s move on the the provider implementations; first, Desktop:
What this provider does is simply receive the location of a .WAV file and feed it to SpeechRecognitionEngine, together with some parameters of SpeechRecognition (Culture, AudioRate, Grammars and Choices)
Finally, the code for the Server (Microsoft Speech Platform Software Development Kit) version:
As you can see, it is very similar to the Desktop one. Keep in mind, however, that for this provider to work you will have to download the Microsoft Speech Platform SDK, the Microsoft Speech Platform Runtime and at least one language from the Language Pack.
Here is a sample markup declaration:
If you want to add specific choices, add the Choices attribute to the control declaration and separate the values by commas:
Or add a grammar file:
By the way, grammars are not so difficult to create, you can find a good documentation in MSDN: http://msdn.microsoft.com/en-us/library/ee800145.aspx.
And that’s it. You start recognition by calling startRecording(), get results in onSpeechRecognized() (or any other function set in the OnClientSpeechRecognized property) and stop recording with stopRecording(). The values passed to onSpeechRecognized() are those that may have been filtered by the server-side SpeechRecognized event handler.
A final word of advisory: because generated sound files may become very large, do keep the recordings as short as possible.
Of course, this offers several different possibilities, I am looking forward to hearing them from you!