Speech Recognition in ASP.NET

Speech synthesis and recognition were both introduced in .NET 3.0. They both live in System.Speech.dll. In the past, I already talked about speech synthesis in the context of ASP.NET Web Form applications, this time, I’m going to talk about speech recognition.

.NET has in fact two APIs for that:

I am going to demonstrate a technique that makes use of HTML5 features, namely, Data URIs and the getUserMedia API and also ASP.NET Client Callbacks, which, if you have been following my blog, should know that I am a big fan of.

First, because we have two APIs that we can use, let’s start by creating an abstract base provider class:

   1: public abstract class SpeechRecognitionProvider : IDisposable
   2: {
   3:     protected SpeechRecognitionProvider(SpeechRecognition recognition)
   4:     {
   5:         this.Recognition = recognition;
   6:     }
   7:  
   8:     ~SpeechRecognitionProvider()
   9:     {
  10:         this.Dispose(false);
  11:     }
  12:  
  13:     public abstract SpeechRecognitionResult RecognizeFromWav(String filename);
  14:  
  15:     protected SpeechRecognition Recognition
  16:     {
  17:         get;
  18:         private set;
  19:     }
  20:  
  21:     protected virtual void Dispose(Boolean disposing)
  22:     {
  23:     }
  24:  
  25:     void IDisposable.Dispose()
  26:     {
  27:         GC.SuppressFinalize(this);
  28:         this.Dispose(true);
  29:     }
  30: }

It basically features one method, RecognizeFromWav, which takes a physical path and returns a SpeechRecognitionResult (code coming next). For completeness, it also implements the Dispose Pattern, because some provider may require it.

In a moment we will be creating implementations for the built-in .NET provider as well as Microsoft Speech Platform.

The SpeechRecognition property refers to our Web Forms control, inheriting from HtmlGenericControl, which is the one that knows how to instantiate one provider or the other:

   1: [ConstructorNeedsTag(false)]
   2: public class SpeechRecognition : HtmlGenericControl, ICallbackEventHandler
   3: {
   4:     public SpeechRecognition() : base("span")
   5:     {
   6:         this.OnClientSpeechRecognized = String.Empty;
   7:         this.Mode = SpeechRecognitionMode.Desktop;
   8:         this.Culture = String.Empty;
   9:         this.SampleRate = 44100;
  10:         this.Grammars = new String[0];
  11:         this.Choices = new String[0];
  12:     }
  13:  
  14:     public event EventHandler<SpeechRecognizedEventArgs> SpeechRecognized;
  15:  
  16:     [DefaultValue("")]
  17:     public String Culture
  18:     {
  19:         get;
  20:         set;
  21:     }
  22:  
  23:     [DefaultValue(SpeechRecognitionMode.Desktop)]
  24:     public SpeechRecognitionMode Mode
  25:     {
  26:         get;
  27:         set;
  28:     }
  29:  
  30:     [DefaultValue("")]
  31:     public String OnClientSpeechRecognized
  32:     {
  33:         get;
  34:         set;
  35:     }
  36:  
  37:     [DefaultValue(44100)]
  38:     public UInt32 SampleRate
  39:     {
  40:         get;
  41:         set;
  42:     }
  43:  
  44:     [TypeConverter(typeof(StringArrayConverter))]
  45:     [DefaultValue("")]
  46:     public String [] Grammars
  47:     {
  48:         get;
  49:         private set;
  50:     }
  51:  
  52:     [TypeConverter(typeof(StringArrayConverter))]
  53:     [DefaultValue("")]
  54:     public String[] Choices
  55:     {
  56:         get;
  57:         set;
  58:     }
  59:  
  60:     protected override void OnInit(EventArgs e)
  61:     {
  62:         if (this.Page.Items.Contains(typeof(SpeechRecognition)))
  63:         {
  64:             throw (new InvalidOperationException("There can be only one SpeechRecognition control on a page."));
  65:         }
  66:  
  67:         var sm = ScriptManager.GetCurrent(this.Page);
  68:         var reference = this.Page.ClientScript.GetCallbackEventReference(this, "sound", String.Format("function(result){{ {0}(JSON.parse(result)); }}", String.IsNullOrWhiteSpace(this.OnClientSpeechRecognized) ? "void" : this.OnClientSpeechRecognized), String.Empty, true);
  69:         var script = String.Format("\nvar processor = document.getElementById('{0}'); processor.stopRecording = function(sampleRate) {{ window.stopRecording(processor, sampleRate ? sampleRate : 44100); }}; processor.startRecording = function() {{ window.startRecording(); }}; processor.process = function(sound){{ {1} }};\n", this.ClientID, reference);
  70:  
  71:         if (sm != null)
  72:         {
  73:             this.Page.ClientScript.RegisterStartupScript(this.GetType(), String.Concat("process", this.ClientID), String.Format("Sys.WebForms.PageRequestManager.getInstance().add_pageLoaded(function() {{ {0} }});\n", script), true);
  74:         }
  75:         else
  76:         {
  77:             this.Page.ClientScript.RegisterStartupScript(this.GetType(), String.Concat("process", this.ClientID), script, true);
  78:         }
  79:  
  80:         this.Page.ClientScript.RegisterClientScriptResource(this.GetType(), String.Concat(this.GetType().Namespace, ".Script.js"));
  81:         this.Page.Items[typeof(SpeechRecognition)] = this;
  82:  
  83:         base.OnInit(e);
  84:     }
  85:  
  86:     protected virtual void OnSpeechRecognized(SpeechRecognizedEventArgs e)
  87:     {
  88:         var handler = this.SpeechRecognized;
  89:  
  90:         if (handler != null)
  91:         {
  92:             handler(this, e);
  93:         }
  94:     }
  95:  
  96:     protected SpeechRecognitionProvider GetProvider()
  97:     {
  98:         switch (this.Mode)
  99:         {
 100:             case SpeechRecognitionMode.Desktop:
 101:                 return (new DesktopSpeechRecognitionProvider(this));
 102:  
 103:             case SpeechRecognitionMode.Server:
 104:                 return (new ServerSpeechRecognitionProvider(this));
 105:         }
 106:  
 107:         return (null);
 108:     }
 109:  
 110:     #region ICallbackEventHandler Members
 111:  
 112:     String ICallbackEventHandler.GetCallbackResult()
 113:     {
 114:         AsyncOperationManager.SynchronizationContext = new SynchronizationContext();
 115:  
 116:         var filename = Path.GetTempFileName();
 117:         var result = null as SpeechRecognitionResult;
 118:  
 119:         using (var engine = this.GetProvider())
 120:         {
 121:             var data = this.Context.Items["data"].ToString();
 122:  
 123:             using (var file = File.OpenWrite(filename))
 124:             {
 125:                 var bytes = Convert.FromBase64String(data);
 126:                 file.Write(bytes, 0, bytes.Length);
 127:             }
 128:  
 129:             result = engine.RecognizeFromWav(filename) ?? new SpeechRecognitionResult(String.Empty);
 130:         }
 131:  
 132:         File.Delete(filename);
 133:  
 134:         var args = new SpeechRecognizedEventArgs(result);
 135:  
 136:         this.OnSpeechRecognized(args);
 137:  
 138:         var json = new JavaScriptSerializer().Serialize(result);
 139:  
 140:         return (json);
 141:     }
 142:  
 143:     void ICallbackEventHandler.RaiseCallbackEvent(String eventArgument)
 144:     {
 145:         this.Context.Items["data"] = eventArgument;
 146:     }
 147:  
 148:     #endregion
 149: }

SpeechRecognition implements ICallbackEventHandler for a self-contained AJAX experience; it registers a couple of JavaScript functions and also an embedded JavaScript file, for some useful sound manipulation and conversion. Only one instance is allowed on a page. On the client-side, this JavaScript uses getUserMedia to access an audio source and uses a clever mechanism to pack them as a .WAV file in a Data URI. I got these functions from http://typedarray.org/from-microphone-to-wav-with-getusermedia-and-web-audio/ and made some changes to them. I like them because they don’t require any external library, which makes all this pretty much self-contained.

The control exposes some custom properties:

  • Culture: an optional culture name, such as “pt-PT” or “en-US”; if not specified, it defaults to the current culture in the server machine;
  • Mode: one of the two providers: Desktop (for System.Speech) or Server (for Microsoft.Speech, of Microsoft Speech Platform);
  • OnClientSpeechRecognized: the name of a callback JavaScript version that will be called when there are results (more on this later);
  • SampleRate: a sample rate, by default, it is 44100;
  • Grammars: an optional collection of additional grammar files, with extension .grxml (Speech Recognition Grammar Specification), to add to the engine;
  • Choices: an optional collection of choices to recognize, if we want to restrict the scope, such as “yes”/”no”, “red”/”green”, etc.

The mode enumeration looks like this:

   1: public enum SpeechRecognitionMode
   2: {
   3:     Desktop,
   4:     Server
   5: }

The SpeechRecognition control also has an event, SpeechRecognized, which allows overriding the detected phrases. Its argument is this simple class that follows the regular .NET event pattern:

   1: [Serializable]
   2: public sealed class SpeechRecognizedEventArgs : EventArgs
   3: {
   4:     public SpeechRecognizedEventArgs(SpeechRecognitionResult result)
   5:     {
   6:         this.Result = result;
   7:     }
   8:  
   9:     public SpeechRecognitionResult Result
  10:     {
  11:         get;
  12:         private set;
  13:     }
  14: }

Which in turn holds a SpeechRecognitionResult:

   1: public class SpeechRecognitionResult
   2: {
   3:     public SpeechRecognitionResult(String text, params String [] alternates)
   4:     {
   5:         this.Text = text;
   6:         this.Alternates = alternates.ToList();
   7:     }
   8:  
   9:     public String Text
  10:     {
  11:         get;
  12:         set;
  13:     }
  14:  
  15:     public List<String> Alternates
  16:     {
  17:         get;
  18:         private set;
  19:     }
  20: }

This class receives the phrase that the speech recognition engine understood plus an array of additional alternatives, in descending order.

The JavaScript file containing the utility functions is embedded in the assembly:

image

You need to add an assembly-level attribute, WebResourceAttribute, possibly in AssemblyInfo.cs, of course, replacing MyNamespace for your assembly’s default namespace:

   1: [assembly: WebResource("MyNamespace.Script.js", "text/javascript")]

This attribute registers a script file with some content-type so that it can be included in a page by the RegisterClientScriptResource method.

And here is it:

   1: // variables
   2: var leftchannel = [];
   3: var rightchannel = [];
   4: var recorder = null;
   5: var recording = false;
   6: var recordingLength = 0;
   7: var volume = null;
   8: var audioInput = null;
   9: var audioContext = null;
  10: var context = null;
  11:  
  12: // feature detection 
  13: navigator.getUserMedia = navigator.getUserMedia || navigator.webkitGetUserMedia || navigator.mozGetUserMedia || navigator.msGetUserMedia;
  14:  
  15: if (navigator.getUserMedia)
  16: {
  17:     navigator.getUserMedia({ audio: true }, onSuccess, onFailure);
  18: }
  19: else
  20: {
  21:     alert('getUserMedia not supported in this browser.');
  22: }
  23:  
  24: function startRecording()
  25: {
  26:     recording = true;
  27:     // reset the buffers for the new recording
  28:     leftchannel.length = rightchannel.length = 0;
  29:     recordingLength = 0;
  30:     leftchannel = [];
  31:     rightchannel = [];
  32: }
  33:  
  34: function stopRecording(elm, sampleRate)
  35: {
  36:     recording = false;
  37:  
  38:     // we flat the left and right channels down
  39:     var leftBuffer = mergeBuffers(leftchannel, recordingLength);
  40:     var rightBuffer = mergeBuffers(rightchannel, recordingLength);
  41:     // we interleave both channels together
  42:     var interleaved = interleave(leftBuffer, rightBuffer);
  43:  
  44:     // we create our wav file
  45:     var buffer = new ArrayBuffer(44 + interleaved.length * 2);
  46:     var view = new DataView(buffer);
  47:  
  48:     // RIFF chunk descriptor
  49:     writeUTFBytes(view, 0, 'RIFF');
  50:     view.setUint32(4, 44 + interleaved.length * 2, true);
  51:     writeUTFBytes(view, 8, 'WAVE');
  52:     // FMT sub-chunk
  53:     writeUTFBytes(view, 12, 'fmt ');
  54:     view.setUint32(16, 16, true);
  55:     view.setUint16(20, 1, true);
  56:     // stereo (2 channels)
  57:     view.setUint16(22, 2, true);
  58:     view.setUint32(24, sampleRate, true);
  59:     view.setUint32(28, sampleRate * 4, true);
  60:     view.setUint16(32, 4, true);
  61:     view.setUint16(34, 16, true);
  62:     // data sub-chunk
  63:     writeUTFBytes(view, 36, 'data');
  64:     view.setUint32(40, interleaved.length * 2, true);
  65:  
  66:     // write the PCM samples
  67:     var index = 44;
  68:     var volume = 1;
  69:  
  70:     for (var i = 0; i < interleaved.length; i++)
  71:     {
  72:         view.setInt16(index, interleaved[i] * (0x7FFF * volume), true);
  73:         index += 2;
  74:     }
  75:  
  76:     // our final binary blob
  77:     var blob = new Blob([view], { type: 'audio/wav' });
  78:  
  79:     var reader = new FileReader();
  80:     reader.onloadend = function ()
  81:     {
  82:         var url = reader.result.replace('data:audio/wav;base64,', '');
  83:         elm.process(url);
  84:     };
  85:     reader.readAsDataURL(blob);
  86: }
  87:  
  88: function interleave(leftChannel, rightChannel)
  89: {
  90:     var length = leftChannel.length + rightChannel.length;
  91:     var result = new Float32Array(length);
  92:     var inputIndex = 0;
  93:  
  94:     for (var index = 0; index < length;)
  95:     {
  96:         result[index++] = leftChannel[inputIndex];
  97:         result[index++] = rightChannel[inputIndex];
  98:         inputIndex++;
  99:     }
 100:  
 101:     return result;
 102: }
 103:  
 104: function mergeBuffers(channelBuffer, recordingLength)
 105: {
 106:     var result = new Float32Array(recordingLength);
 107:     var offset = 0;
 108:  
 109:     for (var i = 0; i < channelBuffer.length; i++)
 110:     {
 111:         var buffer = channelBuffer[i];
 112:         result.set(buffer, offset);
 113:         offset += buffer.length;
 114:     }
 115:  
 116:     return result;
 117: }
 118:  
 119: function writeUTFBytes(view, offset, string)
 120: {
 121:     for (var i = 0; i < string.length; i++)
 122:     {
 123:         view.setUint8(offset + i, string.charCodeAt(i));
 124:     }
 125: }
 126:  
 127: function onFailure(e)
 128: {
 129:     alert('Error capturing audio.');
 130: }
 131:  
 132: function onSuccess(e)
 133: {
 134:     // creates the audio context
 135:     audioContext = (window.AudioContext || window.webkitAudioContext);
 136:     context = new audioContext();
 137:  
 138:     // creates a gain node
 139:     volume = context.createGain();
 140:  
 141:     // creates an audio node from the microphone incoming stream
 142:     audioInput = context.createMediaStreamSource(e);
 143:  
 144:     // connect the stream to the gain node
 145:     audioInput.connect(volume);
 146:  
 147:     /* From the spec: This value controls how frequently the audioprocess event is 
 148:     dispatched and how many sample-frames need to be processed each call. 
 149:     Lower values for buffer size will result in a lower (better) latency. 
 150:     Higher values will be necessary to avoid audio breakup and glitches */
 151:     var bufferSize = 2048;
 152:  
 153:     recorder = context.createScriptProcessor(bufferSize, 2, 2);
 154:     recorder.onaudioprocess = function (e)
 155:     {
 156:         if (recording == false)
 157:         {
 158:             return;
 159:         }
 160:  
 161:         var left = e.inputBuffer.getChannelData(0);
 162:         var right = e.inputBuffer.getChannelData(1);
 163:  
 164:         // we clone the samples
 165:         leftchannel.push(new Float32Array(left));
 166:         rightchannel.push(new Float32Array(right));
 167:  
 168:         recordingLength += bufferSize;
 169:     }
 170:  
 171:     // we connect the recorder
 172:     volume.connect(recorder);
 173:     recorder.connect(context.destination);
 174: }

OK, let’s move on the the provider implementations; first, Desktop:

   1: public class DesktopSpeechRecognitionProvider : SpeechRecognitionProvider
   2: {
   3:     public DesktopSpeechRecognitionProvider(SpeechRecognition recognition) : base(recognition)
   4:     {
   5:     }
   6:  
   7:     public override SpeechRecognitionResult RecognizeFromWav(String filename)
   8:     {
   9:         var engine = null as SpeechRecognitionEngine;
  10:  
  11:         if (String.IsNullOrWhiteSpace(this.Recognition.Culture) == true)
  12:         {
  13:             engine = new SpeechRecognitionEngine();
  14:         }
  15:         else
  16:         {
  17:             engine = new SpeechRecognitionEngine(CultureInfo.CreateSpecificCulture(this.Recognition.Culture));
  18:         }
  19:  
  20:         using (engine)
  21:         {
  22:             if ((this.Recognition.Grammars.Any() == false) && (this.Recognition.Choices.Any() == false))
  23:             {
  24:                 engine.LoadGrammar(new DictationGrammar());
  25:             }
  26:  
  27:             foreach (var grammar in this.Recognition.Grammars)
  28:             {
  29:                 var doc = new SrgsDocument(Path.Combine(HttpRuntime.AppDomainAppPath, grammar));
  30:                 engine.LoadGrammar(new Grammar(doc));
  31:             }
  32:  
  33:             if (this.Recognition.Choices.Any() == true)
  34:             {
  35:                 var choices = new Choices(this.Recognition.Choices.ToArray());
  36:                 engine.LoadGrammar(new Grammar(choices));
  37:             }
  38:  
  39:             engine.SetInputToWaveFile(filename);
  40:  
  41:             var result = engine.Recognize();
  42:  
  43:             return ((result != null) ? new SpeechRecognitionResult(result.Text, result.Alternates.Select(x => x.Text).ToArray()) : null);
  44:         }
  45:     }
  46: }

What this provider does is simply receive the location of a .WAV file and feed it to SpeechRecognitionEngine, together with some parameters of SpeechRecognition (Culture, AudioRate, Grammars and Choices)

Finally, the code for the Server (Microsoft Speech Platform Software Development Kit) version:

   1: public class ServerSpeechRecognitionProvider : SpeechRecognitionProvider
   2: {
   3:     public ServerSpeechRecognitionProvider(SpeechRecognition recognition) : base(recognition)
   4:     {
   5:     }
   6:  
   7:     public override SpeechRecognitionResult RecognizeFromWav(String filename)
   8:     {
   9:         var engine = null as SpeechRecognitionEngine;
  10:  
  11:         if (String.IsNullOrWhiteSpace(this.Recognition.Culture) == true)
  12:         {
  13:             engine = new SpeechRecognitionEngine();
  14:         }
  15:         else
  16:         {
  17:             engine = new SpeechRecognitionEngine(CultureInfo.CreateSpecificCulture(this.Recognition.Culture));
  18:         }
  19:  
  20:         using (engine)
  21:         {
  22:             foreach (var grammar in this.Recognition.Grammars)
  23:             {
  24:                 var doc = new SrgsDocument(Path.Combine(HttpRuntime.AppDomainAppPath, grammar));
  25:                 engine.LoadGrammar(new Grammar(doc));
  26:             }
  27:  
  28:             if (this.Recognition.Choices.Any() == true)
  29:             {
  30:                 var choices = new Choices(this.Recognition.Choices.ToArray());
  31:                 engine.LoadGrammar(new Grammar(choices));
  32:             }
  33:  
  34:             engine.SetInputToWaveFile(filename);
  35:  
  36:             var result = engine.Recognize();
  37:  
  38:             return ((result != null) ? new SpeechRecognitionResult(result.Text, result.Alternates.Select(x => x.Text).ToArray()) : null);
  39:         }
  40:     }
  41: }

As you can see, it is very similar to the Desktop one. Keep in mind, however, that for this provider to work you will have to download the Microsoft Speech Platform SDK, the Microsoft Speech Platform Runtime and at least one language from the Language Pack.

Here is a sample markup declaration:

   1: <web:SpeechRecognition runat="server" ID="processor" ClientIDMode="Static" Mode="Desktop" Culture="en-US" OnSpeechRecognized="OnSpeechRecognized" OnClientSpeechRecognized="onSpeechRecognized" />

If you want to add specific choices, add the Choices attribute to the control declaration and separate the values by commas:

   1: Choices="One, Two, Three"

Or add a grammar file:

   1: Grammars="~/MyGrammar.grxml"

By the way, grammars are not so difficult to create, you can find a good documentation in MSDN: http://msdn.microsoft.com/en-us/library/ee800145.aspx.

To finalize, a sample JavaScript for starting recognition and receiving the results:

   1: <script type="text/javascript">
   1:  
   2:     
   3:     function onSpeechRecognized(result)
   4:     {
   5:         window.alert('Recognized: ' + result.Text + '\nAlternatives: ' + String.join(', ', result.Alternatives));
   6:     }
   7:  
   8:     function start()
   9:     {
  10:         document.getElementById('processor').startRecording();
  11:     }
  12:  
  13:     function stop()
  14:     {
  15:         document.getElementById('processor').stopRecording();
  16:     }
  17:  
</script>

And that’s it. You start recognition by calling startRecording(), get results in onSpeechRecognized() (or any other function set in the OnClientSpeechRecognized property) and stop recording with stopRecording(). The values passed to onSpeechRecognized() are those that may have been filtered by the server-side SpeechRecognized event handler.

A final word of advisory: because generated sound files may become very large, do keep the recordings as short as possible.

Of course, this offers several different possibilities, I am looking forward to hearing them from you! Winking smile

                             

3 Comments

Add a Comment

As it will appear on the website

Not displayed

Your website