Speech Recognition in ASP.NET

Tuesday, November 25, 2014

Speech synthesis and recognition were both introduced in .NET 3.0. They both live in System.Speech.dll. In the past, I already talked about speech synthesis in the context of ASP.NET Web Form applications, this time, I’m going to talk about speech recognition.

.NET has in fact two APIs for that:

System.Speech, which comes with the core framework;
Microsoft.Speech, offering a somewhat similar API, but requiring a separate download: Microsoft Speech Platform Software Development Kit.

I am going to demonstrate a technique that makes use of HTML5 features, namely, Data URIs and the getUserMedia API and also ASP.NET Client Callbacks, which, if you have been following my blog, should know that I am a big fan of.

First, because we have two APIs that we can use, let’s start by creating an abstract base provider class:

public abstract class SpeechRecognitionProvider : IDisposable

{

    protected SpeechRecognitionProvider(SpeechRecognition recognition)

    {

        this.Recognition = recognition;

    }

  

    ~SpeechRecognitionProvider()

    {

        this.Dispose(false);

    }

  

    public abstract SpeechRecognitionResult RecognizeFromWav(String filename);

  

    protected SpeechRecognition Recognition

    {

        get;

        private set;

    }

  

    protected virtual void Dispose(Boolean disposing)

    {

    }

  

    void IDisposable.Dispose()

    {

        GC.SuppressFinalize(this);

        this.Dispose(true);

    }

}

It basically features one method, RecognizeFromWav, which takes a physical path and returns a SpeechRecognitionResult (code coming next). For completeness, it also implements the Dispose Pattern, because some provider may require it.

In a moment we will be creating implementations for the built-in .NET provider as well as Microsoft Speech Platform.

The SpeechRecognition property refers to our Web Forms control, inheriting from HtmlGenericControl, which is the one that knows how to instantiate one provider or the other:

[ConstructorNeedsTag(false)]

public class SpeechRecognition : HtmlGenericControl, ICallbackEventHandler

{

    public SpeechRecognition() : base("span")

    {

        this.OnClientSpeechRecognized = String.Empty;

        this.Mode = SpeechRecognitionMode.Desktop;

        this.Culture = String.Empty;

        this.SampleRate = 44100;

        this.Grammars = new String[0];

        this.Choices = new String[0];

    }

  

    public event EventHandler<SpeechRecognizedEventArgs> SpeechRecognized;

  

    [DefaultValue("")]

    public String Culture

    {

        get;

        set;

    }

  

    [DefaultValue(SpeechRecognitionMode.Desktop)]

    public SpeechRecognitionMode Mode

    {

        get;

        set;

    }

  

    [DefaultValue("")]

    public String OnClientSpeechRecognized

    {

        get;

        set;

    }

  

    [DefaultValue(44100)]

    public UInt32 SampleRate

    {

        get;

        set;

    }

  

    [TypeConverter(typeof(StringArrayConverter))]

    [DefaultValue("")]

    public String [] Grammars

    {

        get;

        private set;

    }

  

    [TypeConverter(typeof(StringArrayConverter))]

    [DefaultValue("")]

    public String[] Choices

    {

        get;

        set;

    }

  

    protected override void OnInit(EventArgs e)

    {

        if (this.Page.Items.Contains(typeof(SpeechRecognition)))

        {

            throw (new InvalidOperationException("There can be only one SpeechRecognition control on a page."));

        }

  

        var sm = ScriptManager.GetCurrent(this.Page);

        var reference = this.Page.ClientScript.GetCallbackEventReference(this, "sound", String.Format("function(result){{ {0}(JSON.parse(result)); }}", String.IsNullOrWhiteSpace(this.OnClientSpeechRecognized) ? "void" : this.OnClientSpeechRecognized), String.Empty, true);

        var script = String.Format("\nvar processor = document.getElementById('{0}'); processor.stopRecording = function(sampleRate) {{ window.stopRecording(processor, sampleRate ? sampleRate : 44100); }}; processor.startRecording = function() {{ window.startRecording(); }}; processor.process = function(sound){{ {1} }};\n", this.ClientID, reference);

  

        if (sm != null)

        {

            this.Page.ClientScript.RegisterStartupScript(this.GetType(), String.Concat("process", this.ClientID), String.Format("Sys.WebForms.PageRequestManager.getInstance().add_pageLoaded(function() {{ {0} }});\n", script), true);

        }

        else

        {

            this.Page.ClientScript.RegisterStartupScript(this.GetType(), String.Concat("process", this.ClientID), script, true);

        }

  

        this.Page.ClientScript.RegisterClientScriptResource(this.GetType(), String.Concat(this.GetType().Namespace, ".Script.js"));

        this.Page.Items[typeof(SpeechRecognition)] = this;

  

        base.OnInit(e);

    }

  

    protected virtual void OnSpeechRecognized(SpeechRecognizedEventArgs e)

    {

        var handler = this.SpeechRecognized;

  

        if (handler != null)

        {

            handler(this, e);

        }

    }

  

    protected SpeechRecognitionProvider GetProvider()

    {

        switch (this.Mode)

        {

            case SpeechRecognitionMode.Desktop:

                return (new DesktopSpeechRecognitionProvider(this));

  

            case SpeechRecognitionMode.Server:

                return (new ServerSpeechRecognitionProvider(this));

        }

  

        return (null);

    }

  

    #region ICallbackEventHandler Members

  

    String ICallbackEventHandler.GetCallbackResult()

    {

        AsyncOperationManager.SynchronizationContext = new SynchronizationContext();

  

        var filename = Path.GetTempFileName();

        var result = null as SpeechRecognitionResult;

  

        using (var engine = this.GetProvider())

        {

            var data = this.Context.Items["data"].ToString();

  

            using (var file = File.OpenWrite(filename))

            {

                var bytes = Convert.FromBase64String(data);

                file.Write(bytes, 0, bytes.Length);

            }

  

            result = engine.RecognizeFromWav(filename) ?? new SpeechRecognitionResult(String.Empty);

        }

  

        File.Delete(filename);

  

        var args = new SpeechRecognizedEventArgs(result);

  

        this.OnSpeechRecognized(args);

  

        var json = new JavaScriptSerializer().Serialize(result);

  

        return (json);

    }

  

    void ICallbackEventHandler.RaiseCallbackEvent(String eventArgument)

    {

        this.Context.Items["data"] = eventArgument;

    }

  

    #endregion

}

SpeechRecognition implements ICallbackEventHandler for a self-contained AJAX experience; it registers a couple of JavaScript functions and also an embedded JavaScript file, for some useful sound manipulation and conversion. Only one instance is allowed on a page. On the client-side, this JavaScript uses getUserMedia to access an audio source and uses a clever mechanism to pack them as a .WAV file in a Data URI. I got these functions from http://typedarray.org/from-microphone-to-wav-with-getusermedia-and-web-audio/ and made some changes to them. I like them because they don’t require any external library, which makes all this pretty much self-contained.

The control exposes some custom properties:

Culture: an optional culture name, such as “pt-PT” or “en-US”; if not specified, it defaults to the current culture in the server machine;
Mode: one of the two providers: Desktop (for System.Speech) or Server (for Microsoft.Speech, of Microsoft Speech Platform);
OnClientSpeechRecognized: the name of a callback JavaScript version that will be called when there are results (more on this later);
SampleRate: a sample rate, by default, it is 44100;
Grammars: an optional collection of additional grammar files, with extension .grxml (Speech Recognition Grammar Specification), to add to the engine;
Choices: an optional collection of choices to recognize, if we want to restrict the scope, such as “yes”/”no”, “red”/”green”, etc.

The mode enumeration looks like this:

public enum SpeechRecognitionMode

{

    Desktop,

    Server

}

The SpeechRecognition control also has an event, SpeechRecognized, which allows overriding the detected phrases. Its argument is this simple class that follows the regular .NET event pattern:

[Serializable]

public sealed class SpeechRecognizedEventArgs : EventArgs

{

    public SpeechRecognizedEventArgs(SpeechRecognitionResult result)

    {

        this.Result = result;

    }

  

    public SpeechRecognitionResult Result

    {

        get;

        private set;

    }

}

Which in turn holds a SpeechRecognitionResult:

public class SpeechRecognitionResult

{

    public SpeechRecognitionResult(String text, params String [] alternates)

    {

        this.Text = text;

        this.Alternates = alternates.ToList();

    }

  

    public String Text

    {

        get;

        set;

    }

  

    public List<String> Alternates

    {

        get;

        private set;

    }

}

This class receives the phrase that the speech recognition engine understood plus an array of additional alternatives, in descending order.

The JavaScript file containing the utility functions is embedded in the assembly:

You need to add an assembly-level attribute, WebResourceAttribute, possibly in AssemblyInfo.cs, of course, replacing MyNamespace for your assembly’s default namespace:

   1: [assembly: WebResource("MyNamespace.Script.js", "text/javascript")]

This attribute registers a script file with some content-type so that it can be included in a page by the RegisterClientScriptResource method.

And here is it:

// variables

var leftchannel = [];

var rightchannel = [];

var recorder = null;

var recording = false;

var recordingLength = 0;

var volume = null;

var audioInput = null;

var audioContext = null;

var context = null;

  

// feature detection 

navigator.getUserMedia = navigator.getUserMedia || navigator.webkitGetUserMedia || navigator.mozGetUserMedia || navigator.msGetUserMedia;

  

if (navigator.getUserMedia)

{

    navigator.getUserMedia({ audio: true }, onSuccess, onFailure);

}

else

{

    alert('getUserMedia not supported in this browser.');

}

  

function startRecording()

{

    recording = true;

    // reset the buffers for the new recording

    leftchannel.length = rightchannel.length = 0;

    recordingLength = 0;

    leftchannel = [];

    rightchannel = [];

}

  

function stopRecording(elm, sampleRate)

{

    recording = false;

  

    // we flat the left and right channels down

    var leftBuffer = mergeBuffers(leftchannel, recordingLength);

    var rightBuffer = mergeBuffers(rightchannel, recordingLength);

    // we interleave both channels together

    var interleaved = interleave(leftBuffer, rightBuffer);

  

    // we create our wav file

    var buffer = new ArrayBuffer(44 + interleaved.length * 2);

    var view = new DataView(buffer);

  

    // RIFF chunk descriptor

    writeUTFBytes(view, 0, 'RIFF');

    view.setUint32(4, 44 + interleaved.length * 2, true);

    writeUTFBytes(view, 8, 'WAVE');

    // FMT sub-chunk

    writeUTFBytes(view, 12, 'fmt ');

    view.setUint32(16, 16, true);

    view.setUint16(20, 1, true);

    // stereo (2 channels)

    view.setUint16(22, 2, true);

    view.setUint32(24, sampleRate, true);

    view.setUint32(28, sampleRate * 4, true);

    view.setUint16(32, 4, true);

    view.setUint16(34, 16, true);

    // data sub-chunk

    writeUTFBytes(view, 36, 'data');

    view.setUint32(40, interleaved.length * 2, true);

  

    // write the PCM samples

    var index = 44;

    var volume = 1;

  

    for (var i = 0; i < interleaved.length; i++)

    {

        view.setInt16(index, interleaved[i] * (0x7FFF * volume), true);

        index += 2;

    }

  

    // our final binary blob

    var blob = new Blob([view], { type: 'audio/wav' });

  

    var reader = new FileReader();

    reader.onloadend = function ()

    {

        var url = reader.result.replace('data:audio/wav;base64,', '');

        elm.process(url);

    };

    reader.readAsDataURL(blob);

}

  

function interleave(leftChannel, rightChannel)

{

    var length = leftChannel.length + rightChannel.length;

    var result = new Float32Array(length);

    var inputIndex = 0;

  

    for (var index = 0; index < length;)

    {

        result[index++] = leftChannel[inputIndex];

        result[index++] = rightChannel[inputIndex];

        inputIndex++;

    }

  

    return result;

}

  

function mergeBuffers(channelBuffer, recordingLength)

{

    var result = new Float32Array(recordingLength);

    var offset = 0;

  

    for (var i = 0; i < channelBuffer.length; i++)

    {

        var buffer = channelBuffer[i];

        result.set(buffer, offset);

        offset += buffer.length;

    }

  

    return result;

}

  

function writeUTFBytes(view, offset, string)

{

    for (var i = 0; i < string.length; i++)

    {

        view.setUint8(offset + i, string.charCodeAt(i));

    }

}

  

function onFailure(e)

{

    alert('Error capturing audio.');

}

  

function onSuccess(e)

{

    // creates the audio context

    audioContext = (window.AudioContext || window.webkitAudioContext);

    context = new audioContext();

  

    // creates a gain node

    volume = context.createGain();

  

    // creates an audio node from the microphone incoming stream

    audioInput = context.createMediaStreamSource(e);

  

    // connect the stream to the gain node

    audioInput.connect(volume);

  

    /* From the spec: This value controls how frequently the audioprocess event is 

    dispatched and how many sample-frames need to be processed each call. 

    Lower values for buffer size will result in a lower (better) latency. 

    Higher values will be necessary to avoid audio breakup and glitches */

    var bufferSize = 2048;

  

    recorder = context.createScriptProcessor(bufferSize, 2, 2);

    recorder.onaudioprocess = function (e)

    {

        if (recording == false)

        {

            return;

        }

  

        var left = e.inputBuffer.getChannelData(0);

        var right = e.inputBuffer.getChannelData(1);

  

        // we clone the samples

        leftchannel.push(new Float32Array(left));

        rightchannel.push(new Float32Array(right));

  

        recordingLength += bufferSize;

    }

  

    // we connect the recorder

    volume.connect(recorder);

    recorder.connect(context.destination);

}

OK, let’s move on the the provider implementations; first, Desktop:

public class DesktopSpeechRecognitionProvider : SpeechRecognitionProvider

{

    public DesktopSpeechRecognitionProvider(SpeechRecognition recognition) : base(recognition)

    {

    }

  

    public override SpeechRecognitionResult RecognizeFromWav(String filename)

    {

        var engine = null as SpeechRecognitionEngine;

  

        if (String.IsNullOrWhiteSpace(this.Recognition.Culture) == true)

        {

            engine = new SpeechRecognitionEngine();

        }

        else

        {

            engine = new SpeechRecognitionEngine(CultureInfo.CreateSpecificCulture(this.Recognition.Culture));

        }

  

        using (engine)

        {

            if ((this.Recognition.Grammars.Any() == false) && (this.Recognition.Choices.Any() == false))

            {

                engine.LoadGrammar(new DictationGrammar());

            }

  

            foreach (var grammar in this.Recognition.Grammars)

            {

                var doc = new SrgsDocument(Path.Combine(HttpRuntime.AppDomainAppPath, grammar));

                engine.LoadGrammar(new Grammar(doc));

            }

  

            if (this.Recognition.Choices.Any() == true)

            {

                var choices = new Choices(this.Recognition.Choices.ToArray());

                engine.LoadGrammar(new Grammar(choices));

            }

  

            engine.SetInputToWaveFile(filename);

  

            var result = engine.Recognize();

  

            return ((result != null) ? new SpeechRecognitionResult(result.Text, result.Alternates.Select(x => x.Text).ToArray()) : null);

        }

    }

}

What this provider does is simply receive the location of a .WAV file and feed it to SpeechRecognitionEngine, together with some parameters of SpeechRecognition (Culture, AudioRate, Grammars and Choices)

Finally, the code for the Server (Microsoft Speech Platform Software Development Kit) version:

public class ServerSpeechRecognitionProvider : SpeechRecognitionProvider

{

    public ServerSpeechRecognitionProvider(SpeechRecognition recognition) : base(recognition)

    {

    }

  

    public override SpeechRecognitionResult RecognizeFromWav(String filename)

    {

        var engine = null as SpeechRecognitionEngine;

  

        if (String.IsNullOrWhiteSpace(this.Recognition.Culture) == true)

        {

            engine = new SpeechRecognitionEngine();

        }

        else

        {

            engine = new SpeechRecognitionEngine(CultureInfo.CreateSpecificCulture(this.Recognition.Culture));

        }

  

        using (engine)

        {

            foreach (var grammar in this.Recognition.Grammars)

            {

                var doc = new SrgsDocument(Path.Combine(HttpRuntime.AppDomainAppPath, grammar));

                engine.LoadGrammar(new Grammar(doc));

            }

  

            if (this.Recognition.Choices.Any() == true)

            {

                var choices = new Choices(this.Recognition.Choices.ToArray());

                engine.LoadGrammar(new Grammar(choices));

            }

  

            engine.SetInputToWaveFile(filename);

  

            var result = engine.Recognize();

  

            return ((result != null) ? new SpeechRecognitionResult(result.Text, result.Alternates.Select(x => x.Text).ToArray()) : null);

        }

    }

}

As you can see, it is very similar to the Desktop one. Keep in mind, however, that for this provider to work you will have to download the Microsoft Speech Platform SDK, the Microsoft Speech Platform Runtime and at least one language from the Language Pack.

Here is a sample markup declaration:

   1: <web:SpeechRecognition runat="server" ID="processor" ClientIDMode="Static" Mode="Desktop" Culture="en-US" OnSpeechRecognized="OnSpeechRecognized" OnClientSpeechRecognized="onSpeechRecognized" />

If you want to add specific choices, add the Choices attribute to the control declaration and separate the values by commas:

   1: Choices="One, Two, Three"

Or add a grammar file:

   1: Grammars="~/MyGrammar.grxml"

By the way, grammars are not so difficult to create, you can find a good documentation in MSDN: http://msdn.microsoft.com/en-us/library/ee800145.aspx.

To finalize, a sample JavaScript for starting recognition and receiving the results:

<script type="text/javascript">
  

    

    function onSpeechRecognized(result)

    {

        window.alert('Recognized: ' + result.Text + '\nAlternatives: ' + String.join(', ', result.Alternatives));

    }

  

    function start()

    {

        document.getElementById('processor').startRecording();

    }

  

    function stop()

    {

        document.getElementById('processor').stopRecording();

    }

  
</script>

And that’s it. You start recognition by calling startRecording(), get results in onSpeechRecognized() (or any other function set in the OnClientSpeechRecognized property) and stop recording with stopRecording(). The values passed to onSpeechRecognized() are those that may have been filtered by the server-side SpeechRecognized event handler.

A final word of advisory: because generated sound files may become very large, do keep the recordings as short as possible.

Of course, this offers several different possibilities, I am looking forward to hearing them from you! Winking smile

Nice blog as usual but this time having most using stuff Speech recognition, understand the solution in a better way.

Thanks

Rajiv Khandelwal - Friday, November 28, 2014 12:19:57 PM

Nice article...can please provide sources code for download .

samir - Thursday, June 29, 2017 8:24:26 AM

Hi, Samir!
Yes, you can find it here: https://github.com/rjperes/DevelopmentWithADot.AspNetSpeechRecognition.
Don't forget to disable assembly signing, as it's not needed and I didn't include my key.

RicardoPeres - Thursday, June 29, 2017 8:28:34 AM