Speech Synthesis with ASP.NET and HTML5
The .NET framework includes the SpeechSynthesizer class which can be used to access the Windows speech synthesis engine. The problem with web applications is, of course, this class runs on the server. Because I wanted a mechanism to have speech synthesis (text-to-speech) fired by JavaScript, without requiring any plugin, I decided to implement one myself.
Once again, I will be using client callbacks, my out-of-the-box ASP.NET favorite AJAX technique. I will also be using HTML5’s AUDIO tag and Data URIs. What I’m going to do is:
-
Set up a control that renders an AUDIO tag;
-
Add to it a JavaScript function that takes a string parameter and causes a client callback;
-
Generate a voice sound from the passed text parameter on the server and save it into an in-memory stream;
-
Convert the stream’s contents to a Data URI;
-
Return the generated Data URI to the client as the response to the client callback.
Of course, all of this in cross-browser style (provided your browser knows the AUDIO tag and Data URIs, which all modern browsers do).
So, first of all, my markup looks like this:
1: <web:SpeechSynthesizer runat="server" ID="synthesizer" Ssml="false" VoiceName="Microsoft Anna" Age="Adult" Gender="Male" Culture="en-US" Rate="0" Volume="100" />
As you can see, the SpeechSynthesizer control features a few optional properties:
- Age: the age for the generated voice (default is the one of the system’s default language);
- Gender: gender of the generated voice (same default as per Age);
- Culture: the culture of the generated voice (system default);
- Rate: the speaking rate, from –10 (fastest) to 10 (slowest), where the default is 0 (normal rate);
- Ssml: if the text is to be considered SSML or not (default is false);
- Volume: the volume %, between 0 and 100 (default);
- VoiceName: the name of a voice that is installed on the server machine.
The Age, Gender and Culture properties and the VoiceName are exclusive, you either specify one or the other. If you want to know the voices installed on your machine, have a look at the GetInstalledVoices method. If no property is specified, the speech will be synthesized with the operating system’s default (Microsoft Anna on Windows 7, Microsoft Dave, Hazel and Zira on Windows 8, etc). By the way, you can get additional voices, either commercially or for free, just look them up in Google.
Without further delay, here is the code:
1: [ConstructorNeedsTag(false)]
2: public class SpeechSynthesizer : HtmlGenericControl, ICallbackEventHandler
3: {
4: private readonly System.Speech.Synthesis.SpeechSynthesizer synth = new System.Speech.Synthesis.SpeechSynthesizer();
5:
6: public SpeechSynthesizer() : base("audio")
7: {
8: this.Age = VoiceAge.NotSet;
9: this.Gender = VoiceGender.NotSet;
10: this.Culture = CultureInfo.CurrentCulture;
11: this.VoiceName = String.Empty;
12: this.Ssml = false;
13: }
14:
15: [DefaultValue("")]
16: public String VoiceName { get; set; }
17:
18: [DefaultValue(100)]
19: public Int32 Volume { get; set; }
20:
21: [DefaultValue(0)]
22: public Int32 Rate { get; set; }
23:
24: [TypeConverter(typeof(CultureInfoConverter))]
25: public CultureInfo Culture { get; set; }
26:
27: [DefaultValue(VoiceGender.NotSet)]
28: public VoiceGender Gender { get; set; }
29:
30: [DefaultValue(VoiceAge.NotSet)]
31: public VoiceAge Age { get; set; }
32:
33: [DefaultValue(false)]
34: public Boolean Ssml { get; set; }
35:
36: protected override void OnInit(EventArgs e)
37: {
38: AsyncOperationManager.SynchronizationContext = new SynchronizationContext();
39:
40: var sm = ScriptManager.GetCurrent(this.Page);
41: var reference = this.Page.ClientScript.GetCallbackEventReference(this, "text", String.Format("function(result){{ document.getElementById('{0}').src = result; document.getElementById('{0}').play(); }}", this.ClientID), String.Empty, true);
42: var script = String.Format("\ndocument.getElementById('{0}').speak = function(text){{ {1} }};\n", this.ClientID, reference);
43:
44: if (sm != null)
45: {
46: this.Page.ClientScript.RegisterStartupScript(this.GetType(), String.Concat("speak", this.ClientID), String.Format("Sys.WebForms.PageRequestManager.getInstance().add_pageLoaded(function() {{ {0} }});\n", script), true);
47: }
48: else
49: {
50: this.Page.ClientScript.RegisterStartupScript(this.GetType(), String.Concat("speak", this.ClientID), script, true);
51: }
52:
53: base.OnInit(e);
54: }
55:
56: protected override void OnPreRender(EventArgs e)
57: {
58: this.Attributes.Remove("class");
59: this.Attributes.Remove("src");
60: this.Attributes.Remove("preload");
61: this.Attributes.Remove("loop");
62: this.Attributes.Remove("autoplay");
63: this.Attributes.Remove("controls");
64:
65: this.Style[HtmlTextWriterStyle.Display] = "none";
66: this.Style[HtmlTextWriterStyle.Visibility] = "hidden";
67:
68: base.OnPreRender(e);
69: }
70:
71: public override void Dispose()
72: {
73: this.synth.Dispose();
74:
75: base.Dispose();
76: }
77:
78: #region ICallbackEventHandler Members
79:
80: String ICallbackEventHandler.GetCallbackResult()
81: {
82: using (var stream = new MemoryStream())
83: {
84: this.synth.Rate = this.Rate;
85: this.synth.Volume = this.Volume;
86: this.synth.SetOutputToWaveStream(stream);
87:
88: if (String.IsNullOrWhiteSpace(this.VoiceName) == false)
89: {
90: this.synth.SelectVoice(this.VoiceName);
91: }
92: else
93: {
94: this.synth.SelectVoiceByHints(this.Gender, this.Age, 0, this.Culture);
95: }
96:
97: if (this.Ssml == false)
98: {
99: this.synth.Speak(this.Context.Items["data"] as String);
100: }
101: else
102: {
103: this.synth.SpeakSsml(this.Context.Items["data"] as String);
104: }
105:
106: return (String.Concat("data:audio/wav;base64,", Convert.ToBase64String(stream.ToArray())));
107: }
108: }
109:
110: void ICallbackEventHandler.RaiseCallbackEvent(String eventArgument)
111: {
112: this.Context.Items["data"] = eventArgument;
113: }
114:
115: #endregion
116: }
As you can see, the SpeechSynthesizer control inherits from HtmlGenericControl, this is the simplest out-of-the-box class that will allow me to render my tag of choice (in this case, AUDIO); by the way, this class requires that I decorate it with a ConstructorNeedsTagAttribute, but you don’t have to worry about it. It implements ICallbackEventHandler for the client callback mechanism. I make sure that all of AUDIO’s attributes are removed from the output, because I don’t want them around.
Inside of it, I have an instance of the SpeechSynthesizer class, the one that will be used to do the actual work. Because this class is disposable, I make sure it is disposed at the end of the control’s life cycle. Based on the parameters being supplied, I either call the SelectVoiceByHints or the SelectVoice methods. One thing to note is, we need to set up a synchronization context, because the SpeechSynthesizer works asynchronously, so that we can wait for its result.
The generated sound will be output to an in-memory buffer and then converted into a WAV Data URI, which is basically a Base64 encoded string with an associated mime-type.
Finally, on the client-side, all it takes is to set the returned Data URI as the AUDIO SRC property, and that's it.
A full markup example would be:
1: <%@ Register Assembly="System.Speech, Version=4.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35" Namespace="System.Speech" TagPrefix="web" %>
2: <!DOCTYPE html>
3: <html xmlns="http://www.w3.org/1999/xhtml">
4: <head runat="server">
5: <script type="text/javascript">
1:
2:
3: function onSpeak(text)
4: {
5: document.getElementById('synthesizer').speak(text);
6: }
7:
8:
</script>
6: </head>
7: <body>
8: <form runat="server">
9: <div>
10: <web:SpeechSynthesizer runat="server" ID="synthesizer" Age="Adult" Gender="Male" Culture="en-US" Rate="0" Volume="100" />
11: <input type="text" id="text" name="text"/>
12: <input type="button" value="Speak" onclick="onSpeak(this.form.text.value)"/>
13: </div>
14: </form>
15: </body>
16: </html>
And that’s it! Have fun with speech on your web apps!