Enabling Voice Input into the Open Web and Firefox OS

With the advent of smartphones triggered by iPhone in 2007, Touch became the primary mode of input for interacting with these devices. And now with the advent of wearables (and other hands-free technologies that existed before), Voice is becoming another key method of input. The possibilities of experiences Voice Input enables are huge, to say the least.

They go beyond just merely interacting with in-vehicle devices, accessories and wearables. Just think of the avenues Voice Input opens up for bringing more technologies to more people. It’s enormous: Accessibility, Literacy, Gaming, VR and the list goes on. There is a social vibe there that definitely resonates with our mission at Mozilla detailed in our Mozilla Manifesto

How it started

Both current leading mobile OS/ecosystem providers of today- Apple & Google have their native experiences with Siri and “OK Google” (coupled with Google Now). We really needed an effort to enable Voice Input into the first ecosystem that existed – the open Web. Around MWC 2013 in Barcelona, when Desigan Chinniah introduced me to André Natal – Firefox contributor from Brazil, we had a conversation around this and we instantly agreed to do something about this in whichever way possible. Andre told me about being inspired from a talk by Brendan Eich in BrazilJS, so I did not have much convincing to do. :-)

First steps

We had numerous calls and meetings over the past year on the approach and tactics around this. Since “code wins arguments”, the basic work started in parallel with Firefox desktop and FxOS Unagi devices, later switching to Mozilla Flame devices over time. Over a period of the past year, we had several meetings with Mozilla engineering leads on exact approach and decided to break this effort into several smaller phases (“baby steps”).

The first target was getting Web Speech API implemented, and getting acoustic/language modules integrated with a decoder and giving that a try. Lots of similar minded folks in Mozilla Engineering/QA & community helped along with guidance and code-reviews while Andre moonlighted (on top of his day job) with a very high focus. Things moved fast in past month or so. (Well, to be honest, the only day this effort slowed down was when Team Brazil lost to Germany in FIFA 2014. :-)) Full credit to André for his hard work!

Where are we?

Our current thinking is to get a grammar-based (limited commands) app working first and distribute it in our rich & diverse international Mozilla community for accent-based testing and enhancements. Once we have this stablilized, we will get into the phase 2 where we can focus more on natural language processing and get closer to a virtual assistant experience sometime in future that can give users voice based answers. There is lots of work to do there and we are just beginning.

I will save the rest of the details for later and jump to the current status this month. Where are we so far?

We now have the Web Speech API ready for testing and we have a couple demos for you to see!

Desktop: Firefox Nightly on Mac

Editor’s note: for full effect, start playing the two above videos at the same time.

Firefox OS demo

Come, Join us!

If you want to follow along, please look at the SpeechRTC – Speech enabling the open web wiki and Bug 1032964 – Enabling Voice input in Firefox OS.

So jump in and help out if you can. We need all of you (and your voices). Remember “Many Voices, One Mozilla”!

About Sandip Kamat

Sandip Kamat is part of Mozilla's Connected Devices Product Management team. He has spent most of his career in building mobile technologies and products. Prior to joining Mozilla, he worked at Motorola Mobility (later, owned by Google) and Siemens Mobile. He is an alum of IIT Madras and UCSD (Rady). He is passionate about bringing cutting edge technologies to everyday people to make their lives meaningfully better.

More articles by Sandip Kamat…

About Robert Nyman [Editor emeritus]

Technical Evangelist & Editor of Mozilla Hacks. Gives talks & blogs about HTML5, JavaScript & the Open Web. Robert is a strong believer in HTML5 and the Open Web and has been working since 1999 with Front End development for the web - in Sweden and in New York City. He regularly also blogs at http://robertnyman.com and loves to travel and meet people.

More articles by Robert Nyman [Editor emeritus]…


14 comments

  1. szimek

    Are there any chances to update Web Speech API to optionally accept media stream object as its input instead of always using a mic?

    September 9th, 2014 at 15:21

    1. Andre Natal

      Hello szimek.

      This is possible, but as we follow W3C specification [1], it is not expected yet.

      But you can please fill a bug on Speech API bugtree in bugzilla [2] asking for this implementation !

      [1] https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html

      [2] https://bugzilla.mozilla.org/showdependencytree.cgi?id=1032964&hide_resolved=1

      September 10th, 2014 at 06:12

      1. Marco Chen

        Please refer to [1], which already added media stream as optional input source.

        [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1038061

        September 10th, 2014 at 07:58

        1. Andre Natal

          This his great Marco. I didn’t know about it.

          Thanks

          September 11th, 2014 at 00:12

        2. szimek

          That’s great, thanks!

          September 11th, 2014 at 14:02

    2. Sandip Kamat

      Hi szimek,

      Andre commented above, but can you elaborate on what kind of use cases you have in mind? That type of implementation is definitely possible, provided we can work with w3c on standards.

      Sandip

      September 10th, 2014 at 15:10

      1. szimek

        Hi Sandip,

        I was thinking about a scenario where you make a phone call from a browser – being able to pass remote stream to voice recognition would allow you to display a transcript of whatever the person using a phone is saying. This would be pretty awesome e.g. for people with hearing loss.

        Additionally, you could quite easily translate such transcript using some translation service. I wrote an experimental app that translates whatever you’re saying and reads back the translation on the other side (https://webrtc-translate.herokuapp.com), but of course it would be much more awesome if you could create a WebRTC client that does it with any (local or remote) audio stream.

        September 11th, 2014 at 06:40

        1. Sandip Kamat

          szimek, Great ideas. We are thinking along the same lines here. Please continue following these bugs and see if you can integrate your app with the current implementation. Thx!

          September 18th, 2014 at 04:58

  2. Riccardo

    Cool! One thing i’d like to see is Mozilla helping voxforge project for completing Free acoustic models for more languages.

    September 9th, 2014 at 23:13

    1. Andre Natal

      Hello Riccardo!

      Yes, this is one of the goals!!

      We expect to setup a flow to ask community for their voices to enhance existent and create new models!

      Please, follow the bugtree in bugzilla [2] and open a bug about it!

      [1] https://bugzilla.mozilla.org/showdependencytree.cgi?id=1032964&hide_resolved=1

      September 10th, 2014 at 06:17

      1. Riccardo

        Hello Andre,

        filed this bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1065904

        September 11th, 2014 at 01:35

        1. Andre Natal

          Thanks Riccardo!

          Hopefully we can setup soon a platform for Mozilla community contribute with their voices!

          September 11th, 2014 at 01:37

  3. Noitidart

    So awesome guys!!! Can’t wait to be using this API! :)

    September 13th, 2014 at 01:47

    1. Noitidart

      How can I get the app on my fxos phone? It says there are two apps already out there.

      September 13th, 2014 at 01:50

Comments are closed for this article.