A yet to be named project that is related to this post has recently led me on a search for a good way to transcribe dictation audio (convert free-form speech into text). The only real usable package out there is Microsoft Speech API. It does a great job of transcribing speech from a limited set of options. However, when it comes to free-form speech, it is downright terrible. This is odd, I thought, because I've seen some services that do an excellent job of transcribing voicemail messages. TrapCall is one of these services. So, a couple of days ago, I emailed TrapCall and asked them how they did it. Their response: "We use human transcribers not automation." Wow. I guess since the quality of automated transcription is so terrible, it is more economical to do the transcription manually.
That leaves me two options for my service. Option 1 is manual transcription. Option 2 is to improve the existing speech to text engine.
Now, keep in mind that free-form speech transcription is not essential for my service at this time. But, it would blow it out of the water. Also, if I were to develop an accurate free-form speech transcription engine, that is valueable enough that I could sell it all on its own and make a pretty penny. So, I started thinking about how I could improve upon this existing technology.
- Performance Matters. Microsoft has a default 'DictationGrammar' that you are supposed to use for free-form speech. That's the one that sucks. So, I had an idea to just create my own grammar consisting of a simple list of 80,000 words and see how well that did. When I went to load the grammar file, I sat there for 15 minutes and it was still loading. So I killed it. Then I tried 20,000 common words. That took 2 or 3 minutes to load and like 30 seconds to process each word, and it would only process one word at a time; and the quality sucks. Again failure. What this tells me is that even though Microsoft's DictationGrammar produces terrible results, at least it is fast. It also rules out a massive, complicated grammar based on common sentence structure. It would just be too slow. We need to be processing this audio in at least 1/2x speed (it takes 2 seconds to process 1 second of audio).
- Consider certainty. Did she say "I want to play with your bells" (you know, the ones you play with your Christmas orchestra) or did she say something else? That is a very important word, and it's better to ask and verify than to assume she wants to go shoot some hoops. The problem is, computers lack linguistic skills and there is no way for a computer to know if a word is important. Better safe than sorry. All words that are below a certain threshold of uncertainty are suspect.
- Making a list, checking it twice. My engine would take a two prong approach. First, results are stored in a 2 dimensional array. The sentence proceeds across the top row, with each column being closely sounding alternatives to the engine's picks. For example, the first column might consist of A, Lay, Pay; the second column: fog, dog, cog; and third: guns, nuns, runs. Secondly, the engine looks at word triplets that word compiled from web content. These are 3 word groups that appear together in text. They are rated by how common they are. Using this approach, one can clearly see that the transcription should be "A dog runs" as those words would appear most commonly together. We would run this test on each individual word. Which scores higher? A fog/a dog/a cog, lay fog/lay dog/lay cog, or pay fog/pay dog/pay cog. "A" is clearly the correct choice. The same goes for dog, but since it's in the middle, we can use the word before and the word after. We have chose "A" so we don't have to compare the other options here. So we have (a fog guns/a fog nuns/a fog runs), (a dog guns/a dog nuns/a dog runs), and (a cog guns/a cog nuns/a cog runs). "Dog" would likely stand out from the rest.
Tags: audio, transcription, dictation, free-form speech transcription