Intro to Speech Recognition

Speech Recognition (also referred to as speech to text) is the first stage in a string of algorithms in which user input is provided via voice. You can see this technology in play in virtual assistants like Google Assistant, Alexa, and Siri. It takes sound, whether recorded or live, and encodes it into text for other techniques to parse. Natural Language Processing can then be used to extract meaning from the text sentences encoded by the Speech Recognition Algorithm.

Supervised learning is used to train Speech Recognition algorithms. This means that a large sample of audio clips are fed into the algorithm along with the text transcription of the audio. Through training on a large amount of audio, the algorithm develops rules that it can use to match audio to words. There are several publicly available speech to text data sets that can be used for training, but in order to build a robust model, Speech Recognition has to be trained on words and phrases that will be spoken by users as input. For more general Speech Recognition algorithms, the training does not have to include words that will be spoken by users but must include the phonemes that the algorithm will encounter.

Traditional approaches to Speech Recognition used something called a Hidden Markov Model (HMM) and are still widely in use today, however new approaches utilizing Deep Neural Networks have increased efficiency. Traditional models are still in use today because they are more efficient at encoding longer sections of speech. We will get into how these algorithms work in detail in a future article.

Speech Recognition is faced with several challenges that its algorithms must overcome. The first is quality of input. Accuracy will decrease significantly with audio recorded in a noisy environment or with poor transmission quality (audio on the other end of a telephone line). Pre-processing algorithms must first be used to clean up the audio signal before feeding it into Speech Recognition algorithm.

Another issue is variability in speech patterns. People speak in different languages and with different accents. In order to properly recognize speech across languages and accents, algorithms must be trained using a data set that includes those languages and accents. It must be able to identify when another language or accent is being employed and be able to employ rules it has developed for encoding that language or accent.

The last issue faced by Speech Recognition algorithms is one of domain. If you have a virtual assistant that is programmed to give financial advice, you can assume that users will generally be talking about topics related to finances. The data set you will need to use to train your virtual assistant can be smaller because you can limit it to a specific domain. This is called a “closed domain problem”. Google Assistant, Alexa, and Siri are trying to solve an “open domain problem”. Because their users can use nearly any word or talk about nearly any topic, much larger data sets become necessary and the importance of context increases. Context is used to separate homonyms like “your” and “you’re” in the case of Speech Recognition. Deep Neural Networks (and particularly Recurrent Neural Networks) can look backwards and forwards through a sentence to determine context.

Speech Recognition is only the first step toward extracting meaning from spoken words. Once text has been encoded, it is then passed to Natural Language Processing algorithms to derive meaning. You can read more about Natural Language Processing here.