Using the Google Cloud Speech API

By Douglas Starnes

Jul 21, 2020 • 9 Minute Read

Introduction

There are many ways to interact with apps. Obviously the keyboard has been, and still is, one of the most often-used devices to communicate with computers. More recently (in human years, not computer years) the mouse allowed us to move past text-based interfaces. And improving upon the mouse, touch has let us use more natural interactions. But when people are communicating with each other, in close proximity, we don't type what we are thinking or point to it, we use spoken language. So why should it be any different with a computer?

Google Cloud Speech APIs

There are two sides to a spoken conversation: the speaker and the listener. The speaker must generate the sounds to express ideas, and the listener interprets those sounds to reconstruct the ideas. This might seem like overstating the obvious as people do this every day. But a computer doesn't understand this and can't understand it unless these interactions are explained to them in great detail. This is why there are two speech APIs in Google Cloud.

First is the Text-to-Speech (or TTS) API. This service converts written (or typed) text into sounds resembling a human voice. It offers over 200 voices in over 40 languages. It supports Speech Synthesis Markup Language, or SSML, which lets you annotate written text with a set of "stage instructions" that customize the sounds for a more realistic effect. This includes pauses in text and the pronunciation of acronyms and abbreviations.

The other API is the complement, Speech-to-Text (or STT). If TTS is the speaker, then STT is the listener. The STT API can transcribe speech in more than 125 languages. And like the TTS API, it can be customized. A common use case is recognizing jargon present in specific industries. The STT API can even transcribe streaming audio in real time.

The Text-to-Speech API

Both the TTS and STT APIs support client libraries for Python, Node.js, C#, Java, and other popular languages. They also both support a REST API. I'll be using the Python client libraries for this guide.

To get started with the API, you'll need to enable the Text-to-Speech API in a Cloud Console project. Then you'll need to download the credentials for a service account and store the path to the credentials file in an environment variable named GOOGLE_APPLICATION_SETTINGS. My previous guide, Computer Vision with Google Cloud Vision, walks through this process. Refer to the link for the details.

To install the client library for Python, use pip.

      $ pip install google-cloud-texttospeech

The simplest example begins by importing the texttospeech module.

      from google.cloud import texttospeech

Next create a new TextToSpeechClient.

      tts_client = texttospeech.TextToSpeechClient()

The VoiceSelectionParams configure the generated voice.

      params = texttospeech.VoiceSelectionParams(language_code='en-US')

The language_code keyword argument is required. There are currently 40 different language codes. Here the language code is set to United States English. You can also set the name of the voice to use one of the current 237 different voices. A voice is a combination of a language code and a gender. For example, the voice "en-GB-Standard-A" uses a language code of en-GB for Great Britain English and a gender of 2, which is female. Alternatively, you can specify a gender with the ssml_gender keyword argument.

          params = texttospeech.VoiceSelectionParams(language_code='en-US', ssml_gender=texttospeech.SsmlVoiceGender.FEMALE)
    

The SsmlVoiceGender type is one of four values: SSML_VOICE_GENDER_UNSPECIFIED, MALE, FEMALE, or NEUTRAL.

The AudioConfig will select a format for the generated audio file.

      audio = texttospeech.AudioConfig(audio_encoding=texttospeech.AudioEncoding.MP3)

Valid AudioEncodings are MP3, LINEAR16, OGG_OPUS and AUDIO_ENCODING_UNSPECIFIED.

The text to speak is in a SynthesisInput.

          si = texttospeech.SynthesisInput(text='Peter Piper picked a peck of pickled peppers.')
    

And finally, the synthesize_speech method will return a SynthesizeSpeechResponse containing the generated audio.

          response = tts_client.synthesize_speech(input=si, voice=params, audio_config=audio)
    

The audio_content property of the response is the audio data that can be written to a file. Make sure to write binary data with the mode wb.

          f = open('en_us_female.mp3', 'wb')
f.write(response.audio_content)
f.close()
    

You can download the generated audio file (en_us_female.mp3) from Github. I've also generated a male voice (en_us_male.mp3) and another with a British accent (en_gb_male.mp3) by setting the language_code keyword argument to en-GB.

SSML

What happens if we generate an audio file for the text "My zip code is 20202"? Listen to the file (zip_code_no_ssml.mp3) on Github.

The API reads "20202" as "twenty thousand two hundred two". But that is not how we read zip codes. We speak each number, as in "two oh two oh two". How can we make Google understand this?

The answer is SSML or Speech Synthesis Markup Language. It is a set of tags that is used to markup the text to be generated. Here is the SSML that would tell Google to read out each digit in the zip code.

          ssml = """
<speak>
    My zip code is
    <say-as interpret-as="characters">
    20202
    </say-as>
</speak>
"""
    

By telling Google to interpret the zip code as characters, it will read each number. To have Google use SSML, look at the SynthesisInput and change the text keyword argument to ssml.

      si = texttospeech.SynthesisInput(ssml=ssml)

The generated audio file (zip_code_ssml.mp3) can be found on Github.

This is just one example of what SSML supports. For more details, consult the Text-to-Speech API documentation.

The Speech-to-Text API

The setup for the STT API is similar to the TTS API. You'll need to enable the Speech-to-Text API for a Cloud Console project and store the credentials for a service account in an environment variable named GOOGLE_APPLICATION_CREDENTIALS. And finally, install the client library package with pip.

      $ pip install google-cloud-speech

The audio file to be transcribed needs to be stored somewhere that the API can access it. A good choice is a bucket in Google Cloud Storage.

First import the speech package.

      from google.cloud import speech_v1p1beta1 as gcp_speech

A SpeechClient handles all interaction with the API.

      stt_client = gcp_speech.SpeechClient()

You must also tell the API the language being spoken in the audio file and the sample rate. I'm going to use one of the generated files, which use a rate of 24000 Hertz.

          language = 'en-US'
sr = 24000
    

For MP3 files the encoding must be given. The encoding is in the enums module.

          from google.cloud.speech_v1p1beta1 import enums

MP3 = enums.RecognitionConfig.AudioEncoding.MP3

The recognize method will call the API.

          response = stt_client.recognize(
	{
        'language_code': language,
        'sample_rate_hertz': sr,
        'encoding': MP3
    }, {
        'uri': 'gs://ps-guide-speech/en_us_male.mp3'
    }
)
    

The second dictionary tells the API the location of the audio file to transcribe. The response has a list of alternatives, each with a transcript and confidence score.

          transcript: "Peter Piper picked a peck of pickled peppers"
confidence: 0.9863739

This is the same text that was used to generate the audio file. And the confidence is quite high.

Conclusion

Don't forget that Python is not the only language the client libraries use. And it is always possible to call the API directly using any HTTP framework. If you combine the Google Cloud Text-to-Speech and Speech-to-Text API, you almost have enough to create a virtual assistant. The only thing remaining is to parse the text and extract meaning from it. The Google Cloud Natural API can provide that. The STT API lets you transcribe up to 60 minutes of audio every month for free. And you can generate audio files for up to 4 million characters of text free per month with the TTS API. Thanks for reading!

Douglas S.

Douglas Starnes is a tech author, professional explainer and Microsoft Most Valuable Professional in developer technologies in Memphis, TN. He is published on Pluralsight, Real Python and SkillShare. Douglas is co-director of the Memphis Python User Group, Memphis .NET User Group, Memphis Xamarin User Group and Memphis Power Platform User Group. He is also on the organizing committees of Scenic City Summit in Chattanooga, and TDevConf, a virtual conference in the state of Tennessee. A frequent conference and user group speaker, Douglas has delivered more than 70 featured presentations and workshops at more than 35 events over the past 10 years. He holds a Bachelor of Music degree with an emphasis on Music Composition from the University of Memphis.

More about this author