Author avatar

Douglas Starnes

Create Lifelike Speech with Amazon Polly

Douglas Starnes

  • Jul 15, 2020
  • 7 Min read
  • 114 Views
  • Jul 15, 2020
  • 7 Min read
  • 114 Views
Data
Data Analytics
Machine Learning

Introduction

We interact with computers in many ways. Some of the more common are with a keyboard and mouse. Lately, touchscreens have given us a more natural interface. But the most common way that humans communicate with each other is through speaking, and only in the last few years have voice interfaces become feasible for humans to communicate with computers and apps.

AWS Polly

There are two roles in a spoken conversation, the speaker and the listener. AWS Polly enables an app to play the role of the speaker. Polly is a text-to-speech (or TTS) service. Given a collection of written text, Polly will synthesize audio that mimics a human reading the text.

This might sound simple, but there is more to it than that. First, speakers have different genders, and voices sound different depending on age. Polly takes these factors into account and provides different "characters". And Polly also understands over 30 languages and dialects. You can use Speech Synthesis Markup Language (SSML) to control the pronunciation of numbers or acronyms or place pauses in the generated audio for a more natural recitation. The generated audio can be saved in multiple formats or streamed in real time.

AWS Console

To experiment with the voices that Polly offers, you can create audio files and download them from the AWS console. Go to https://console.aws.amazon.com/polly in your browser. You'll need to sign in with your AWS credentials. This is the interface.

amazon polly console

Enter some text in the text area and select a voice. Press the Listen to speech button to hear the generated audio.

Notice there is also a dropdown to change the language and region. The default is United States English. Different voices are available in different languages. You can also download the audio file in several different formats. Long-running text will be saved to an S3 bucket. In addition to text, you can also enter SSML for more control over the audio. You'll see more about that later in this guide. But for now, the console is great for experimenting, but the real power of Polly comes from integrating it with your own app. And you can use several different programming languages.

Programming Polly

I'm going to use Python to demonstrate how to access the Polly API with a language. But other languages like Java are also supported. There is no special Polly Python package to install. Instead, the boto3 package lets you access the AWS APIs all from the same place. I won't detail the process of configuring boto3, but it's fairly simple. You need to create an IAM user, get the ID and access key for the user, and store them in a credentials file in a .aws directory in your home directory. This is the default location where boto3 will look for the file.

To use Polly with boto3, import the boto3 module.

1
import boto3
python

A client will be the entry point to the API. The client needs to know which AWS your app uses. This is done with a Config object, which needs to be imported.

1
2
3
4
5
from botocore.config import Config

polly_client = boto3.client('polly', config=Config(
	region_name='us-east-1'
))
python

This example uses the N. Virginia region. I'll use a popular tongue twister for the text to synthesize.

1
2
3
text = """
	Peter Piper picked a peck of pickled peppers.
"""
python

The synthesize_speech method will accept the text and return the synthesized voice.

1
response = polly_client(Text=text, VoiceId='Matthew', OutputFormat='mp3')
python

This example also uses the Matthew voice in the default en-US region and will generate an mp3 file. You'll see more about the keyword arguments later, but first, let's write this to a file.

Open a file handle, then write the AudioStream to the file. Don't forget to open the file as binary with the wb mode.

1
2
f = open('polly.mp3', 'wb')
f.write(response['AudioStream'].read())
python

Before closing the file, close the AudioStream.

1
2
response['AudioContent'].close()
f.close()
python

And that's all! You will now have a file named polly.mp3 that you can listen to. The audio files for this guide have been uploaded to Github. You can download a zip of the repository and play the files. This example is in a file named polly.mp3.

How about a British accent? Add the LanguageCode keyword argument and it to en-GB. The Matthew voice is not valid in this region so I'll use Amy instead.

1
response = polly_client(Text=text, VoiceId='Amy', OutputFormat='mp3', LanguageCode='en-GB')
python

Listen to polly_gb.mp3 to hear Amy read the text.

Or in Spanish? Check out polly_es.mp3 in the repo.

1
2
3
4
text_es = """
	Peter Piper recogió un picotazo de pimientos en vinagre
"""
response = polly_client(Text=text_es, VoiceId='Miguel', OutputFormat='mp3', LanguageCode='es-ES')
python

SSML

The tongue twister is no match for Polly, even in different languages. Given that, you'd think this simple sentence would be no problem.

1
2
3
zip_code = """
	My zip code is 20202.
"""
python

Listen to polly_zip_code.mp3 to hear Matthew read the zip.

It's not what you'd expect. In the United States, we usually read the digits of the zip code, as in "two oh two oh two". But by default, Polly reads the number "twenty thousand two hundred two". Using Speech Synthesis Markup Language, or SSML, you can include "stage directions" so Polly will know to read the digits.

1
2
3
4
5
ssml_zip_code = """
	<speak>
		My zip code is <say-as interpret-as='digits'>20202</say-as>
	</speak>
"""
python

The entire text is enclosed in <speak> tags. The zip code is enclosed in say-as tags, and the interpret-as attribute tells Polly to pronounce the digits one at a time instead of reading the entire number. To hear the voice, open the polly_ssml_zip_code.mp3 file in the repo.

That sounds much better. This is just the beginning of what SSML can do. Check the documentation for the support tags and features.

Conclusion

Using AWS Polly lets your app speak to your users in a life-like voice. It supports different voices with different dialects to give your users an authentic experience for their location and geography. And with SSML, you have precise control over how text is read and pronounced in the synthesized voice. The different voices also simulate age and gender. The boto3 package lets you integrate Polly into a Python app with only a few lines of code. But this is only one side of the conversation. AWS Lex provides the other part, speech recognition, so that your app can have a complete voice interface. Thanks for reading!

2