We interact with computers in many ways. Some of the more common are with a keyboard and mouse. Lately, touchscreens have given us a more natural interface. But the most common way that humans communicate with each other is through speaking, and only in the last few years have voice interfaces become feasible for humans to communicate with computers and apps.
There are two roles in a spoken conversation, the speaker and the listener. AWS Polly enables an app to play the role of the speaker. Polly is a text-to-speech (or TTS) service. Given a collection of written text, Polly will synthesize audio that mimics a human reading the text.
This might sound simple, but there is more to it than that. First, speakers have different genders, and voices sound different depending on age. Polly takes these factors into account and provides different "characters". And Polly also understands over 30 languages and dialects. You can use Speech Synthesis Markup Language (SSML) to control the pronunciation of numbers or acronyms or place pauses in the generated audio for a more natural recitation. The generated audio can be saved in multiple formats or streamed in real time.
To experiment with the voices that Polly offers, you can create audio files and download them from the AWS console. Go to https://console.aws.amazon.com/polly in your browser. You'll need to sign in with your AWS credentials. This is the interface.
Enter some text in the text area and select a voice. Press the Listen to speech button to hear the generated audio.
Notice there is also a dropdown to change the language and region. The default is United States English. Different voices are available in different languages. You can also download the audio file in several different formats. Long-running text will be saved to an S3 bucket. In addition to text, you can also enter SSML for more control over the audio. You'll see more about that later in this guide. But for now, the console is great for experimenting, but the real power of Polly comes from integrating it with your own app. And you can use several different programming languages.
I'm going to use Python to demonstrate how to access the Polly API with a language. But other languages like Java are also supported. There is no special Polly Python package to install. Instead, the
boto3 package lets you access the AWS APIs all from the same place. I won't detail the process of configuring
boto3, but it's fairly simple. You need to create an IAM user, get the ID and access key for the user, and store them in a
credentials file in a
.aws directory in your home directory. This is the default location where
boto3 will look for the file.
To use Polly with
boto3, import the
A client will be the entry point to the API. The client needs to know which AWS your app uses. This is done with a
Config object, which needs to be imported.
1 2 3 4 5
from botocore.config import Config polly_client = boto3.client('polly', config=Config( region_name='us-east-1' ))
This example uses the N. Virginia region. I'll use a popular tongue twister for the text to synthesize.
1 2 3
text = """ Peter Piper picked a peck of pickled peppers. """
synthesize_speech method will accept the text and return the synthesized voice.
response = polly_client(Text=text, VoiceId='Matthew', OutputFormat='mp3')
This example also uses the
Matthew voice in the default
en-US region and will generate an
mp3 file. You'll see more about the keyword arguments later, but first, let's write this to a file.
Open a file handle, then write the
AudioStream to the file. Don't forget to open the file as binary with the
f = open('polly.mp3', 'wb') f.write(response['AudioStream'].read())
Before closing the file, close the
And that's all! You will now have a file named
polly.mp3 that you can listen to. The audio files for this guide have been uploaded to Github. You can download a zip of the repository and play the files. This example is in a file named
How about a British accent? Add the
LanguageCode keyword argument and it to
Matthew voice is not valid in this region so I'll use
response = polly_client(Text=text, VoiceId='Amy', OutputFormat='mp3', LanguageCode='en-GB')
polly_gb.mp3 to hear Amy read the text.
Or in Spanish? Check out
polly_es.mp3 in the repo.
1 2 3 4
text_es = """ Peter Piper recogió un picotazo de pimientos en vinagre """ response = polly_client(Text=text_es, VoiceId='Miguel', OutputFormat='mp3', LanguageCode='es-ES')
The tongue twister is no match for Polly, even in different languages. Given that, you'd think this simple sentence would be no problem.
1 2 3
zip_code = """ My zip code is 20202. """
polly_zip_code.mp3 to hear Matthew read the zip.
It's not what you'd expect. In the United States, we usually read the digits of the zip code, as in "two oh two oh two". But by default, Polly reads the number "twenty thousand two hundred two". Using Speech Synthesis Markup Language, or SSML, you can include "stage directions" so Polly will know to read the digits.
1 2 3 4 5
ssml_zip_code = """ <speak> My zip code is <say-as interpret-as='digits'>20202</say-as> </speak> """
The entire text is enclosed in
<speak> tags. The zip code is enclosed in
say-as tags, and the
interpret-as attribute tells Polly to pronounce the digits one at a time instead of reading the entire number. To hear the voice, open the
polly_ssml_zip_code.mp3 file in the repo.
That sounds much better. This is just the beginning of what SSML can do. Check the documentation for the support tags and features.
Using AWS Polly lets your app speak to your users in a life-like voice. It supports different voices with different dialects to give your users an authentic experience for their location and geography. And with SSML, you have precise control over how text is read and pronounced in the synthesized voice. The different voices also simulate age and gender. The
boto3 package lets you integrate Polly into a Python app with only a few lines of code. But this is only one side of the conversation. AWS Lex provides the other part, speech recognition, so that your app can have a complete voice interface. Thanks for reading!