I cloned my voice using Azure, Eleven Labs, Play HT, and Resemble AI.
The same data (script) was used to train the Azure, Eleven Labs, and Play HT voices. Resemble AI’s Basic subscription has you record their script using your computer with whatever microphone is available to you (internal, headset, USB mic, etc.).
Azure: The custom ‘Lite’ version is restricted to playing back Microsoft’s text. I edited the audio to create a prompt that is replicated on the other platforms. The read is kind of flat.
Eleven Labs: There is personality in the delivery, though the first half of the rendered audio had unwanted noise that I edited out.
PlayHT: Strange read back of the word ‘created’. Like Eleven Labs, there is some personality in the delivery.
Resemble: Added breath noises to audio file which I edited out. The read is flat and monotone. The audio quality is the lowest. The Azure and Resemble audio were recorded using the same microphone.
Azure: Pay-as-you-go plus rendering time for their ‘Lite’ version. Cost ~$50 for training my voice.
Eleven Labs: The ‘Starter’ subscription is currently $5/month though they have promotions where it’s $1/month.
Play HT: Free. Next level is $29/month includes unlimited instant voices and one High Fidelity clone. The sample used in this post is an ‘instant voice’.
Resemble: Basic plan at $0.006/second of generated audio.
Eleven Labs and Play HT provide the best sounding cloned voice because they both allow you to upload you own audio files to train the TTS model. They are also inexpensive options.
Azure and Resemble are at a disadvantage due to having to record audio directly into their platform. This affected the overall quality of sound. Besides the degraded sound, the rendered audio lacked expressiveness- especially noticeable with Resemble.
Varying intonations, emphasis, and pausing help make TTS easier to auraully digest and retain information. The samples presented were rendered with their platform's default settings. Eleven Labs and Play HT offer sliding scales for Stability, Similarity, and Style Exageration (Eleven Labs) and Intensity (Play HT). The Stability settings are used for expressiveness. Eleven Labs states that the expressive settings can lead to instability in playback.
Resemble AI has no settings for styles for their free cloned voices. However, there is the option to use SSML to control playback parameters which will be explored in the near future. Azure's custom voices should have the ability to use SSML as well but the 'Lite' version does not allow for it.