Text to Speech: The Promise and the Perils from a Writer’s Perspective

Man at Computer

It’s no secret at this point that voice technology is drastically changing the way in which we interact with our electronic devices. Whether it’s using Amazon’s Alexa to order a weighted blanket or pulling up Apple’s Siri to help you find the way to that juice bar you heard about from that one guy, this tech is creeping into our lives in a way we can no longer ignore.

Although this type of assistance isn’t new, for the longest time many people simply avoided using speech-producing software for no other reason than the fact that they couldn’t stand the robotic-sounding default voices on their chosen machine. I’ll be the first to admit this was the case with me up until the mid-2010s. It was around this time that my desire to broaden my horizons as a writer and content creator made me revisit voice technology, in particular text to speech (TTS) software, as a means of refining my workflow and possibly creating new income opportunities.

Then

For context, let’s turn the clock back to about 2014 (or even a few years earlier had I been more hip and savvy)…

From there you can comfortably kiss the Speak & Spell voice of your particular OS goodbye and say hello to Amy and Brian — who couldn’t sound more British if they were above-the-line actors on Downton Abbey.

For those of you who missed the carriage, I’m referring to the TTS tech known as Ivona Voices, which at the time had a free app through the Google Play store and a desktop standalone going for a one-time payment of around $35 for personal use.

Yes, you read that right: FREE and ONE-TIME PAYMENT. None of this subscription service for everything nonsense.

As a writer, I definitely saw the benefits of using TTS from a creative perspective since the voices had evolved to the point of being tolerable for the duration of a blog article or eBook. You didn’t have to look all that hard to see that audio was the new frontier for media consumption, especially with the rising popularity of podcasts and indie creators starting to look for new ways to diversify their IP portfolios.

For me, TTS provided several benefits. First, it meant I could bypass subscription services like Audible, since I could literally copy a full-length book into a dedicated player, press a button, and have it spit out a complete audiobook in less than half an hour. In addition to that, the software served as a second set of eyes on anything I was writing, allowing me to see and hear my work in a way that made it easier to edit.

With the most obvious uses considered, the progress of the technology also made it plausible to envision future commercial iterations eliminating the need to hire voice actors for any narration work I might eventually experiment with. Don’t get me wrong, neither Brian nor Amy would be winning any acting awards, but in a world in which emotive inflection is quickly becoming a rarity among modern orators, one could hardly say either would be completely out of place in the market.

All in all, I would have said I had found an enviable balance between both the consumer and producer perspectives.

Yeah…

That was pretty much my exact level of naive, first-round optimism — until I realized the logical progression of the whole idea and how it would basically punch any frugal indie author’s attempt at a long-term business model square in the jaw.

Now

Jump to the start of the 2020s, and we (somewhat predictably) find the referee has already made it to the count of five while the next Tolkien is still face-first on the canvas wondering what the hell happened.

Naturally, it only goes to reason that the aforementioned money-saving tactic offered by text-to-speech is the exact reason why the idea of making real money by the same means is becoming more and more far-fetched. After all, who’s going to buy your audiobook if they can use the same technology to make their own audiobook at little to no additional cost? Even worse, what if, for the same reason, consumer expectations begin to shift to a point where meeting the needs of a willing buyer put production costs out of your price range?

Polly Wants All Your Crackers

In the years since I initially started using Ivona, the company has been acquired by Amazon. Shocker, I know. (Google must have slept on that day.)

The technology now powers Amazon Polly and is included as part of Amazon Web Services (AWS), which means you can no longer get Ivona’s previous standalone desktop application or the free voices on Google Play. (Oh, brave new world of cloud-based “convenience”!)

The only real upside is that Amazon’s commercial use pricing only costs fractions of a cent per character, giving you up to a million characters for around $4.00. This means you can download and make money from the voice files if you want, but casual use for something like editing might not be as beneficial as it once was.

However, none of this is the real point. No, the true area of concern arises courtesy of Google’s DeepMind and certain advances in machine learning. (You probably should have seen that coming. Early bird or not, Google always wins… always.)

Now that I think about it, you should definitely throw IBM’s Watson in the TTS pile for good measure. The days of the one-stop shop are over, which essentially means if you had any chance of capitalizing on the ignorance of your competition regarding the available options for production, you can pretty much tuck those dreams away with that one Blockbuster gift card you never got to use (but won’t get rid of for some reason).

So, what does that mean for the small-time creator as far as making money with the technology we currently have and also with what’s coming? I can only say that the terrain is very interesting.

The Outlook

Concerning Celebrities

The most likely winners in all of this, from what I can tell, will be celebrities.

We’ve already seen that deep learning TTS models, like RealTalk, are capable of creating realistic voice replicas of public figures like Joe Rogan, Barack Obama, Donald Trump, and Elon Musk, so the idea of celebrity endorsed and licensed TTS voices doesn’t seem like a stretch once someone figures out the legalese (which they better start working on pretty soon before things get out of hand).

The way I see it, this influx of celebrity options will be the Swiss Army knife equivalent of the proverbial double-edged sword. On the one blade, you could have the voice of Morgan Freeman reading all your audiobooks, narrating your documentary, and playing the lead character in your video game. On another, though, you would have to worry about paying the real Morgan Freeman for every possible license under the sun to use his voice, which probably means you won’t make any money. In response, of course, you could always make a TTS of your own voice to cut costs, but you would quickly realize that no self-respecting buyer would choose you when they could get Morgan Freeman.

Basically what I’m saying is that Morgan Freeman will be everyone and everywhere…

Like… like…

God…

Just as he seemed destined to be…

In all seriousness, though, I believe platforms like Audible will eventually start partnering with celebrities to make custom TTS voices that they can then offer exclusively to subscription members who want to hear their favorite books read by certain people. It’s not difficult to fathom that many authors will have a hard time competing with any celebrity who has a halfway decent voice. In that market, the only thing that could preserve the author’s royalty check would be the collective goodwill of a very loyal fan base that just prefers their favorite author unconditionally — a circumstance which conveniently brings me to my next point.

Concerning Nonfiction Writers

As I subtly alluded to above, one segment of writers might still benefit from advances in TTS. Specifically, I’m referring to those who deal in nonfiction topics. Unlike fiction, nonfiction, especially in the personal development category, is one area in which readers might prefer authors take charge of their own words. Depending on the subject, there’s always something special about hearing an author’s voice connect with the subject of its passion, and there’s often no substitute for that with many fans. Still, as it concerns TTS, one can’t help but also embrace the irony as this desire for intimacy brushes harshly against the inherent conflict presented by the production method. In this respect, it’s difficult to predict how the process will be perceived.

I see the resolution of this predicament happening in one of two ways. Either authors choose to defy reader (listener) expectations in favor of revolutionizing their own workflow and profitability, or they will ultimately succumb to the pressure to do things the old-fashion way by going through the full recording process page by page. One can only hope the product of the latter choice will fetch a higher price for the apparent courtesy, since I have no doubt authors of a certain persuasion will try to milk that particular perception for all they can get.

Concerning Fiction Writers

The TTS revolution will likely hit this group the hardest. When faced with the competition described above, productions of fictional works will have to become much more elaborate in order to make a listener’s purchase worthwhile. We are already seeing the effects of this now with the resurgence in popularity of radio-style audio dramas, which often come complete with music, sound effects, and sometimes even a full cast of voice actors. To be sure, expectations are only going to get more demanding in both podcast and audiobook circles, but that doesn’t mean this type of creator should be completely discouraged. After all, pressure tends to breed innovation.

Conclusion

As with most other aspects of our current digital revolution, advancements in voice technology bring both hope and anxiety to the table, and a moment of reflection is always welcome. Indeed, recognizing the possibilities of both instances is the first step to dealing with the challenges we will necessarily have to face. From there, preparation will have to take us the rest of the way.