Interview With Alex Serdiuk - CEO and Co-Founder of Respeecher

Shauli Zacks
Shauli Zacks Content Editor
Shauli Zacks Shauli Zacks Content Editor

SafetyDetectives had the privilege of sitting down with Alex Serdiuk, CEO and Co-Founder of Respeecher, to discuss the intricate world of voice cloning technology and its impact on the future of the entertainment industry. In this insightful interview, Alex shares the origins of Respeecher, the company’s journey from a winning Hackathon idea to Emmy-winning projects in Hollywood, and addresses common misconceptions about voice cloning. He also explores the vast opportunities that this technology presents for content creators and the entertainment industry at large, while delving into the ethical considerations and the democratization of content creation. Join us as we uncover the nuances of this groundbreaking technology with one of the leading experts in the field.

Can you share your journey and the inspiration behind founding Respeecher, and how did you become involved in the field of voice cloning?

Respeecher was started a little over six years ago. However, before that, we spent several years playing with the technology and the idea. It started with a hackathon, an IT competition where teams try to create things overnight. My partner and co-founder Dmytro Bielievtsov, who is currently CTO at Respeecher, picked the idea of trying voice conversion. The concept was basically to create a voice from another vocal input, through speech-to-speech voice conversion.

Back then, there was no heavy deep learning, just basic machine learning models. We ended up winning that hackathon, and then we started to think about some applications of the technology.

In 2018, we started the company, and by 2019 we began our first Hollywood projects, and even won an Emmy. Respeecher is a team of 53 and growing. We have delivered more than 150 projects to date, including plenty of pieces in the Star Wars universe, Netflix projects, and some big video games like God of War, Cyberpunk 2077, and Phantom Liberty.

In addition to the high-profile projects, we built a product for small creators as well. We also work in healthcare and security. Lastly, we do real-time voice cloning, so we can change someone’s accent in real time, which brings us to new lines of business.

Can you share some background on the evolution of voice cloning technology and its applications in the entertainment and content creation industries?

When we started the company, we thought about the high-end market, high-profile content creators. We thought that it’s a very cool technology and can be utilized in high-profile content creation. We started to speak with film studios and video game companies and we got very clear feedback:

  • They did want to use a program like we were offering.
  • The quality of voice cloning they were seeing in the market wasn’t good enough.

The reason other programs were not good enough, and in many cases is still not good enough, is that many of the providers used synthetic speech. This technology is focused on text-to-speech applications, while we were focused on speech-to-speech applications.

The difference between the two was significant, based on the input method. With a text-to-speech model, there is a system that tries to make the output resemble a human voice. It relies heavily on language models and vocabularies, but if some words are not known, the text-to-speech fails.

An even bigger issue is that text-to-speech models don’t give you control over emotions. If you want to produce high-quality content, you need to be able to include emotion. Our speech-to-speech model heavily relies on the performance. That means that people would perform and you can instruct the person how to perform, and then their voice would be just changed to the desired voice. The technology essentially removes the limitation of a person being tied to just one voice.

What are some common misconceptions about voice cloning?

There are two main misconceptions about voice cloning:

  • How people perceive voice cloning. People who are not exposed to projects like Respeecher, or who don’t understand how it works, assume that the technology is a threat to disinformation or deepfakes.
  • Voice cloning stealing jobs. This is not true at all, especially in the case of Respeecher, because we are doing speech to speech. We rely on voice actors, who also happen to be the early adopters of our technology.

Our technology helps with their ability to perform using both the voice they were born with and the one they have in a particular moment of their life. It also helps their ability to perform, as the technology lets them scale voices or scale performances, so they can perform different voices.

In your experience, what are the opportunities that voice transformation technology presents?

There’s a huge opportunity for disrupting the way we do voiceovers. Once we learned how to record sound, there was not much disruption there. A typical setup had a microphone, a person in front of the microphone, they recorded something, and they used it. There have been some adjustments made to make it sound better, and post-processing of the audio, but in general, in terms of logistics, nothing has changed so much.

The technology of voice cloning introduces the ability to remove the hard limitations of a person having only one voice. This is the voice they have at one particular moment of their life. Voice cloning allows one person to speak in many voices, including multiple versions of their voice, even if they don’t have the same sound as 20 years ago. Another opportunity is character voices, which are really hard to get for animation. Some voice actors can only work in a character voice for like 15 minutes a day and they have to rest for the rest of the day just because it hurts.

What is your perspective on the potential impact of voice cloning on the future of voice acting and talent in the entertainment industry?

I’m a firm believer that the technology of voice cloning will be just another tool that will let talent enhance themselves. Additionally, it will allow producers to create content in a better, faster, less expensive, and more efficient way. This will bring us to a situation where content is being produced more easily, and all the resources that have been focused on organizing the process of content production could be allocated to the creative side of things.

We also see the trend of democratization in general. So what’s happening right now for the last 5 to 7 years, we have seen the democratization of everything, including tools and technologies that empower creators. This means that technology has become a commodity, and we are at the stage where even if you have a budget of a few hundred dollars, you still can produce something in your basement that could be on Netflix or in front of a global audience.

This means that we can start living in a world where creators can compete with their creative ideas, not with their budgets, and that’s an amazing world I would love to live in. The technology plays a big role, because voice is a major part of this huge technology stack that’s being used for producing content globally.

Can you discuss any notable trends or innovations in voice cloning that content creators should be aware of?

There are plenty of tools being released now, including some very sophisticated text-to-speech tools that are much higher quality than they were like five years ago. Respeecher also has our own text-to-speech tools, and like our speech-to-speech tools, it was only available for big studios with big budgets. But now, you can use Hollywood-grade tools yourself without needing sound professionals and team members.

Content creators should be very cautious about which particular technology and provider they use. We see plenty of companies working in an unethical manner. Meaning, they are not putting any boundaries in place, such as content moderation or checking if the voices belong to the people who claim they do. This introduces significant issues in the industry as someone can just go to a website, clone whatever voice they want, and make them say whatever they want.

That’s a function of the technology that’s been dropped into the product. However, if you’re producing high-quality content, the kind that can be shown in front of millions of people, you should be concerned about the ethical side of things. If the software provider is involved in doing something like that, it should be a significant concern to you. It can be a PR problem as well as contribute to an unsafe environment.

About the Author
Shauli Zacks
Shauli Zacks
Content Editor

About the Author

Shauli Zacks is a content editor at SafetyDetectives.

He has worked in the tech industry for over a decade as a writer and journalist. Shauli has interviewed executives from more than 350 companies to hear their stories, advice, and insights on industry trends. As a writer, he has conducted in-depth reviews and comparisons of VPNs, antivirus software, and parental control apps, offering advice both online and offline on which apps are best based on users' needs.

Shauli began his career as a journalist for his college newspaper, breaking stories about sports and campus news. After a brief stint in the online gaming industry, he joined a high-tech company and discovered his passion for online security. Leveraging his journalistic training, he researched not only his company’s software but also its competitors, gaining a unique perspective on what truly sets products apart.

He joined SafetyDetectives during the COVID years, finding that it allows him to combine his professional passions without being confined to focusing on a single product. This role provides him with the flexibility and freedom he craves, while helping others stay safe online.

Leave a Comment