In this SafetyDetectives interview, Rafael Delgado, the founder and CEO of DTec Biometria, brings a unique blend of technical expertise and business acumen to his role. With a background in Computer Science and extensive experience in Advanced Speech Technologies, Delgado played a pivotal role in developing DTec’s flagship product, BioVox, a voice biometrics solution. His interview sheds light on DTec’s core offerings and their vision for the evolving landscape of voice technology in cybersecurity.
Can you introduce yourself and talk about your role at DTec?
My name is Rafael Delgado and I’m the founder and CEO of DTec. My professional background is technical because I have a BSc in Computer Science by the UPM (Polytechnic University of Madrid) in Spain and I spent most of my professional career before founding DTec working for Telefonica R&D in Advanced Speech Technologies. Thanks to that I developed myself the core technology of DTec’s first product BioVox, our voice biometrics solution, and because of that even today I keep a technical role in the company. Before founding DTec I also acquired a business administration and marketing background
What are DTec’s main services?
DTec is a product oriented company more than service oriented, although we can bring custom development services for integration of our solutions into our client’s systems. We license our core speech technologies and develop final solutions built upon them. These are our current speech solutions:
- BioVox: Voice Biometrics for the identification of speakers in security and authentication applications.
- TruePersona: voice cloning detection based on deepfake AI, for fighting fake news and also as an anti-spoofing feature working together with BioVox in authentication applications based on voice biometrics.
- ReconVox: speech recognition robust to noisy conditions, can work in natural language, word-spotting and custom vocabulary modes. Can adapt to changing acoustic conditions thanks to the exclusive AutoLearn feature.
- AudioWatermark: steganography technology for embedding inaudible information into an audio signal. Can be used for sending secret messages, tracking the source of the audio and verifying the integrity of the original conversation because is not file bound, is audio signal bound.
How do you see the role of voice technology evolving in the landscape of cybersecurity?
After a period of somewhat less presence of voice technologies in the call centers market, recent advancement in voice driven personal assistants for home and the car have given a new impulse to these technologies. Also new conversational AI advances like chatGPT will most probably integrate with speech technologies for an even more human like experience. This features present in our daily life make security a very important asset. And because all these services are cloud based, always relying on an internet connection, cybersecurity is even more necessary. In addition, new voice cloning technology has created new threats in voice driven procedures that had been secure so far, so impersonation countermeasures in phone calls for fighting scams or radio communications in the homeland and military sector are now also mandatory.
With deepfake audio becoming increasingly sophisticated, what are the challenges in detecting real versus synthetic voice?
Deepfake technologies have evolved very fast recently to reach a voice quality unthinkable a couple of years ago and they are still improving. So as developers of countermeasures for detecting synthetic voice we need also to evolve and adapt to every new voice cloning technology. Think of it as an arms race. Our TruePersona solution currently models about 20 different voice cloning technologies and it’s a very live product with frequent updates. But voice cloning technology will stabilize at some point when the quality and the requirements for creating a new voice reach a very high standard. The challenges from a technological point of view for detecting this artificially created voices are many because they are actually designed to sound natural and human like, so we need to model and train our detection engines for each family of cloning technology. Sometimes several systems share the same core cloning technology and they can be grouped together. But we need to take advantage of our know-how in audio processing and analysis to detect anomalies not present in human voice. We are imperfect and sentiment driven speech producing machines, and we need to search for that. I don’t know whether we will be able to detect HAL-9000 when it comes out, but we will try…
How does the AutoLearn feature in your speech recognition technology adapt to different environments, and what are its limitations?
AutoLearn is actually a very powerful feature because it can adapt basically to whatever you throw at it. Originally designed to adapt the recognition engine to a specific speaker and improve recognition accuracy, we discovered that you can also provide audio samples of a different dialectic region or even of a noisy acoustic environment and AutoLearn can also model that and adapt. But it requires the user to have some knowledge of the problem to be modelled in order to get maximum benefit because it analyses the audio and modifies the generic acoustic models of the recognition engine according to the training audio. It’s also best suited for acoustic properties that will repeat in the future, like the voice of a speaker, a dialectic region, an acoustic channel, a repetitive noise… But obviously it can’t predict sudden non repetitive noises like bumps, clicks or horns. We have to cope with these in a different way with specific audio filters.
What advancements in AI have most impacted the capabilities of your voice technology suite?
Without any doubt, recent advances in neural networks and deep learning open source toolkits. AI can be an overused word, too generic, but new algorithms that make the bricks of neural networks provide a training power not available just a few years ago. Also new paradigms in AI learning like generative adversarial networks (GAN) bring very powerful tools for training. Thanks to that now we take an hybrid approach in our solutions, combining traditional and more supervised stochastic and probabilistic algorithms with modern neural networks techniques. This way we can get the best of both worlds.