Bhusan Chettri on AI and Computer-generated voice: their applications, advantages, and risks on voice-biometrics


Spread the love

Dr. Bhusan Chettri a Ph.D. graduate from Queen Mary University of London (QMUL) explains the fundamentals of how today’s AI technology and computers are capable of producing human-like sounding synthetic voices; he discusses their various applications and talks about advantages along with the threat it imposes on voice-controlled applications. Before we dive deep into the topic of computer-generated voice using AI and its applications, let me walk you through a simple use case scenario of a banking application to provide a better context of the topic. 

Bhusan Chettri took a scenario,”Let us assume that you are an employee of some reputed international bank. And one fine day while working late hours to keep up with the deadline ahead you receive a phone call from an unknown number from abroad. The number seems to be from the UK. You attend the call, and the person introduces himself to be your manager. This makes absolute sense because your manager is currently in London to attend some business meetings to fix up some international deals. Yes, you are now confident – because of the country code and the familiar voice you hear – that the person you are speaking to is your manager. You decide to engage in the conversation despite the urgent deadline that needs to be finished. He asks you to do a wired transfer of some huge amount to a new account. He emphasizes that this money is very important to complete the deal that he has come to London for and that it needs to be done asap. Then, he hangs up the phone after giving some clear instructions for making the transfer.”

This does not sound right. You have so many questions running in your head. You are very confused and do not understand what to do. On one hand, it is your manager’s voice from London and the voice confirms it’s him and the country code says it’s a UK number. But hang on, what he is asking is not in line with the company’s policy. Transfer to a new account needs to go through verification checks. There are procedures to be followed and he has asked me to just ignore everything else and make the wired transfer asap. 

You start to think about all the possible options that you have. If I stick to the company’s protocol and do not listen to him, he may get very upset; the deal might not go through because of this, and consequently, I will not get my most awaited promotion next month. Okay, with all these thoughts running in your mind, you follow your instincts and finally make the transfer of the sum to the account that was provided.

What do you think could have happened? According to Bhusan Chettri there can be two likely outcomes.

  1. The voice was really from the manager and the phone call was genuine (and of high emergency). So, you are praised for your prompt action and support. The deal went through, and you got your most awaited promotion. Everybody is happy.
  2. The second option does not have a happy ending though. The voice was fake. It was from an attacker (scammer) who had been planning to attack/scam the bank for a long time. This means you just got scammed. You end up losing huge money to a scammer. So, in this case, instead of getting that promotion your job is at high risk. Most likely you may be facing some internal investigation where some legal action might be taken against you for not following the company’s policy. This sounds very ugly!
See also  Buying YouTube PVA Accounts.

Thus, the second point we just discussed is of the biggest concern. With the recent advancement AI has made in Text-to-Speech and Voice-Conversion technology, computers are now capable of producing fake/synthetic voices that sound as natural as if spoken by a real human. The technology behind this is often driven by so-called Deep Learning which makes use of a massive amount of speech data to learn the pattern in the voices (similarities and differences across different voices). They are capable of building synthetic voices that sound flawless to human ears. One reason that humans are often unable to discriminate between today’s computer-generated voice (advanced AI algorithms) and real voice is the fact that our ears are not designed to detect tiny artifacts that appear in synthetic voices from the application of AI algorithms, rather our ears focus on the bigger picture – aims to understand the contents (spoken words, speaking styles, etc) rather than tiny differences induced from algorithms. 

Therefore, it is very important for us to understand some basics about these technologies that can produce artificial voices which humans cannot discriminate; their pros and cons – well we will understand towards the end of this blog that such technology can be used equally for good and bad – there are both good and bad actors. Their applications: advantages and potential risks that it imposes in voice-driven applications in various settings (for example banking, digital home, etc).

Next, we provide a basic overview of TTS and VC technology and discuss their applications.

Text to Speech (TTS): transforming the input text into a speech/voice. Given an input text, the goal of TTS technology is to produce speech/voice corresponding to the text while maintaining naturalness as closely as possible such that for human ears it becomes quite hard to distinguish them from real human speech. Figure 1 illustrates a typical TTS system. It consists of two major components: text analysis and speech waveform generation module. The text analysis module is responsible for analysing the input text and producing a sequence of phonemes defining the linguistic specification of the input text. These are then passed to the waveform generation module which is responsible for producing the speech waveform from the input phonemes. It is also worth noting that today’s advanced AI algorithms (so-called end-to-end deep learning) are capable of producing speech waveforms directly from the input text. 

LMH4Ly9KpcyGGkoUkvEfsSqxO6wRRuyUw9PlQ3MoBXBSKDjNZHEJJDezX6BOxgQxo8mHi iTIkbMcPjMY6PrCnspfSK6K6fjYavgHfd13CRAK9DBJHfyeNGebOkS0pHjd6vKyX7V4pTl 1jZ Jbh8w

Figure 1: A typical Text-to-Speech system

Such recent advancement in technology driven by big data and deep learning (a form of AI algorithm) has brought significant progress in the TTS field. For example, a Canadian start-up company lyreBird claims that their technology is capable of producing synthetic voice by listening to just one minute of sample audio of the target speaker. In simple words, their technology can clone any voice (by using only 60 seconds of voice sample) and make you speak anything they like. Furthermore, they claim that their technology can also incorporate emotions in synthesized speech. This further means that their customers can create synthetic voices expressing anger, joy, or expressing being stressed out in the spoken speech. As we are all aware of the ability of computers to photoshop (edit) images creating fake images, with such commercial speech technology people can also edit and manipulate voices very easily. For example, checkout The Verge Post to hear the computer-generated voices of famous US politicians Donald Trump, Barack Obama, and Hilary Clinton. Even big companies like Google, Microsoft, Apple, and Adobe have built AI systems that can create human-like voices. 

See also  Extender is Connected But no Internet? Resolve This Error in 5 Minutes. 

Applications of TTS

Bhusan Chettri Explained this technology can be used in a wide range of applications such as automatic text reading on mobile phones, e-book reading using the voices of popular celebrities, synthesizing voices for disabled people, and translating speech from one language to another, to name a few. However, it can also have severe implications when used with the wrong intentions. With such technologies, engineers/programmers can create convincing fake voices of anyone you like. An example and use case of fake voice and the impact it could have has been discussed considering banking applications at the start already.

Combining such synthetic voices with fake images one can further create customized fake videos of any person (for example some celebrity) doing something unethical and shouldn’t have been done. For example, imagine the consequences and impact of a fake video (created using AI technologies for voice, images, and videos) of President Joe Biden making some negative comments about China (and its culture) going viral on social media. What could be the consequence? These synthetic voices can also trick voice-controlled biometrics systems which are used to verify the identity of a person.

FqmOCWcjWv6QyhrgG1 g5Ng93vrUrb3M0EcwrpPKsfD7SkMEpw43ubCAPN LkNHcXPzLaZg35qbJ m0rhOO3OgLHZZVQXN5rMa7IE8vXHYC nSKNWWrnUNrIpEtA4RWhA0O QB9eg2i8iB4eULIKog

Figure 2: A typical Voice Conversion system showing both training and deployment steps.

Voice conversion (VC)

This technology aims at producing an artificial voice of the target speaker from the source speaker’s voice. Thus, unlike TTS, voice conversion technology typically takes as input two voices: one from the source speaker (whose voice is to be transformed/converted to sound like the target voice) and the target speaker’s voice (the voice to be created) during its training step. It should be noted that the contents – or spoken words – remain the same during the conversion process. The only thing that changes is the voice identity. Figure 2 illustrates a typical VC system. Usually, the voice conversion algorithm works directly on speech signals of the source and the target speaker where both persons are speaking the same utterances. In other words, this system requires a parallel corpus of the two speakers to learn the transformation function that converts the vocal characteristics/properties of one speaker to another.

Applications of VC

Some of the applications of VC technologies include producing natural-sounding voices for people with speech disabilities and voice dubbing in the entertainment industries. Alternatively, this technology can also be used to produce fake voices of some targeted speaker with bad intentions either to defame the person or steal the user’s identity to perform bad activities using their identity.

Risks on Voice Biometrics

It is well known that images can be faked using photoshop. Often when we see certain images we instantly react by commenting “oh that image is photoshopped”. It’s a fake image. We make such conclusions either because the image was too good to be true or we already had some prior knowledge about that photoshopped picture’s contents. Without such prior knowledge, it is very easy for anyone to get fooled into believing that the image is real. It becomes hard to judge between a real and a photoshopped image.

As Chettri discussed above, TTS and VC technology are capable of producing human-like synthetic voices which an attacker with malicious intentions can use to attack voice biometric controlled access systems for example banks, personalized systems such as voice-controlled automobile access, smartphones of some other person. What you hear may not be 100% trustworthy and the person who he/she claims to be may not be a real person. With this article, we aim to raise awareness about the existing computer AI technology and algorithms that are capable of editing or synthesizing voices to make them sound as natural as if spoken by a real human. With such awareness, it helps us to think about how to safeguard ourselves from bad events launched by such bad actors. Next, we provide a brief overview of AI that is deployed to counter fake speech generated using TTS and VC technology.

See also  How to Fix It When Yahoo Mail Is Not Receiving Emails

Protecting Voice Biometrics against synthetic voices

Bhusan briefly discussed how voice biometrics can be protected from being manipulated using computer-generated voices. Figure 3 illustrates different components of a typical countermeasure a.k.a. fake speech/voice detector. A countermeasure is basically an AI system, typically a binary classifier, whose main task is to determine whether the input speech is a human-spoken real speech or a computer-generated voice. To make such judgments these systems are first trained using a dataset of large speech containing both real and computer-generated voices collected from several speakers across the globe. During the training process, the algorithm learns to find the discriminative pattern between the real and fake voices. Later during deployment (testing step), the fake speech detector looks for the learned pattern/signature (between real and fake speech) in the voice to make the judgment. If it finds the pattern of a fake voice in the unknown voice (being tested) then it classifies the new voice as being fake otherwise it regards the new voice to be a genuine voice, and therefore the detector allows the voice to be passed through other components of a biometric system to provide further services. It should be noted however that during both training and testing steps one common step is feature extraction. This step is primarily responsible for transforming the input speech into some representable format that is simpler and meaningful for the algorithm/computer to process further towards building the desired classifier. And this step of extracting features will be the same during both training and testing.

utMpeXxu JdmYPjUMEJ723kuw2CKNtksWdr4ws0bH jDsXXNOtgl2GE6KvnA7EmNxlE4in8hHyWVwpY1U86 9Hkj0Mxamd5sCzyn6 jmc6Z1VeiBKV4TCV9I1Dc3

                                          Fig 3: Fake speech detector (countermeasure).

As fake speech detection and prevention has become quite a hot topic and an emerging research field, the speech community has been promoting this research by launching so-called automatic speaker verification and countermeasures challenges (ASVspoof) since 2015. The main goal of ASVspoof is to promote awareness of voice spoofing techniques and bring researchers around the globe to combat voice spoofing attacks. For this, they also release free databases of spoofing corpus which are available from their website.

In this article, Bhusan Chettri has written about how computers and AI can be used to generate synthetic voices. Much like the way one can photoshop images using commercial software available (e.g., Adobe Photoshop), using such technologies one can easily edit and manipulate voices. Text-to-speech and voice conversion are two commonly used technologies to produce artificial voices which sound as natural as if spoken by a real human. We also discussed their applications in different domains and talked through the dangers of such human-like sounding synthetic voices. Then we also discussed the risks of computer-generated voices on voice-biometrics systems.


Spread the love

Selim Khan

Hi, I am Selim Khan Dipu. I am a professional freelancer and blogger. I have 5 years of experience in this section. Thank You So Much