Content

1 Voice Cloning: A Revolutionary Technology
2 Steps to Clone Your Own Voice Using Tortoise TTS
3 Tortoise TTS
4
5 How Tortoise TTS Works
6 Demo: Running Tortoise TTS
7 Uploading Your Own Voice Recordings
8 Starting Voice Cloning
9 Modify Text and Re-run
10 Closing Thoughts

Vijona

25 Feb at 9:46

AI and Deep Learning: The Power of Voice Cloning

One of the coolest possibilities offered by AI and Deep Learning technologies is the ability to replicate various things in the real world. Whether it be generating realistic images from scratch or the right response to an incoming chat request or appropriate music for a given theme, we can rely on AI to deliver awesome approximations of the things previously only possible when guided directly by a human’s hand.

Voice Cloning: A Revolutionary Technology

Voice cloning is one of those interesting possibilities offered by this novel tech. This is the quality of mimicking the voice qualities of some actor by attempting to recreate their specific intonation, accent, and pitch using some deep learning model. When combined with technologies like Generative Pretrained Transformers and static image manipulators, like SadTalker, we can start to make some really interesting approximations of real life human behaviors – albeit from behind a screen and speaker.

Steps to Clone Your Own Voice Using Tortoise TTS

In this short article, we will walk through each of the steps required to clone your own voice, and then generate accurate impersonations of yourself using Tortoise TTS. We can then take these clips and combine them with other projects to create some really interesting outcomes with AI.

Tortoise TTS

Released by solo author James Betker, Tortoise is undoubtedly the best and easiest to use voice cloning model available for use on local and cloud machines without requiring any sort of API or service payment to access. It makes it easy to clone a voice from just a few (3-5) 10 second voice clips.

How Tortoise TTS Works

In terms of how it works and its inspiration, both lie with image generation with AutoRegressive Transformers and Denoising Diffusion Probabilistic Models. The author sought to recreate the success of those model approaches, but applied towards speech generation. In those models, they learn the process of image generation with a step-wise probabilistic procedure which, over time and large amounts of data, learn the image distribution.

With TorToise, the model is specifically trained on visualizations of speech data called MEL spectrograms. These representations of the audio can be easily modeled using the same process as used in typical DDPM situations with only slight modification to account for voice data. Additionally, we add the ability to mimic some existing voice type by using it as an initial noise object weight condition.

Together, this can be used to accurately recreate voice data using very little initial input.

Demo: Running Tortoise TTS

To get started with Tortoise TTS, open the Notebook environment, locate the tortoise_tts.ipynb file, and run the first cell. This will set up the necessary dependencies and initialize the model.

Selecting Voice Samples

Choosing the right voice samples is crucial for achieving high-quality voice cloning. Below are some tips to ensure the best results:

Use a high-quality microphone: If you don’t have a professional microphone, a mobile phone often provides better noise reduction than a standard laptop microphone.
Record in a quiet, echo-free environment: A closet full of clothing can help dampen any unwanted reverberations.
Prepare a script: Reading from a script minimizes unnecessary pauses, “um” sounds, and stutters.
Include diverse phonemes: Using phonetic pangrams like “That quick beige fox jumped in the air over each thin dog” helps the AI capture a broad range of speech sounds.

Following these suggestions, along with the original recommendations, will ensure a smoother cloning process. Below are the voice samples we used for this demonstration:

Ethical Considerations

Cloning voices raises ethical and legal concerns. You should never use someone’s voice without their explicit permission. We strongly discourage any malicious or deceptive use of voice cloning technology. It should only be used for parody, research, or personal projects.

Code Breakdown

Before we begin, we need to set up the workspace. The first step involves installing the required dependencies. Unfortunately, not all dependencies are listed in the original requirements.txt file, so we need to install some additional packages manually.

Copy Code

# First, follow the instructions in the README.md file under Local Installation !pip3 install -r requirements.txt !pip install librosa einops rotary_embedding_torch omegaconf pydub inflect !python3 setup.py install

The next step involves importing the necessary modules and downloading the pre-trained model:

Copy Code


# Imports used through the rest of the notebook.
import torch
import torchaudio
import torch.nn as nn
import torch.nn.functional as F

import IPython

from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_audio, load_voice, load_voices

# This will download all the models used by Tortoise from the HF hub.
# tts = TextToSpeech()
# If you want to use deepspeed the pass use_deepspeed=True nearly 2x faster than normal
tts = TextToSpeech(use_deepspeed=True, kv_cache=True)

Once the model is downloaded, we can generate speech using a default voice:

Copy Code

# This is the text that will be spoken. text = "Joining two modalities results in a surprising increase in generalization! What would happen if we combined them all?" # Here's something for the poetically inclined.. (set text=) """ Then took the other, as just as fair, And having perhaps the better claim, Because it was grassy and wanted wear; Though as for that the passing there Had worn them really about the same,""" # Pick a "preset mode" to determine quality. Options: {"ultra_fast", "fast" (default), "standard", "high_quality"}. See docs in api.py preset = "ultra_fast"

Uploading Your Own Voice Recordings

We can now upload our own voice recordings to the directory /notebooks/tortoise-tts/tortoise/voices. Use the file navigator on the left side of the GUI to find this folder, and create a new subdirectory named voice_test. Upload your sample recordings to this folder.

Once the upload is complete, we can run the next cell to see all available voices that we can use for the demo.

Copy Code

# Tortoise will attempt to mimic voices you provide. # It comes pre-packaged with some voices you might recognize. # Let's list all the available voices. # These are some random clips I've gathered from the internet # as well as a few voices from the training dataset. # Feel free to add your own clips to the voices/ folder. %ls tortoise/voices # Play a sample voice IPython.display.Audio('tortoise/voices/tom/1.wav')

List of Available Voices

After running the above code, you should see a list of available voices similar to this:

angie/                freeman/  myself/        tom/            train_grace/
applejack/            geralt/   pat/           train_atkins/   train_kennard/
cond_latent_example/  halle/    pat2/          train_daws/     train_lescault/
daniel/               jlaw/     rainbow/       train_dotrice/  train_mouse/
deniro/               lj/       snakes/        train_dreams/   weaver/
emma/                 mol/      tim_reynolds/  train_empire/   william/

Starting Voice Cloning

Now we are finally ready to begin voice cloning. Use the code in the following cell to generate a sample clone using the text variable as input. Note that we can adjust the speed (fast, ultra_fast, standard, or high_quality), which can have a significant effect on the final output.

Copy Code


# Pick one of the voices from the output above
voice = 'voice_test'

# Define the text to be spoken
text = 'Hello, you have reached the voicemail of myname. Please leave a message.'

# Load the voice samples and conditioning latents
voice_samples, conditioning_latents = load_voice(voice)

# Generate the cloned speech using Tortoise
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, 
                          preset=preset)

# Save and play the generated audio file
torchaudio.save('generated.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio('generated.wav')

Modify Text and Re-run

Change the text variable in the code cell above and run it again to generate different speech outputs using your cloned voice!

Closing Thoughts

Voice cloning with Tortoise TTS opens up a world of possibilities. Whether for entertainment, research, or personal projects, the ability to synthesize human-like speech from a few samples is an incredible feat of modern AI. Experiment with different inputs, tweak the settings, and explore the full potential of neural voice cloning.

Source: digitalocean.com

Create a Free Account

Try now

Posts you might be interested in:

Moderne Hosting Services mit Cloud Server, Managed Server und skalierbarem Cloud Hosting für professionelle IT-Infrastrukturen

How to Install Plesk on CentOS 7 – Step-by-Step Guide

Linux Basics, Tutorial

18 hours ago

Installing Plesk on CentOS 7 – Complete Guide Plesk is a proprietary web hosting control panel designed to simplify the management of websites, databases, domains, and email for personal…

How to Use Nano Editor on Linux: A Complete Guide

Linux Basics, Tutorial

7 days ago

Using the Nano Text Editor on Linux Nano is a terminal-based text editor designed for editing files within Linux environments. It features an intuitive interface that incorporates a function-based menu…

Install and Secure Jenkins on Rocky Linux 9 – Full Guide

Linux Basics, Tutorial

7 days ago

How to Install Jenkins on Rocky Linux 9 for CI/CD Automation Jenkins is a widely used open-source automation platform designed to support continuous integration and continuous delivery (CI/CD) workflows.…

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS

AI and Deep Learning: The Power of Voice Cloning

Voice Cloning: A Revolutionary Technology

Steps to Clone Your Own Voice Using Tortoise TTS

Tortoise TTS

<img fetchpriority="high" decoding="async" class="aligncenter wp-image-33564 size-full" src="https://www.centron.de/wp-content/uploads/2025/02/Screen-Shot-2023-08-11-at-4.53.13-PM-e1739189083416.png" alt="" width="850" height="459" />

How Tortoise TTS Works

Demo: Running Tortoise TTS

Selecting Voice Samples

Ethical Considerations

Code Breakdown

Uploading Your Own Voice Recordings

List of Available Voices

Starting Voice Cloning

Modify Text and Re-run