This results in similar advantages and disadvantages as already discussed in the introduction to deepfakes. In combination with visual deepfakes, this can potentially create a complete imitation of a person. Nevertheless, it shows that amazing results can already be achieved today. However, the examples from this YouTube channel are created with a method that is not publicly available. However, with the publicly available tools, the results are mostly audibly manipulated. For example, the Vocal Synthesis channel uses this approach. On YouTube you can find many examples of results of audio deepfakes with Tacotron2.
#DEEPFAKE TEXT TO SPEECH FREE HOW TO#
Both have instructions on how to use them, but it quickly becomes clear that the tools currently available require technical understanding as well as an understanding of how audio deepfakes work. Two of these tools that look very promising are TTS from Mozilla and tacotron2 from NVIDIA. There are different publicly available tools. The results obtained are fed into a so-called neural vocoder, which fills in the gaps in the frequencies and thus gives the whole thing a natural sound. In order to turn the metallic voice into a better sounding imitation or an unrecognizable imitation of the voice, one last step is needed. The reason for this is that in this training only the most important frequencies were trained, since the amount of computation and time required would be far too great to train correctly for all frequencies that are present. However, the results obtained from this fine-tuning still sound relatively metallic and robot-like. About 2.5 to 3 hours of voice recordings of the desired speaker are necessary to achieve a good result. The fine tuning requires another 30% of the time needed for the training of the base model. This results in the generic base model in the desired language, which can be used in the next step to fine tune the model with the target voice. After the material has been prepared, a lengthy calculation is performed to allow the model to establish a correlation between the text and the audio. On the one hand, enough good recordings must be available, on the other hand, the recordings and texts must be brought into the state expected by the model.
This means that a lot of effort is needed to prepare the data for such a generic model. The text and audio segments given to the model should not be longer than 10 seconds each for the training and should stop with the end of a word. The model is fed with the recordings and the transcripts. In addition, transcripts must be available for the recordings so that text can eventually be converted to voice sounds. This is based on a generic voice for which a lot of voice material, at least 24 hours of audio, has to be available. In a first step, the model must be taught to read in a specified language and to be able to reproduce what has been read. The approach of current tools is to read out text in the voice of a selected person. And the more such material is available, the better the audio deepfake will be. To create an audio deepfake, clear recordings of a speaker, preferably without interruptions, ambient or background noise, are needed. However, the approach to processing voice material is different because of the starting point. What is similar is that audio deepfakes are based on the same principles of computation with neural networks. The procedure for creating audio deepfakes is similar to visual deepfakes but still different. This article gives an overview of deepfakes for voice recordings. However, deepfakes can be created not only for videos or images, but this possibility also exists for audio recordings. At that time, only video or image deepfakes were considered.
#DEEPFAKE TEXT TO SPEECH FREE SERIES#
In 20, in a series of articles, deepfakes were analysed as well as own deepfakes were created. Nevertheless, companies should think about how they will deal with for example fake calls in the future.With currently publicly available tools, a solid technical understanding is required to create audio deepfakes, the general public cannot currently easily create an audio deepfake.This uses the so-called text-to-speech method.In addition to visual deepfakes, there is also the possibility of creating deepfakes from audio.