Jemine, also known as CorentinJ on Github, has developed a Voice Cloning Toolbox that can take a 5 second clip of a person speaking and simulate their voice. The user must simply type a phrase in and the toolbox will generate audio that sounds like the actual person, speaking what you have typed. It is the implementation of a paper hosted at: https://arxiv.org/pdf/1806.04558.pdf. It converts a 5 second audio clip into the numerical representation of a voice that can be used to train a text-to-speech model to generate new voices. This is all part of a three-stage deep learning framework called SV2TTS, the three stages being the encoder, the synthesizer, and the vocoder.
A video showing the exact process to record and synthesize the voice can be found below:
I was interested in this repo as the thought of someone copying my voice so accurately fascinates and terrifies me so much more than them taking my face. A voice is so personal, while people can look like other people all the time. I feel like if this software develops further, along with things like FaceApp, we could see a change in the way we represent ourselves online. I also think that this is a great example of collaboration on Github, and a good first look at an open source project.
This software is located at on Github here.