Joseph Chang, jdchang@ucsd.edu
The goal of this project is to produce music and written text in Python using Machine Learning. This hands-off approach will allow artists to experiment with new music or find inspiration for new songs. A person will write whatever lyrics they want and DeepVoice3 will translate it into speech using a model created from a woman's voice. In this case, hopeful quotes are used as the text. Performance RNN is then separately used to produce a music piece based on a combination of multiple, generated music phrases. These phrases are similar to a bass guitar audio clip on which the model was trained. Gansynth is used to interpolate a MIDI file of Frank Mill's Musicbox Dancer. Finally, the speech, generated music, and interpolated music are combined on the Davinci Resolve video editor. This project succeeds in producing comprehendible human speech and completely new music. However, it is not near the quality one would expect from an actual artist. A future direction would be to include training a RNN model for singing rather than just speaking.
DeepVoice3 is trained on the model 20180505_deepvoice3_ljspeech.json which is found online. The code automatically downloads it.
SGM-v2.01-Sal-Guit-Bass-V1.3.sf2
- https://sites.google.com/site/soundfonts4u/
- Performance RNN is run on this audio sample from Soundfonts4u which combines guitar and bass. The .sf2 file must be added to the /tmp/ directory to be accessed by the code.
DeepVoice3 converts the following hopeful quotes text to speech
- "Good, better, best. Never let it rest. 'Til your good is better and your better is best."
- "The most beautiful things in the world cannot be seen or even touched. They must be felt with the heart."
- "The best preparation for tomorrow is doing your best."
- "Every next level of your life will demand a different you."
- "If your goals don't scare you. They aren't big enough."
- "Don't listen to what they say."
- "Be fearless in the pursuit of what sets your soul on fire."
- "The greatest glory in living lies not in never falling, but in rising every time we fall."
Frank_Mills_-_ Musicbox_Dancer.mid
- https://www.midiworld.com/search/?q=dance
- Gansynth interpolates this music piece from Midiworld. The .mid file must be added to the /gansynth/midi/ directory to be accessed by the code.
DeepVoice3
- https://colab.research.google.com/drive/1JpWuvyPCZqGdsXuclHqKidvf2yx_NFtc
- Training and generation code
- Converts the following hopeful quotes text to a woman's speech
Performance RNN
- https://colab.research.google.com/drive/1W6yGQP3bJ-IfvSpLgr9ELJ68jr6SBgES
- Takes SGM-v2.01-Sal-Guit-Bass-V1.3.sf2 music as input to build the RNN
- Generates similar sounding samples of music each 5 seconds long (length can be adjusted)
Gansynth
- https://colab.research.google.com/drive/1W6yGQP3bJ-IfvSpLgr9ELJ68jr6SBgES
- Takes Frank_Mills_-_ Musicbox_Dancer.mid as input and interpolates the music
The resulting speech and music can be found in this repository. The text-to-speech generated by DeepVoice3 are the speech.wav files. The music generated by PerformanceRNN based on the guitar-bass audio are the music.mp3 files. The Musicbox Dancer music interpolated by Gansynth is the musicbox-gansynth.wav file. The 8 speech outputs, 8 music outputs, and 1 interpolated song are combined in the DaVinci Resolve video editor and uploaded to YouTube for viewing. The resulting speech and music are much below the quality expected from a composer or songwriter, but as something generated by a machine, it is quite impressive.
https://www.youtube.com/watch?v=48HugqVAv9o
This implementation requires Google Colab which is an open source coding notebook. It only runs on Colab even though it is in Python Notebook format.
- Online-Convert: MIDI to MP3 Converter
- Audio-Joiner: MP3 Audio Joiner
- Bear Audio: MP3 to MIDI Converter
- Trim Midi File: Trim MIDI File