InfoQ Homepage Podcasts Generally AI Episode 2: AI-Generated Speech and Music

Generally AI Episode 2: AI-Generated Speech and Music

Jan 31, 2024

In this podcast episode, Roland and Anthony explore the world of AI-generated voices and music. The discussion begins with Stephen Hawking and the topic of artificially generated voices. They touch upon the applications of generated voices, the use of AI-generated celebrity voices, the ethical considerations surrounding consent, and the risks of misuse. Moving on to music, they discuss the generation of musical scores then conclude with a live demonstration of AI-generated music.

Key Takeaways

AI-generated voices could be great solutions to help podcast editors and people at risk of losing their ability to speak
AI-generated voices do present risks for misuse as well as legal and ethical questions regarding “cloning” the voices of celebrities
There are two broad strategies for generating music with AI: score generation and audio generation
LLMs have inspired methods for score generation, while image generation AI inspired techniques for audio generation
Real-time and adaptive generation of audio could be helpful for solo performers

Subscribe on:

Introduction

Roland Meertens: In 1985, Stephen Hawking lost his voice during a lifesaving operation. Most people remember him not only for his scientific theories, but also as a man who was wheelchair bound and speaking with a very artificial sounding voice. The voice he used was generated using a very old hardware speech synthesizer made in 1986, but he said he kept it because he didn't find a voice he liked better and because he has identified with it.

Welcome, everybody, to Generally AI, an InfoQ Podcast. This is episode number two, where today I, Roland Meertens, will be diving deep into a topic with Anthony Alford, and we're diving into the world of synthetic sounds. And we're going to discuss artificial music, synthetic music, and artificially generated voices. And maybe at the end the artificial voices can sing over the generated music.

So Yes, Anthony, how did you like the introduction story today?

Anthony Alford: That's pretty interesting. Of course, he's definitely one of the greats of 20th century physics and a very tragic figure. So, it was nice that technology was able to help him out like that.

Roland Meertens: Yes, so, When I was researching this I thought it was actually a quite interesting challenge to help people with ALS, this muscle disease. And, apparently in 1997, Gordon Moore, the co-founder of Intel, he met Stephen Hawking at a conference and he knew that he was talking using an AMD processor. And, basically he offered him a real computer, and Stephen Hawking couldn't say no to that.

So apparently since then they replaced his computer every two years and at some point, his disease got so bad that he only managed one or two words per minute. So he sent a letter to Moore to ask for help, and apparently Intel got a team together to help him. So it's quite interesting to read through what they all tried. And apparently in the end he was using a SwiftKey keyboard, which I also love using.

Anthony Alford: Can you describe that, I'm not familiar with it?

Roland Meertens: Okay, so SwiftKey is an app which is currently owned by Microsoft. It used to be like a small startup in London. And, it is extremely good at predicting what words you want to type and what words you want to type next. So, it also adapts based on where you are typing on your screen for what letter, so it has a bit of a more probabilistic approach. So, Yes, basically I always go wild and then it still manages to pick up my thoughts.

Anthony Alford: Interesting.

Roland Meertens: The last fun fact I have, by the way, is that the speech synthesizer was called CallText 5010 and it was actually based on the voice of Dennis Klatt, who made it. So this guy who... he was inventing multiple voices, among which the voice which would eventually become the voice Stephen Hawking was known for using, which he called Perfect Paul. And apparently Dennis Klatt who invented his voice died in 1988 after a long struggle with cancer, which also took away his voice. So I guess in the end he could use his own invention in a very sad way.

AI-Generated Speech [03:33]

Roland Meertens: Okay, so anyways, the topics today, my topic is artificially generated voice. And I first figured, why not dive deep into why do people want a generated voice? Why do you think that people want a generated voice?

Anthony Alford: Well, certainly Stephen Hawking is an example of one reason one might want that.

Roland Meertens: Yes, indeed. So you could be handicapped or you could have lost your own voice. This is actually a bit of a coincidence that Apple is now introducing the feature Personal Voice in the new iOS. So it is currently in beta, so if you want to try that, you can go ahead. You need to speak to your iPhone for about an hour to create your own personal voice. And it's kind of aimed at people who are at risk of losing their ability to speak. And you can use it to type in FaceTime and the phone app and all the communication apps.

Anthony Alford: So do they have you read a script or is it just you say whatever is on your mind?

Roland Meertens: I honestly don't know because I couldn't get into it as beta yet. I only discovered that this was an option today.

Anthony Alford: It's homework.

Roland Meertens: Yes. Next time. But, Yes, so that could be an option. What other reasons do you think there are for generating voices?

Anthony Alford: Well, obviously, ones that are of questionable moral and legal basis, want to defraud people and fool them, which I'm sure we're going to talk about that.

Roland Meertens: Well, Yes, so there are indeed all kinds of malicious reasons like bank scams, social media impersonations. It could even be used for larger scale attacks such as someone calling your company saying that they are you and they quickly need to have the Wi-Fi password or some data based access, I don't know. So basically be careful out there, Yes. The other thing, the reason I wanted to get into it is that it's now really easy to re-record small parts of podcasts.

Anthony Alford: Oh, right, you're going to fix the podcast when I say something that is garbled, you can just go back and edit it by using my voice to speak text.

Roland Meertens: Yes, so, right now, for example, there's actually a police car coming by my house. I don't know if you can hear that.

Anthony Alford: I don't.

Roland Meertens: In that case, my new microphone, which I bought today, is working perfectly. But this is actually a perfect moment for me to go in and regenerate that part. So either in the past I would have to go in and re-record that part, and now I can simply say overdub or regenerate this part using this tool called Descript, which I use to edit videos for a couple of weeks.

Anthony Alford: Nice.

Roland Meertens: Yes. And also for example, if I say the wrong thing, I can retype what I actually wanted to say and then it can generate that for me. So it kind of fills in gaps which is really powerful. And, this is even easy to do without needing some kind of pre-trained model.

Anthony Alford: Yes, in fact I think I recently did InfoQ News about Meta's VoiceBox model, which essentially does this. And I think they open-sourced it.

Roland Meertens: Well, the headline of your article was probably something along the lines of, "Meta Opens VoiceBox or Releases VoiceBox, But I Tried to Use It and They Keep It a Secret." They don't do anything with it.

Anthony Alford: Oh, you have to request the model, don't you? You have to request access.

AI-Generated Roland [06:49]

Roland Meertens: They only gave it to a very limited amount of people and I am not one of them. So, that sucks. Now, you're probably wondering, "Well, why are we still talking to each other? Can't we just both generate our voice and then send it to the other person and then we've got a complete podcast?"

Well, you can create models of your own voice, so I'm going to send you what my cloned voice sounds like. And as a heads-up, I recorded this with my old microphone, and since today I have a new microphone, so you are going to hear the difference. But let me know what you think of it.

Anthony Alford: Okay.

Roland Meertens: I sent it now in the chat. And just, I don't know, listen to it and comment on it.

Anthony Alford: Here we go. Okay.

Generated Roland Meertens: Hey listeners, this is Roland speaking. This voice was generated using the Descript tool and is put directly into this podcast. How many seconds do you think you need to hear the difference between my real voice and this generated voice? Would you also hear it if I did not tell you that this was artificially generated?

Anthony Alford: I have to tell you, this one sounds a lot better than the real you. I don't know.

Roland Meertens: Maybe I can actually replace my voice with a computer and just go through life like this.

Anthony Alford: That was really quite high quality. I was impressed.

Roland Meertens: Yes. I must say that... I don't know, how would you judge it? If you would get a voice message like that saying that I'm abducted and to pay for me, would you pay for me?

Anthony Alford: I don't know, can we send Venmo or Cash App? I don't know. But that would be a good question. I would want to hear maybe some stress in the voice if you were really in a stressful situation. It seemed very calm, right? So I don't know if you can make it stressful, but I definitely believed that was you. I could easily believe that that was a recording of you actually speaking.

Roland Meertens: Okay, and now the next question is, I was first thinking about generating the introduction to this podcast. Would you have kept listening after the introduction if you heard this?

Anthony Alford: Interesting. Maybe.

Roland Meertens: Okay. So in my opinion it sounds a tiny bit too artificial. You can still hear that it's artificially generated, but it's mostly the ending of the sentences and the flow isn't really good is what I felt.

Anthony Alford: I see.

Roland Meertens: Yes.

Anthony Alford: So, you bring up an interesting point. I was going to say with a lot of machine learning and AI, we measure quality by doing accuracy on a test set. That really doesn't work as well with these generative AI types of models. How do you automatically judge the quality of that?

Roland Meertens: I didn't really dive into measuring quality. I mean, for things like natural language generation there are multiple scores which can judge things, but those things also seem to be relatively hackable or hackable but also not always good. Actually, the best way you could do it here would be an Elo rank test where you show people two pieces of generated voice, ask them which one is better, and in that sense you can build up a whole Elo score for every generator if you want.

Anthony Alford: Right.

Roland Meertens: I know that they have this on Hugging Face for the generative language models. Which I think is really cool. Now you heard this voice, as a random fun fact which I want to tell you, is that when I was about 15 years old, you know you had these, we called it TomTom, but like it's a navigation aid which tells you, "Go left. Go right."

I discovered that you could upload your personalized voices onto that. So when I was about 15 I recorded my own voice for a navigation system so, in my parents car I would always be the person saying, "Turn left."

So, I had to record about 50 to 100 sentences, something, and that's how I got my own personal navigation assistance voice.

Anthony Alford: Did they know you were doing that before the first time they heard it?

Roland Meertens: I think so, but sometimes it's a surprise for people in the car who were like, "Oh, that's me." Also, of course, while I went through puberty my actual voice kept changing from the navigation device.

Anthony Alford: Well that's great.

AI-Generated Celebrity Voices [10:54]

Roland Meertens: Cool. Anyways, so we already touched on the malicious parts, and I think that one video which struck me which is already, I think, about a year old, is a person called James Feech. And he managed to hack the voice of David Attenborough, and David Attenborough is the voice of a lot of nature documentary movies like Planet Earth. And he managed to hack it because one specific app asked for training data, which you are to generate by reading a script of Planet Earth.

So, he thought, "Okay, well, I have the actual data so I can just upload that." But what's good if you want to generate your own voice is that all these tools always ask you for explicit consent. So, you always have to read something like, "I, Roland Meertens, blah blah blah blah..." I'm not going to complete this for obvious reasons.

And first he tried to sneak together bits and pieces of cut-outs from the shows, which sounded terrible and didn't work. But then he hired an impressionist, who actually recorded the consent for him. And that actually worked. And I will send you a piece of the final result so you can listen to how good you think it is.

Anthony Alford: Well, let's check this out.

Generated David Attenborough: Herds that used to consist of fifty animals are now rarely found in groups of more than... What's the point? You're not even listening are you? You're just sitting at home thinking, "Oh, look at the funny orange thing. It has a long neck."

I'm so tired. Tired of talking. Tired of cameras. Tired of narrating a day in the life of a walrus. You enjoy this bit? Oh you like the stripey horses? But when it comes to the important bit about the planet dying, you bury your heads in the sand.

Anthony Alford: That's not bad. You know, I was just thinking during the pandemic, celebrities managed to figure out a way to get a revenue stream by... you could pay them $15, $20, $50, $100 whatever to record a video message for your friends. So maybe this guy could have gotten David Attenborough to read his own script. Although, the impressionist was probably cheaper.

Roland Meertens: Well, the thing is that in the consent you very clearly state that it is okay to use your voice to create an artificial model. So it is very clear if you're asking someone to read it, that you're doing this.

Anthony Alford: I see.

Roland Meertens: And at first I thought, I have a lot of your voice because this is the second episode. But I, of course, didn't want to find a way to get around the consent part. So, there's now no artificial voice for you. Sorry about that.

Anthony Alford: I'll have to train one so that we can outsource this Podcast to the robots.

Roland Meertens: Yes. Like, for that I still think that the fun of listening to a Podcast is still knowing that you and I are interacting. But, Yes, we could have ChatGPT record multiple Podcasts for us and see how many people want to listen to that.

Anthony Alford: Surely somebody's doing that already.

Roland Meertens: Yes. I think it's very low quality content. Anyways... you were mentioning that the celebrities are offering services like recording something, right? I tried to dive into the rabbit hole generated voices for offerings for famous voices. And it turns out that there's multiple offerings and the price seems quite reasonable. It's only like 10 or so dollars a month. And, you have like 3000 celebrities available to use from David Attenborough to Barack Obama like, everyone is available. I will send you the one I generated from David Attenborough so you can see how much you like this.

Generated David Attenborough: Look at those scummy punks simply generating the voice of someone who did not even ask for this to happen.

Anthony Alford: Very nice.

Roland Meertens: Yes, it actually sounds amazingly good. Yes, and now you're probably wondering: where is the consent? Consent is sexy. Not in this case because I found the comment of the creator of the website I used to generate this, who basically... he was asked on Reddit what he found of the consent part of these voices.

And he basically said, "Oh, we should go after the people who abuse them, who abuse this and not the service who offers this. And the cat is out of the bag anyways, and the US will probably not ban deep fakes anyways and if they do people from another country will probably start generating it." So, he didn't seem to have a lot of issue with his own behavior.

Anthony Alford: Well, of course not. But it does raise some interesting legal questions, you know? Your voice. Can you copyright how your voice sounds? Who owns that? Or is it something that can be owned? For example, you can copyright a likeness or an image. But, I don't think we've quite figured out the legal ramifications of all this.

Roland Meertens: Yes, I don't know what parts you can copyright and what parts you can't. Like you can copyright a song, right? The characteristics of the song, but what about people who are impersonating your voice? Is that already okay, or not?

Anthony Alford: Let's ask Weird Al.

Defense Against the Dark AIs [15:55]

Roland Meertens: Good point. Anyways, the last part for me is that I was then wondering how to protect yourself from your voice being stolen. And, I think this list is very short, because I didn't really find a way. And I think that's tip number one from me, if you're afraid of being impersonated online, is to limit publicly available recordings. So maybe don't make a Podcast about AI.

Anthony Alford: Wow. Well, we're already breaking rule number one, so... what are the other rules?

Roland Meertens: Yes, well I think that the second thing is that if someone is calling you using my voice, you have to really think about, "Oh, is what's happening reasonable?" So, hereby I want to say to everybody listening to this Podcast, I will never ask for gift cards. Please, don't buy gift cards if I ask for it. If I tell you to go to the store and pour your savings into gift cards, it was not me. Okay?

So I think that's something for if you are hearing a voice, you should think about, is what's happening okay? And then, if you get a call and I am asking for gift cards, now is a good moment to verify the personal relationship with the person you're interacting with by asking questions only that person can know.

And actually, my mom did that when I got a new phone number and I sent her my new number. And she actually started asking me personal questions because she thought that maybe I was a hacker. So, my mom is already ready for this century.

Anthony Alford: I was going to say, that's great that your mom is that savvy.

Roland Meertens: Yes, that is really good. But, since this is a very common scam, I think we will see more of this in the future with AI generated voices, unfortunately. And the last step is that, if there is something very urgent, maybe try to meet in person or meet using a webcam. And I guess, the last step is something we will come back from when in the future, everybody can just generate an image of a face on the Webcam whenever you want.

Anthony Alford: Yes, right.

Roland Meertens: So Yes, those were my tips for protecting yourself.

Anthony Alford: Very good tips.

Roland Meertens: And that's the end of generating voices. Any questions?

Anthony Alford: No, that was really good. In a lot of ways it's scary, but it's also kind of cool. People are making some pretty funny videos with the celebrity impersonations. I think as long as it's used in good fun and for good, then it's a good thing. But, that's everything, isn't it?

Roland Meertens: Yes, as long as it is used for entertainment value and for people who are losing their voice, like Stephen Hawking or like the new thing in Apple. It's great. And as I said, I really like it for Podcasts. Because I can actually now correct one or two words which I miss-said, miss-spelled, I don't know. So there are definitely good-use cases but then there are definitely malicious-use cases as well. So, I figured I would cover both sides of the coin here.

QCon London 2024 [18:52]

Hey, it's Roland Meertens here. I wanted to tell you about QCon London 2024. It is Qcon's flagship international software development conference that takes place in the heart of London next April 8 to 10. I will be there, learning about senior practitioners' experiences and exploring their points of view on emerging trends and best practices across topics like software architecture, generative AI, platform engineering, observability, and secure software supply chains.

Discover what your peers have learned, explore the techniques they are using, and learn about all pitfalls to avoid. Learn more at qconlondon.com and we really hope to see you there. Please say hi to me when you are.

Synthesized Music [19:45]

Anthony Alford: At the risk of getting a copyright strike, I'm going to play some music that sort of combines synthesized music and synthesized voice. I don't know if you recognize this. We may edit it. Here we go. No. Here we go.

So that's a song by the German group Kraftwerk. I don't know if... check my pronunciation on that. But they were pretty famous for synthesizers plus they used a vocoder, or you know, a speech synthesizer, for a lot of the vocals. So, sort of a nice little combination of the two topics there.

So, what I wanted to start with, we're talking about synthesized voice, now we're going to start talking about synthesized music. If you think about how the speech synthesis works, text to speech, you have the text that you wanted to say, it's just typed out in a file maybe, or in a text box. And then the speech synthesis module says it.

Music is very similar, you know, we've literally had music synthesizers for a very long time. Let's listen to a famous one from the 80s here. See if I can pull this up. Can you hear that?

Roland Meertens: I can hear that.

Anthony Alford: Do you recognize what that is?

Roland Meertens: I don't recognize what it is.

Anthony Alford: Roland! Do you know what that is…Roland?

Roland Meertens:I actually went to the Roland store here in London twice today, because I needed a cable and wanted to check out their really cool gear. Which one is it? Which synthesizer is it?

Anthony Alford: That's the 808. That's the classic 1980s original drum machine used by a lot of hip hop acts in the 80s including Afrika Bambaataa, who also sampled the Kraftwerk song that we started out with. So, I'm just pulling together all kinds of threads.

So, anyway, we're very familiar with this idea of synthesized sounds replacing the real instruments. And they're quite good. And, by the way, almost all of these synthesizers also have a digital interface that you can use to automatically play them. It's called MIDI: Musical Instrument Digital Interface. And that was developed in the 80s, also, the Roland company was an early proponent of it.

So let me quickly share the screen maybe, and you can see a visual representation. So, here at the bottom with these horizontal gray lines. That's a representation of the MIDI file. The MIDI notes, right, that the synthesizer would play. And so, as you can see, it's a stream of notes.

And the nice thing about this, is you can play it with any instrument voice that you want, just like you can write text and have that text synthesizer use different voices. You could have it use your voice or you could have it use David Attenborough's voice. So I'm going to play some Midi notes using one synthesizer.

I'm sure you heard that.

Roland Meertens: I did hear that.

Anthony Alford: With the click of a button I can change it to be... Let's go back…

Anyway, so that's the nice thing about a MIDI representation is, what you have... this is the score. Right? So the text that you write to be spoken by the TTS, this is the same thing as the score.

Roland Meertens: Maybe just for my clarification, so a MIDI file just says, "I have an X amount of audio tracks," and then for every audio track it just has the music score like the exact notes. Or does it have two frequencies?

Anthony Alford: Yes. It's the note in terms of C, D, E, you know, A, B, C, that kind of thing, in a scale.

Roland Meertens: Yes, like C2, C3, in terms of height.

Anthony Alford: Exactly right.

Roland Meertens: Okay.

AI-Generated Musical Scores [23:48]

Anthony Alford: Yes. And it actually consists of a note on, and a note off, and the velocity or volume that you hit it at. So, just as you might use ChatGPT to write your text for you and then go send that to TTS and then just have a complete robot Podcast, people have been trying to do the same thing with the score. You could have a generative AI model output that score, and then pass that to a synthesizer using MIDI.

And you can imagine score generation has been tried using many of the same techniques used for text generation in language models. They're very similar, right? It's a sequence of discrete things, just like a sentence is a sequence of words, and the score is a sequence of notes.

So, in the 1950s for example, people were using Markov models to generate simple melodies.

Roland Meertens: Yes. Did these melodies sound good?

Anthony Alford: Probably they're not going to be very complex, right? It's just a simple... just like I did with “Twinkle Twinkle Little Star.” You might get something like that. But, I mentioned ChatGPT, of course, OpenAI had the same idea. Both OpenAI and Google used transformer language models to generate sequences of Midi notes.

Google called theirs Music Transformer and OpenAI called theirs MuseNet. And, these things work pretty decently. They could make fairly long sequences of notes even with multiple instruments. MuseNet, they encoded tokens that you could use to indicate the composer, style, and the instrumentation.

MuseNet actually had a demo that was a little advanced, but right now it's very simple, where you can do things like play Lady Gaga's “Poker Face” in the style of Chopin. So the way these things work is they're completion models, just like we talked about with language models. You give it a sequence of input and it predicts the next things to come. These work the same way. You give it a sequence of notes and it predicts the next notes that it should play.

Roland Meertens: Is it then predicting the tokens for a MIDI file?

Anthony Alford: Yes, exactly.

Roland Meertens: And they then set a synthesizer behind it? Or is it predicting something else?

Anthony Alford: Just the MIDI notes itself. And so, that's got some pluses and minuses. So, if I'm a composer or a music producer that's probably the solution that I want. To generate MIDI because I have a whole set of tools to deal with that and then I can tweak it with those tools. I can have it performed with whatever virtual instruments I might like. Whether I want it to be virtual strings or if I want it to be brass, guitar, etc. So, we can think of that as similar to how if, maybe if I'm a writer I might want ChatGPT to help me compose some text.

Roland Meertens: But if we think about ChatGPT it is trained on a massive corpus of text, because we humans are generating text the entire day. But, we humans are not generating MIDI files the entire day, right?

Anthony Alford: That's right.

Roland Meertens: Where do they get the training data from?

Anthony Alford: So, they actually did get a set of training data that's based on, let's see... so OpenAI has a whole webpage on it, but yes, they created a training dataset from many different sources. They essentially had collections of MIDI files.

Roland Meertens: Okay.

Anthony Alford: They trained it on that.

Roland Meertens: So they actually do have a large collection of training data as MIDI files.

Anthony Alford: Yes. Exactly.

Roland Meertens: Okay, interesting.

AI-Generated Text-To-Sound [27:13]

Anthony Alford: Yes. Now, as a composer or music producer, I might like something that can generate MIDI. But, if I'm not a composer, maybe if I'm just someone who wants to listen to interesting new music that's similar to something that I know I already like, I may not want MIDI. Because maybe I don't have a good device to play MIDI or I don't care about MIDI.

And so, the other generative AI, besides ChatGPT and those kinds of models that we are using a lot today are image generation models. And most of those are based on diffusion. And so, you'll not be surprised to find out that people tried that too.

Google did that. They did a model called Noise2Music where it takes audio noise and then progressively de-noises it. And you can guide those with a text prompt.

Roland Meertens: I really enjoy the name Noise2Music. That is so absolutely perfect that the person who gave that name should just get promoted. No matter what.

Anthony Alford: I mean it's the opposite of the old person’s "This music is just noise." But, anyway, now the other thing, you remember how we talked about the Midi, I showed you on the screen. It's 2D, right? You've got pitch on the Y axis and time on the X axis. It's an image. And so you may wonder, "Well could we generate images that represent sound?"

And the answer is yes. The common way to represent sound with an image is something called a spectrogram where the Y axis is the frequency that's calculated using a Fourier Transform. And the X axis is time.

So, some smart guys, they took a Stable Diffusion model, so that's the image generator, and they fine-tuned it on images of spectrograms and that spectrograms were sound and they also had text descriptions of that sound. So, now you can give it a prompt and say, "Play me a Reggae song." And it generates spectrograms.

And then you just decode that into sound. Because it's based on Stable Diffusion, it does the stuff that Stable Diffusion does. It supports image to image, so you can actually give it a sound clip and tell the AI to modify it, and say, "Take this sound clip but add piano."

And, there's a demo site, which we'll get to in a minute, it doesn't let you do that…it does have some pre-configured seeds instead of letting you upload. But, it will infinitely generate music based on your prompt.

Roland Meertens: Okay, but so, is the AI basically thinking, "Oh, you want this Mozart song but then a dubstep version. Oh, in dubstep there's always like the [wubwubwub]." That must have sounded terrible to whoever was listening.

Anthony Alford: Right, so they've taken Mozart, they've taken dubstep, and they've created spectrograms of those and used those images to fine-tune the Stable Diffusion.

Roland Meertens: And did someone then tag a lot of these short audio clips with, "In this song, you hear dubstep in the style of Skrillex. This is a song by Mozart, played on the piano and the violin."

Anthony Alford: Yes, I think the way it works is just like with Stable Diffusion and CLIP where you train a thing that maps sound and text to the same embedding space. And then you can combine them by doing vector math.

Roland Meertens: Okay.

Anthony Alford: Yes, so that was last year. This year, now, on InfoQ we've recently covered two new techniques for generating music down at the audio level. So right, so these image ones, the diffusion ones they're also at the audio level but... these are similar to the auto-regressive language models, but instead of outputting text tokens they output audio tokens. You may have come across this idea in your speech synthesis research.

But, of course, again the two players here are Google and Meta. So, I've got three demos. I'm going to use the Riffusion, which is the Stable Diffusion variant. We've got Meta's MusicGen which you can find on Huggingface. And then Google has theirs called MusicLM and you have to apply for access to their AI Test Kitchen which can take a few days, but I've done that.

So, Roland, you have the job of giving me a prompt. And we're going to try this on all three of these, and we'll see which one we like the best.

Roland’s Riff [31:17]

Roland Meertens: All right. So, what if I want a blues riff in the style of Mozart.

Anthony Alford: Okay, so I'm going to type, "Blues riff in the style of Mozart." I'm going to kick it off on the Meta one, because that takes up to a minute to generate.

Roland Meertens: Let's ask it specifically for a 12 bar blues, to see if it can sustain... we can simply count to 12 and see if it goes correct.

Anthony Alford: Oh, see, I already... I got to start over. Well, we can try this. I'm going to say, "Twelve bar blues riff in the style of Mozart." So I'm going to kick that off on Meta and I'm going to kick that off on Google. Now, there's a couple of differences. The Meta one actually does the sound to sound transfer, but it only generates clips that are 12 seconds long.

Roland Meertens: Then it has one bar per second.

Anthony Alford: Okay, Google says I can't generate audio for that. That's interesting.

Roland Meertens: Oh. Did we hit…security?

Anthony Alford: I don't know. Let me try this.

Roland Meertens: Did we hit a safety switch?

Anthony Alford: Maybe I need to re-log in. That's weird. All right, well give me a different one.

Roland Meertens: Okay, how about a new song by Lady Gaga, but then I want the song to be a polka song. New polka song by Lady Gaga.

Anthony Alford: Okay. "A new polka song by Lady Gaga." It can't generate audio for that! That's weird.

Roland Meertens: It would say that even Mozart isn't copyrighted anymore.

Anthony Alford: I don't know. So, it says, "Try something like soothing instrumental music to help me focus. Tunes to listen to on a bumpy train ride."

Roland Meertens: Okay, well, I want some Lo-fi blues. The Lo-fi 12 bar blues.

Anthony Alford: "Lo-fi 12 bar blues." Let's see what that gives us. Okay, that's crunching. So, we'll also put that into Meta.

Roland Meertens: Maybe you just can't do artists anymore. Maybe they started protecting...

Anthony Alford: Can you hear? So it gives you like a 20 second clip and it actually gives you two and you can do that thing that you talked about, the Elo ranking where you can compare them. Here's the second one.

Which one's better?

Roland Meertens: I think I like the first one better.

Anthony Alford: All right, so we'll click the first one.

Roland Meertens: The first one was a bit more bluesy, the second one was a bit more Lo-fi.

Anthony Alford: So here's Meta's version.

That's all we get.

Roland Meertens: I think Meta's version is definitely better than the first two we heard.

Anthony Alford: So here comes Riffusion’s. And let's see what this sounds like. Here we go.

I'm going to stop it because it will in theory generate for indefinitely.

Roland Meertens: Was it just me or did that sound pretty terrible, by the way?

Anthony Alford: Yes, it did. I noticed that all of these, there's very much like a digitized kind of noise, sound to it. Like a hiss. A very high-end kind of hiss that I've noticed, but... Yes I think I agree with you that the Meta one was pretty neat.

Roland Meertens: Yes, I would actually listen to that.

Anthony Alford: For ten seconds. Or fifteen seconds, at least.

Roland Meertens: No definitely, like, if the Meta computer was playing at a bar? I would not mind.

Anthony Alford: And so, that's the interesting thing here, right? So, is that what we're going to get? Instead of looking for ambient music, could we maybe see that in coffee shops or something? I don't know.

Roland Meertens: Yes. I genuinely mean it if I say that if a Meta song we just heard was playing in a bar, I would absolutely feel right at home, listen to it, I had no issues with it whatsoever.

Anthony Alford: Would you tip the robot?

Roland Meertens: You are living in America where people have to tip the robots.

Anthony Alford: You got to tip everybody.

Roland Meertens: You tip everybody, but that's why you employ humans because they work for free. Unlike the robots.

Anthony Alford: A lot of musicians work just for tips.

Roland Meertens: Yes, I must say that lately I have been going a couple of times to those open mic jam sessions where you as a musician bring an instrument and you can join. And somehow I still have to pay to enter the bar. So, somehow as a musician I'm earning a negative income.

Anthony Alford: Well...

Roland Meertens: Yes, that's the life of a musician.

Anthony Alford: That's why we're in the software business, Roland.

Roland Meertens: Yes, indeed. Indeed.

Anthony Alford: We won't be replaced by robots.

Roland Meertens: One idea I had is, "How fast are these tools generating things?" So far they seem to be quite fast, right?

Anthony Alford: Yes, so the Google one was pretty quick. As soon as I put it in, it generated pretty quickly. The one from Meta, it's hosted on Hugging Face and so they're taking their sweet time. This Riffusion one it's just kind of streaming. And I think it's fairly easy to do, I imagine a lot of it's done in the browser.

Roland Meertens: So these Stable Diffusion models, they don't have any grammar rules like music theory, what notes are in what key, it just, from nothing, everything?

Anthony Alford: Yes, right.

Robot Jam Band [36:55]

Roland Meertens: It's very good. Yes, also one thing I was wondering, one idea I have, or which would be cool, what I'm wondering is the adoption rate of this music? Like, is there a market for it and what can artists creatively do with it? And for that I was just wondering, for example, the drum machine you started with, the Roland drum machine, can drum machines already adapt themselves based on what you're playing?

Anthony Alford: Hmm. I don't know. A couple of these generation things that I looked at are trying to match other tracks, right? So, create a drum track that goes along with the rest of this song. Those are definitely an application for those kinds of things. And people are selling plugins to digital audio workstations.

So, for example, if you've got Pro Tools or Ableton Live, you can get a plugin that will generate drum tracks or generate melodies or something like that, that plugs right into your existing software and you can just paste it into your song. So that's definitely a use case. People are already selling those to actual musicians.

Roland Meertens: Yes because it seems to me that especially for street performers, if you only have one or two instruments, it makes so much sense that you can basically have an entire band in a GPU workstation next to you.

Anthony Alford: That would be a cool idea. You should start working on that.

Roland Meertens: I absolutely will as soon as I have more time available for this. One thing I also discovered talking about GarageBand earlier, and talking about generating drums, I noticed that in GarageBand if you want a drum beat, you can actually select multiple kind of personalities for your drum beats. And you can say which instrument, like what parts of the drum kits they are allowed to use. And you can set some kind of... like how much presence and how much kind of aggression should there be? And then it just generates these beats the whole time. Which is really cool.

But one thing I'm really missing at the moment, like the one thing which isn't there is the loop has to be closed. So just as a practical example last week I bought a, don't know if you know what a loop station is, but you can basically record a small part of an instrument like, in my case, guitar. And then you hit, say, "This has to be a loop." And it keeps looping and looping. And then you can create multiple loops, like especially Ed Sheeran is very well known for creating loops and loops and loops and that's why his music always sounds a bit the same in the background because it's just looping.

And you can have drum kits which augment these loops. There's no drum kit which dynamically adapts to what you're playing or dynamically adapts to what is already in the loops. I really just want to press a button and say, "Now you go solo for a bit." Or like, "Please make a beat for what I just played." Or even adaptive, like predicting, you always know as a musician when the end of the song is, why can't the drum beat slow down then and do like a nice final riff. Like, when are we closing the loop? How does this work?

Anthony Alford: Open research problems. Somebody can get a PhD for that.

Roland Meertens: Yes, or if Ed Sheeran is listening please make this. We need this. Do you have anything else?

Anthony Alford: That is all my content. So, I'm slowing down, or fading out here toward the end of the song.

Conclusion [39:59]

Roland Meertens: Perfect. The last question then, to end on a nice final riff is, what did we learn today? What are the take home messages for the listener?

Anthony Alford: I learned not to create Podcasts to keep people from stealing my voice. You know, it's crazy it's like a trope in movies, “Don't take my picture, I don't want you to steal my soul.” But, that's almost where we are.

Roland Meertens: Yes, and especially because there's no way like, "I could start talking with a very high voice and then nobody can steal it anymore," but like, you always have an identity for your voice. So, this is not possible. The one thing I think I learned is that I think that especially the Meta music you showed, sounded way better than I expected the state of the art of music to be. And I'm also very impressed that OpenAI managed to scrape a lot of Midi files and just treat it as a normal “predict the next token” problem. Which I'm very happy with about learning.

Anthony Alford: Yes.

Roland Meertens: All right, thank you everybody for listening. This was Generally AI, an InfoQ Podcast. If you liked this Podcast when you were listening to it, please give us a like, five stars, whatever you can do on your favorite Podcast platform and I always say that the way I discovered Podcasts is because friends are telling me about them. So, if you enjoyed listening to this, if you managed to make it to the end, please tell your friends about this Podcast, tell them that you enjoyed listening to it. And, Yes, that's everything for this week.

Thank you for how much... you listeners for joining in. And I will hope to see you next time.

Anthony Alford: I guess the robot musician is not going to wind up with a bigger bar tab than he makes from the gig.

Roland Meertens: Yes, I don't know how much bars are currently paying as royalties for musicians. I know that for the Stable Diffusion there was a huge uproar because artists felt and are still feeling that their art is being stolen. And used to train the network. And I can really imagine that if you, you yourself as an artist worked for years on generating a plugin for your digital audio workstation, and someone with one push of a button manages to steal your digital audio workstation setup basically. It's terrible.

Anthony Alford: Yes, I mean that's one of the downsides of software is the marginal reproduction cost is zero and it's so easy to copy everything.

Roland Meertens: Yes. Indeed. And scalability is quite big. The other fun fact, by the way, is that I actually have a record. I bought a record player awhile ago, and I found in a second hand shop a record with the sounds of the Moog synthesizer from when it was just new. So they kind of demo what a synthesizer is capable of to try to convince you to buy it. And Moog still sells a lot of synthesizers. So, Yes, that's an interesting fun fact.

Anthony Alford: Yes, well they're based in North Carolina. And they have a festival every year down the road here in Durham, North Carolina.

Roland Meertens: Oh, really? How's the festival?

Anthony Alford: I've never been. It's expensive to go.

Roland Meertens: Okay. I mean, do you get a free synthesizer if you go?

Anthony Alford: I don't think so.

Roland Meertens: But then, actually that's pretty cool.

Anthony Alford: Yes.

Roland Meertens: Cool.

About the Authors

Anthony Alford

Show moreShow less

Roland Meertens

Show moreShow less

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.