InfoQ Homepage Podcasts Generally AI Episode 5: Making Waves

Generally AI Episode 5: Making Waves

Feb 22, 2024

In this episode of the Generally AI podcast, hosts Roland and Anthony explore the fascinating world of audio waves by discussing the history of Fourier analysis and how the Fast Fourier Transform (FFT) revolutionized signal processing with its efficiency. They then shift to song recognition apps like Shazam and their underlying algorithms: breaking songs into snippets and using techniques such as the FFT or neural networks.

Key Takeaways

Joseph Fourier's work on differential equations led to the development of the Fourier Transform
The FFT's efficiency in transforming time-domain signals into the frequency domain has made it a cornerstone in signal processing.
Periodicity assumptions and sampling strategy can affect the accuracy of the FFT output
Music recognition involves breaking songs into snippets and using FFTs and neural networks to match and identify these snippets efficiently.
The challenges in recognizing music include sampled songs, transposing keys, and distinguishing between songs with similar elements .

Subscribe on:

Transcript

Introduction

Roland Meertens: Welcome everybody to episode five of Generally AI, an InfoQ podcast. The theme of the episode today is Making Waves, where we are going to talk about audio waves. The fun fact I wanted to start with is related to the loudness wars. Have you ever heard of the loudness wars, Anthony?

Anthony Alford: I don't believe I have. Tell me about it.

Roland Meertens: Okay, so I found this on Wikipedia. Apparently there is a trend of increasing audio levels in recorded music because every music producer wants their music to be really out there. So apparently, every producer started finding ways to increase the maximum loudness of their music. And it really started with CDs, which apparently have a maximum loudness.

But if you tweak the compression algorithm, you can make your music even louder. But it does mean that the fidelity of your music kind of dies out. So the loudness wars on Wikipedia talks about this phenomenon where music is getting louder and louder every year. But that, of course, also means that because of your compression, your details die out. And this is why Ian Sheppard organized the first Dynamic Range Day on April 21st to raise awareness that dynamic music sounds better.

Anthony Alford: Yes, I have heard about people complaining about compression and how all music now has a lot of compression applied. And we actually probably could do a whole talk on that. But Yes, I'm not surprised it can cause distortion because that's essentially what it is.

Roland Meertens: Yes, apparently some people hear it. One thing which I found interesting, that Daft Punk's album, Random Access Memories, sounds great. An engineer decided to use less compression on the project, and it won five Grammy Awards, including Best Engineered Album. And in the early 2020s, most major streaming services began normalizing audio by default. So I guess the problem of picking the loudest sound is gone, and now we just got to get our fidelity back.

Anthony Alford: So Yes, audio compression, I have heard of that, but I did not know that everyone was vying for the loudest song.

Roland Meertens: What I found interesting when I was looking at this on Wikipedia was that you can very clearly see that the old records, like the vinyl, if you look at the audio wave, you can very clearly see that they're just trying to be as loud as possible the whole time instead of just being loud at specific parts of the song. So I guess this is why we should all switch back to vinyl.

Anthony Alford: So my fun fact is a similar theme. We're making a podcast, so we are recording our voices. And if you're like me, when you hear your own voice played back from a recording, you don't like it. I don't like the way my voice sounds. So why is that?

Well, the fun fact is, and it's obvious when you think about it, when you hear your own voice, it partly is conducted through bone into your ear, so it sounds different, filtered through your skull essentially, versus hearing it coming from a recording. So it doesn't sound like your voice at all.

Roland Meertens: I guess, normally, I at least hear my voice a lot lower in my head than it actually sounds-

Anthony Alford: Exactly. So conducting it through the bones in your ears makes it sound deeper.

Roland Meertens: Yes, I definitely sound better in my head than I sound on any recording.

Anthony Alford: Don't we all?

Roland Meertens: Don't we all! Welcome to Generally AI, an InfoQ podcast. As I said, the theme today is making waves. We are going to talk about audio waves. And what I don't really know today, Anthony, is who should go first? You or me?

Anthony Alford: Let me go first.

Roland Meertens: Yes, please.

Anthony Alford: If you don't mind.

Roland Meertens: Yes, please.

Waterloo [04:33]

Anthony Alford: We're going to have a history lesson. So we never know, of course, when a given listener is going to hear this podcast, but we are recording it right now, it's late 2023, November of 2023 actually. And Ridley Scott's new film, Napoleon, is in theaters. And I did go to see it.

Now, regardless of what you think of the film or the subject matter, I feel like that's a good frame to start this discussion. Because this is a podcast, we're trying to tell a complex story in a limited amount of time. Just like Ridley Scott, I'm going to gloss over a few things, and maybe not get the details completely correctly, but I'm not going to deliberately make things up.

So imagine you're a French mathematician living in the Napoleonic era, the turn of the 19th century. You're trying to solve differential equations. By the way, you probably knew this, Napoleon was supposedly quite good at math. There's a theorem attributed to him, although nowadays the consensus is he probably did not come up with it. But there is a-

Roland Meertens: Well, what is his theorem?

Anthony Alford: It's called the Napoleonic Theorem. Sorry, Napoleon's Theorem. It has to do with equilateral triangles constructed on the sides of a triangle.

Roland Meertens: Oh, that sounds completely different than what I expected.

Anthony Alford: Yes. Well, of course, he was an artillery officer, so geometry and triangles were his thing. But the consensus is he got the idea from someone else. Anyway, I digress. So imagine you're a mathematician in this era. You're trying to solve a differential equation, in this case, a partial differential equation for heat transfer in a solid body, which is driven by a heat source or input.

Roland Meertens: As one does.

Anthony Alford: As one does. I guess they had a lot of free time. So there's no known general solution for this, but you do know two things. You know that a solution exists for sources that can be represented by a sine wave. And you also know a general property of differential equations of these types is that they're linear. Which means that if you have an input function that can be represented as the sum of two or more functions, the solution is the sum of the solution for each of those. That's called superposition.

So anyway, this is the situation in the year 1807, where Napoleon is doing his Napoleon things. The mathematician was actually a real person named Joseph Fourier. So he had this insight. He knew that there was a solution for sine waves or sinusoids. He also knew that the equation had the property of superposition. He realized if he could just figure out how to take other inputs and decompose them into a combination of sinusoids, then the problem was solved. And that's what he did. And that is the origin story of Fourier analysis.

Roland Meertens: Did he only invent the fact that these sinusoids could be combined, or did he also figure out all the ways to calculate them?

What’s the Frequency? [07:48]

Anthony Alford: Well, he figured out the way to calculate, given an arbitrary function, decompose it into a combination of sinusoids.

Roland Meertens: Okay, so it works for any arbitrary function?

Anthony Alford: Well, subject to certain constraints, but yes. So there's actually variations on this. There's one way that works for periodic functions. There's one that works for arbitrary functions. And I'm not sure what he actually came up with originally, but as with everything, he had their first idea and people who've built on it have continued to credit him with it.

So the mathematics is a little beyond the scope of what we want to do here. Really, we need a whiteboard and PowerPoints or charts. And it's calculus.

Roland Meertens: Yes, I am pretty sure that we will lose a lot of listeners if we start doing calculus lessons through a podcast.

Anthony Alford: Exactly. And we'll leave that to the professionals at Khan Academy. So let me do hit a few high points though. So in general, a sine wave or a sinusoid has a few parameters: the amplitude or height of the wave; the frequency or how fast it's going up and down; and the phase shift. Now that last one, if you visualize the wave plotted on an X-Y chart, you can think of the phase shift as moving it left or right along the X axis.

Roland Meertens: Yes.

Anthony Alford: Well, I said X axis, but in fact it's usually a function of time. So it would really be the t axis. And I didn't even mention time at all, did I? But yes, it's also a function of time. But when you do Fourier analysis, time gets taken completely out of the equation and everything becomes just a function of frequency.

And this operation, changing from a function of time to a function of frequency is called the Fourier Transform. It transforms a function from what we call the time domain into the frequency domain. Remember, the function's domain is its possible input value. So the input of the function is no longer time. It's frequency.

Roland Meertens: So the input becomes frequency and then the output becomes amplitudes?

Anthony Alford: Yes. Actually amplitude and phase. The value of the function is a combination of amplitude and phase. And so-

Roland Meertens: What is phase then in this case?

Anthony Alford: Remember, with the sine wave, phase is shifting it left or right along the axis, right?

Roland Meertens: Yes. Okay.

Anthony Alford: So how do you combine two numbers to get a single value? Well, you use complex numbers. So maybe you remember the polar form of a complex number. If you thought calculus was going to lose us listeners, wait till we get into complex numbers.

Roland Meertens: Maybe I do remember it, but maybe not. And I'm not going to admit it.

Anthony Alford: But anyway, with the complex number, the amplitude becomes the length. Remember a complex number is like an arrow. It has a length and a direction it's pointing. So the length is the amplitude and the direction it points is the phase shift.

It turns out you can actually represent these using exponentials or the number e, number 2.71828…, and you raise that to a complex exponent. So what you get out of a Fourier transform is a function whose input is frequency, and the output is a complex number. It's e to the...some complex exponent.

Roland Meertens: Yes.

Anthony Alford: Okay. So the Fourier Transform can take an arbitrary function of time and transform it into a function of frequency. What that means is it's a linear composition of these exponentials. And because exponentials are solutions to a lot of differential equations, and because those differential equations have the property of superposition, that's why the Fourier Transform is so useful. You can think of it as turning calculus into algebra. So maybe we won't lose as many people because we're now talking about algebra instead of calculus.

Roland Meertens: I'm sure a podcast about algebra is going to be a massive banger on Spotify.

Anthony Alford: I'm sure we're going to get a lot of streams. Oh, fun fact, you get a half a penny per stream on Spotify.

Roland Meertens: Oh, I didn't know that.

Anthony Alford: That's what I've heard. Anyway, so why is this useful? It turns calculus into algebra. So now you can solve those differential equations just using algebraic manipulation instead of having to do actual calculus. And there's another operation called the convolution, which is fairly complex, that just turns into multiplication. So you multiply your input function times some other function, and that's a convolution.

You could think of convolution as when a signal goes through some system like a filter, that system is performing convolution. So in other words, Fourier transform lets us do signal processing by using multiplication and division or algebra.

Roland Meertens: Yes, on the entire frequency domain at once rather than on the individual wave?

Anthony Alford: Exactly. So Yes, again, you can look at any particular frequency that you want because they're linearly independent. So that's actually a key assumption with a lot of these systems, is that they're linear. That means that whatever happens at a certain frequency doesn't affect things happening at other frequencies.

It turns out, of course, that's never true in reality. But for most engineering, you could pretend like certain things are true as long as you're okay with a little bit of an error there, like Ridley Scott shooting at the pyramids.

Roland Meertens: I didn't know what's happening with Ridley Scott shooting at the pyramids.

Anthony Alford: Well, in the movie, Napoleon's Army shoots cannons at the pyramids, which did not actually happen.

Roland Meertens: Okay.

Anthony Alford: It looked cool to Ridley Scott.

Roland Meertens: Okay, I should really go watch this movie.

Digital Love [13:42]

Anthony Alford: I have to have an anchor, a frame to talk about things. So we've discussed the Fourier transform somewhat at a high level, and from the time of Napoleon. Now we're going to fast-forward 150 years or so to the middle of the 20th century or the time of Oppenheimer. So now we're in the age of computers and digital signal processes. And one key difference in this world compared to the time of Napoleon is sampling.

Roland Meertens: What is sampling?

Anthony Alford: So with a time domain signal, typically it's continuous, right? The sine wave goes up and down with no breaks.

Roland Meertens: Yes.

Anthony Alford: To get that into a computer, you have to take a snapshot of those values every so often, and that's called sampling. And the key thing to remember, the rate at which you sample is called the sampling frequency. And that's very important.

But another thing, now that we're in the computer world, we're going to do a variation of the Fourier transform, called the Discrete Fourier Transform, or DFT. First you sample your signal. You collect some number of samples, N. And remember, we're doing this at the sampling frequency. The DFT gives you a transform of those N samples from a time domain function into the frequency domain.

Roland Meertens: And what sampling frequency should I think about? What sampling frequency are we currently recording at?

Anthony Alford: Well, that's a great question, and I will bring that up in my list of caveats at the end.

Roland Meertens: Okay.

Anthony Alford: So what you get from the DFT, again, you get a set of complex numbers, and those represent the amplitude and phase of sinusoids. And there are sinusoids that are fractions of that sampling frequency. The first one, the zero component, is actually zero frequency. So it's a constant. In the electrical engineering world, we call it DC or direct current.

The second component, index one, is the sampling frequency divided by N, the number of samples, and then you multiply...for each index i after that, you multiply it by i. So the first one is 1 over N, or the sampling frequency over N. The second one is two times the sampling frequency over N. And so on.

Now it turns out there's a mathematical formula for calculating the DFT. And if you just did that calculation for each component, you're looking at a so-called Big O complexity, which I think probably most of our podcast listeners will know this Big O complexity for an algorithm. It's N-squared.

Roland Meertens: Okay. But at the moment, I am sampling at every possible, right, like 1, 2, 3, 4? Or am I already doing the quadratic numbers?

Anthony Alford: What do you mean?

Roland Meertens: You say, I'm currently sampling at one times sampling frequency divided by N, two times sampling frequency divided by N.

Anthony Alford: That's the components. So when you get an output, you get N complex numbers. And each one of them represents a sinusoid or a frequency from zero to the sampling frequency divided by N. So the last number you get out represents something of the sampling frequency. The middle number you get out is half the sampling frequency.

Roland Meertens: Yes. Got it. Okay. And so then the N for the O(N^2), what is N in this case? Is it sampling frequency?

Anthony Alford: No, it's how many samples you collect.

Roland Meertens: Yes, okay. And if that is squared, then that's quite a lot. Okay, that will grow.

Fast Forward [17:27]

Anthony Alford: Right. Yes, you never want to see N squared if you can help it. So that was the problem. And in 1965, two mathematicians solved this problem. Their names were James Cooley and John Tukey, and they worked out a more efficient implementation that has complexity of N times the log of N, which is much better.

Roland Meertens: Way better.

Anthony Alford: Yes. It turns out probably Gauss also came up with this a couple of hundred years ago, but didn't publish it. Again, as Gauss did: it turned out that Gauss invented everything, he just never published it. But anyway, by the way, this guy, John Tukey, is also credited with introducing the box plot in a textbook that he wrote in the seventies. And he's sometimes called the father of data science.

Roland Meertens: Oh, interesting. I never thought that someone was the first person to do a box plot.

Anthony Alford: Yes, it's unclear if he's the first, but he introduced it into a textbook and made it really popular. So this implementation of the DFT, or Discrete Fourier Transform, this implementation was called the Fast Fourier Transform, or FFT. And it's probably one of the most influential and widely used signal processing algorithms today.

Roland Meertens: Nice.

Anthony Alford: Yes. A couple of interesting things about the DFT in general and the FFT specifically. So you asked, what sampling frequency should you use?

Roland Meertens: Yes.

Anthony Alford: Well, it turns out that you need to use a frequency that's twice as high as the highest component frequency in your signal. So if you look at human speech, I don't even remember the numbers. But human speech, let's say the highest frequency is like 20 kilohertz, 20,000 cycles per second. You need to go at least double that. And if you're familiar with basic digital audio, like compact discs and things like that, it's 44.1 kilohertz.

Roland Meertens: So perfect human voice plus a tiny bit for a piccolo.

Anthony Alford: I made up the number for human voice because it's about half. I picked half of that.

Roland Meertens: Okay.

Anthony Alford: I don't actually know what it is. But human hearing range does have a limit for most people, and it's probably about that.

Roland Meertens: Okay. So what you're saying is that my vinyl collection is worthless after all, because I could have just listened to Spotify.

Anthony Alford: Well, I'm not going to get into that debate. People do experiment sometimes with gear that audiophiles say is better. And with blind tests, a lot of times they can't tell the difference. So we did talk about how nothing is ever linear. So probably there's distortion and artifacts introduced from sampling.

Roland Meertens: Yes.

Anthony Alford: I don't know.

Roland Meertens: So is it then better to have a lot of bass in your songs or better to have a lot of high fidelity?

Anthony Alford: Probably the best thing is to have a good set of speakers, actually.

Roland Meertens: Okay.

Anthony Alford: But again, I digress. Okay. So sampling frequency needs to be twice as high as whatever the highest frequency in your signal is. So the other thing is, the DFT assumes that your signal is periodic. This means that it assumes that it repeats itself at some frequency. So again, I think you're going to talk about music.

We know a note on a piano like concert A is 440 hertz. That is the fundamental frequency. So that note, the signal of that repeats 440 times per second. So the DFT assumes all its input is like that, and you've just given it one period. If your signal's not actually periodic, the mathematics of the DFT assumes that you have applied a so-called window function.

And the output of the DFT is actually going to be a combination of the DFT of your real signal with the convolution of the window function's DFT applied. So it's actually kind of a mess. So long story short, the spectrum you get out is not quite exactly the real spectrum of the signal. It's going to be-

Roland Meertens: Because you never know how much it repeats.

Anthony Alford: Yes, it doesn't repeat, and you've pretended that it does. You've basically lied to the DFT telling it it's pretending. So the signal, the spectrum is going to be, it's called smeared. You can actually apply other window functions besides just the chop. The rectangular window is what you do if you don't do anything.

You can apply other window functions to try to get a little better response out of that. Again, we're probably going to lose some people, so I won't go too deep into that.

So that's the DFT in general. For the FFT specifically, there's a really weird result. I mentioned that the output of the DFT is a list of N numbers. Just like you had an input of N numbers, you have a list of N frequency components, spectrum components, and they go from zero to N minus 1, and those are fractions of the sampling frequency.

With the FFT, the indices of these components are a bit reversed. So what that means is that they're not in numeric order. Instead, you take the binary representation of the index and reverse the bits. So let me illustrate.

Roland Meertens: Okay.

Anthony Alford: Imagine that N is four. So we have the index as 0, 1, 2, 3. Index zero is still zero. But the next index, one, in binary, it's 01.

Roland Meertens: 01. Yes.

Anthony Alford: You actually reverse that to 10. Now that's actually index two. So those two, the second and third component, are actually swapped.

Roland Meertens: Okay. And why do you do that?

Anthony Alford: That's the mathematics of the transform. That's how it's more efficient. That's the trade-off. Instead of being N-squared, it's N*log(N), but the output is a bit reversed.

Roland Meertens: It's like black magic with binary numbers.

Anthony Alford: So the reason you can do N*log(N) is you reuse some of the intermediate steps.

Roland Meertens: Yes. Okay.

Anthony Alford: So if you draw a diagram of it, it looks like this butterfly thing. And so things get swapped around. It's weird, but it gives you the answer in N*log(N).

Roland Meertens: Nice.

Anthony Alford: But the output is essentially scrambled. So one interesting consequence of this is that back in the 1990s, I actually did a lot of this in my time in academia, there were dedicated hardware devices. Basically something like a GPU, but it was a digital signal processing hardware, dedicated hardware. It had a bit reversed addressing mode. So you could transparently read the data out of memory in the correct order.

Roland Meertens: Interesting.

Anthony Alford: Yes. So back then, again, this is mid '90s, it was people doing this for serious scientific and other instrumentation applications. They would try really hard to get the fastest, biggest FFT they could. And so just like with GPUs, there was dedicated hardware that could run these FFTs really fast.

Roland Meertens: I actually think that for a lot of software applications, you can transform them to signal processing problems. And then you can use these onboard signal processors, which very little software engineers are ever doing. But I think that once you manage it, you're absolutely an optimization ninja.

Anthony Alford: Yes. So again, I tried to throw the caveat out there that it's going to gloss over things, and maybe not get everything completely right. The reason is, I would be extremely embarrassed of any mistakes in this because my undergraduate and graduate education in electrical engineering, a good chunk of it was devoted to Fourier analysis and signal processing. So if I made mistakes, it's because time has erased some of that from my memory.

Roland Meertens: And we all know that when you are talking about time, you need to have as much details as your sampling frequency, right?

Anthony Alford: Exactly. Yes, we didn't even get into that. What happens, if your sampling frequency is not high enough, you get what's called aliasing. So things that should look like high frequency show up as low frequencies in the output of the DFT.

Roland Meertens: Yes. Interesting. Interesting.

Anthony Alford: So that's my material.

Fourier Fun Facts [26:17]

Roland Meertens: I will add one fun fact to round off the fact that you started talking about Napoleon. Do you know where Jean-Baptiste Joseph Fourier is buried?

Anthony Alford: I do not.

Roland Meertens: Okay. So with everything in this podcast, I'm only doing research by actually going to the grave of the people who invented things. So he is at the Cimetière du Père-Lachaise in Paris.

Anthony Alford: Interesting.

Roland Meertens: And I actually went to his grave in 2014 when I was visiting or living in Paris. And I was trying to find a photo of me with the grave just for the show notes, but I couldn't find it. I could only find a photo I took of someone else who was in this graveyard, which is Jim Morrison. So I guess we can see that my interest is more with old records than with people or old mathematicians. But I did remember visiting the grave, so I did remember that he was buried there.

Anthony Alford: That's great. Hey, Jim Morrison kind of fits in there. Actually, Fourier went with Napoleon to Egypt.

Roland Meertens: Yes, actually, wasn't Fourier some kind of leader or governor of Cairo or something?

Anthony Alford: I don't know. He got left behind.

Roland Meertens: I only know that his gravestone looks a bit Egyptian or something. It's very weird.

Anthony Alford: Could be. Maybe they brought it back with him. Who knows? So I completely forgot to add that to my notes, that when Napoleon went to Egypt, he actually did not shoot the cannonballs at the pyramids. But he did in fact bring quite a few scientists with him to do some research there. And one of them was Fourier.

Roland Meertens: I think I have a photo of his grave in front of me, and there is some kind of Egyptian thing on the top. And actually, his face looks more like what Voldemort looks like in Harry Potter. But we should just say that they took the essential components of his face and they left out the details.

Anthony Alford: That's perhaps quite accurate.

Roland Meertens: Which is the weirdest joke I could ever make about Joseph Fourier.

QCon London [28:45]

Roland Meertens: Hey, it's Roland Mertens here. I wanted to tell you about QCon London 2024. It is QCon's flagship international software development conference that takes place in the heart of London next April, 8 to 10.

I will be there learning about senior practitioners' experiences and exploring their points of view on the emerging trends and best practices across topics like software architecture, generative AI, platform engineering, observability and secure software supply chains. Discover what your peers have learned, explore the techniques they're using and learn about all the pitfalls to avoid. Learn more at qconlondon.com. And we really hope to see you there. Please say hi to me when you are.

Name That Tune [29:31]

Roland Meertens: The reason I was always excited about visiting or having visited the grave of Joseph Fourier is that I knew you could use this Fourier theory to recognize sound, like recognize songs. And I've always kind of consistently been giving this as a fun fact to people without really thinking of how does it actually work? So what I wanted to do with you is basically explain what I learned in the last couple of weeks of how I thought music recognition works and how I was wrong. Does that sound good?

Anthony Alford: Sounds great. That's the best way to learn.

Roland Meertens: Yes, because I thought, okay, well, Fourier theory can basically boil any signal down to its individual component. So surely you just take a song and then take the entire song, calculate your individual song components, and then you can use it to reconstruct a song.

But let's be honest, after what you just told me, that is going to sound absolutely terrible because most songs are not a very repeating signal or they will be very boring songs. So people have to break it up. And I also think that, actually, this is different than how humans recognize sounds.

Humans are surprisingly good at recognizing sounds, and I figured I would do this with you. And also, by preventing any kind of copyright strikes by me just humming the sound for you, and then you can recognize it. So if I say (singing).

Anthony Alford: That's “Smoke on the Water.”

Roland Meertens: Yes. And if I do (singing).

Anthony Alford: “Seven Nation Army.”

Roland Meertens: So you can already---

Anthony Alford: Oh, I'm good at this.

Roland Meertens: I took things which are very good for bass players to recognize. Those went well, right? And you can very quickly recognize them. But if I, for example, say (singing).

Anthony Alford: Well, that's a good question. Wait, there's an extra note in “Ice, Ice, Baby.” So I didn't catch if it's “Under Pressure" or “Ice, Ice, Baby.”

Roland Meertens: Yes.

Anthony Alford: There's an extra note. There actually is an extra note in “Ice, Ice, Baby.”

Roland Meertens: So this is indeed interesting because Ice Ice Baby and Under Pressure have very similar things.

Anthony Alford: Oh, it's the same. He sampled it. It's just there's an extra note.

Roland Meertens: I didn't even know there was an extra note. But for example, if I say (singing).

Anthony Alford: That's “Beat It.”

Roland Meertens: Yes. But if I would say (singing).

Anthony Alford: Oh, transposed. We still know.

Roland Meertens: Yes, it's still “Beat It,” right?

Anthony Alford: Yes.

Roland Meertens: I at least didn't find the correct key for all these samples, and you still recognized them, which means that somehow humans are good at recognizing sounds even if the key is wrong. And the fact that you recognized, for example, Beat It, you recognized it very quickly. What I think is interesting is that “Beat It” by Michael Jackson is number seven in the top list of most recognizable songs of all time.

Anthony Alford: What's number one? “Another One Bites the Dust?”

Roland Meertens: No, not even that. So I only found this list and it says Beat It is recognized in 2.8 seconds by most people. And number one seems to be Spice Girls' “Wannabe,” with 2.29 Seconds. “Yes. I'll tell you what I want, what I really, really want. So tell me what you want, what you really, really want.”

Here is where I thought it started becoming weird, because number two was Lou Bega with “Mambo No. 5,” in 2.48 seconds.

Anthony Alford: Is that that playing the song or humming it?

Roland Meertens: Well, the thing is that it starts with, "Ladies and gentlemen, this is Mambo No. 5."

Anthony Alford: Well, Yes, obviously. “What's the name of that song?”

Roland Meertens: Yes. So what I propose to you is that one day we just go to the studio and make a song called Most Recognizable Song, and we just start a song by, "Most Recognizable Song," and then we beat the records.

Anthony Alford: I have a teenage daughter, and a few years ago I found out that One Direction actually has a song called “Best Song Ever.”

Roland Meertens: Oh, smart.

Anthony Alford: But wait, it sounds exactly like “Baba O'Riley,” By The Who.

Roland Meertens: Oh, it is a pretty good song.

Anthony Alford: She played it for me. We were listening to Spotify and she played it for me. I said, "Go look up ‘Baba O'Riley,’ By The Who." And she's like, "Oh, this is the same song."

Roland Meertens: Also, you were talking before, that the song “Ice, Ice, Baby” is sampling. There are way more songs nowadays which are just using samples of existing songs, which makes sound recognition harder because, of course, they're going to compose into the same spectrum.

Anthony Alford: True.

Listen to the Music [34:56]

Roland Meertens: What I also found interesting is that, especially for us in Western music, it's very often about the progression, but sometimes you can also recognize songs just based on the instruments. Anyways, just to go back to what does this have to do with Fourier? As I said, I initially thought, oh, you can probably represent the entire song as individual sound waves.

And I never tried this, but I figured it would probably just sound very bad. And I also think that if you would try to recognize an entire song based on just the sound waves, it would probably also not work and probably be a mess. But there are a lot of apps nowadays which are good at this, even despite noise in the background, and some apps you can even hum a song to and it'll recognize it.

Anthony Alford: Oh, nice.

Roland Meertens: And I think the app, which most people are using is Shazam. So let me start with Shazam fun facts. When do you think Shazam started?

Anthony Alford:I know it's been around a while. So I don't know, 2005, 2008?

Roland Meertens: Yes, so you probably think it's from 2008 because that's when it became an app. But Shazam has been around since the year 2000, when it used to be a phone number.

Anthony Alford: Oh, interesting. Really?

Roland Meertens: When you were at the bar, you could use your mobile phone to call the service. You would just hold your phone in the air for 30 seconds and then it would send you a text afterwards with what the song is.

Anthony Alford: I can't imagine that worked very well from a bar. That must have been really challenging.

Roland Meertens: Yes, well, I mean, you have a mobile phone, you can recognize songs, which I think is insane. They're located in London, by the way. And they actually wrote a paper in 2003 where they were describing how they were able to recognize 2 million songs, which is already a lot. It was purchased by Apple in 2018 for $400 million.

Anthony Alford: Wow.

Roland Meertens: There are other services. So the others I found is Hum to Search from Google. So Google supports that. And apparently, on some Android phones, you can have a “now playing” function, which sounds insane. I unfortunately don't have one of those phones. But the entire time your phone is listening to, is there music in the background? And if there's music, it tells you what music is playing.

Anthony Alford: Well, we already knew they were listening to us, but I guess-

Roland Meertens: Here's the insane thing, they are actually doing this on device.

Anthony Alford: That's amazing.

Roland Meertens: So they have a really interesting paper describing how they do this on device because you need to do this on device, fast, while keeping your battery power in mind. And you have to have this entire music database.

Anthony Alford: On device?

Roland Meertens: On device. Yes.

Anthony Alford: That's bonkers.

Roland Meertens: Yes, it is. Because they have a way to get all these songs down to as little bites as possible for recognition.

Anthony Alford: Send me the link to that paper.

Roland Meertens: I will send you the link to that paper. But let me first start giving you some secrets to this music recognition because maybe then you can implement it yourself.

Anthony Alford: Sure.

Roland Meertens: And I think the first thing I already talked about is, instead of taking the entire song at once, you split the music into parts, into snippets which are individually recognized.

Anthony Alford: Yes.

Anthony Alford: So for each snippet you try to get a match for individual songs. So for each snippet you have basically a list from what song it could be. And then already guessing the song, if you have a couple of snippets which you're not very sure of yet, you try to match those snippets to those songs as well.

Anthony Alford: I see.

Roland Meertens: So you start with a couple of snippets you are relatively sure of, and then you try to fill in the remaining snippets to see if they can actually fit. It's basically starting with a puzzle. If you have a puzzle and you know what the pattern is probably going to be, you start with very clear pieces, and then you can try to fill in the rest. And if that works, I mean now the puzzle analogy breaks down because if that works, you can actually find the correct box.

Anthony Alford: Well, I mean you could think about it as a sequence matching in a way.

Roland Meertens: Yes, indeed. Indeed. You are matching a sequence. The second thing I was initially thinking of is, okay, well you just take the waves and then you take this image basically, the spectrogram, and then you match that entire thing.

So whenever videos or explanations would say, oh, you just take a fingerprint, I would be a bit annoyed because I was like, oh, but you got a spectrogram. Isn't that the entire fingerprint? No, probably not, because that thing is massive. So people take representations or hashes of the spectrogram.

And I found multiple approaches which worked over the years. I think the Shazam paper talks about if you, okay, I'm now trying to make you visualize this, but if you have the spectrogram and you see all those waves and hotspots going around over your time periods, you can, instead of having all these details in all the sound waves, you can get the most dominant frequency at each time period.

So for example, every hundred milliseconds or every 10 milliseconds, you take the most dominant frequency at that time point. That gives you a lot of dots, and then you can start doing constellation matching to find a song with that constellation.

Anthony Alford: Interesting.

Pitch Perfect [40:46]

Roland Meertens: Yes. And I think one thing which is especially cool here is that this is important if the frequency can change. Like when I am humming songs to you, I don't know exactly if I am on key for the songs I'm playing. So in this case, the fact that you are doing constellation matching solves that problem.

The other thing I found, by the way, is that you as a musician probably know that if I go one octave up, I am doubling the frequency of my sound. So if I would sing (singing), that doubles the frequency of the sound wave my mouth is producing (singing). Fun fact. But we as humans perceive that as the same note. So maybe I was humming a low A and then later I'm humming a high A. But we perceive this as the same note.

When you sing “Happy Birthday,” and suddenly you have to say, "Happy birthday, dear Anthony." Sometimes you will automatically in your brain just go lower instead of higher because you can't hit a high note. That's when your brain's deciding to just take the same note but then lower instead of higher. And our brain interprets it as the same note. So I also saw that some papers had a way of encoding frequencies which are changing by an octave basically in the same way to account for this problem.

Anthony Alford: Interesting. So another thing. Again, you probably know this, but sometimes we have a hard time determining the exact pitch, A versus A-flat. But we can detect intervals. We can distinguish intervals correctly. That's how I can pick the same melody, even though you're humming it at a different key.

Roland Meertens: Yes. So what you're describing is how most humans are interpreting songs, which is by relative pitch.

Anthony Alford: Yes.

Roland Meertens: Do you know how many people have absolute pitch?

Anthony Alford: Not very many. My son says that Charlie Puth does, but I don't know beyond him.

Roland Meertens: Yes. So apparently 1 in 10,000 people have absolute pitch, so they know exactly if they hear a sound basically what frequency it is on, what note it is.

Anthony Alford: Yes, it's amazing.

Roland Meertens: Yes. Apparently these people are sometimes having a harder time learning music or recognizing melodies because it is not exactly the same melody.

Anthony Alford: I can believe that.

Roland Meertens: And this is what I tell myself whenever I'm having issues playing music, is that maybe there's a chance I have absolute pitch.

Anthony Alford: Oh, is that it? I thought you were going to say, that's the upside. The fact that I'm terrible at singing, that's a good thing because then I can appreciate good stuff other people make.

Roland Meertens: Yes, I always tell myself if I'm having an issue recognizing or playing music, I think, oh, maybe I secretly have absolute pitch. I just didn't notice it yet.

Anthony Alford: It's possible. It's possible.

Roland Meertens: Yes. I don't think so. I think I'm just very bad at music. So the thing for this matching the snippets and calculating the fingerprints, nowadays, people use neural networks. This has also been taken over by neural networks. And the loss function you can use here is basically, you say that two snippets are matching if they are just like a few hundred milliseconds apart.

So you can take a tiny piece of a song, take the same piece of a song, or take the same song, but then a hundred milliseconds shift it, and you tell the neural network, these are actually the same snippets. You should recognize it as the same thing. And that gives you a fingerprint. That gives you this idea, this hash, which you need to recognize songs.

And that brings me to the last problem which I addressed. And that is that some songs sound a lot like other songs, and some songs sound very unique. But you still want to have a function to recognize the song as fast as possible despite these differences. So issues you could have here are, for example, songs with loads of samples, maybe some generic house songs, which all just start with a random beat and you just don't know what's happening.

And one paper I found, which I found super interesting, is a paper which did a density based distance function. So basically they changed the threshold for picking songs based on how many songs there are in the cluster where you are searching.

Anthony Alford: Oh, interesting.

Roland Meertens: So if you would start with the (singing), it would probably pick up, oh, it is in this cluster of songs. But because there are so many songs which have this bassline, I'm not picking the exact song yet until I am really, really sure and I gathered enough samples, I gave it a long enough string of patterns which match.

Anthony Alford: Nice. That makes sense.

Roland Meertens:

Yes.

Anthony Alford: Here's a random thought. We often hear of lawsuits about "This song is actually a copy of some other song." So we talked about Ice Ice Baby. But maybe you've heard, this happened to Led Zeppelin with Stairway to Heaven. It happened with, there's a pop song that was considered a rip-off of a Marvin Gaye song.

Anyway, has anybody ever thought of using one of these song recognition algorithms to basically figure out, well, they're actually not. Yes, they are very similar. Or no, they're not actually really that similar? That would take the subjectiveness out of it.

Roland Meertens: Yes. It still seems to be really about objectiveness, unfortunately, because it's really also about how many notes do you have in a row? And also, what is the chance that you already heard a song seems to play into it.

Anthony Alford: “Blurred Lines.” “Blurred Lines,” that was the song.

Roland Meertens: “Blurred Lines?”

Anthony Alford:By Robin Thicke. But it was supposedly a rip-off of a Marvin Gaye song.

Roland Meertens: Oh, interesting.

Mr Roboto [47:14]

Anthony Alford: Anyway, I have another cool FFT story related to absolute pitch. I alluded to doing this sort of work in graduate school. Perhaps you've heard of a musical instrument called the theremin.

Roland Meertens: Yes.

Anthony Alford: If you've ever watched a 1950 sci-Fi movie that goes whoo, whoo, whoo, whoo, that's probably a theremin.

Roland Meertens: Yes.

Anthony Alford: And it consists of a simple oscillator. It's controlled by the proximity of your hand to an antenna. You move your hand closer, the pitch goes up, whoo. Move it away, it goes down, whoo. When I was working in the robotics lab, a good friend of mine, his name, by the way, is Jason Barile, and he runs the Thereminworld website. Thereminworld.com.

Roland Meertens: If they are looking to sponsor a podcast, our podcast would be a great one.

Anthony Alford: He started this website in 1994 or '95, I forget.

Roland Meertens: Oh, interesting.

Anthony Alford: So anyway, he was a theremin enthusiast. And so he said, "Wouldn't it be cool if we could make the robot play the theremin?" And I said, "That would be extremely cool." I said, "Here's what we do. So we plug the theremin into the computer, so the audio goes into the computer. It's sampled. We do an FFT. The FFT gives us the frequency component.

And it's very simple waveform. Basically you get big spike, little spike, smaller spikes. So we just find that biggest spike. That's the frequency of the note that we're playing. And we know if we move the robot arm closer, it goes up. We move it away, it goes down. It's servoing. So we have a target frequency, which we had a MIDI keyboard. You press a button and it tells you here's the frequency we want.

And you just servo. We used a proportional controller at first, but you can do PID control to make it a little smoother. But literally just move closer, move further away. And we wrote a paper on that and that was one of my favorite papers to do in grad school.

Roland Meertens: Nice. That's quite fun. We once tried to build a theremin ourselves. But then instead of using an antenna, we used, how do you call this, infrared sensors for distance.

Anthony Alford: So proximity.

Roland Meertens: Proximity. But then you could use the entire room for your theremin. So we did that at some art festival. We put that down. People had fun with it.

Anthony Alford: Oh, that's cool.

Roland Meertens: I don't know if I'm proud of it or not, but we had a lot of fun building it.

Anthony Alford: Well, the same with this paper. Even more fun was, we had the basic servo, and one of the other researchers said, "Well, you need some vibrato." And we literally just added sinusoids to the joints to make the hand go "wobble wobble." And it was comical. People would get a kick out of that because the arm would just go like that. And it sounded pretty cool actually.

Roland Meertens: I know that sometimes at IROS, this conference for robotics, they have a music and robots competition. So sometimes people are indeed handing in things like this. I was really impressed with some of the performances some people put down with their robots.

Anthony Alford: Where is the next IROS?

Roland Meertens: I don't know.

Anthony Alford: I have to check that out. May have to show up for that. That'd be cool. We'll do a live podcast from there. Didn't you do a podcast from there?

Roland Meertens: Yes, I recorded a podcast at IROS. Or no, ICRA this year.

Anthony Alford: Oh, it was ICRA. Yes. Right, right.

Roland Meertens: And the next one seems to be in Abu Dhabi, October 14 till 18 2024.

Anthony Alford: It'll be a little out of my way.

Roland Meertens: It is quite far.

Anthony Alford: You get around a lot.

Roland Meertens: But naturally, Anthony, you and I are going to meet each other at QCon in London, right?

Anthony Alford: Is it London? I don't know if I can get to London. I was thinking I might go to New York. You'll be in London.

Roland Meertens: You would totally have to go to QCon London, right?

Anthony Alford: If I have a chance, I'll go.

Roland Meertens: If any listener of this podcast was listening to this right now, you would totally have to go to QCon London, April 8 to 10 in London.

Anthony Alford: It's on the calendar.

Conclusion [51:27]

Roland Meertens: It's on the calendar. Anyways, let's go to words of wisdom. What did we learn in this podcast? And did you learn anything yourself recently?

Anthony Alford: Well, I mean, obviously I learned a lot about Napoleon and Fourier. I had forgotten that they worked together. I was not aware that Google had this app that could do the song identification on device. That's pretty remarkable.

Roland Meertens: In the background continuously, that's the insane thing, with all the database stored on your phone. It's really cool.

Anthony Alford: I don't know how they do it.

Roland Meertens: I completely forgot that these Fourier transformations are entirely based on e to the power of imaginary numbers. That's really cool.

Anthony Alford: For those who have not learned this, Euler, mathematician, he had a formula, e to the ix is equal to cosine of x plus i times the sine of x. That's how that works. So imaginary exponents or complex exponents on e give you sinusoids.

Roland Meertens: Yes. Really interesting. Anything you learned in your personal life last week?

Anthony Alford: Oh, no. Gosh. It was Thanksgiving here in the United States, so it should be a time for reflection, but unfortunately probably didn't do as much of that as I should have.

Roland Meertens: Okay. I already started preparing for the next episode, which we'll tell you later about. But I went to Greenwich, which is an area here of London, and they have the prime meridian line, the prime meridian of the world, which is longitude zero. So every place on the earth is measured in terms of its angle east or west from this specific line.

And since 1884, the prime meridian has served as a reference point for Greenwich Mean Time. And the thing I was very confused about is that I came to this place and I saw all these signs for, oh, this is this line we used in this year. Oh, this is the line we used in that year.

Anthony Alford: Really?

Roland Meertens: Yes, because I was very confused. I found five different lines of where 0.0 is, longitude zero. I asked them. It turns out that these were all the places where people set up their telescopes. So really, the line where someone sets up the telescope to do all the measurements from was the line. So there is actually a telescope five meters away from this prime meridian, from where they do all the measurements.

Anthony Alford: I thought there was an observatory there. I guess I didn't consider that they might be moving the telescope.

Roland Meertens: Well, every time someone makes a better telescope, they put it a couple of meters to the west. So that's why the line always moves a couple of meters to the west.

Anthony Alford: This is going to mess things up quite a bit.

Roland Meertens: Well, Yes. I can tell you next time that it's even more confusing because if you check your GPS at the Prime Meridian, it is not at longitude zero, but 112.5 meters away from it.

Anthony Alford: Interesting.

Roland Meertens: But if people want to know more about that.

Anthony Alford: I can't wait.

Roland Meertens: We can tell them in the next episode, which will be called Lost and Found.

Anthony Alford: Awesome.

Roland Meertens: Anyways, if people enjoyed this podcast, the way I get all my podcast information is from friends telling me about it. So why not make someone happy and make their day by recommending them this podcast? You can also make our day by liking this podcast. Thank you, Anthony Alford, for joining.

Anthony Alford: Thanks, Roland. This is fun, as always.

Roland Meertens: Yes. And thank you, listeners, for listening to this episode. This was Generally AI, an InfoQ podcast.

Anthony Alford: I wonder if Google recognizes this tune.

About the Authors

Anthony Alford

Show moreShow less

Roland Meertens

Show moreShow less

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.