Facebook Research Develops AI System for Music Source Separation

Facebook Research recently released Demucs, a new deep-learning-powered system for music source separation. Demucs outperforms previously reported results based on human evaluations of overall quality of sound after separation.

Music source separation is one application of a heavily researched process called blind source separation. This process includes the separation of a set of source signals from a set of mixed signals, without the aid of meta-information. For music, the individual components could include vocal or other instrumental tracks. The source separation domain first received major attention when air traffic controllers started having issues hearing the intermixed voices of multiple pilots over a single loudspeaker. This led British scientist Colin Cherry to name this effect as the "cocktail party problem" in 1953.

Accelerated by existing research in the source separation domain, research scientists began using AI to separate sounds in music in the early 2000s. Today, spectrograms generated by the short-time Fourier transform (STFT) are the centerpiece of state-of-the-art music source separation. These systems produce a mask on the magnitude spectrums of each frame and each source, and the output audio is generated by running an inverse STFT on the masked spectrograms while reusing the input mixture phase. The systems built around analyzing spectrograms excel at source separation for instruments like mezzo-piano or legato violin because they create a consistent frequency and ring. However, these systems struggle to isolate percussive sounds because the residual noise created by percussive instruments creates a broader range of frequencies, and when coupled with the overlap of multiple instruments, information is lost and no longer reversible by a masking operation. Demucs tackles this issue by learning more about the individual sources in the context of the whole track, rather than analyzing a single structure like a spectogram.

Demucs is a deep learning model that directly operates on the raw input waveform and generates a waveform for each source. The U-net architecture uses a convolutional encoder and a decoder based on wide transposed convolutions with large strides. Waveform models work in a similar way to common computer vision models, as they both use neural networks to detect basic patterns before inferring higher-level patterns.

Spectrogram-based models outperformed Wave-U-Net, which was the most advanced waveform-based model before Demucs. Demucs builds on the Wave-U-Net architecture with both tuned hyperparameters and long short-term memory, allowing a network to process entire sequences of data rather than a single data point. These improvements helped the system deal with the problem of having one sound overpower another, because the decoder is smart enough to fill in the subdued notes.

Demucs is evaluated by humans on the MusDB dataset and compared with the results of other state-of-the-art source separation systems. The tables below come from the paper released by Facebook Research. Mean Opinion Scores are first obtained on the quality and absence of artifacts from other sources (1: many artifacts and distortion, 5: perfect quality, no artifacts):

Then on overall contamination (1: frequent and loud contamination, 5: no contamination):

38 people rated 20 samples each where each sample is 8 seconds of a song from MusDB.

The code, methods for replicating the study’s results, and models can be found on GitHub.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter