OPAW: Real-Time Target Sound Extraction

In this instalment of “One Paper a Week”, we’re looking at Waveformer, a neural network for extracting specific waveforms from a sound mix in real-time. If you’re thinking “Independent Component Analysis“, you’re not alone: ICA can also extract a desired signal from a mix of signals (similarly to how we are able to understand a particular speaker in a mix of conversations). ICA is a blind, statistical separator while Waveformer is a supervised, semantic extractor.

The abstract already dips into two concepts: dilated and causal. “Dilated” means processing samples that are spread out so that there are gaps in them, which reduces processing. “Causal” means that we can’t use a popular trick from signal processing, namely using “future” samples (because one is processing recorded signals). This is a consequence of wanting to process streaming sound in real time. The transformer is “causally” tamed by maintaining a matrix of “attention scores where all the “future” positions are −∞. This ensures the focus of the transformer is incapable of latching on any data that occurs after the current millisecond. [There actually is a tiny bit of lookahead for math reasons which translates to a tiny bit of lag].

The high level picture of the Waveformer architecture: it uses a learned/adapted convolution (yep, the one from 101 signal processing) to keep a memory of what has been heard so that it can recognise patterns and transformers to focus and filter on the desirable signal component. Theoretically one could implement signal separation with only convolutions or transformers; the issue is that convolutions alone would require a huge sample window and very well-tuned coefficients, while transformers alone would be too slow. By using convolution for the heavy lifting (memory) and a transformer for the smart part (focus), the authors made a system that is smart and fast enough to run in real-time.

The convolution implements an “encoder”: Waveformer uses a stack of 10 layers to process a large “receptive field” (a sound window of about 1,5 seconds of context). Because it uses dilated convolutions, its computational complexity is O(logR) which is more efficient than the O(R) complexity found in chunk-based Transformers.

The transformer implements a “decoder”: a transformer layer queries the encoded features for the “target” sound. It uses a specialised streaming attention mechanism that only looks at the current and one previous chunk to keep latency low and fixed. Apropos latency: The model operates on 10ms chunks with a lookahead of only 1,45ms, meeting the strict requirements for real-time human listening applications.

I was really curious to know how one tells the model which sounds to extract. Stressing the ICA analogy, ICA isolates sounds on it’s own, you just have to tell it how many different sounds to extract (like K-means clustering). The model is trained on a variety of sound categories, so it can recognise only sounds in those categories. You submit a query vector to the model which tells it which sounds to listen to and then feed it the mix.