Weakly-supervised Audio Separation via Bi-modal Semantic Similarity (ICLR 2024)
We propose a novel weakly supervised learning framework for
conditional audio separation that significantly outperforms the
baselines in unsupervised and semi-supervised settings.
(Left) The proposed conditional audio separation framework.
(Right) The comparison of our framework and the
mix-and-separate baseline in unsupervised and semi-supervised
settings.
Abstract
Conditional sound separation in multi-source audio mixtures without
having access to single source sound data during training is a long
standing challenge. Existing mix-and-separate based methods suffer
from significant performance drop with multi-source training mixtures
due to the lack of supervision signal for single source separation
cases during training. However, in the case of language-conditional audio
separation, we do have access to corresponding text descriptions for each
audio mixture in our training data, which can be seen as (rough) representations
of the audio samples in the language modality. That raises the curious question of
how to generate supervision signal for single-source audio extraction by leveraging
the fact that single-source sounding language entities can be easily extracted from the
text description. To this end, in this paper, we propose a generic bi-modal separation
framework which can enhance the existing unsupervised frameworks to separate single-source
signals in a target modality (i.e., audio) using the easily separable corresponding signals
in the conditioning modality (i.e., language), without having access to single-source samples
in the target modality during training. We empirically show that this is well within reach
if we have access to a pretrained joint embedding model between the two modalities (i.e., CLAP).
Furthermore, we propose to incorporate our framework into two fundamental scenarios to enhance
separation performance. First, we show that our proposed methodology significantly improves
the performance of purely unsupervised baselines by reducing the distribution shift between
training and test samples. In particular, we show that our framework can achieve 71% boost
in terms of Signal-to-Distortion Ratio (SDR) over the baseline, reaching 97.5% of the supervised
learning performance. Second, we show that we can further improve the performance of the supervised
learning itself by 17% if we augment it by our proposed weakly-supervised framework. Our framework
achieves this by making large corpora of unsupervised data available to the supervised learning model
as well as utilizing a natural, robust regularization mechanism through weak supervision from the language
modality, and hence enabling a powerful semi-supervised framework for audio separation.
Data: We present analysis on MUSIC1, VGGSound2, and AudioCaps
Dataset3.
Synthetic Mixture Training: We carried out training on synthetic multi-source mixtures on
MUSIC, and VGGSound dataset.
We use corresponding class labels for text-conditioning for audio separation on input mixtures.
Unsupervised Training: Only mixtures of two sounding sources are used without using any
single-source sounds.
Semi-supervised Training: Only mixtures of two sounding sources are used without using
any single-source sounds
from 95% of training samples. Remaining 5% of training samples are used for supervised training with
single-source sounds.
Supervised Training: Directly single source sounds are used for training.
Synthetic Mixture Evaluation: Mixtures of two sounding sources are used from the test set.
Class labels of each sounding source
is used for conditional audio separation.
Natural Mixture Training: We carried out training on natural multi-source mixtures from
AudioCaps dataset.
We use corresponding captions representing sounding events for text-conditioning on mixtures.
Unsupervised Training: Directly natural mixtures of multiple sounding sources (1~6) are
used.
Semi-supervised Training: All natural mixture samples from AudioCaps training dataset
are
used. Simultaneously, 100%
singl-source sounds from VGGSound dataset are used.
Natural Mixture Qualitative Evaluation: Fine-grained text prompting is used to separate the
sounds from input natural mixtures.
Natural Mixture Quantitative Evaluation: Mixture of mixtures are produced from two natural
mixtures. Whole caption of each mixture is
used to perform text-conditional separation of corresponding mixture from input mixture-of-mixtures.
Model Architecture: We use the same improved conditional U-Net archutecture for all methods
as described in Paper.
Summary of the compared models
Supervised (CLAPSep): Supervised single-source trained models with
CLAP-Text encoder.
Baseline Unsupervised:CLIPSep2 model is used as baseline multi-source
separation training on mixture samples. CLIP-text
encoder is used for conditioning.
Proposed Unsupervised: Proposed multi-source separation training on mixture samples.
CLAP-text
encoder is used for conditioning.
Proposed Semi-supervised: Proposed single and multi-source semi-supervised separation
training on mixture samples. CLAP-text
encoder is used for conditioning.
Notes
All the examples presented below use
text conditioning for source separation.
For CLAP-Text encoding, we use “The sound of [user input query]” template whereas
for CLAP-Text encoding, we use “A photo of [user input query]”.
We present all spectrograms in the log frequency scale.
Example results on “MUSIC” Dataset
Settings: We take two audio samples in the VGGSound
dataset and mix to produce a synthetic mixture for evaluation.
We carried out supervised (using single-source sounds), unsupervised (using two-source mixture sounds), and
semi-supervised (using 5% single and 95% two-source mixture sounds) training using source data.
Example 1 – "accordion" + “erhu”
Source1: accordion
Source2: erhu
Query1: “accordion”
Query2: “erhu”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Our proposed method significantly reduces cross-interference in predictions compared
to the baseline.
Both sounds have large spectral overlap.
Example 2 – "bagpipe" + "xylophone"
Source1: bagpipe
Source2: xylophone
Query1: “bagpipe”
Query2: “xylophone”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Bagpipe sound has become much cleaner with reduce interference noise in both
predictions in our method.
Example 3 – "basson" + "violin"
Source1: basson
Source2: violin
Query1: “basson”
Query2: “violin”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* In the violin sound separation, our proposed method significantly reduced overlaps of
basson sound noise.
Example 4 – "cello" + "tuba"
Source1: cello
Source2: tuba
Query1: “cello”
Query2: “tuba”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Baseline unsupervised method cannot distinguish between cello and tuba. Our method
achieves better discrimination across these two.
Example 5 – "drum" + "flute"
Source1: drum
Source2: flute
Query1: “drum”
Query2: “flute”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Baseline unsupervised drum (source 1) prediction has audible spectral contents from
flute sound. However, this noise is greatly reduced in proposed method.
Example 6 – "electric bass" + "congas"
Source1: electric bass
Source2: congas
Query1: “electric bass”
Query2: “congas”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Baseline unsupervised electric bass sound prediction is not satifactory. Our method
greatl;y reduces noise.
Example 7 – "flute" + “erhu”
Source1: flute
Source2: erhu
Query1: “flute”
Query2: “erhu”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Baseline method cannot distinguish between flute and erhu. Our method generates
satisfatory results in both predictions. Notably, semi-supervised erhu (sound 2) predictions can recover the
challenging high
frequecy contents that was missed by supervised method.
Example 8 – "flute" + "violin"
Source1: flute
Source2: violin
Query1: “flute”
Query2: “violin”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Baseline unsupervised method fails to separate the violin (source 2) sound from
mixture. Our method generates distinguishable violin prediction. Also, semi-supervised prediction contains
more high-frequecy contents in Violin prediction than supervised one.
Example 9 – "guzheng" + "electric bass"
Source1: guzheng
Source2: electric bass
Query1: “guzheng”
Query2: “electric bass”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Baseline unsupervised method brings significant spectral leakage in both predictions.
Our proposed method largely reduces such noise.
Example 10 – "piano" + "violin"
Source1: piano
Source2: violin
Query1: “piano”
Query2: “violin”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Spectrum of Violin is clearly visible in the pirano sound prediction with the
baseline unsupervised method. Our method significantly reduces such spectral overlap.
Example 11 – "ukulele" + "piano"
Source1: ukulele
Source2: piano
Query1: “ukulele”
Query2: “piano”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* With the baseline unupervised method, the piano (sound 2) prediction suffers
significant spectral loss and is barely audible. Our method generates distinguishable predictions in both.
Example 12 – "violin" + “accordion”
Source1: violin
Source2: accordion
Query1: “violin”
Query2: “accordion”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Baseline unsupervised method cannot distinguish between violin and accordion sounds
in this mixture. Also, baseline supervised and proposed unsupervised method suffers to separaate challenging
violin sound. However, the semi-supervised method generates notably improved prediction in source 1.
Example results on “VGGSound” Dataset
Settings: We take two audio samples in the VGGSound
dataset and mix to produce a synthetic mixture for evaluation.
We carried out supervised (using single-source sounds), unsupervised (using two-source mixture sounds), and
semi-supervised (using 5% single and 95% two-source mixture sounds) training using source data.
Example 1 – "people coughing" + “cell phone buzzing”
Source1: people coughing
Source2: cell phone buzzing
Query1: “people coughing”
Query2: “cell phone buzzing”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Proposed method comparatively generates lower cross-interference noise than the
baseline unsupervised method. Semi-supervised prediction in sound 2 has lower noise than supervised one.
Example 2 – "people battle cry" + “police radio chatter”
Source1: people battle cry
Source2: police radio chatter
Query1: “people battle cry”
Query2: “police radio chatter”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Baseline unsupervised method cannot discriminate between these two sounds. Police
radio charter (Sound 2) predictions is much clearer in proposed method.
Example 3 – "pumping water" + “playing bongo”
Source1: pumping water
Source2: playing bongo
Query1: “people coughing”
Query2: “playing bongo”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Source 1 prediction suffers from significant spectral loss in the baseline
unsupervised method. Our method achieves consistent results in both predictions.
Example 4 – "thunder" + “people cheering”
Source1: thunder
Source2: people cheering
Query1: “thunder”
Query2: “people cheering”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Proposed method generates notably better sounds in Thunder (source 1) predictions.
Also, source 2 predictions in semi-supervised method sounds better than supervised one.
Example 5 – "air conditioning noise" + “people crowd”
Source1: air conditioning noise
Source2: people crowd
Query1: “air conditioning noise”
Query2: “people crowd”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Baseline unsupervised method cannot distinguish the low magnitude air conditioning
noise while losing spectral contents in source 2 prediction. Our method achieves good performance in source
2 predictions. However, all the methods face challenges in separating the source 1 sound.
Example 6 – "playing table tennis" + “playing accordion”
Source1: playing table tennis
Source2: playing accordion
Query1: “playing table tennis”
Query2: “playing accordion”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Baseline unsupervised method loses significant parts of the source 1 sound. Our
method achieves much cleaner prediction in source 1.
Example 7 – "child speech, kid speaking" + “fox barking”
Source1: child speech, kid speaking
Source2: fox barking
Query1: “child speech, kid speaking”
Query2: “fox barking”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Baseline method brings more interference noises in both predictions. Proposed method
largely reduces the noise. Moreover, semi-supervised predictions in Source 2 has visibly lower noise than
the supervised one.
Example 8 – "people marching" + “lathe spinning”
Source1: people marching
Source2: lathe spinning
Query1: “people marching”
Query2: “lathe spinning”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Source 2 sound prediction of lathe spinning is notably better in the proposed method
compared to the unsupervised baseline.
Example 9 – "child singing" + “mouse squeaking”
Source1: child singing
Source2: mouse squeaking
Query1: “child singing”
Query2: “mouse squeaking”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Mouse squeaking sound (source 2) is mostly suppressed in the unsupervised baseline
prediction and also, the source 1 prediction has lots of noises. However, our method achieves considerably
better performance.
Example 10 – "cat meowing" + “pigeon, dove cooing”
Source1: cat meowing
Source2: pigeon, dove cooing
Query1: “cat meowing”
Query2: “pigeon, dove cooing”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Proposed method achieves lower interference noise in the source 2 prediction than the
unsupervised baseline.
Example 11 – "hair dryer drying" + “people slurping”
Source1: hair dryer drying
Source2: people slurping
Query1: “hair dryer drying”
Query2: “people slurping”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Baseline unsupervised method cannot distinguish between these two. Proposed
semi-supervised prediction in source 2 has lower noise than the supervised one.
Example 12 – "playing electronic organ" + “sharpen knife”
Source1: playing electronic organ
Source2: sharpen knife
Query1: “playing electronic organ”
Query2: “sharpen knife”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Baseline unsupervised method produces hardly audible source 2 prediction. Proposed
method improves the predictions considerably. Notably, proposed semi-supervised prediction is source 2 has
lower noise than the supervised one.
Example 13 – "playing bass drum" + “airplane flyby”
Source1: playing bass drum
Source2: airplane flyby
Query1: “playing bass drum”
Query2: “airplane flyby”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Proposed method significantly improves the source 2 spearation performance here than
the unsupervised baseline.
Example 14 – "heart sounds, heartbeat" + “warbler chirping”
Source1: heart sounds, heartbeat
Source2: warbler chirping
Query1: “heart sounds, heartbeat”
Query2: “warbler chirping”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* The source 2 prediction is completely suppressed in the baseline unsupervised
prediction. Proposed method achieves satisfactory result.
Example 15 – "elephant trumpeting" + “rope skipping”
Source1: elephant trumpeting
Source2: rope skipping
Query1: “elephant trumpetinge”
Query2: “rope skipping”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Separation performance on rope skipping sound is comparatively better in the proposed
method compared to the unsupervised baseline.
Example 16 – "male singing" + “people burping”
Source1: male singing
Source2: people burping
Query1: “male singing”
Query2: “people burping”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* All the methods mostly suffer from spectral interference in this challenging example.
However, source 2 predictions in proposed method is sharper than the baseline.
Example 17 – "lathe spinning" + “air conditioning noise”
Source1: lathe spinning
Source2: air conditioning noise
Query1: “lathe spinning”
Query2: “air conditioning noise”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* In this very low magnitude example, proposed method achieves comparatively better
performance. The low frequency contents of sound 2 is missed in the baseline, which is properly detected in
the proposed method.
Example 18 – "dog growling" + “car engine starting”
Source1: dog growling
Source2: car engine starting
Query1: “dog growling”
Query2: “lcar engine starting”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Baseline unsupervised method cannot properly distinguish between these two. Proposed
methods achieve comparatively better predictions.
* All the methods struggle in this challenging example. However, predictions of
skateboarding is comparatively better in the proposed method.
Example 20 – "playing zither" + “cutting hair with electric trimmers”
Source1: playing zither
Source2: cutting hair with electric trimmers
Query1: “playing zither”
Query2: “cutting hair with electric trimmers”
Mixture
Ground truth (Source 1)
Ground truth (Source 2)
Supervised Pred (Source1)
Unupervised Pred (Source1)
CLIPSep
Unupervised Pred (Source1)
Proposed *
Semi-supervised Pred (Source1)
Proposed*
Supervised Pred (Source2)
Unupervised Pred (Source2)
CLIPSep
Unupervised Pred (Source2)
Proposed *
Semi-supervised Pred (Source2)
Proposed*
* Baseline unsupervised method loses most part of spectral contents in both
predictions. Proposed semi-supervised prediction in sound 2 is slightly better than the supervised one.
Example results on “AudioCaps” Dataset
Settings: We take two audio samples in the AudioCaps
dataset and mix to produce a synthetic mixture for evaluation.
We carried out supervised (using single-source sounds), unsupervised (using two-source mixture sounds), and
semi-supervised (using 5% single and 95% two-source mixture sounds) training using source data.
Example 1 – "A woman speaks followed by laughter and a cat crying”
Query1: “A woman speaks”
Query2: “A cat crying”
Mixture
Unsupervised Pred (Query 1)
(CLIPSep)
Unsupervised Pred (Query 1)
(Proposed)
Semi-supervised Pred (Query 1)
(Proposed)
Unsupervised Pred (Query 2)
(CLIPSep)
Unsupervised Pred (Query 2)
(Proposed)
Semi-supervised Pred (Query 2)
(Proposed)
* Baseline method predictions contains significant overlaps for under-separation in
both queries. Our method sharply discriminates between two sounds.
Example 2 – "A flushing of water and people talking”
Query1: “A flushing of water”
Query2: “People are talking”
Mixture
Unsupervised Pred (Query 1)
(CLIPSep)
Unsupervised Pred (Query 1)
(Proposed)
Semi-supervised Pred (Query 1)
(Proposed)
Unsupervised Pred (Query 2)
(CLIPSep)
Unsupervised Pred (Query 2)
(Proposed)
Semi-supervised Pred (Query 2)
(Proposed)
* Baseline methods generates low magnitude predictions in query 1 and produces
significant speactral overlap in query 2. Our method greatly improves the prediction with reduced overlaps.
Example 3 – "Metal clashes and a man speaks”
Query1: “Metal clashes”
Query2: “A man speaks”
Mixture
Unsupervised Pred (Query 1)
(CLIPSep)
Unsupervised Pred (Query 1)
(Proposed)
Semi-supervised Pred (Query 1)
(Proposed)
Unsupervised Pred (Query 2)
(CLIPSep)
Unsupervised Pred (Query 2)
(Proposed)
Semi-supervised Pred (Query 2)
(Proposed)
* Baseline methods generates low magnitude predictions in query 1 and produces
significant speactral overlap in query 2. Our method greatly improves the prediction with reduced overlaps.
Also, semi-supervised prediction on query 2 shows improvement in noise suppression over unsupervised
predictions.
Example 4 – "A man speeches while a crowd whispers”
Query1: “A man speeches”
Query2: “A crowd whispers”
Mixture
Unsupervised Pred (Query 1)
(CLIPSep)
Unsupervised Pred (Query 1)
(Proposed)
Semi-supervised Pred (Query 1)
(Proposed)
Unsupervised Pred (Query 2)
(CLIPSep)
Unsupervised Pred (Query 2)
(Proposed)
Semi-supervised Pred (Query 2)
(Proposed)
* Our proposed method generates clearly distinguishable sounds in both queries.
Whereas, baseline method generates low magnitude prediction in query 2 as well as contains spectral overlap
in query 1.
Example 5 – "A woman talks followed by sizzling and frying of food”
Query1: “A woman talks”
Query2: “Sizzling and frying of food”
Mixture
Unsupervised Pred (Query 1)
(CLIPSep)
Unsupervised Pred (Query 1)
(Proposed)
Semi-supervised Pred (Query 1)
(Proposed)
Unsupervised Pred (Query 2)
(CLIPSep)
Unsupervised Pred (Query 2)
(Proposed)
Semi-supervised Pred (Query 2)
(Proposed)
* The separation in query 2 is noticably better in our method compared to the baseline.
Example 6 – "An engine works while some frogs croak”
Query1: “An engine works”
Query2: “Some frogs croak”
Mixture
Unsupervised Pred (Query 1)
(CLIPSep)
Unsupervised Pred (Query 1)
(Proposed)
Semi-supervised Pred (Query 1)
(Proposed)
Unsupervised Pred (Query 2)
(CLIPSep)
Unsupervised Pred (Query 2)
(Proposed)
Semi-supervised Pred (Query 2)
(Proposed)
* Our method generates significantly better separation of the challenging query 2 sound
over the baseline.
Example 7 – "A train horn sounds then someone laughs”
Query1: “A train horn sounds”
Query2: “Someone laughs”
Mixture
Unsupervised Pred (Query 1)
(CLIPSep)
Unsupervised Pred (Query 1)
(Proposed)
Semi-supervised Pred (Query 1)
(Proposed)
Unsupervised Pred (Query 2)
(CLIPSep)
Unsupervised Pred (Query 2)
(Proposed)
Semi-supervised Pred (Query 2)
(Proposed)
* Baseline method cannot properly separate the sounds in both queries. In contrast, our
method generates significantly better predictions with reduced overlaps.
Example 8 – "International music plays as water pours into a pot and finally some splashes”
Query1: “International music plays”
Query2: “Water pours into a pot”
Mixture
Unsupervised Pred (Query 1)
(CLIPSep)
Unsupervised Pred (Query 1)
(Proposed)
Semi-supervised Pred (Query 1)
(Proposed)
Unsupervised Pred (Query 2)
(CLIPSep)
Unsupervised Pred (Query 2)
(Proposed)
Semi-supervised Pred (Query 2)
(Proposed)
* This is a multi-mixture challenging example. Our proposed method shows better
qualitative results in both cases.
Example 9 – "An infant crying followed by a woman giggling”
Query1: “An infant is crying”
Query2: “A woman is giggling”
Mixture
Unsupervised Pred (Query 1)
(CLIPSep)
Unsupervised Pred (Query 1)
(Proposed)
Semi-supervised Pred (Query 1)
(Proposed)
Unsupervised Pred (Query 2)
(CLIPSep)
Unsupervised Pred (Query 2)
(Proposed)
Semi-supervised Pred (Query 2)
(Proposed)
* This is also a very challenging example to separate the woman giggling sound from
dominant child crying. Nevertheless, proposed method generates noticably better separation than the
baseline.
Example 10 – "A man speaks as birds chirp and dogs bark”
Query1: “A man speaks”
Query2: “Birds chirp”
Mixture
Unsupervised Pred (Query 1)
(CLIPSep)
Unsupervised Pred (Query 1)
(Proposed)
Semi-supervised Pred (Query 1)
(Proposed)
Unsupervised Pred (Query 2)
(CLIPSep)
Unsupervised Pred (Query 2)
(Proposed)
Semi-supervised Pred (Query 2)
(Proposed)
Baseline predictions contains lots of overlaps and reduced magnitude predictions. The proposed method significantly improves the separation performance.
Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio
Torralba. The sound of pixels. In Proceedings of the European conference on computer vision
(ECCV), pp. 570–586, 2018.
↩
Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-
visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pp. 721–725. IEEE, 2020.
↩
Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating
captions for audios in the wild. In
Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1
(Long and Short Papers), pp. 119–132, 2019.
↩
DDong, H. W., Takahashi, N., Mitsufuji, Y., McAuley, J., & Berg-Kirkpatrick, T. (2022, September).
CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos.
In The Eleventh International Conference on Learning Representations.
↩