Weakly-supervised Audio Separation via Bi-modal Semantic Similarity

main figure

(Left) The proposed conditional audio separation framework. (Right) The comparison of our framework and the mix-and-separate baseline in unsupervised and semi-supervised settings.

Abstract

Conditional sound separation in multi-source audio mixtures without having access to single source sound data during training is a long standing challenge. Existing mix-and-separate based methods suffer from significant performance drop with multi-source training mixtures due to the lack of supervision signal for single source separation cases during training. However, in the case of language-conditional audio separation, we do have access to corresponding text descriptions for each audio mixture in our training data, which can be seen as (rough) representations of the audio samples in the language modality. That raises the curious question of how to generate supervision signal for single-source audio extraction by leveraging the fact that single-source sounding language entities can be easily extracted from the text description. To this end, in this paper, we propose a generic bi-modal separation framework which can enhance the existing unsupervised frameworks to separate single-source signals in a target modality (i.e., audio) using the easily separable corresponding signals in the conditioning modality (i.e., language), without having access to single-source samples in the target modality during training. We empirically show that this is well within reach if we have access to a pretrained joint embedding model between the two modalities (i.e., CLAP). Furthermore, we propose to incorporate our framework into two fundamental scenarios to enhance separation performance. First, we show that our proposed methodology significantly improves the performance of purely unsupervised baselines by reducing the distribution shift between training and test samples. In particular, we show that our framework can achieve 71% boost in terms of Signal-to-Distortion Ratio (SDR) over the baseline, reaching 97.5% of the supervised learning performance. Second, we show that we can further improve the performance of the supervised learning itself by 17% if we augment it by our proposed weakly-supervised framework. Our framework achieves this by making large corpora of unsupervised data available to the supervised learning model as well as utilizing a natural, robust regularization mechanism through weak supervision from the language modality, and hence enabling a powerful semi-supervised framework for audio separation.

Content

Setup for qualitative analysis
Models under comparison
Example results on "MUSIC"
Example results on “VGGSound”
Example results on “AudioCaps”

Setup for qualitative analysis

Data: We present analysis on MUSIC¹, VGGSound², and AudioCaps Dataset³.
Synthetic Mixture Training: We carried out training on synthetic multi-source mixtures on MUSIC, and VGGSound dataset. We use corresponding class labels for text-conditioning for audio separation on input mixtures.
1. Unsupervised Training: Only mixtures of two sounding sources are used without using any single-source sounds.
2. Semi-supervised Training: Only mixtures of two sounding sources are used without using any single-source sounds from 95% of training samples. Remaining 5% of training samples are used for supervised training with single-source sounds.
3. Supervised Training: Directly single source sounds are used for training.
Synthetic Mixture Evaluation: Mixtures of two sounding sources are used from the test set. Class labels of each sounding source is used for conditional audio separation.
Natural Mixture Training: We carried out training on natural multi-source mixtures from AudioCaps dataset. We use corresponding captions representing sounding events for text-conditioning on mixtures.
1. Unsupervised Training: Directly natural mixtures of multiple sounding sources (1~6) are used.
2. Semi-supervised Training: All natural mixture samples from AudioCaps training dataset are used. Simultaneously, 100% singl-source sounds from VGGSound dataset are used.
Natural Mixture Qualitative Evaluation: Fine-grained text prompting is used to separate the sounds from input natural mixtures.
Natural Mixture Quantitative Evaluation: Mixture of mixtures are produced from two natural mixtures. Whole caption of each mixture is used to perform text-conditional separation of corresponding mixture from input mixture-of-mixtures.
Model Architecture: We use the same improved conditional U-Net archutecture for all methods as described in Paper.

Summary of the compared models

Supervised (CLAPSep): Supervised single-source trained models with CLAP-Text encoder.
Baseline Unsupervised: CLIPSep² model is used as baseline multi-source separation training on mixture samples. CLIP-text encoder is used for conditioning.
Proposed Unsupervised: Proposed multi-source separation training on mixture samples. CLAP-text encoder is used for conditioning.
Proposed Semi-supervised: Proposed single and multi-source semi-supervised separation training on mixture samples. CLAP-text encoder is used for conditioning.

Notes

All the examples presented below use text conditioning for source separation.
For CLAP-Text encoding, we use “The sound of [user input query]” template whereas for CLAP-Text encoding, we use “A photo of [user input query]”.
We present all spectrograms in the log frequency scale.

Example results on “MUSIC” Dataset

Settings: We take two audio samples in the VGGSound dataset and mix to produce a synthetic mixture for evaluation. We carried out supervised (using single-source sounds), unsupervised (using two-source mixture sounds), and semi-supervised (using 5% single and 95% two-source mixture sounds) training using source data.

Example 1 – "accordion" + “erhu”

Source1: accordion
Source2: erhu
Query1: “accordion”
Query2: “erhu”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Our proposed method significantly reduces cross-interference in predictions compared to the baseline. Both sounds have large spectral overlap.

Example 2 – "bagpipe" + "xylophone"

Source1: bagpipe
Source2: xylophone
Query1: “bagpipe”
Query2: “xylophone”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Bagpipe sound has become much cleaner with reduce interference noise in both predictions in our method.

Example 3 – "basson" + "violin"

Source1: basson
Source2: violin
Query1: “basson”
Query2: “violin”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* In the violin sound separation, our proposed method significantly reduced overlaps of basson sound noise.

Example 4 – "cello" + "tuba"

Source1: cello
Source2: tuba
Query1: “cello”
Query2: “tuba”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Baseline unsupervised method cannot distinguish between cello and tuba. Our method achieves better discrimination across these two.

Example 5 – "drum" + "flute"

Source1: drum
Source2: flute
Query1: “drum”
Query2: “flute”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Baseline unsupervised drum (source 1) prediction has audible spectral contents from flute sound. However, this noise is greatly reduced in proposed method.

Example 6 – "electric bass" + "congas"

Source1: electric bass
Source2: congas
Query1: “electric bass”
Query2: “congas”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Baseline unsupervised electric bass sound prediction is not satifactory. Our method greatl;y reduces noise.

Example 7 – "flute" + “erhu”

Source1: flute
Source2: erhu
Query1: “flute”
Query2: “erhu”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Baseline method cannot distinguish between flute and erhu. Our method generates satisfatory results in both predictions. Notably, semi-supervised erhu (sound 2) predictions can recover the challenging high frequecy contents that was missed by supervised method.

Example 8 – "flute" + "violin"

Source1: flute
Source2: violin
Query1: “flute”
Query2: “violin”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Baseline unsupervised method fails to separate the violin (source 2) sound from mixture. Our method generates distinguishable violin prediction. Also, semi-supervised prediction contains more high-frequecy contents in Violin prediction than supervised one.

Example 9 – "guzheng" + "electric bass"

Source1: guzheng
Source2: electric bass
Query1: “guzheng”
Query2: “electric bass”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Baseline unsupervised method brings significant spectral leakage in both predictions. Our proposed method largely reduces such noise.

Example 10 – "piano" + "violin"

Source1: piano
Source2: violin
Query1: “piano”
Query2: “violin”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Spectrum of Violin is clearly visible in the pirano sound prediction with the baseline unsupervised method. Our method significantly reduces such spectral overlap.

Example 11 – "ukulele" + "piano"

Source1: ukulele
Source2: piano
Query1: “ukulele”
Query2: “piano”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* With the baseline unupervised method, the piano (sound 2) prediction suffers significant spectral loss and is barely audible. Our method generates distinguishable predictions in both.

Example 12 – "violin" + “accordion”

Source1: violin
Source2: accordion
Query1: “violin”
Query2: “accordion”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Baseline unsupervised method cannot distinguish between violin and accordion sounds in this mixture. Also, baseline supervised and proposed unsupervised method suffers to separaate challenging violin sound. However, the semi-supervised method generates notably improved prediction in source 1.

Example results on “VGGSound” Dataset

Settings: We take two audio samples in the VGGSound dataset and mix to produce a synthetic mixture for evaluation. We carried out supervised (using single-source sounds), unsupervised (using two-source mixture sounds), and semi-supervised (using 5% single and 95% two-source mixture sounds) training using source data.

Example 1 – "people coughing" + “cell phone buzzing”

Source1: people coughing
Source2: cell phone buzzing
Query1: “people coughing”
Query2: “cell phone buzzing”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Proposed method comparatively generates lower cross-interference noise than the baseline unsupervised method. Semi-supervised prediction in sound 2 has lower noise than supervised one.

Example 2 – "people battle cry" + “police radio chatter”

Source1: people battle cry
Source2: police radio chatter
Query1: “people battle cry”
Query2: “police radio chatter”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Baseline unsupervised method cannot discriminate between these two sounds. Police radio charter (Sound 2) predictions is much clearer in proposed method.

Example 3 – "pumping water" + “playing bongo”

Source1: pumping water
Source2: playing bongo
Query1: “people coughing”
Query2: “playing bongo”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Source 1 prediction suffers from significant spectral loss in the baseline unsupervised method. Our method achieves consistent results in both predictions.

Example 4 – "thunder" + “people cheering”

Source1: thunder
Source2: people cheering
Query1: “thunder”
Query2: “people cheering”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Proposed method generates notably better sounds in Thunder (source 1) predictions. Also, source 2 predictions in semi-supervised method sounds better than supervised one.

Example 5 – "air conditioning noise" + “people crowd”

Source1: air conditioning noise
Source2: people crowd
Query1: “air conditioning noise”
Query2: “people crowd”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Baseline unsupervised method cannot distinguish the low magnitude air conditioning noise while losing spectral contents in source 2 prediction. Our method achieves good performance in source 2 predictions. However, all the methods face challenges in separating the source 1 sound.

Example 6 – "playing table tennis" + “playing accordion”

Source1: playing table tennis
Source2: playing accordion
Query1: “playing table tennis”
Query2: “playing accordion”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Baseline unsupervised method loses significant parts of the source 1 sound. Our method achieves much cleaner prediction in source 1.

Example 7 – "child speech, kid speaking" + “fox barking”

Source1: child speech, kid speaking
Source2: fox barking
Query1: “child speech, kid speaking”
Query2: “fox barking”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Baseline method brings more interference noises in both predictions. Proposed method largely reduces the noise. Moreover, semi-supervised predictions in Source 2 has visibly lower noise than the supervised one.

Example 8 – "people marching" + “lathe spinning”

Source1: people marching
Source2: lathe spinning
Query1: “people marching”
Query2: “lathe spinning”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Source 2 sound prediction of lathe spinning is notably better in the proposed method compared to the unsupervised baseline.

Example 9 – "child singing" + “mouse squeaking”

Source1: child singing
Source2: mouse squeaking
Query1: “child singing”
Query2: “mouse squeaking”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Mouse squeaking sound (source 2) is mostly suppressed in the unsupervised baseline prediction and also, the source 1 prediction has lots of noises. However, our method achieves considerably better performance.

Example 10 – "cat meowing" + “pigeon, dove cooing”

Source1: cat meowing
Source2: pigeon, dove cooing
Query1: “cat meowing”
Query2: “pigeon, dove cooing”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Proposed method achieves lower interference noise in the source 2 prediction than the unsupervised baseline.

Example 11 – "hair dryer drying" + “people slurping”

Source1: hair dryer drying
Source2: people slurping
Query1: “hair dryer drying”
Query2: “people slurping”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Baseline unsupervised method cannot distinguish between these two. Proposed semi-supervised prediction in source 2 has lower noise than the supervised one.

Example 12 – "playing electronic organ" + “sharpen knife”

Source1: playing electronic organ
Source2: sharpen knife
Query1: “playing electronic organ”
Query2: “sharpen knife”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Baseline unsupervised method produces hardly audible source 2 prediction. Proposed method improves the predictions considerably. Notably, proposed semi-supervised prediction is source 2 has lower noise than the supervised one.

Example 13 – "playing bass drum" + “airplane flyby”

Source1: playing bass drum
Source2: airplane flyby
Query1: “playing bass drum”
Query2: “airplane flyby”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Proposed method significantly improves the source 2 spearation performance here than the unsupervised baseline.

Example 14 – "heart sounds, heartbeat" + “warbler chirping”

Source1: heart sounds, heartbeat
Source2: warbler chirping
Query1: “heart sounds, heartbeat”
Query2: “warbler chirping”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* The source 2 prediction is completely suppressed in the baseline unsupervised prediction. Proposed method achieves satisfactory result.

Example 15 – "elephant trumpeting" + “rope skipping”

Source1: elephant trumpeting
Source2: rope skipping
Query1: “elephant trumpetinge”
Query2: “rope skipping”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Separation performance on rope skipping sound is comparatively better in the proposed method compared to the unsupervised baseline.

Example 16 – "male singing" + “people burping”

Source1: male singing
Source2: people burping
Query1: “male singing”
Query2: “people burping”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* All the methods mostly suffer from spectral interference in this challenging example. However, source 2 predictions in proposed method is sharper than the baseline.

Example 17 – "lathe spinning" + “air conditioning noise”

Source1: lathe spinning
Source2: air conditioning noise
Query1: “lathe spinning”
Query2: “air conditioning noise”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* In this very low magnitude example, proposed method achieves comparatively better performance. The low frequency contents of sound 2 is missed in the baseline, which is properly detected in the proposed method.

Example 18 – "dog growling" + “car engine starting”

Source1: dog growling
Source2: car engine starting
Query1: “dog growling”
Query2: “lcar engine starting”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Baseline unsupervised method cannot properly distinguish between these two. Proposed methods achieve comparatively better predictions.

Example 19 – "ice cream truck, ice cream van" + “skateboarding”

Source1: ice cream truck, ice cream van
Source2: skateboarding
Query1: “ice cream truck, ice cream van”
Query2: “skateboarding”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* All the methods struggle in this challenging example. However, predictions of skateboarding is comparatively better in the proposed method.

Example 20 – "playing zither" + “cutting hair with electric trimmers”

Source1: playing zither
Source2: cutting hair with electric trimmers
Query1: “playing zither”
Query2: “cutting hair with electric trimmers”

Mixture	Ground truth (Source 1)	Ground truth (Source 2)

Supervised Pred (Source1)	Unupervised Pred (Source1) CLIPSep	Unupervised Pred (Source1) Proposed *	Semi-supervised Pred (Source1) Proposed*

Supervised Pred (Source2)	Unupervised Pred (Source2) CLIPSep	Unupervised Pred (Source2) Proposed *	Semi-supervised Pred (Source2) Proposed*

* Baseline unsupervised method loses most part of spectral contents in both predictions. Proposed semi-supervised prediction in sound 2 is slightly better than the supervised one.

Example results on “AudioCaps” Dataset

Settings: We take two audio samples in the AudioCaps dataset and mix to produce a synthetic mixture for evaluation. We carried out supervised (using single-source sounds), unsupervised (using two-source mixture sounds), and semi-supervised (using 5% single and 95% two-source mixture sounds) training using source data.

Example 1 – "A woman speaks followed by laughter and a cat crying”

Query1: “A woman speaks”
Query2: “A cat crying”

Mixture	Unsupervised Pred (Query 1) (CLIPSep)	Unsupervised Pred (Query 1) (Proposed)	Semi-supervised Pred (Query 1) (Proposed)

Unsupervised Pred (Query 2) (CLIPSep)	Unsupervised Pred (Query 2) (Proposed)	Semi-supervised Pred (Query 2) (Proposed)

* Baseline method predictions contains significant overlaps for under-separation in both queries. Our method sharply discriminates between two sounds.

Example 2 – "A flushing of water and people talking”

Query1: “A flushing of water”
Query2: “People are talking”

Mixture	Unsupervised Pred (Query 1) (CLIPSep)	Unsupervised Pred (Query 1) (Proposed)	Semi-supervised Pred (Query 1) (Proposed)

Unsupervised Pred (Query 2) (CLIPSep)	Unsupervised Pred (Query 2) (Proposed)	Semi-supervised Pred (Query 2) (Proposed)

* Baseline methods generates low magnitude predictions in query 1 and produces significant speactral overlap in query 2. Our method greatly improves the prediction with reduced overlaps.

Example 3 – "Metal clashes and a man speaks”

Query1: “Metal clashes”
Query2: “A man speaks”

Mixture	Unsupervised Pred (Query 1) (CLIPSep)	Unsupervised Pred (Query 1) (Proposed)	Semi-supervised Pred (Query 1) (Proposed)

Unsupervised Pred (Query 2) (CLIPSep)	Unsupervised Pred (Query 2) (Proposed)	Semi-supervised Pred (Query 2) (Proposed)

* Baseline methods generates low magnitude predictions in query 1 and produces significant speactral overlap in query 2. Our method greatly improves the prediction with reduced overlaps. Also, semi-supervised prediction on query 2 shows improvement in noise suppression over unsupervised predictions.

Example 4 – "A man speeches while a crowd whispers”

Query1: “A man speeches”
Query2: “A crowd whispers”

Mixture	Unsupervised Pred (Query 1) (CLIPSep)	Unsupervised Pred (Query 1) (Proposed)	Semi-supervised Pred (Query 1) (Proposed)

Unsupervised Pred (Query 2) (CLIPSep)	Unsupervised Pred (Query 2) (Proposed)	Semi-supervised Pred (Query 2) (Proposed)

* Our proposed method generates clearly distinguishable sounds in both queries. Whereas, baseline method generates low magnitude prediction in query 2 as well as contains spectral overlap in query 1.

Example 5 – "A woman talks followed by sizzling and frying of food”

Query1: “A woman talks”
Query2: “Sizzling and frying of food”

Mixture	Unsupervised Pred (Query 1) (CLIPSep)	Unsupervised Pred (Query 1) (Proposed)	Semi-supervised Pred (Query 1) (Proposed)

Unsupervised Pred (Query 2) (CLIPSep)	Unsupervised Pred (Query 2) (Proposed)	Semi-supervised Pred (Query 2) (Proposed)

* The separation in query 2 is noticably better in our method compared to the baseline.

Example 6 – "An engine works while some frogs croak”

Query1: “An engine works”
Query2: “Some frogs croak”

Mixture	Unsupervised Pred (Query 1) (CLIPSep)	Unsupervised Pred (Query 1) (Proposed)	Semi-supervised Pred (Query 1) (Proposed)

Unsupervised Pred (Query 2) (CLIPSep)	Unsupervised Pred (Query 2) (Proposed)	Semi-supervised Pred (Query 2) (Proposed)

* Our method generates significantly better separation of the challenging query 2 sound over the baseline.

Example 7 – "A train horn sounds then someone laughs”

Query1: “A train horn sounds”
Query2: “Someone laughs”

Mixture	Unsupervised Pred (Query 1) (CLIPSep)	Unsupervised Pred (Query 1) (Proposed)	Semi-supervised Pred (Query 1) (Proposed)

Unsupervised Pred (Query 2) (CLIPSep)	Unsupervised Pred (Query 2) (Proposed)	Semi-supervised Pred (Query 2) (Proposed)

* Baseline method cannot properly separate the sounds in both queries. In contrast, our method generates significantly better predictions with reduced overlaps.

Example 8 – "International music plays as water pours into a pot and finally some splashes”

Query1: “International music plays”
Query2: “Water pours into a pot”

Mixture	Unsupervised Pred (Query 1) (CLIPSep)	Unsupervised Pred (Query 1) (Proposed)	Semi-supervised Pred (Query 1) (Proposed)

Unsupervised Pred (Query 2) (CLIPSep)	Unsupervised Pred (Query 2) (Proposed)	Semi-supervised Pred (Query 2) (Proposed)

* This is a multi-mixture challenging example. Our proposed method shows better qualitative results in both cases.

Example 9 – "An infant crying followed by a woman giggling”

Query1: “An infant is crying”
Query2: “A woman is giggling”

Mixture	Unsupervised Pred (Query 1) (CLIPSep)	Unsupervised Pred (Query 1) (Proposed)	Semi-supervised Pred (Query 1) (Proposed)

Unsupervised Pred (Query 2) (CLIPSep)	Unsupervised Pred (Query 2) (Proposed)	Semi-supervised Pred (Query 2) (Proposed)

* This is also a very challenging example to separate the woman giggling sound from dominant child crying. Nevertheless, proposed method generates noticably better separation than the baseline.

Example 10 – "A man speaks as birds chirp and dogs bark”

Query1: “A man speaks”
Query2: “Birds chirp”

Mixture	Unsupervised Pred (Query 1) (CLIPSep)	Unsupervised Pred (Query 1) (Proposed)	Semi-supervised Pred (Query 1) (Proposed)

Unsupervised Pred (Query 2) (CLIPSep)	Unsupervised Pred (Query 2) (Proposed)	Semi-supervised Pred (Query 2) (Proposed)

Baseline predictions contains lots of overlaps and reduced magnitude predictions. The proposed method significantly improves the separation performance.

Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. The sound of pixels. In Proceedings of the European conference on computer vision (ECCV), pp. 570–586, 2018. ↩
Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio- visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 721–725. IEEE, 2020. ↩
Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 119–132, 2019. ↩
DDong, H. W., Takahashi, N., Mitsufuji, Y., McAuley, J., & Berg-Kirkpatrick, T. (2022, September). CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos. In The Eleventh International Conference on Learning Representations. ↩