SeparateLeadStereo, with Time-Frequency choice

Provides a class (SeparateLeadProcess) within which several processings can be run on an audio file, in order to extract the lead instrument/main voice from a (stereophonic) audio mixture.

copyright (C) 2011 - 2013 Jean-Louis Durrieu

class pyfasst.SeparateLeadStereo.SeparateLeadStereoTF.SeparateLeadProcess(inputAudioFilename, windowSize=0.0464, hopsize=None, NFT=None, nbIter=10, numCompAccomp=40, minF0=39, maxF0=2000, stepNotes=16, chirpPerF0=1, K_numFilters=4, P_numAtomFilters=30, imageCanvas=None, wavCanvas=None, progressBar=None, verbose=True, outputDirSuffix='/', minF0search=None, maxF0search=None, tfrepresentation='stft', cqtfmax=4000, cqtfmin=50, cqtbins=48, cqtWinFunc=<function sqrt_blackmanharris at 0x10260ded8>, cqtAtomHopFactor=0.25, initHF00='random', freeMemory=True)[source]


class which implements the source separation algorithm, separating the ‘lead’ voice from the ‘accompaniment’. It can deal automatically with the task (the ‘lead’ voice becomes the most energetic one), or can be manually told what the ‘lead’ is (through the melody line).

dataType : dtype
this is the input data type (usually the same as the audio encoding)
displayEvolution : boolean
display the evolution of the arrays (notably HF0)
F, N : integer, integer
F the number of frequency bins in the time-frequency representation
(this is half the Fourier bins, + 1)

N the number of analysis input frames

files :

dictionary containing the filenames of the output files for the separated signals, with the following keys (after initialization)

‘inputAudioFilename’ : input filename

‘mus_output_file’ : output filename for the estimated ‘accompaniment’, appending ‘_acc.wav’ to the radical.

‘outputDirSuffix’ : the subfolder name to be appended to the path of the directory of the input file, the output files will be written in that subfolder

‘outputDir’ : the full path of the output files directory

‘pathBaseName’ : base name for the output files (full path + radical for all output files)

‘pitch_output_file’ : output filename for the estimated melody line appending ‘_pitches.txt’ to the radical.

‘voc_output_file’ : output filename for the estimated ‘lead instrument’, appending ‘_voc.wav’ to the radical.

Additionally, the estimated ‘accompaniment’ and ‘lead’ with unvoiced parts estimation are written to the corresponding filename without these unvoiced parts, to which ‘_VUIMM.wav’ is appended.

imageCanvas : instance from MplCanvas or MplCanvas3Axes
canvas used to draw the image of HF0
scaleData : double
maximum value of the input data array. With, the data array type is integer, and does not fit well with the algorithm, so we need this scaleData parameter to navigate back and forth between the double and integer representation.
scopeAllowedHF0 : double
scope of allowed F0s around the estimated/given melody line

stftParams : dictionary with the parameters for the time-frequency representation (Short-Time Fourier Transform - STFT), with the keys:

‘hopsize’ : the step, in number of samples, between analysis frames for the STFT

‘NFT’ : the number of Fourier bins on which the Fourier transforms are computed.

‘windowSizeInSamples’ : analysis frame length, in samples

SIMMParams : dictionary with the parameters of the SIMM model (Smoothed Instantaneous Mixture Model [DRDF2010]), with following keys:

‘alphaL’, ‘alphaR’ : double
stereo model, panoramic parameters for the lead part
‘betaL’, ‘betaR’ : (R,) ndarray
stereo model, panoramic parameters for each of the component of the accompaniment part.
‘chirpPerF0’ : integer
number of F0s between two ‘stable’ F0s, modelled as chirps.
‘F0Table’ : (NF0,) ndarray
frequency in Hz for each of the F0s appearing in WF0
‘HF0’ : (NF0*chirpPerF0, N) ndarray, estimated
amplitude array corresponding to the different F0s (this is what you want if you want the visualisation representation of the pitch saliances).
‘HF00’ : (NF0*chirpPerF0, N) ndarray, estimated
amplitude array HF0, after being zeroed everywhere outside the given scope from the estimated melody
‘HGAMMA’ : (P, K) ndarray, estimated
amplitude array corresponding to the different smooth shapes, decomposition of the filters on the smooth shapes in WGAMMA
‘HM’ : (R, N) ndarray, estimated
amplitude array corresponding to the decomposition of the accompaniment on the spectral shapes in WM
‘HPHI’ : (K, N) ndarray, estimated
amplitude array corresponding to the decomposition of the filter part on the filter spectral shapes in WPHI, defined as, HGAMMA)
‘K’ : integer
number of filters for the filter part decomposition
‘maxF0’ : double
the highest F0 candidate
‘minF0’ : double
the lowest F0 candidate
‘NF0’ : integer
number of F0s in total
‘niter’ : integer
number of iterations for the estimation algorithm
‘P’ : integer
number of smooth spectral shapes for the filter part (in WGAMMA)
‘R’ : integer
number of spectral shapes for the accompaniment part (in WM)
‘stepNotes’ : integer
number of F0s between two semitones
‘WF0’ : (F, NF0*chirpPerF0) ndarray, fixed
‘dictionary’ of harmonic spectral shapes for the F0 candidates generated thanks to the KLGLOTT88 model [DRDF2010]
‘WGAMMA’ : (F, P) ndarray, fixed
‘dictionary’ of smooth spectral shapes for the filter part
‘WM’ : (F, R) ndarray, estimated
array of spectral shapes that are directly estimated on the signal
verbose : boolean
if True, the program writes some information about what is happening
wavCanvas : instance from MplCanvas or MplCanvas3Axes
the canvas that is going to be used to draw the input audio waveform
XL, XR : (F, N) ndarray
resp. left and right channel STFT arrays


Constructor : reads the input audio file, computes the STFT,
generates the different dictionaries (for the source part, harmonic patterns WF0, and for the filter part, smooth patterns WGAMMA).
automaticMelodyAndSeparation :
launches sequence of methods to estimate the parameters, estimate the melody, then re-estimate the parameters and at last separate the lead from the rest, considering the lead is the most energetic source of the mixture (with some continuity regularity)
estimSIMMParams :
estimates the parameters of the SIMM, i.e. HF0, HPHI, HGAMMA, HM and WM
estimStereoSIMMParams :
estimates the parameters of the stereo version of the SIMM, i.e. same parameters as estimSIMMParams, with the alphas and betas
estimStereoSUIMMParams :
same as above, but first adds ‘noise’ components to the source part
initiateHF0WithIndexBestPath :
computes the initial HF0, before the estimation, given the melody line (estimated or not)
runViterbi :
estimates the melody line from HF0, the energies of each F0 candidates
setOutputFileNames :
triggered when the text fields are changed, changing the output filenames
writeSeparatedSignals :
computing and writing the adaptive Wiener filtered separated files
writeSeparatedSignalsWithUnvoice() :
computing and writing the adaptive Wiener filtered separated files, unvoiced parts.


This is a class that encapsulates our work on source separation, published as:

[DDR2011]J.-L. Durrieu, B. David and G. Richard, A Musically Motivated Mid-Level Representation For Pitch Estimation And Musical Audio Source Separation, IEEE Journal of Selected Topics on Signal Processing, October 2011, Vol. 5 (6), pp. 1180 - 1191.


[DRDF2010]J.-L. Durrieu, G. Richard, B. David and C. F’evotte, Source/Filter Model for Main Melody Extraction From Polyphonic Audio Signals, IEEE Transactions on Audio, Speech and Language Processing, special issue on Signal Models and Representations of Musical and Environmental Sounds, March 2010, vol. 18 (3), pp. 564 – 575.

As of 3/1/2012, available at


Fully automated estimation of melody and separation of signals.


Fully automated estimation of melody and separation of signals.


Computes the number of chunks of size maxFrames, and changes maxFrames in case it does not provide long enough chunks (especially the last chunk).


Compute the chroma matrix.

computeMonoX(start=0, stop=None)[source]

Computes and return SX, the mono channel or mean over the channels of the power spectrum of the signal


compute Nb Frames:

computeStereoX(start=0, stop=None)[source]

Compute the transform on each of the channels.

TODO this function should be modified such that we only use the pyfasst.tftransforms.tft.TFTransform framework. This could prove complicated though (especially for multiple chunk processing.). Current state (20130820): hack mainly focussed on STFT as a TF representation.


Computes the frequency basis for the source part of SIMM, if tfrepresentation is a CQT, it also computes the cqt/hybridcqt transform object.


Determine Tuning by checking the peaks corresponding to all possible patterns

estimHF0(R=1, maxFrames=1000)[source]

estimating and storing only HF0 for the whole excerpt, with only


Estimates the parameters little by little, by chunks, and sequentially writes the signals. In the end, concatenates all these separated signals into the desired output files


same as estimStereoSIMMParamsWriteSeps, but adds the unvoiced element in HF0


If already loaded a wav file, at this point, we can redefine where we want the output files to be written.

Could be used, for instance, between the first estimation or the Viterbi smooth estimation of the melody, and the re-estimation of the parameters.


Writes the separated signals to the files in self.files. If suffix contains ‘VUIMM’, then this method will take the WF0 and HF0 that contain the estimated unvoiced elements.


A wrapper to give a decent name to the function: simply calling self.writeSeparatedSignals with the ‘_VUIMM.wav’ suffix.

Previous topic


Next topic


This Page