SeparateLeadStereo, with Time-Frequency choice
Provides a class (SeparateLeadProcess) within which several processings can be run on an audio file, in order to extract the lead instrument/main voice from a (stereophonic) audio mixture.
copyright (C) 2011 - 2013 Jean-Louis Durrieu
SeparateLeadProcess
class which implements the source separation algorithm, separating the ‘lead’ voice from the ‘accompaniment’. It can deal automatically with the task (the ‘lead’ voice becomes the most energetic one), or can be manually told what the ‘lead’ is (through the melody line).
N the number of analysis input frames
dictionary containing the filenames of the output files for the separated signals, with the following keys (after initialization)
‘inputAudioFilename’ : input filename
‘mus_output_file’ : output filename for the estimated ‘accompaniment’, appending ‘_acc.wav’ to the radical.
‘outputDirSuffix’ : the subfolder name to be appended to the path of the directory of the input file, the output files will be written in that subfolder
‘outputDir’ : the full path of the output files directory
‘pathBaseName’ : base name for the output files (full path + radical for all output files)
‘pitch_output_file’ : output filename for the estimated melody line appending ‘_pitches.txt’ to the radical.
‘voc_output_file’ : output filename for the estimated ‘lead instrument’, appending ‘_voc.wav’ to the radical.
Additionally, the estimated ‘accompaniment’ and ‘lead’ with unvoiced parts estimation are written to the corresponding filename without these unvoiced parts, to which ‘_VUIMM.wav’ is appended.
stftParams : dictionary with the parameters for the time-frequency representation (Short-Time Fourier Transform - STFT), with the keys:
‘hopsize’ : the step, in number of samples, between analysis frames for the STFT
‘NFT’ : the number of Fourier bins on which the Fourier transforms are computed.
‘windowSizeInSamples’ : analysis frame length, in samples
SIMMParams : dictionary with the parameters of the SIMM model (Smoothed Instantaneous Mixture Model [DRDF2010]), with following keys:
- ‘alphaL’, ‘alphaR’ : double
- stereo model, panoramic parameters for the lead part
- ‘betaL’, ‘betaR’ : (R,) ndarray
- stereo model, panoramic parameters for each of the component of the accompaniment part.
- ‘chirpPerF0’ : integer
- number of F0s between two ‘stable’ F0s, modelled as chirps.
- ‘F0Table’ : (NF0,) ndarray
- frequency in Hz for each of the F0s appearing in WF0
- ‘HF0’ : (NF0*chirpPerF0, N) ndarray, estimated
- amplitude array corresponding to the different F0s (this is what you want if you want the visualisation representation of the pitch saliances).
- ‘HF00’ : (NF0*chirpPerF0, N) ndarray, estimated
- amplitude array HF0, after being zeroed everywhere outside the given scope from the estimated melody
- ‘HGAMMA’ : (P, K) ndarray, estimated
- amplitude array corresponding to the different smooth shapes, decomposition of the filters on the smooth shapes in WGAMMA
- ‘HM’ : (R, N) ndarray, estimated
- amplitude array corresponding to the decomposition of the accompaniment on the spectral shapes in WM
- ‘HPHI’ : (K, N) ndarray, estimated
- amplitude array corresponding to the decomposition of the filter part on the filter spectral shapes in WPHI, defined as np.dot(WGAMMA, HGAMMA)
- ‘K’ : integer
- number of filters for the filter part decomposition
- ‘maxF0’ : double
- the highest F0 candidate
- ‘minF0’ : double
- the lowest F0 candidate
- ‘NF0’ : integer
- number of F0s in total
- ‘niter’ : integer
- number of iterations for the estimation algorithm
- ‘P’ : integer
- number of smooth spectral shapes for the filter part (in WGAMMA)
- ‘R’ : integer
- number of spectral shapes for the accompaniment part (in WM)
- ‘stepNotes’ : integer
- number of F0s between two semitones
- ‘WF0’ : (F, NF0*chirpPerF0) ndarray, fixed
- ‘dictionary’ of harmonic spectral shapes for the F0 candidates generated thanks to the KLGLOTT88 model [DRDF2010]
- ‘WGAMMA’ : (F, P) ndarray, fixed
- ‘dictionary’ of smooth spectral shapes for the filter part
- ‘WM’ : (F, R) ndarray, estimated
- array of spectral shapes that are directly estimated on the signal
Methods
- Constructor : reads the input audio file, computes the STFT,
- generates the different dictionaries (for the source part, harmonic patterns WF0, and for the filter part, smooth patterns WGAMMA).
- automaticMelodyAndSeparation :
- launches sequence of methods to estimate the parameters, estimate the melody, then re-estimate the parameters and at last separate the lead from the rest, considering the lead is the most energetic source of the mixture (with some continuity regularity)
- estimSIMMParams :
- estimates the parameters of the SIMM, i.e. HF0, HPHI, HGAMMA, HM and WM
- estimStereoSIMMParams :
- estimates the parameters of the stereo version of the SIMM, i.e. same parameters as estimSIMMParams, with the alphas and betas
- estimStereoSUIMMParams :
- same as above, but first adds ‘noise’ components to the source part
- initiateHF0WithIndexBestPath :
- computes the initial HF0, before the estimation, given the melody line (estimated or not)
- runViterbi :
- estimates the melody line from HF0, the energies of each F0 candidates
- setOutputFileNames :
- triggered when the text fields are changed, changing the output filenames
- writeSeparatedSignals :
- computing and writing the adaptive Wiener filtered separated files
- writeSeparatedSignalsWithUnvoice() :
- computing and writing the adaptive Wiener filtered separated files, unvoiced parts.
References
This is a class that encapsulates our work on source separation, published as:
[DDR2011] | J.-L. Durrieu, B. David and G. Richard, A Musically Motivated Mid-Level Representation For Pitch Estimation And Musical Audio Source Separation, IEEE Journal of Selected Topics on Signal Processing, October 2011, Vol. 5 (6), pp. 1180 - 1191. |
and
[DRDF2010] | J.-L. Durrieu, G. Richard, B. David and C. F’evotte, Source/Filter Model for Main Melody Extraction From Polyphonic Audio Signals, IEEE Transactions on Audio, Speech and Language Processing, special issue on Signal Models and Representations of Musical and Environmental Sounds, March 2010, vol. 18 (3), pp. 564 – 575. |
As of 3/1/2012, available at http://www.durrieu.ch/research
Fully automated estimation of melody and separation of signals.
Fully automated estimation of melody and separation of signals.
Computes the number of chunks of size maxFrames, and changes maxFrames in case it does not provide long enough chunks (especially the last chunk).
Computes and return SX, the mono channel or mean over the channels of the power spectrum of the signal
Compute the transform on each of the channels.
TODO this function should be modified such that we only use the pyfasst.tftransforms.tft.TFTransform framework. This could prove complicated though (especially for multiple chunk processing.). Current state (20130820): hack mainly focussed on STFT as a TF representation.
Computes the frequency basis for the source part of SIMM, if tfrepresentation is a CQT, it also computes the cqt/hybridcqt transform object.
Determine Tuning by checking the peaks corresponding to all possible patterns
estimating and storing only HF0 for the whole excerpt, with only
Estimates the parameters little by little, by chunks, and sequentially writes the signals. In the end, concatenates all these separated signals into the desired output files
same as estimStereoSIMMParamsWriteSeps, but adds the unvoiced element in HF0
If already loaded a wav file, at this point, we can redefine where we want the output files to be written.
Could be used, for instance, between the first estimation or the Viterbi smooth estimation of the melody, and the re-estimation of the parameters.