Jean-Louis DURRIEU
PhD Candidate at the Ecole Nationale Supérieure des Télécommunications (ENST)



I am doing a PhD research under the supervision of MM. Gaël Richard and Bertrand David, at Télécom ParisTech.
We are working on the following subject:

Automatic Extraction of the Main Melody In Music Signals

This research is partly funded by the European K-Space project, and partly funded by the OSEO project Quaero.


Some links to demo pages related to our papers:


Some milestones of our research, as well as some insights about our techniques (for more details, see the publications page):

***
03/2008
Our accepted paper at ICASSP2008 about singer melody estimation using source separation techniques can be found on my publications page, along with the poster Gaël and I will be presenting in Las Vegas.
    Our next moves are aiming at the elaboration of an article dealing with the GMM framework for source separation and main melody estimation, and compare it with our instantaneous mixture model for the singer voice, which is but a mere extension of the GMM model. Source separation examples still available on the results page.

***
02/2008
The ICASSP 2008 conference paper has been accepted for poster presentation at the Las Vegas conference, in the poster session intitled "source separation II". I will personnally present the poster at that session, so meet you on wednesday, April 2nd, in Las Vegas!
We are working on a journal article describing in more details the methods we use and the results we obtain. Also working on a demo webpage and program.


***
10/2007
We have submitted an article at ICASSP 2008, in which we explain our present model. The latter is based on NMF techniques for source separation, to which we added a source-filter model to fit the vocal source. Some examples of the source separation can be listened to here.

***
07/2007

Music Information Retrieval (MIR) is a field that aims at mining databases the smartest way as possible. To draw a parallel with text information retrieval, it is as if you would search the Internet with your favorite web browser using complete sentences, even questions, instead of using restrictive keywords, which somehow would lead to better results (however well some search machine might work...).
Our field is more challenging. For text IR, we already have the material that allows us to mine the database, the words themselves, such that we needn't guess what the low-level content of the document is. Unfortunately, for music, it would be closer to a speech-to-text task. We would need to find the words, that is to say, for music, the notes themselves. Hence our work.
What is more, in order to build an effective IR database, one need relevant metadata. For texts, it would be for example the style, the language, the subject and so on. For music, it can be music genre, artist, instruments and so on. However, for this kind of media, nowadays data mining consists essentially in manually submitted metadata, meaning that it is subjective, and not even subject to the user, but subject to some other person, with a potentially different culture, thus another way of classifying the data (for the genre, for example). We therefore need automatic annotation schemes, that would give more accurate and user related results. Nowadays research for genre classification and so on usually is based on general timbre information, which would be equivalent to categorizing texts just by the shape they have, without looking at the words. That would work up to a certain extend, for example to extract poems from novels, but not biographies from fictions. What we are investigating is ways to retrieve the notes in a musical piece. Notes are one step higher level than pure sound waveform, but they consitute the low-level feature which is the base of written (western) music.

We decided to focus on pieces that are quite specific, where one can hear a singer over some background music. This is quite restrictive as compared to the whole variability of music, but seems to constitute a good rate of what people might be interested in searching in databases, for traditionnal songs to pop-music songs. We model such songs using their Short-Time Fourier Transform (STFT) by considering it is a mix of two sub-signals (sub-STFTs): one voice and one music.
The model for the Vocal part is a spectral Gaussian Mixture Model (GMM), while the Music part can be regarded as a decomposition on elements of a spectral Gaussians dictionnary. We somehow found that such a model led us to compute alternatively the Wiener estimator of the voice, in order to infer the parameters for the GMM, and then the Wiener estimator of the music, followed by the Non-Negative Matrix Factorization (NMF) of the Spectrogram (or equivalently Short-Time Fourier Transform Amplitude) of the obtained estimation.
This is an Audio Source Separation scheme that allows us to physically extract the desired melody and transcribe it into a music score. We also introduced a pitch-dependant model to the vocal-model, which actually directly gives us the music score, once the parameters are well estimated. We believe that this way of dealing with vocal+background music pieces is relevant for different reasons.
  • Human voices, due to their rather big instability in the pitch, are not easily estimated or captured by NMF decompositions
  • The spectral Gaussian representation of the Power Density Spectrum of the signals is well formed, if we consider that the temporal signals are second-order stationnary
  • philosophically, this model allows the vocal part to have only one state per frame, since it uses a GMM, while the music part is allowed to be the sum of several contribution of different sources (such as instruments), thanks to the NMF model.
Our source separation system is able to separate the vocal signal with quite a good SIR (Signal to Interferences ratio) of about 10 to 20 dB. Unfortunately, the SAR (signal to artifacts ratio) is around 0dB.
Here are some files we produced, the "remix" files usually contain on the left channel the estimated vocal part, and the right channel is the estimated music part. One will at once notice that the vocal part is still present in the music part, and that in the vocal part as estimated by our algorithm, there is quite a lot of artifacts, even if the singer is rather intelligible. We still need to evaluate the vocal part model itself, before we further test the whole system.
Some files to listen to:
These are not the most up to date files, but they give a good insight in what we are doing at the moment. More soon, and hopefully there will be a paper on the subject some day...

Document made with Nvu