OutputDescriptor attributes for matrix processing Vamp plug-in

· development, Music

Some time ago, I started to develop a plug-in allowing to visualize the fundamental frequency f0 salience as I developed it during my PhD thesis: https://github.com/wslihgt/IMMF0salience. I’d like to share when my fellow plug-in developer one part of my experience that took me some time to understand: getting the right attributes for the OutputDescriptor of a visualization plug-in.

The algorithm to compute that F0 salience that is the core of my plug-in is based on **non-negative matrix factorization (NMF)**. For this to work in an “optimal” way, it would be better to load all the frequency-domain frames, and process them all at once. As it were, it is “necessary” to some extend, in order to capture long term tendencies. Using BLAS or other libraries, it could also be computationally a bit more efficient.

I actually managed to make this work the way I wanted! That was however not so straight-forward from the documentation, so I thought I could post a little word on my experience…

Implementation, failures, work around
At first, my implementation only worked as follows: at each frame, the plug-in calls the process method, in which I compute the updates for the matrices in my model only for that frame. In short: I had to make the processing steps in process, and output each time the features from the FeatureSet at the end.

As afore-mentioned, I want to load several frames into a “matrix”, in process, and then as soon as the matrix is filled (for now fixing it to a fixed number of frames to process each time), I can update the matrices, for all the stored frames at once. Once this is done, I put the columns of the matrix, each of which corresponding to a frame, into the FeatureSet map, and return that FeatureSet and…

… and in my first attempts, I got nothing. Well, not exactly, it seems that the plug-in returned the last feature that was computed, the last column of the processed matrix. So for instance, if I decide to process 2 frames in one go, I only get the result for one frame out of two. For 3 frames, only one out of 3, and so forth…

At first, I thought I did not implement it correctly, especially not as intended, with respect to the process and getRemainingFrames methods. But in the end I found out, thanks to reading through the code of NNLSchroma (http://www.isophonics.net/nnls-chroma), that I gave the wrong attributes to the corresponding OutputDescriptor. In the getOutputDescriptors method, I indeed had something like:

OutputDescriptor d;
d.identifier = "f0salience";
d.name = "Salience of F0s.";
d.description = "This representation should emphasize the salience of the different F0s in the signal.";
d.unit = "";
d.hasFixedBinCount = true;
if (m_blockSize == 0) {
    // Just so as not to return "1". This is the bin count that
    // would result from a block size of 1024, which is a likely
    // default -- but the host should always set the block size
    // before querying the bin count for certain.
    d.binCount = 513;
} else {
    d.binCount = m_numberOfF0;
d.binNames = f0values;
d.hasKnownExtents = false;
d.isQuantized = false;
d.sampleType = OutputDescriptor::OneSamplePerStep;
d.hasDuration = false;

The key issue seems to be the sampleType attribute, along with the associated sampleRate. There are various possibilities, whose differences, to my taste, were difficult to grasp just from the provided guide… The documetation of Plugin.h is also not too helpful on this aspect.

The original trial was to put oneSamplePerStep. That could be obvious from the name, but not quite, and indeed, such an output is supposed to be returned by process for each frame. In that case, it would seem that the host only reads the last computed feature that was push_back‘ed in the corresponding FeatureList.

Since that did not work, I figure I could try the VariableSampleRate sample type: with that one, according to the documentation, one has to provide the timestamp attribute for each Feature. I did so, with something like:

Feature f0salienceFeat;
f0salienceFeat.hasTimestamp = true;
f0salienceFeat.timestamp = timestamp
    - Vamp::RealTime::frame2RealTime((m_numberFrames-1-nf)*m_stepSize,

With this, however, for some reason, the output was not rendered as an image in SV anymore, but rather as a sequence of values. Well, all in all, not what was desired.

I might have tried more things, here and there, but let’s jump to the solution I got from the NNLSchroma plug-in. The sample type of OutputDescriptor needs to be declared as FixedSampleRate. The code above therefore needs to be modified as:

d.sampleType = OutputDescriptor::FixedSampleRate;
d.sampleRate = (m_stepSize == 0)?
    m_inputSampleRate / 2048 :
    m_inputSampleRate / m_stepSize;

Furthermore, I added the output feature set as a member m_featureSet of the class. This way, in my process method, I can run my NMF algorithm every m_numberFrames frames, then add (push_back) the generated features to the current m_featureSet.

Feature f0salienceFeat;
// put interesting stuff in f0salienceFeat
// assuming the desired output is the first one:

Method process should return any feature set you’d like, but not containing the f0 salience feature. Well, actually, what worked for me was to simply output an empty feature set:

return FeatureSet();

At last, m_featureSet is returned by method getRemainingFrames:

return m_featureSet;

With this, I am now able to compute features and output them with the correct timing, resulting in an image in SV, well synchronized with the audio.

As a summary, I could say from this experience that I learned the following:

  • About the sample types for the SampleType attribute of the output descriptor:
    • OneSamplePerStep means that the feature that is returned by process is used for the current frame only. No time stamp information can correct that. For example, this can be used for a plug-in computing a power spectrum, which does only require data from the current frame. It can also be used if we use information from previous frames, but not if we desire to use data from following ones.
    • VariableSampleRate is appropriate for detection tasks, but does not seem designed to deal with visualisation plug-ins. It is good if one computes an onset detection: even if we require to use the whole signal’s data, we could still return the feature at any time, and it will be aligned with the correct frame (this is more a guess, as I did not check more: would it for example still work as expected if the plug-in returns features in non-chronological order?).
    • FixedSampleRate means that the host will take the features in the feature list in the given order, and assume that they are aligned at fixed time stamps, whose relation is given by the SampleRate attribute. This is suitable for NMF algorithms which need to compute several frames at once, potentially needing frame from the past and frames from the future of the current frame in process. Processing the data at some other rate (every 1000 frames, for instance), the feature set can be returned at the end by the getRemainingFrames method.
  • About how process and getRemainingFrames works: it was somehow not very clear for me when the latter method was called. Now, it seems clearer that it is called once there are no more frames to be processed. That is the right moment to finish the estimations, on the spectra that were stored and not processed yet.

At last, it is important to state that these conclusions are mostly resulting from some reverse engineering, and it would be very useful if the developers of the framework could comment on these, correcting or confirming my experiences! While a closer look at the documentation (http://www.vamp-plugins.org/guide.pdf, https://code.soundsoftware.ac.uk/projects/vamp-plugin-sdk/wiki/SampleType) reveals that I could have found that piece of information earlier, I am still confident that providing more examples and an experience on a specific feature development will help anyone interested in this kind of features!

Leave a Comment