Multi-label prediction using Gibbs sampling (& Classifier Chains) is a wrapper that takes any binary classification algorithm as a base classifier and constructs an ensemble of classifier chains and uses Gibbs sampling to make the multi-label predictions. In short, the steps are:

  1. Train an ECC using any base classifier that can predict classes and probabilities. This is done by training the binary classifier on each label and extending the feature space with the other labels. This creates a set of classifier chains. Repeat this multiple times on different subsets of the training data to create an ensemble.
  2. Use the ECC to make predictions (with Gibbs sampling).
  3. Collapse multiple iterations and models into a final set of predictions.

MLPUGS is available as an open source, MIT-licensed R package available on CRAN. The development version is available on GitHub.


Suppose we have a dataset $D$ of $n$ observations and a label set $L$. Each $i$-th instance can have one or more labels ($S_i \subseteq L$). In the multi-label classification problem, we are given $x_*$ and we are interested in $\hat{S}_*$. The most well known approach to this problem is binary relevance (BM), which transforms the problem into one binary classification problem for each label. The problem with this approach is that it ignores the relationship between the response variables, akin to fitting two multiple regression models when one should instead be using multivariate regression.

Classifier chains (Read et al., 2009) (CC) is a type of BM that makes use of the correlations between the labels. Each link in the chain is trained to predict label $l_j \in L$ using a feature space extended to include $(l_1,\ldots,l_{j-1})$. An Ensemble of Classifier Chains (ECC) trains $m$ CC classifiers where each $C_k$ is trained with a random subset of $D$ and is likely to be unique and able to give different multi-label predictions.

Of course, the problem with this approach is that to make a prediction for any of the labels, we need to know what the other labels are, which we don't because we also need to predict those. This is where the idea of Gibbs sampling comes in. We start with random predictions, then proceed label-by-label, conditioning on the most recent predictions within each iteration. After the burn-in, we should have have a stable multivariate distribution of the labels for each observation.