Introduction
often comes with an implicit assumption: you want a variety of labeled information.
On the identical time, many fashions are able to discovering construction in information with none labels in any respect.
Generative fashions, specifically, typically manage information into significant clusters throughout unsupervised coaching. When skilled on photos, they might naturally separate digits, objects, or kinds of their latent representations.
This raises a easy however necessary query:
If a mannequin has already found the construction of the info with out labels,how a lot supervision is definitely wanted to show it right into a classifier?
On this article, we discover this query utilizing a Gaussian Combination Variational Autoencoder (GMVAE) (Dilokthanakul et al., 2016).
Dataset
We use the EMNIST Letters dataset launched by Cohen et al. (2017), which is an extension of the unique MNIST dataset.
- Supply: NIST Particular Database 19
- Processed by: Cohen et al. (2017)
- Measurement: 145 600 photos (26 balanced lessons)
- Possession: U.S. Nationwide Institute of Requirements and Expertise (NIST)
- License: Public area (U.S. authorities work)
Disclaimer
The code supplied on this article is meant for analysis and reproducibility functions solely.
It’s presently tailor-made to the MNIST and EMNIST datasets, and isn’t designed as a general-purpose framework.
Extending it to different datasets requires diversifications (information preprocessing, structure tuning, and hyperparameter choice).Code and experiments can be found on GitHub: https://github.com/murex/gmvae-label-decoding
This selection will not be arbitrary. EMNIST is much extra ambiguous than the classical MNIST dataset, which makes it a greater benchmark to spotlight the significance of probabilistic representations (Determine 1).
The GMVAE: Studying Construction in an Unsupervised Means
A normal Variational Autoencoder (VAE) is a generative mannequin that learns a steady latent illustration of the info.
Extra exactly, every information level is mapped to a multivariate regular distribution , referred to as the posterior.
Nevertheless, this isn’t enough if we need to carry out clustering. With a regular Gaussian prior, the latent area tends to stay steady and doesn’t naturally separate into distinct teams.
That is the place GMVAE come into play.
A GMVAE extends the VAE by changing the prior with a mix of parts, the place is chosen beforehand.
To realize this, a brand new discrete latent variable is launched:

This enables the mannequin to be taught a posterior distribution over clusters:

Every element of the combination can then be interpreted as a cluster.
In different phrases, GMVAEs intrinsically be taught clusters throughout coaching.
The selection of controls a trade-off between expressivity and reliability.
- If is simply too small, clusters are likely to merge distinct kinds and even completely different letters, limiting the mannequin’s means to seize fine-grained construction.
- If is simply too massive, clusters turn out to be too fragmented, making it more durable to estimate dependable label–cluster relationships from a restricted labeled subset.
We select as a compromise: massive sufficient to seize stylistic variations inside every class, but sufficiently small to make sure that every cluster is sufficiently represented within the labeled information (Determine 1).

Totally different stylistic variants of the identical letter are captured, resembling an uppercase F (c=36) and a lowercase f (c=0).
Nevertheless, clusters are usually not pure: for example, element c=73 predominantly represents the letter “T”, but additionally contains samples of “J”.
Turning Clusters Right into a Classifier
As soon as the GMVAE is skilled, every picture is related to a posterior distribution over clusters: .
In observe, when the variety of clusters is unknown, it may be handled as a hyperparameter and tuned by way of grid search.
A pure thought is to assign every information level to a single cluster.
Nevertheless, clusters themselves don’t but have semantic which means. To attach clusters to labels, we’d like a labeled subset.
A pure baseline for this job is the classical cluster-then-label method: information are first clustered utilizing an unsupervised methodology (e.g. k-means or GMM), and every cluster is assigned a label primarily based on the labeled subset, sometimes by way of majority voting.
This corresponds to a tough project technique, the place every information level is mapped to a single cluster earlier than labeling.
In distinction, our method doesn’t depend on a single cluster project.
As a substitute, it leverages the total posterior distribution over clusters, permitting every information level to be represented as a mix of clusters somewhat than a single discrete project.
This may be seen as a probabilistic generalization of the cluster-then-label paradigm.
What number of labels are theoretically required?
In a super situation, clusters are completely pure: every cluster corresponds to a single class. In such a case, clusters would even have equal sizes.
Nonetheless on this ideally suited setting, suppose we are able to select which information factors to label.
Then, a single labeled instance per cluster could be enough — that’s, solely Ok labels in whole.
In our setting (N = 145 600, Ok = 100), this corresponds to solely 0.07% of labeled information.
Nevertheless, in observe, we assume that labeled samples are drawn at random.
Below this assumption, and nonetheless assuming equal cluster sizes, we are able to derive an approximate decrease certain on the quantity of supervision wanted to cowl all clusters with a selected stage of confidence.
In our case (), we receive a minimal of roughly 0.6% labeled information to cowl all clusters with 95% confidence.
We will loosen up the equal-size assumption and derive a extra basic inequality, though it doesn’t admit a closed-form answer.
Sadly, all these calculations are optimistic:
in observe, clusters are usually not completely pure. A single cluster could, for instance, include each “i” and “l” in comparable proportions.
And now, how can we assign labels to the remaining information?
We examine two alternative ways to assign labels to the remaining (unlabeled) information:
- Arduous decoding: we ignore the likelihood distributions supplied by the mannequin
- Delicate decoding: we totally exploit them
Arduous decoding
The thought is simple.
First, we assign to every cluster a novel label through the use of the labeled subset.
Extra exactly, we affiliate every cluster with essentially the most frequent label among the many labeled factors assigned to it.
Now, given an unlabeled picture , we assign it to its almost definitely cluster:

We then assign to the label related to this cluster, i.e. :

Nevertheless, this method suffers from two main limitations:
1. It ignores the mannequin’s uncertainty for a given enter (the GMVAE could “hesitate” between a number of clusters)
2. It assumes that clusters are pure, i.e. that every cluster corresponds to a single label — which is usually not true
That is exactly what comfortable decoding goals to handle.
Delicate decoding
As a substitute of assuming that every cluster corresponds to a single label, we use the labeled subset to estimate, for every label , a likelihood vector of dimension :

This vector represents empirically the likelihood of belonging to every cluster , provided that the true label is , which is definitely an empirical illustration of !
On the identical time, the GMVAE offers, for every picture , a posterior likelihood vector:

We then assign to the label that maximizes the similarity between and :

This comfortable determination rule naturally takes under consideration:
- The mannequin’s uncertainty for , through the use of the total posterior somewhat than solely its most
- The truth that clusters are usually not completely pure, by permitting every label to be related to a number of clusters
This may be interpreted as evaluating with , and choosing the label whose cluster distribution greatest matches the posterior of !
A concrete instance the place comfortable decoding helps
To higher perceive why comfortable decoding can outperform the exhausting rule, let’s have a look at a concrete instance (Determine 2).

On this case, the true label is e. The mannequin produces the cluster posterior distribution proven within the middle of the determine 2:

for clusters 76, 40, 35, 81, 61 respectively.
The exhausting rule solely considers essentially the most possible cluster:

Since cluster 76 is usually related to the label c, the exhausting prediction turns into

which is wrong.
Delicate decoding as a substitute aggregates info from all believable clusters.
Intuitively, this computes a weighted vote of clusters utilizing their posterior chances.
On this instance, a number of clusters strongly correspond to the right label e.
Approximating the vote:

whereas

Though cluster 76 clearly dominates the posterior, a lot of the likelihood mass truly lies on clusters related to the right label. By aggregating these alerts, the comfortable rule appropriately predicts

This illustrates the important thing limitation of exhausting decoding: it discards a lot of the info contained within the posterior distribution . Delicate decoding, then again, leverages the total uncertainty of the generative mannequin.
How A lot Supervision Do We Want in Follow?
Idea apart, let’s see how this works on actual information.
The aim right here is twofold:
- to know what number of labeled samples are wanted to realize good accuracy
- to find out when comfortable decoding is helpful
To this finish, we progressively enhance the variety of labeled samples and consider accuracy on the remaining information.
We examine our method in opposition to customary baselines: logistic regression, MLP, and XGBoost.
Outcomes are reported as imply accuracy with confidence intervals (95%) over 5 random seeds (Determine 3).

Even with extraordinarily small labeled subsets, the classifier already performs surprisingly properly.
Most notably, comfortable decoding considerably improves efficiency when supervision is scarce.
With solely 73 labeled samples — which means that a number of clusters are usually not represented — comfortable decoding achieves an absolute accuracy acquire of round 18 proportion factors in comparison with exhausting decoding.
Apart from, with 0.2% labeled information (291 samples out of 145 600 — roughly 3 labeled examples per cluster), the GMVAE-based classifier already reaches 80% accuracy.
Compared, XGBoost requires round 7% labeled information — 35 occasions extra supervision — to realize an analogous efficiency.
This placing hole highlights a key level:
A lot of the construction required for classification is already discovered in the course of the unsupervised part — labels are solely wanted to interpret it.
Conclusion
Utilizing a GMVAE skilled completely with out labels, we see {that a} classifier may be constructed utilizing as little as 0.2% labeled information.
The important thing statement is that the unsupervised mannequin already learns a big a part of the construction required for classification.
Labels are usually not used to construct the illustration from scratch.
As a substitute, they’re solely used to interpret clusters that the mannequin has already found.
A easy exhausting decoding rule already performs properly, however leveraging the total posterior distribution over clusters offers a small but constant enchancment, particularly when the mannequin is unsure.
Extra broadly, this experiment highlights a promising paradigm for label-efficient machine studying:
- be taught construction first
- add labels later
- use supervision primarily to interpret representations somewhat than to assemble them
This means that, in lots of instances, labels are usually not wanted to be taught — solely to call what has already been discovered.
All experiments had been carried out utilizing our personal implementation of GMVAE and analysis pipeline.
References
- Cohen, G., Afshar, S., Tapson, J., & van Schaik, A. (2017). EMNIST: Extending MNIST to handwritten letters.
- Dilokthanakul, N., Mediano, P. A., Garnelo, M., Lee, M. C., Salimbeni, H., Arulkumaran, Ok., & Shanahan, M. (2016). Deep Unsupervised Clustering with Gaussian Combination Variational Autoencoders.
© 2026 MUREX S.A.S. and Université Paris Dauphine — PSL
This work is licensed beneath the Inventive Commons Attribution 4.0 Worldwide License. To view a replica of this license, go to https://creativecommons.org/licenses/by/4.0/















