➡️Semi-supervised learning(SSL) is one of the artificial intelligence(AI) methods that have become popular in the last few months. Companies such as Google have been advancing the tools and frameworks relevant for building semi-supervised learning applications. Google Expander is a great example of a tool that reflects the advancements in semi-supervised learning applications.
Conceptually, semi-supervised learning can be positioned halfway between unsupervised and supervised learning models. A semi-supervised learning problem starts with a series of labeled data points as well as some data point for which labels are not known. The goal of a semi-supervised model is to classify some of the unlabeled data using the labeled information set.
Some AI practitioners see semi-supervised learning as a form of supervised learning with additional information. At the end, the goal of semi-supervised learning models is to sesame as supervised ones: to predict a target value for a specific input data set. Alternatively, other segments of the AI community see semi-supervised learning as a form of unsupervised learning with constraints. You can pick your favorite school of thought ;)
There➡️ are plenty of other scenarios for SSL models. However, not all AI scenarios can directly be tackled using SSL. There are a few essential characteristics that should be present on a problem to be effectively solvable using SSL.
1 — Sizable Unlabeled Dataset: In SSL scenarios , the seize of the unlabeled dataset should be substantially bigger than the labeled data. Otherwise, the problem can be simply addressed using supervised algorithms.
2 — Input-Output Proximity Symmetry: SSL operates by inferring classification for unlabeled data based on proximity with labeled data points. Inverting that reasoning, SSL scenarios entail that if two data points are part of the same cluster (determined by a K-means algo or similar) their outputs are likely to be in close proximity as well. Complementarily, if two data points are separated by a low density area, their output should not be close.
3 — Relatively Simple Labeling & Low-Dimension Nature of the Problem: In SSL scenarios, it is important that the inference of the labeled data doesn’t become a problem more complicated than the original problem. This is known in AI circles as the “Vapnik Principle” which essentially states that in order to solve a problem we should not pick an intermediate problem of a higher order of complexity. Also, problems that use datasets with many dimensions or attributes are likely to become really challenging for SSL algorithms as the labeling task will become very complex.