Skip to main content

Multiclass Classification

ReductionGistTraining TimePrediction TimeRemark
OvAIs it kk or not?CNCNCCNot robust
OvOIs it kk or kโ€ฒk'?(Cโˆ’1)N(C-1)NO(C2)\mathcal{O}(C^2)can achieve very small training error
ECOCAssign bits per class. Is this specific bit on or off?LNLNLLNeeds diversity when designing code
TreeWhich half does it belong to?O(Nlogโก2C)\mathcal{O}(N \log_2 C)O(logโก2C)\mathcal{O}(\log_2 C)Good for extreme classifications

OvA (One-vs-All or One-vs-Rest)โ€‹

  • A binary classifier is trained against each category's combined classes. So, if there are KK classes, KK binary classifiers are trained.
  • For prediction, all KK classifiers predict the class. The class with the highest confidence score is selected as the final prediction.

OvO (One-vs-One)โ€‹

  • A binary classifier is trained for every pair of classes. So, if there are KK classes, K(Kโˆ’1)2\frac{K(K-1)}{2} classifiers are trained.
  • For prediction, the class that gets voted for the most across all classifiers is chosen.

ECOC (Error-Correcting Output Codes)โ€‹

  • This method represents each class with a unique binary length code LL. The aim is to train binary classifiers not just based on separating types but also based on distinguishing these binary codes.
  • During prediction, the class code closest to the predicted code (in terms of Hamming distance or other metrics) is selected.

Decision Treesโ€‹

  • Instead of converting the multiclass problem to multiple binary problems, decision trees handle multiclass classification directly by splitting data points at each node based on feature values, eventually leading them to a leaf node representing a class.
  • Random Forests, gradient-boosted trees, and other ensemble tree methods can also be used for multiclass classification.

Comparisonโ€‹

Efficiencyโ€‹

  • OvA โ€” Requires training KK classifiers.
  • OvO โ€” Requires training K(Kโˆ’1)2\frac{K(K-1)}{2} classifiers, which can be computationally expensive as KK grows.
  • ECOC โ€” The number of classifiers is determined by the length LL of the code, which is generally less than KK and K(Kโˆ’1)2\frac{K(K-1)}{2}.
  • Tree โ€” A single tree is trained, but ensemble methods like Random Forests would involve teaching multiple trees.

Scalabilityโ€‹

  • OvA โ€” Scales linearly with the number of classes.
  • OvO โ€” Quadratic scalability with the number of classes can be inefficient for an enormous KK.
  • ECOC โ€” Scalability can vary depending on the coding design.
  • Tree โ€” Scales well with the number of classes, but tree depth might increase.

Decision Boundaryโ€‹

  • OvA โ€” This can lead to imbalanced decision boundaries since one class is always compared against all others combined.
  • OvO โ€” Often results in more balanced decision boundaries since it considers every pair of classes separately.
  • ECOC โ€” The coding design determines the decision boundary.
  • Tree โ€” Produces orthogonal decision boundaries based on feature splits.

Complexityโ€‹

  • OvA and OvO โ€” Relatively straightforward to understand.
  • ECOC โ€” Requires an additional step of designing and decoding the binary codes for classes.
  • Tree โ€” Interpretable as it offers hierarchical decisions based on features.