Hinge loss

From HandWiki
Short description: Loss function in machine learning
The vertical axis represents the value of the Hinge loss (in blue) and zero-one loss (in green) for fixed t = 1, while the horizontal axis represents the value of the prediction y. The plot shows that the Hinge loss penalizes predictions y < 1, corresponding to the notion of a margin in a support vector machine.

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).[1]

For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined as

(y)=max(0,1ty)

Note that y should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs, y=𝐰𝐱+b, where (𝐰,b) are the parameters of the hyperplane and 𝐱 is the input variable(s).

When t and y have the same sign (meaning y predicts the right class) and |y|1, the hinge loss (y)=0. When they have opposite signs, (y) increases linearly with y, and similarly if |y|<1, even if it has the same sign (correct prediction, but not by enough margin).

Extensions

While binary SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion,[2] it is also possible to extend the hinge loss itself for such an end. Several different variations of multiclass hinge loss have been proposed.[3] For example, Crammer and Singer[4] defined it for a linear classifier as[5]

(y)=max(0,1+maxyt𝐰y𝐱𝐰t𝐱),

where t is the target label, 𝐰t and 𝐰y are the model parameters.

Weston and Watkins provided a similar definition, but with a sum rather than a max:[6][3]

(y)=ytmax(0,1+𝐰y𝐱𝐰t𝐱).

In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs with margin rescaling use the following variant, where w denotes the SVM's parameters, y the SVM's predictions, φ the joint feature function, and Δ the Hamming loss:

(𝐲)=max(0,Δ(𝐲,𝐭)+𝐰,ϕ(𝐱,𝐲)𝐰,ϕ(𝐱,𝐭))=max(0,maxy𝒴(Δ(𝐲,𝐭)+𝐰,ϕ(𝐱,𝐲))𝐰,ϕ(𝐱,𝐭)).

Optimization

The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters w of a linear SVM with score function y=𝐰𝐱 that is given by

wi={txiif ty<1,0otherwise.
Plot of three variants of the hinge loss as a function of z = ty: the "ordinary" variant (blue), its square (green), and the piece-wise smooth version by Rennie and Srebro (red). The y-axis is the l(y) hinge loss, and the x-axis is the parameter t

However, since the derivative of the hinge loss at ty=1 is undefined, smoothed versions may be preferred for optimization, such as Rennie and Srebro's[7]

(y)={12tyifty0,12(1ty)2if0<ty<1,0if1ty

or the quadratically smoothed

γ(y)={12γmax(0,1ty)2ifty1γ,1γ2tyotherwise

suggested by Zhang.[8] The modified Huber loss L is a special case of this loss function with γ=2, specifically L(t,y)=42(y).

See also

References

  1. Rosasco, L.; De Vito, E. D.; Caponnetto, A.; Piana, M.; Verri, A. (2004). "Are Loss Functions All the Same?". Neural Computation 16 (5): 1063–1076. doi:10.1162/089976604773135104. PMID 15070510. http://web.mit.edu/lrosasco/www/publications/loss.pdf. 
  2. Duan, K. B.; Keerthi, S. S. (2005). "Which Is the Best Multiclass SVM Method? An Empirical Study". Multiple Classifier Systems. LNCS. 3541. pp. 278–285. doi:10.1007/11494683_28. ISBN 978-3-540-26306-7. http://www.keerthis.com/multiclass_mcs_kaibo_05.pdf. 
  3. 3.0 3.1 Doğan, Ürün; Glasmachers, Tobias; Igel, Christian (2016). "A Unified View on Multi-class Support Vector Classification". Journal of Machine Learning Research 17: 1–32. http://www.jmlr.org/papers/volume17/11-229/11-229.pdf. 
  4. Crammer, Koby; Singer, Yoram (2001). "On the algorithmic implementation of multiclass kernel-based vector machines". Journal of Machine Learning Research 2: 265–292. http://jmlr.csail.mit.edu/papers/volume2/crammer01a/crammer01a.pdf. 
  5. Moore, Robert C.; DeNero, John (2011). "L1 and L2 regularization for multiclass hinge loss models". http://www.ttic.edu/sigml/symposium2011/papers/Moore+DeNero_Regularization.pdf. 
  6. Weston, Jason; Watkins, Chris (1999). "Support Vector Machines for Multi-Class Pattern Recognition". https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es1999-461.pdf. 
  7. Rennie, Jason D. M.; Srebro, Nathan (2005). "Loss Functions for Preference Levels: Regression with Discrete Ordered Labels". Proc. IJCAI Multidisciplinary Workshop on Advances in Preference Handling. http://ttic.uchicago.edu/~nati/Publications/RennieSrebroIJCAI05.pdf. 
  8. Zhang, Tong (2004). "Solving large scale linear prediction problems using stochastic gradient descent algorithms". ICML. http://tongzhang-ml.org/papers/icml04-stograd.pdf.