Bayes classifier

From HandWiki

In statistical classification, the Bayes classifier is the classifier having the smallest probability of misclassification of all classifiers using the same set of features.[1]

Definition

Suppose a pair (X,Y) takes values in d×{1,2,,K}, where Y is the class label of an element whose features are given by X. Assume that the conditional distribution of X, given that the label Y takes the value r is given by

(XY=r)Pr for r=1,2,,K

where "" means "is distributed as", and where Pr denotes a probability distribution.

A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function C:d{1,2,,K}, with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as

(C)=P{C(X)Y}.

The Bayes classifier is

CBayes(x)=argmaxr{1,2,,K}P(Y=rX=x).

In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case, P(Y=rX=x). The Bayes classifier is a useful benchmark in statistical classification.

The excess risk of a general classifier C (possibly depending on some training data) is defined as (C)(CBayes). Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity.[2]

Considering the components xi of x to be mutually independent, we get the naive bayes classifier, where CBayes(x)=argmaxr{1,2,,K}P(Y=r)i=1dPr(xi).

Properties

Proof that the Bayes classifier is optimal and Bayes error rate is minimal proceeds as follows.

Define the variables: Risk R(h), Bayes risk R*, all possible classes to which the points can be classified Y={0,1}. Let the posterior probability of a point belonging to class 1 be η(x)=Pr(Y=1|X=x). Define the classifier 𝒽*as

𝒽*(x)={1if η(x)0.5,0otherwise.

Then we have the following results:

(a) R(h*)=R*, i.e. h* is a Bayes classifier,

(b) For any classifier h, the excess risk satisfies R(h)R*=2𝔼X[|η(x)0.5|𝕀{h(X)h*(X)}]

(c) R*=𝔼X[min(η(X),1η(X))]

(d) R*=1212𝔼[|2η(X)1|]


Proof of (a): For any classifier h, we have

R(h)=𝔼XY[𝕀{h(X)Y}]

=𝔼X𝔼Y|X[𝕀{h(X)Y}] (due to Fubini's theorem)

=𝔼X[η(X)𝕀{h(X)=0}+(1η(X))𝕀{h(X)=1}]

Notice that R(h) is minimised by taking xX,

h(x)={1if η(x)1η(x),0otherwise.

Therefore the minimum possible risk is the Bayes risk, R*=R(h*).


Proof of (b):

R(h)R*=R(h)R(h*)=𝔼X[η(X)𝕀{h(X)=0}+(1η(X))𝕀{h(X)=1}η(X)𝕀{h*(X)=0}(1η(X))𝕀{h*(X)=1}]=𝔼X[|2η(X)1|𝕀{h(X)h*(X)}]=2𝔼X[|η(X)0.5|𝕀{h(X)h*(X)}]


Proof of (c):

R(h*)=𝔼X[η(X)𝕀{h*(X)=0}+(1η(X))𝕀{h*(X)=1}]=𝔼X[min(η(X),1η(X))]

Proof of (d):

R(h*)=𝔼X[min(η(X),1η(X))]=12𝔼X[max(η(X)1/2,1/2η(X))]=1212𝔼[|2η(X)1|]

General case

The general case that the Bayes classifier minimises classification error when each element can belong to either of n categories proceeds by towering expectations as follows.

𝔼Y(𝕀{yy^})=𝔼X𝔼Y|X(𝕀{yy^}|X=x)=𝔼[Pr(Y=1|X=x)𝕀{y^=2,3,,n}+Pr(Y=2|X=x)𝕀{y^=1,3,,n}++Pr(Y=n|X=x)𝕀{y^=1,2,3,,n1}]

This is minimised by simultaneously minimizing all the terms of the expectation using the classifierh(x)=k,argmaxkPr(Y=k|X=x)

for each observation x.

See also

References

  1. Devroye, L.; Gyorfi, L.; Lugosi, G. (1996). A probabilistic theory of pattern recognition. Springer. ISBN 0-3879-4618-7. 
  2. Farago, A.; Lugosi, G. (1993). "Strong universal consistency of neural network classifiers". IEEE Transactions on Information Theory 39 (4): 1146–1151. doi:10.1109/18.243433. https://dl.acm.org/doi/abs/10.1109/18.243433.