Probably one of the nicest explanations of the bias/variance tradeoff is the one I found in the book Introduction to Information Retrieval (full book available online). The tradeoff can be explained mathematically, and also more intuitively. The mathematical explanation is as follows:
if we have a learning method that operates on a given set of input data (call it $x$) and a “real” underlying process that we are trying to approximate (call it $\alpha$), then the expected (squared) error is:
E[x-\alpha]^2 = Ex^2 – 2Ex\alpha + \alpha^2\\
= (Ex)^2 – 2Ex\alpha + \alpha^2 + Ex^2 – 2(Ex)^2 + (Ex)^2\\
= [Ex-\alpha]^2 + Ex^2 – E2x(Ex) + E(Ex)^2\\
= [Ex-\alpha]^2 + E[x-Ex]^2
Taking advantage of the linearity of expectation and adding a few extra cancelling terms, we end up with the representation:
Error = bias (E[x-\alpha]^2) + variance (E[x-Ex]^2)
Thats the mathematical equivalence. However, a more descriptive approach is as follows:
Bias is the squared difference between the true underlying distribution and the prediction of the learning process, averaged over our input datasets. Consistently wrong prediction equal large bias. Bias is small when the predictions are consistently right, or the average error across different training sets is roughly zero. Linear models generally have a high bias for nonlinear problems. Bias can represent the domain knowledge that we have built into the learning process – a linear assumption may be unsuitable for a nonlinear problem, and thus result in high bias.
Variance is the variation in prediction (or the consistency) – it is large if different training sets result in different learning models. Linear models will generally have lower variance. High variance generally results in overfitting – in effect, the learning model is learning from noise, and will not generalize well.
Its a useful analogy to think of most learning models as a box with two dials – bias and variance, and the setting of one will affect the other. We can only try and find the “right” setting for the situation we are working with. Hence the bias-variance tradeoff.