Loss and Optimization

Optimization
desp
Author
Published

October 24, 2022

The Loss Function is a method of evaluating how well a machine learning algorithm models a featured data set. If our loss function value is low, our model will provide good results. The loss function we use to evaluate the model performance needs to be minimized to improve its performance.

Broadly speaking, loss functions can be grouped into two major categories concerning the types of problems we come across in the real world: CLASSIFICATION and REGRESSION. In CLASSIFICATION problems, our task is to predict the respective probabilities of all classes the problem is dealing with. When it comes to REGRESSION, our task is to predict the continuous value concerning a given set of independent features to the learning algorithm.

Loss Functions for Regrssion

1. Mean Absolute Error Loss

We define MAE loss function as the average of absolute differences between the actual and the predicted value. It’s the second most commonly used regression loss function. It measures the average magnitude of errors in a set of predictions, without considering their directions.

\[MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y_i}|\]

where \(y_i\) is the actual value and \(\hat{y_i}\) is the predicted value.

The corresponding cost function is the mean of these absolute errors (MAE). It is also known as the L1 loss function.

import numpy as np
import plotly.graph_objects as go
import torch

from sklearn.metrics import mean_squared_error, accuracy_score

np.random.seed(0)
torch.manual_seed(0)
<torch._C.Generator at 0x7f07b72789b0>
# generate data
x = np.random.uniform(-1, 1, (500, 1))
y = 2 * x + 3 + np.random.normal(0, 0.5, (500, 1))

# plot data
fig = go.Figure()
fig.add_trace(go.Scatter(x=x.flatten(), y=y.flatten(), mode='markers', name='data'))
fig.update_layout(title='Data', xaxis_title='x', yaxis_title='y')
fig.show()
# MAE loss
def mae(y, y_pred):
    return torch.mean(torch.abs(y - y_pred))
# add bias term
X = np.concatenate([x, np.ones((500, 1))], axis=1)

# convert to tensors
X = torch.from_numpy(X).float()
Y = torch.from_numpy(y).float()

# initialize weights
w = torch.randn(2, 1, requires_grad=True)

lr = 0.1
rmse = []
# gradient descent
for i in range(100):
    y_pred = torch.matmul(X, w)
    loss = mae(Y, y_pred)
    loss.backward()
    with torch.no_grad():
        w -= lr * w.grad
        w.grad.zero_()
    rmse.append(mean_squared_error(y, y_pred.detach().numpy(), squared=False))
    
    if i % 10 == 0:
        print(f'Epoch {i}, loss {rmse[-1]:.4f}')
Epoch 0, loss 3.2887
Epoch 10, loss 2.3101
Epoch 20, loss 1.3679
Epoch 30, loss 0.6885
Epoch 40, loss 0.5224
Epoch 50, loss 0.4988
Epoch 60, loss 0.4934
Epoch 70, loss 0.4924
Epoch 80, loss 0.4924
Epoch 90, loss 0.4924
# plot loss
fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(len(rmse)), y=rmse, mode='lines', name='loss'))
fig.update_layout(title='Loss', xaxis_title='epoch', yaxis_title='loss')
fig.show()

# plot data with regression line
fig = go.Figure()
fig.add_trace(go.Scatter(x=x[:, 0], y=y.flatten(), mode='markers', name='data'))
fig.add_trace(go.Scatter(x=x[:, 0], y=2 * x[:, 0] + 3, mode='lines', name='true line', line=dict(color='green')))
fig.add_trace(go.Scatter(x=x[:, 0], y=y_pred.detach().numpy().flatten(), mode='lines', name='regression line', line=dict(color='red')))
fig.show()

2. Mean Squared Error Loss

We define MSE loss function as the average of squared differences between the actual and the predicted value. It’s the most commonly used regression loss function.

\[MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2\]

where \(y_i\) is the actual value and \(\hat{y_i}\) is the predicted value.

The corresponding cost function is the mean of these squared errors (MSE). It is also known as the L2 loss function. The MSE loss function penalizes the model for making large errors by squaring them.

# MSE loss
def mse(y, y_pred):
    return torch.mean((y - y_pred) ** 2)
# add bias term
X = np.concatenate([x, np.ones((500, 1))], axis=1)

# convert to tensors
X = torch.from_numpy(X).float()
Y = torch.from_numpy(y).float()

# initialize weights
w = torch.randn(2, 1, requires_grad=True)

lr = 0.1
rmse = []
# gradient descent
for i in range(100):
    y_pred = torch.matmul(X, w)
    loss = mse(Y, y_pred)
    loss.backward()
    with torch.no_grad():
        w -= lr * w.grad
        w.grad.zero_()
    rmse.append(mean_squared_error(y, y_pred.detach().numpy(), squared=False))
    
    if i % 10 == 0:
        print(f'Epoch {i}, loss {rmse[-1]:.4f}')
Epoch 0, loss 3.4254
Epoch 10, loss 1.3340
Epoch 20, loss 0.7770
Epoch 30, loss 0.5749
Epoch 40, loss 0.5136
Epoch 50, loss 0.4975
Epoch 60, loss 0.4935
Epoch 70, loss 0.4925
Epoch 80, loss 0.4922
Epoch 90, loss 0.4922
# plot loss
fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(len(rmse)), y=rmse, mode='lines', name='loss'))
fig.update_layout(title='Loss', xaxis_title='epoch', yaxis_title='loss')
fig.show()

# plot data with regression line
fig = go.Figure()
fig.add_trace(go.Scatter(x=x[:, 0], y=y.flatten(), mode='markers', name='data'))
fig.add_trace(go.Scatter(x=x[:, 0], y=2 * x[:, 0] + 3, mode='lines', name='true line', line=dict(color='green')))
fig.add_trace(go.Scatter(x=x[:, 0], y=y_pred.detach().numpy().flatten(), mode='lines', name='regression line', line=dict(color='red')))
fig.update_layout(title='Data', xaxis_title='x', yaxis_title='y')
fig.show()

3. Huber Loss

We define Huber loss function as the combination of MSE and MAE. It’s less sensitive to outliers than the MSE loss function and is differentiable at 0.

\[Huber = \frac{1}{n}\sum_{i=1}^{n}L_{\delta}(y_i - \hat{y_i})\]

\[L_{\delta}(y_i - \hat{y_i}) = \begin{cases} \frac{1}{2}(y_i - \hat{y_i})^2 & \text{for } |y_i - \hat{y_i}| \leq \delta \\ \delta|y_i - \hat{y_i}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}\]

where \(y_i\) is the actual value and \(\hat{y_i}\) is the predicted value.

The corresponding cost function is the mean of these Huber errors. The Huber loss function is more robust to outliers compared to the MSE loss function.

# Huber loss
def huber(y, y_pred, delta=1):
    abs_diff = torch.abs(y - y_pred)
    return torch.mean(torch.where(abs_diff < delta, 0.5 * abs_diff ** 2, delta * abs_diff - 0.5 * delta ** 2))
# add bias term
X = np.concatenate([x, np.ones((500, 1))], axis=1)

# convert to tensors
X = torch.from_numpy(X).float()
Y = torch.from_numpy(y).float()

# initialize weights
w = torch.randn(2, 1, requires_grad=True)

lr = 0.1
rmse = []
# gradient descent
for i in range(100):
    y_pred = torch.matmul(X, w)
    loss = huber(Y, y_pred)
    loss.backward()
    with torch.no_grad():
        w -= lr * w.grad
        w.grad.zero_()
    rmse.append(mean_squared_error(y, y_pred.detach().numpy(), squared=False))
    
    if i % 10 == 0:
        print(f'Epoch {i}, loss {rmse[-1]:.4f}')
Epoch 0, loss 4.7136
Epoch 10, loss 3.8319
Epoch 20, loss 3.0589
Epoch 30, loss 2.4339
Epoch 40, loss 1.9426
Epoch 50, loss 1.5518
Epoch 60, loss 1.2385
Epoch 70, loss 0.9916
Epoch 80, loss 0.8063
Epoch 90, loss 0.6778
# plot loss
fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(len(rmse)), y=rmse, mode='lines', name='loss'))
fig.update_layout(title='Loss', xaxis_title='epoch', yaxis_title='loss')
fig.show()

# plot data with regression line
fig = go.Figure()
fig.add_trace(go.Scatter(x=x[:, 0], y=y.flatten(), mode='markers', name='data'))
fig.add_trace(go.Scatter(x=x[:, 0], y=2 * x[:, 0] + 3, mode='lines', name='true line', line=dict(color='green')))
fig.add_trace(go.Scatter(x=x[:, 0], y=y_pred.detach().numpy().flatten(), mode='lines', name='regression line', line=dict(color='red')))
fig.show()

Loss Functions for Classification

1. Binary Cross-Entropy Loss

This is the most common loss function used in classification problems. The binary cross-entropy loss decreases as the predicted probability converges to the actual label. It measures the performance of a classification model whose predicted output is a probability value between 0 and 1.

\[L = \begin{cases} -log(\hat{y_i}) & \text{if } y_i = 1 \\ -log(1-\hat{y_i}) & \text{if } y_i = 0 \end{cases}\]

\[L = - \dfrac{1}{m} \sum_{i=1}^{m} y_i \log(\hat{y_i}) + (1-y_i) \log(1-\hat{y_i})\]

where \(y_i\) is the actual value and \(\hat{y_i}\) is the predicted value.

from sklearn.datasets import make_blobs


# generate data blobs
x, y = make_blobs(n_samples=500, centers=2, cluster_std=2, random_state=42)

color = np.where(y == 0.0, 'orange', 'blue')

# plot data
fig = go.Figure()
fig.add_trace(go.Scatter(x=x[:, 0], y=x[:, 1], mode='markers', marker=dict(color=color)))
fig.update_layout(title='Data', xaxis_title='x', yaxis_title='y')
fig.show()
def bce(y, y_pred):
    ce = -torch.mean(y * torch.log(y_pred) + (1 - y) * torch.log(1 - y_pred))
    return ce
# add bias term
X = np.concatenate([x, np.ones((500, 1))], axis=1)

# convert to tensors
X = torch.from_numpy(X).float()
Y = torch.from_numpy(y).float()

# initialize weights
w = torch.randn(3, 1, requires_grad=True)

lr = 0.01
accuracy = []
# gradient descent
for i in range(100):
    y_pred = torch.matmul(X, w)
    y_pred = torch.sigmoid(y_pred)
    loss = bce(Y, y_pred)
    loss.backward()
    with torch.no_grad():
        w -= lr * w.grad
        w.grad.zero_()
    
    y_pred = torch.where(y_pred > 0.5, 1.0, 0.0)
    accuracy.append(accuracy_score(y, y_pred.detach().numpy()))
    
    if i % 10 == 0:
        print(f'Epoch {i}, loss {loss:.4f}, accuracy {accuracy[-1]:.4f}')
Epoch 0, loss 0.9751, accuracy 0.5820
Epoch 10, loss 0.7332, accuracy 0.5260
Epoch 20, loss 0.6994, accuracy 0.5560
Epoch 30, loss 0.6967, accuracy 0.4740
Epoch 40, loss 0.6965, accuracy 0.4380
Epoch 50, loss 0.6964, accuracy 0.4300
Epoch 60, loss 0.6964, accuracy 0.4300
Epoch 70, loss 0.6964, accuracy 0.4280
Epoch 80, loss 0.6964, accuracy 0.4260
Epoch 90, loss 0.6963, accuracy 0.4260
# plot loss
fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(len(accuracy)), y=accuracy, mode='lines', name='accuracy'))
fig.update_layout(title='Accuracy', xaxis_title='epoch', yaxis_title='accuracy')
fig.show()

# plot data with regression line
fig = go.Figure()
fig.add_trace(go.Scatter(x=x[:, 0], y=x[:, 1], mode='markers', marker=dict(color=color)))
fig.add_trace(go.Scatter(x=x[:, 0], y=(-(w[0] * X[:, 0] + w[2]) / w[1]).detach().numpy(), mode='lines', name='regression line', line=dict(color='red')))
fig.show()

2. Focal Loss

We define Focal loss function as the combination of Binary Cross-Entropy Loss and a modulating factor. The modulating factor \(\gamma\) is used to reduce the relative loss for well-classified examples and put more focus on hard, misclassified examples. It’s less sensitive to outliers than the Binary Cross-Entropy Loss function and is differentiable at 0.

\[FL = \begin{cases} -(1-\hat{y_i})^{\gamma}log(\hat{y_i}) & \text{if } y_i = 1 \\ -(\hat{y_i})^{\gamma}log(1-\hat{y_i}) & \text{if } y_i = 0 \end{cases}\]

\[FL = - \dfrac{1}{m} \sum_{i=1}^{m} y_i (1 - \hat{y_i})^{\gamma} \log(\hat{y_i}) + (1-y_i) (\hat{y_i})^{\gamma} \log(1-\hat{y_i})\]

In practice, we use an \(\alpha\)-balanced variant of the focal loss that inherits the characteristics of both the weighing factor \(\alpha\) and the focusing parameter \(\gamma\), yielding slightly better accuracy than the non-balanced form.

\[ FL = \begin{cases} -\alpha(1-\hat{y_i})^{\gamma}log(\hat{y_i}) & \text{if } y_i = 1 \\ -(1-\alpha)(\hat{y_i})^{\gamma}log(1-\hat{y_i}) & \text{if } y_i = 0 \end{cases}\]

\[ FL = - \dfrac{1}{m} \sum_{i=1}^{m} y_i \alpha (1 - \hat{y_i})^{\gamma} \log(\hat{y_i}) + (1-y_i) (1-\alpha) (\hat{y_i})^{\gamma} \log(1-\hat{y_i})\]

where \(y_i\) is the actual label and \(\hat{y_i}\) is the predicted probability of the label.

# Focal Loss
def focal_loss(y, y_pred, alpha=1, gamma=2):
    bce_loss = bce(y, y_pred) #-torch.mean(y * torch.log(y_pred) + (1 - y) * torch.log(1 - y_pred))
    pt = torch.exp(-bce_loss)
    return alpha * (1 - pt) ** gamma * bce_loss
# add bias term
X = np.concatenate([x, np.ones((500, 1))], axis=1)

# convert to tensors
X = torch.from_numpy(X).float()
Y = torch.from_numpy(y).float()

# initialize weights
w = torch.randn(3, 1, requires_grad=True)

lr = 0.1
accuracy = []
# gradient descent
for i in range(100):
    y_pred = torch.matmul(X, w)
    loss = focal_loss(Y, torch.sigmoid(y_pred))
    loss.backward()
    with torch.no_grad():
        w -= lr * w.grad
        w.grad.zero_()
    accuracy.append(1 - accuracy_score(y, [1 if i > 0.5 else 0 for i in torch.sigmoid(y_pred).detach().numpy()]))
    
    if i % 10 == 0:
        print(f'Epoch {i}, accuracy {accuracy[-1]:.4f}')
Epoch 0, accuracy 0.4980
Epoch 10, accuracy 0.3460
Epoch 20, accuracy 0.4220
Epoch 30, accuracy 0.4260
Epoch 40, accuracy 0.4280
Epoch 50, accuracy 0.4280
Epoch 60, accuracy 0.4280
Epoch 70, accuracy 0.4280
Epoch 80, accuracy 0.4280
Epoch 90, accuracy 0.4280
# plot loss
fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(len(accuracy)), y=accuracy, mode='lines', name='loss'))
fig.update_layout(title='Accuracy', xaxis_title='epoch', yaxis_title='loss')
fig.show()

# plot data with regression line
fig = go.Figure()
fig.add_trace(go.Scatter(x=x[:, 0], y=x[:, 1], mode='markers', marker=dict(color=color)))
fig.add_trace(go.Scatter(x=x[:, 0], y=(-(w[0] * X[:, 0] + w[2]) / w[1]).detach().numpy(), mode='lines', name='regression line', line=dict(color='red')))
fig.show()