Skip to content

Evaluation Metrics

Post building a machine learning model, it is evaluated on a fresh unseen "test" dataset to quantity its effectiveness. Intuitively, a model that makes the least amount of mistakes should be associated with a high performance metric. The AI & Analytics Engine offers several metrics for each problem type.


We begin with the simplest case, Binary classification. In this case, the goal of the model is to predict binary labels, denoted as "Positive" and "Negative". An example for which the model correctly predicts a "Positive" label is referred to as a "True Positive (TP)" and where it incorrectly predicts a "Positive" label, we refer to that example as "False Positive (FP)". Likewise, "True Negative (TN)" and "False Negative (FN)" examples are accordingly defined and measured. This can be summarized in the following table (denoted as a confusion matrix):


TP = \text{True positives}\\ TN = \text{True negatives}\\ FP = \text{False positives}\\ FN = \text{False negatives}

Since binary classification is only a subset of the classification problems, we define the following additional mathematical notations (following the Scikit-Learn notations) in order to extend to multi-class problems:

  • y is the set of predicted (sample,label) pairs.
  • \hat{y} the set of true (sample, label) pairs.
  • L the set of labels.
  • y_{l} the subset of y with label l.
  • Similarly, \hat{y_{l}} is a subset of \hat{y}
  • P(A,B):=\frac{|A \cap B|}{A} for some sets A and B
  • R(A,B):=\frac{|A \cap B|}{B} for some sets A and B
  • F_{\beta}(A,B):=(1+\beta^{2})\frac{P(A,B) \times R(A,B)}{\beta^{2}P(A,B) + R(A,B)}

Now that we have these definitions, we can accurately describe the concepts of the metrics we use to evaluate the results of classification models.

Name Formula Concept
Accuracy \frac{TP + TN}{TP + FP + TN + FN} Ratio of correctly predicted events (positive/negative). As in scikit-learn
Precision-micro P(y,\hat{y}) The number of correctly predicted labels out of all predicted
Precision-macro \frac{1}{\mid L\mid}\Sigma_{l\in L}P(y_l, \hat{y_l}) The un-weighted average of precision scores for all classes
Precision-weighted \frac{1}{\Sigma_{l\in L}\mid\hat{y_l}\mid}\Sigma_{l\in L}\mid\hat{y_l}\mid P(y_l, \hat{y_l}) The weighted average of precision scores for all classes
Recall-micro R(y,\hat{y}) The number of correctly predicted labels out of all actual
Recall-macro \frac{1}{\mid L\mid}\Sigma_{l\in L}R(y_l, \hat{y_l}) The un-weighted average of recall scores for all classes
Recall-weighted \frac{1}{\Sigma_{l\in L}\mid\hat{y_l}\mid}\Sigma_{l\in L}\mid\hat{y_l}\mid R(y_l, \hat{y_l}) The weighted average of recall scores for all classes
F1-micro F_{\beta}(y,\hat{y}) Since precision=recall in the micro-averaging case, they are also equal to their harmonic mean. In other words, in the micro-F1
F1-macro \frac{1}{\mid L\mid}\Sigma_{l\in L}F_{\beta}(y_l, \hat{y_l}) The un-weighted average of F1 scores for all classes
F1-weighted \frac{1}{\Sigma_{l\in L}\mid\hat{y_l}\mid}\Sigma_{l\in L}\mid\hat{y_l}\mid F_{\beta}(y_l, \hat{y_l}) The weighted average of F1 scores for all classes
roc_auc Area under the ROC curve. See the scikit-learn-roc page.
Average precision AP=\Sigma_{n}(R_{n}-R_{n-1})P_{n} See the Wiki-average-precision page.


The task of regression is predicting the value of some continuous dependent variable based on one or more feature variables (ideally, independent variable). For concept understanding, the simplest form of regression is called a linear regression. It is linear because the equation for the the prediction of the dependent variable is linear in the parameters of the model. Assume we have n+1 independent features, we denote them x_{0}, x_{1}, ..., x_{n} where we defined for convenience x_{0} = 1. we also denote \theta_{0}, \theta_{1}, ..., \theta_{n} as the model parameters. Lastly, we denote the target variable as \hat{y}=h_{\theta}(x)= \Sigma_{i=0}^{i=n+1}\theta_{i}x_{i}=\boldsymbol{\theta^{T}x} where:

\boldsymbol{x}=\begin{bmatrix}x_0 \\ x_1 \\ \vdots \\ x_n \end{bmatrix} \in \mathbb{R}^{n+1}\, \boldsymbol{\theta}=\begin{bmatrix}\theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix} \in \mathbb{R}^{n+1}\

This form of linear regression is called multiple linear regression. As an example, one can imagine predicting the price of a house based on a set of features such as: land area, number of rooms, distance to public transport, date of construction, etc..

In order to view a simple visual example, we demonstrate a one-dimensional linear regression (n=1) where we predict house prices as a function of the average number of rooms. Data was taken from the UCI "Boston house prices dataset". The result of the regression task is shown below.

If we had used two feature variables for the prediction, such as "average number of rooms per dwelling" and "weighted distance to five Boston employment centers" we would have a 3D plot, with the regression result displayed as a 2D plane.

Now that the regression concept was explained we can move forward with the definitions of the evaluation metrics. We denote \hat{y}_i as the predicted value of the i-th sample, and y_i is the corresponding true value. We assume we have n_{samples}.

Name Formula Concept
Mean absolute error MAE(y,\hat{y}) = \frac{1}{n_{samples}} \sum_{i=0}^{n_{samples}} \mid y_i-\hat{y}_i \mid The MAE function computes a risk metric corresponding to the expected value of the absolute error loss or l_1 norm loss.
Mean squared error MSE(y,\hat{y}) = \frac{1}{n_{samples}} \sum_{i=0}^{n_{samples}}(y_i-\hat{y}_i)^2 The MSE computes a risk metric corresponding to the expected value of the squared (quadratic) error or loss. (l_2)
median absolute error MedAE(y,\hat{y}) = median(\mid y_1-\hat{y}_1 \mid, ..., \mid y_n-\hat{y}_n \mid) The MedAE is particularly interesting because it is robust to outliers. The loss is calculated by taking the median of all absolute differences between the target and the prediction.
R^2 R^2(y,\hat{y}) = 1 - \frac{\sum_{i=1}^n(y_i-\hat{y_i})^2}{\sum_{i=1}^n(y_i-\bar{y}_i)^2} where \bar{y}=\frac{1}{n} \sum_{i=1}^{n}y_i The R^2 is the proportion of variance in the dependent variable that is predictable from the independent variable(s). Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R² score of 0.0.
explained variance explained \underline{} variance(y,\hat{y}) = 1 - \frac{Var(y-\hat{y})}{Var(y)} Also called "Explained variation" - The proportion to which the model accounts for variation. The best possible score is 1.0, lower values are worse.

Time Series

The last group of evaluation metrics relates to the time series forecasting. A time series is a chronological sequence of observations on some target variable. Usually, the target variable is measured at fixed time intervals (daily, monthly, etc.). We can display a time series plot that can describes one or many target variables as function of date column.


We extracted COVID-19 data from the John Hopkins CSSE. Focusing only on Italy, the data is arranged as follows:

Date Country Confirmed cases Fatalities
... ... ... ...
2020-02-20 Italy 3 0
2020-02-21 Italy 20 1
2020-02-22 Italy 62 2
2020-02-23 Italy 155 3
2020-02-24 Italy 229 7
2020-02-25 Italy 322 10
2020-02-26 Italy 453 12
2020-02-27 Italy 655 17
2020-02-28 Italy 888 21
2020-02-29 Italy 1128 29
... ... ... ...

And if we generate the corresponding time series we can observe the data and the clear (and horrific) trend of linear (at least not exponential) increase in the number of infected and deceased people due to the virus.

Forecasting in time series is somewhat similar to regression. We attempt to find a model that will "follow" and "forecast" the trends that exist in the data. In this section we won't explain how this is done. Instead, we'll assume we have some forecasted time series and we shall proceed to the evaluation metrics.

  • Unlike regression, we do NOT want to use "goodness-of-fit" approaches, because often, they use the residuals and don't really reflect the capability of the forecasting technique to successfully predict future observations.

  • We are interested in the accuracy of future forecasts which is called "out-of-samples" forecast error.

To formulate the metrics, we define some mathematical notations. We assume we have T periods of data. We denote y_t as the observation at time t. Thus, we have a time series: y_1, y_2, ..., y_t. We further define \hat{y}_t(T-l) as the forecast of y at the time t, made at time period t-l. l is called the forecast lead time. The forecast error that results from a forecast made at time t-l towards time t is called l \text{-step-ahead forecast error}.


It is customary to evaluate forecasting model performance using the one-step-ahead forecast errors:


Lastly, we assume that there are n observations and forecasts and that n one-step-ahead forecast errors were calculated: e_t(1), t=1,2,...,n. Before we present the metrics, we need to remember that some are absolute in their value, thus, we cannot compare models of different scales or at different times. This means we need to define a relative forecast error (in percents) which mitigates this issue.

re_t(1)=(\frac{y_t-\hat{y}_t(t-1)}{y_t})100 = (\frac{e_t(1)}{y_t})100

Using this definition we can now define some of the metrics the aia-engine presents:

Name Formula Concept
MAE MAE = \frac{1}{n} \sum_{t=1}^{n} \mid e_t(1) \mid Mean Absolute Error
MSE MSE = \frac{1}{n} \sum_{t=1}^{n} e_t(1)^2 Mean Squared Error
RMSE RMSE = \sqrt{\frac{1}{n} \sum_{t=1}^{n} e_t(1)^2} Root Mean Squared Error
MAPE MAPE = \frac{1}{n} \sum_{t=1}^{n} \mid re_t(1) \mid Mean Absolute Percent Error
SMAPE SMAPE = \frac{100 \%}{n} \sum_{t=1}^{n} \frac{\mid y_t-\hat{y}_t \mid}{(\mid y_t \mid + \mid \hat{y}_t \mid) /2} Symmetric Mean Absolute Percentage Error. Unlike the MAPE, SMAPE has both a lower (0%) and an upper (200%) bound