Evaluation Metrics
Post building a machine learning model, it is evaluated on a fresh unseen "test" dataset to quantity its effectiveness. Intuitively, a model that makes the least amount of mistakes should be associated with a high performance metric. The AI & Analytics Engine offers several metrics for each problem type.
Classification
We begin with the simplest case, Binary classification. In this case, the goal of the model is to predict binary labels, denoted as "Positive" and "Negative". An example for which the model correctly predicts a "Positive" label is referred to as a "True Positive (TP)" and where it incorrectly predicts a "Positive" label, we refer to that example as "False Positive (FP)". Likewise, "True Negative (TN)" and "False Negative (FN)" examples are accordingly defined and measured. This can be summarized in the following table (denoted as a confusion matrix):
where:
Since binary classification is only a subset of the classification problems, we define the following additional mathematical notations (following the ScikitLearn notations) in order to extend to multiclass problems:
 y is the set of predicted (sample,label) pairs.
 \hat{y} the set of true (sample, label) pairs.
 L the set of labels.
 y_{l} the subset of y with label l.
 Similarly, \hat{y_{l}} is a subset of \hat{y}
 P(A,B):=\frac{A \cap B}{A} for some sets A and B
 R(A,B):=\frac{A \cap B}{B} for some sets A and B
 F_{\beta}(A,B):=(1+\beta^{2})\frac{P(A,B) \times R(A,B)}{\beta^{2}P(A,B) + R(A,B)}
Now that we have these definitions, we can accurately describe the concepts of the metrics we use to evaluate the results of classification models.
Name  Formula  Concept 

Accuracy  \frac{TP + TN}{TP + FP + TN + FN}  Ratio of correctly predicted events (positive/negative). As in scikitlearn 
Precisionmicro  P(y,\hat{y})  The number of correctly predicted labels out of all predicted 
Precisionmacro  \frac{1}{\mid L\mid}\Sigma_{l\in L}P(y_l, \hat{y_l})  The unweighted average of precision scores for all classes 
Precisionweighted  \frac{1}{\Sigma_{l\in L}\mid\hat{y_l}\mid}\Sigma_{l\in L}\mid\hat{y_l}\mid P(y_l, \hat{y_l})  The weighted average of precision scores for all classes 
Recallmicro  R(y,\hat{y})  The number of correctly predicted labels out of all actual 
Recallmacro  \frac{1}{\mid L\mid}\Sigma_{l\in L}R(y_l, \hat{y_l})  The unweighted average of recall scores for all classes 
Recallweighted  \frac{1}{\Sigma_{l\in L}\mid\hat{y_l}\mid}\Sigma_{l\in L}\mid\hat{y_l}\mid R(y_l, \hat{y_l})  The weighted average of recall scores for all classes 
F1micro  F_{\beta}(y,\hat{y})  Since precision=recall in the microaveraging case, they are also equal to their harmonic mean. In other words, in the microF1 
F1macro  \frac{1}{\mid L\mid}\Sigma_{l\in L}F_{\beta}(y_l, \hat{y_l})  The unweighted average of F1 scores for all classes 
F1weighted  \frac{1}{\Sigma_{l\in L}\mid\hat{y_l}\mid}\Sigma_{l\in L}\mid\hat{y_l}\mid F_{\beta}(y_l, \hat{y_l})  The weighted average of F1 scores for all classes 
roc_auc  Area under the ROC curve. See the scikitlearnroc page.  
Average precision  AP=\Sigma_{n}(R_{n}R_{n1})P_{n}  See the Wikiaverageprecision page. 
Regression
The task of regression is predicting the value of some continuous dependent variable based on one or more feature variables (ideally, independent variable). For concept understanding, the simplest form of regression is called a linear regression. It is linear because the equation for the the prediction of the dependent variable is linear in the parameters of the model. Assume we have n+1 independent features, we denote them x_{0}, x_{1}, ..., x_{n} where we defined for convenience x_{0} = 1. we also denote \theta_{0}, \theta_{1}, ..., \theta_{n} as the model parameters. Lastly, we denote the target variable as \hat{y}=h_{\theta}(x)= \Sigma_{i=0}^{i=n+1}\theta_{i}x_{i}=\boldsymbol{\theta^{T}x} where:
\boldsymbol{x}=\begin{bmatrix}x_0 \\ x_1 \\ \vdots \\ x_n \end{bmatrix} \in \mathbb{R}^{n+1}\, \boldsymbol{\theta}=\begin{bmatrix}\theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{bmatrix} \in \mathbb{R}^{n+1}\
This form of linear regression is called multiple linear regression. As an example, one can imagine predicting the price of a house based on a set of features such as: land area, number of rooms, distance to public transport, date of construction, etc..
In order to view a simple visual example, we demonstrate a onedimensional linear regression (n=1) where we predict house prices as a function of the average number of rooms. Data was taken from the UCI "Boston house prices dataset". The result of the regression task is shown below.
If we had used two feature variables for the prediction, such as "average number of rooms per dwelling" and "weighted distance to five Boston employment centers" we would have a 3D plot, with the regression result displayed as a 2D plane.
Now that the regression concept was explained we can move forward with the definitions of the evaluation metrics. We denote \hat{y}_i as the predicted value of the ith sample, and y_i is the corresponding true value. We assume we have n_{samples}.
Name  Formula  Concept 

Mean absolute error  MAE(y,\hat{y}) = \frac{1}{n_{samples}} \sum_{i=0}^{n_{samples}} \mid y_i\hat{y}_i \mid  The MAE function computes a risk metric corresponding to the expected value of the absolute error loss or l_1 norm loss. 
Mean squared error  MSE(y,\hat{y}) = \frac{1}{n_{samples}} \sum_{i=0}^{n_{samples}}(y_i\hat{y}_i)^2  The MSE computes a risk metric corresponding to the expected value of the squared (quadratic) error or loss. (l_2) 
median absolute error  MedAE(y,\hat{y}) = median(\mid y_1\hat{y}_1 \mid, ..., \mid y_n\hat{y}_n \mid)  The MedAE is particularly interesting because it is robust to outliers. The loss is calculated by taking the median of all absolute differences between the target and the prediction. 
R^2  R^2(y,\hat{y}) = 1  \frac{\sum_{i=1}^n(y_i\hat{y_i})^2}{\sum_{i=1}^n(y_i\bar{y}_i)^2} where \bar{y}=\frac{1}{n} \sum_{i=1}^{n}y_i  The R^2 is the proportion of variance in the dependent variable that is predictable from the independent variable(s). Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R² score of 0.0. 
explained variance  explained \underline{} variance(y,\hat{y}) = 1  \frac{Var(y\hat{y})}{Var(y)}  Also called "Explained variation"  The proportion to which the model accounts for variation. The best possible score is 1.0, lower values are worse. 
Time Series
The last group of evaluation metrics relates to the time series forecasting. A time series is a chronological sequence of observations on some target variable. Usually, the target variable is measured at fixed time intervals (daily, monthly, etc.). We can display a time series plot that can describes one or many target variables as function of date column.
Example:
We extracted COVID19 data from the John Hopkins CSSE. Focusing only on Italy, the data is arranged as follows:
Date  Country  Confirmed cases  Fatalities 

...  ...  ...  ... 
20200220  Italy  3  0 
20200221  Italy  20  1 
20200222  Italy  62  2 
20200223  Italy  155  3 
20200224  Italy  229  7 
20200225  Italy  322  10 
20200226  Italy  453  12 
20200227  Italy  655  17 
20200228  Italy  888  21 
20200229  Italy  1128  29 
...  ...  ...  ... 
And if we generate the corresponding time series we can observe the data and the clear (and horrific) trend of linear (at least not exponential) increase in the number of infected and deceased people due to the virus.
Forecasting in time series is somewhat similar to regression. We attempt to find a model that will "follow" and "forecast" the trends that exist in the data. In this section we won't explain how this is done. Instead, we'll assume we have some forecasted time series and we shall proceed to the evaluation metrics.

Unlike regression, we do NOT want to use "goodnessoffit" approaches, because often, they use the residuals and don't really reflect the capability of the forecasting technique to successfully predict future observations.

We are interested in the accuracy of future forecasts which is called "outofsamples" forecast error.
To formulate the metrics, we define some mathematical notations. We assume we have T periods of data. We denote y_t as the observation at time t. Thus, we have a time series: y_1, y_2, ..., y_t. We further define \hat{y}_t(Tl) as the forecast of y at the time t, made at time period tl. l is called the forecast lead time. The forecast error that results from a forecast made at time tl towards time t is called l \text{stepahead forecast error}.
It is customary to evaluate forecasting model performance using the onestepahead forecast errors:
Lastly, we assume that there are n observations and forecasts and that n onestepahead forecast errors were calculated: e_t(1), t=1,2,...,n. Before we present the metrics, we need to remember that some are absolute in their value, thus, we cannot compare models of different scales or at different times. This means we need to define a relative forecast error (in percents) which mitigates this issue.
Using this definition we can now define some of the metrics the aiaengine presents:
Name  Formula  Concept 

MAE  MAE = \frac{1}{n} \sum_{t=1}^{n} \mid e_t(1) \mid  Mean Absolute Error 
MSE  MSE = \frac{1}{n} \sum_{t=1}^{n} e_t(1)^2  Mean Squared Error 
RMSE  RMSE = \sqrt{\frac{1}{n} \sum_{t=1}^{n} e_t(1)^2}  Root Mean Squared Error 
MAPE  MAPE = \frac{1}{n} \sum_{t=1}^{n} \mid re_t(1) \mid  Mean Absolute Percent Error 
SMAPE  SMAPE = \frac{100 \%}{n} \sum_{t=1}^{n} \frac{\mid y_t\hat{y}_t \mid}{(\mid y_t \mid + \mid \hat{y}_t \mid) /2}  Symmetric Mean Absolute Percentage Error. Unlike the MAPE, SMAPE has both a lower (0%) and an upper (200%) bound 