Prediction Tasks - Classification
Introduction
Each app is a specialized intelligence to perform a single prediction task, trained from one or more datasets. Each app provides a space for the user to define the task, and to train and compare several models to achieve the goal. The "classification" app, as its name suggests, is tasked with performing the classification task.
Data Preparation for Classification
Apart from the actions committed in previous stages of the "data preparation module", you must also have a target column in the dataset. The target column must be of categorical type.
Example:
Predicting whether a tumour is malignant (M) or benign (B) by looking at its measurements. A sample of the dataset for this task is shown below, with diagnosis
being the target column:
diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | fractal_dimension_mean | radius_se | texture_se | perimeter_se | area_se | smoothness_se | compactness_se | concavity_se | concave points_se | symmetry_se | fractal_dimension_se | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | B | 13.62 | 23.23 | 87.19 | 573.2 | 0.09246 | 0.06747 | 0.02974 | 0.02443 | 0.1664 | 0.05801 | 0.346 | 1.336 | 2.066 | 31.24 | 0.005868 | 0.02099 | 0.02021 | 0.009064 | 0.02087 | 0.002583 | 15.35 | 29.09 | 97.58 | 729.8 | 0.1216 | 0.1517 | 0.1049 | 0.07174 | 0.2642 | 0.06953 |
1 | B | 11.66 | 17.07 | 73.7 | 421 | 0.07561 | 0.0363 | 0.008306 | 0.01162 | 0.1671 | 0.05731 | 0.3534 | 0.6724 | 2.225 | 26.03 | 0.006583 | 0.006991 | 0.005949 | 0.006296 | 0.02216 | 0.002668 | 13.28 | 19.74 | 83.61 | 542.5 | 0.09958 | 0.06476 | 0.03046 | 0.04262 | 0.2731 | 0.06825 |
2 | B | 12.76 | 13.37 | 82.29 | 504.1 | 0.08794 | 0.07948 | 0.04052 | 0.02548 | 0.1601 | 0.0614 | 0.3265 | 0.6594 | 2.346 | 25.18 | 0.006494 | 0.02768 | 0.03137 | 0.01069 | 0.01731 | 0.004392 | 14.19 | 16.4 | 92.04 | 618.8 | 0.1194 | 0.2208 | 0.1769 | 0.08411 | 0.2564 | 0.08253 |
3 | M | 21.71 | 17.25 | 140.9 | 1546 | 0.09384 | 0.08562 | 0.1168 | 0.08465 | 0.1717 | 0.05054 | 1.207 | 1.051 | 7.733 | 224.1 | 0.005568 | 0.01112 | 0.02096 | 0.01197 | 0.01263 | 0.001803 | 30.75 | 26.44 | 199.5 | 3143 | 0.1363 | 0.1628 | 0.2861 | 0.182 | 0.251 | 0.06494 |
4 | B | 12.18 | 14.08 | 77.25 | 461.4 | 0.07734 | 0.03212 | 0.01123 | 0.005051 | 0.1673 | 0.05649 | 0.2113 | 0.5996 | 1.438 | 15.82 | 0.005343 | 0.005767 | 0.01123 | 0.005051 | 0.01977 | 0.0009502 | 12.85 | 16.47 | 81.6 | 513.1 | 0.1001 | 0.05332 | 0.04116 | 0.01852 | 0.2293 | 0.06037 |
The Engine's data preparation combined with automated feature engineering embedded within model templates help you with many tasks but there are some tasks that must be handled by you prior to uploading the data to the platform. See:
Task | Handled by the Engine | Handled by user |
---|---|---|
Categorical columns "most frequent" imputation | Yes | |
Categorical columns one hot encoding | Yes | |
Numerical columns scaling | Yes | |
Text columns "constant" imputation | Yes | |
Text columns TF-IDF vectorizer | Yes | |
Outliers removal | Yes | |
Cleanup of bad target labels | Yes | |
Cleanup of duplicated values | Yes | |
Domain knowledge enabled feature transformations | Yes | |
Cleanup of data leakages | Yes |
This section shows how to build an application of problem type classification on the AI & Analytics Engine
- Using the GUI
- Using API access through SDK
Using the GUI
Creating an app for a dataset can be accomplished in one of two methods. Either create it from the dashboard by clicking on the "+ New App" rectangle, or, from within the project or dataset page, hover the mouse above the '+' icon in the bottom right corner of the screen and then click on the "New App" button:
First method | Second method |
---|---|
![]() |
![]() |
Once any of these options are selected, the "New App" menu will appear. It requires the user to first assign an app name, the related dataset to the app and the name of the target column for that app. The type of the app (classification/regression) is automatically determined (unlike the API where it's explicitly stated) by the type of the target column.
In the second step, the app can be created with the default configuration, which means an 80/20 train/test split or with an advanced configuration that allows the user to change the train/test split size. In the advanced configuration, a value of 90 represents a 90:10 train/test split.
Step 1 | Step 2 |
---|---|
![]() |
![]() |
Once finished, the app is created and the app ID can be found in the browser's address bar.
Using API access through SDK
To access the API functions, you must first authenticate into the platform by
from aiaengine import api
client = api.Client()
Importing app
Next you need to import app
in order to call functions involved in this module.
from aiaengine.api import app
Creating an app for a dataset
Now you can add a new app by specifying the required parameter values as follows.
create_app_response = client.apps.CreateApp(
app.CreateAppRequest(
name='App Name',
description='What is this app about',
dataset_id='id_of_dataset_app_is_created_for',
problem_type='classification',
target_columns=['target_column'],
extra_columns={},
training_data_proportion=0.8
)
)
Apart from the name and description of the application, you need to inform the platform the dataset id that the application is built for. You also need to specify the problem type, which is 'classification'
for this task, and a single target column ('target_column'
in the above example). For a classification task, you can keep extra_columns
(only used in a forecasting task) as an empty dictionary {}
. At last, you can set up a ratio of train-test split using training_data_proportion
which indicates the proportion of data used for training over the whole dataset.
app_id = create_app_response.id
Once created, an application is assigned with a unique id, which is frequently used in the related functions.