Skip to content

Prediction Tasks - Classification

Introduction

Each app is a specialized intelligence to perform a single prediction task, trained from one or more datasets. Each app provides a space for the user to define the task, and to train and compare several models to achieve the goal. The "classification" app, as its name suggests, is tasked with performing the classification task.

Data Preparation for Classification

Apart from the actions committed in previous stages of the "data preparation module", you must also have a target column in the dataset. The target column must be of categorical type.

Example:

Predicting whether a tumour is malignant (M) or benign (B) by looking at its measurements. A sample of the dataset for this task is shown below, with diagnosis being the target column:

diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 B 13.62 23.23 87.19 573.2 0.09246 0.06747 0.02974 0.02443 0.1664 0.05801 0.346 1.336 2.066 31.24 0.005868 0.02099 0.02021 0.009064 0.02087 0.002583 15.35 29.09 97.58 729.8 0.1216 0.1517 0.1049 0.07174 0.2642 0.06953
1 B 11.66 17.07 73.7 421 0.07561 0.0363 0.008306 0.01162 0.1671 0.05731 0.3534 0.6724 2.225 26.03 0.006583 0.006991 0.005949 0.006296 0.02216 0.002668 13.28 19.74 83.61 542.5 0.09958 0.06476 0.03046 0.04262 0.2731 0.06825
2 B 12.76 13.37 82.29 504.1 0.08794 0.07948 0.04052 0.02548 0.1601 0.0614 0.3265 0.6594 2.346 25.18 0.006494 0.02768 0.03137 0.01069 0.01731 0.004392 14.19 16.4 92.04 618.8 0.1194 0.2208 0.1769 0.08411 0.2564 0.08253
3 M 21.71 17.25 140.9 1546 0.09384 0.08562 0.1168 0.08465 0.1717 0.05054 1.207 1.051 7.733 224.1 0.005568 0.01112 0.02096 0.01197 0.01263 0.001803 30.75 26.44 199.5 3143 0.1363 0.1628 0.2861 0.182 0.251 0.06494
4 B 12.18 14.08 77.25 461.4 0.07734 0.03212 0.01123 0.005051 0.1673 0.05649 0.2113 0.5996 1.438 15.82 0.005343 0.005767 0.01123 0.005051 0.01977 0.0009502 12.85 16.47 81.6 513.1 0.1001 0.05332 0.04116 0.01852 0.2293 0.06037

The Engine's data preparation combined with automated feature engineering embedded within model templates help you with many tasks but there are some tasks that must be handled by you prior to uploading the data to the platform. See:

Task Handled by the Engine Handled by user
Categorical columns "most frequent" imputation Yes
Categorical columns one hot encoding Yes
Numerical columns scaling Yes
Text columns "constant" imputation Yes
Text columns TF-IDF vectorizer Yes
Outliers removal Yes
Cleanup of bad target labels Yes
Cleanup of duplicated values Yes
Domain knowledge enabled feature transformations Yes
Cleanup of data leakages Yes

This section shows how to build an application of problem type classification on the AI & Analytics Engine

  • Using the GUI
  • Using API access through SDK

Using the GUI

Creating an app for a dataset can be accomplished in one of two methods. Either create it from the dashboard by clicking on the "+ New App" rectangle, or, from within the project or dataset page, hover the mouse above the '+' icon in the bottom right corner of the screen and then click on the "New App" button:

First method Second method

Once any of these options are selected, the "New App" menu will appear. It requires the user to first assign an app name, the related dataset to the app and the name of the target column for that app. The type of the app (classification/regression) is automatically determined (unlike the API where it's explicitly stated) by the type of the target column.

In the second step, the app can be created with the default configuration, which means an 80/20 train/test split or with an advanced configuration that allows the user to change the train/test split size. In the advanced configuration, a value of 90 represents a 90:10 train/test split.

Step 1 Step 2

Once finished, the app is created and the app ID can be found in the browser's address bar.

Using API access through SDK

To access the API functions, you must first authenticate into the platform by

from aiaengine import api

client = api.Client()

Importing app

Next you need to import app in order to call functions involved in this module.

from aiaengine.api import app

Creating an app for a dataset

Now you can add a new app by specifying the required parameter values as follows.

create_app_response = client.apps.CreateApp(
    app.CreateAppRequest(
        name='App Name',
        description='What is this app about',
        dataset_id='id_of_dataset_app_is_created_for',
        problem_type='classification',
        target_columns=['target_column'],
        extra_columns={},
        training_data_proportion=0.8
    )
)

Apart from the name and description of the application, you need to inform the platform the dataset id that the application is built for. You also need to specify the problem type, which is 'classification' for this task, and a single target column ('target_column'in the above example). For a classification task, you can keep extra_columns (only used in a forecasting task) as an empty dictionary {}. At last, you can set up a ratio of train-test split using training_data_proportion which indicates the proportion of data used for training over the whole dataset.

app_id = create_app_response.id

Once created, an application is assigned with a unique id, which is frequently used in the related functions.