Skip to content

Data preparation

Data Preparation Overview

The AI & Analytics Engine's data wrangling module aims to:

  1. Build a data processing pipeline that can be applied to new data in a consistent and reproducible manner,
  2. Provide AI-powered insights into the data and make smart recommendations on how to transform the data for better machine learning outcomes,
  3. Provide clear explanations as to why the presented actions were recommended, so that users can impart trust on the recommendations and feel comfortable adopting them.

In this document you will find detailed explanations of all the Engine's data wrangling capabilities. There are instructions for using both the GUI and the API.

Understand

To fully appreciate the material covered in this page, you will need some familiarity with the concepts of:

  • Schema: A schema is attached to a dataset and it contains information about the column names, and the type of each column. A dataset must have have a schema before actions can be appplied to it. If the user uploads a dataset without a schema (e.g. a CSV or parquet file), then the platform will infer a schema for the dataset, and recommend the appropriate casting actions to convert the columns into the right types.
  • Data wrangling: Prior to using data to train machine learning models, the data must be modified into a more appropriate form. This step is called data wrangling, and involves such things as removing irrelevant data to save training time, and removing or imputing missing values.
  • Recipe: In the Engine, the chain of data transformations applied to a dataset are referred to as a recipe. Once complete, this recipe becomes reuseable on other datasets with a compliant schema.

The Data Preparation Flow

The data preparation process on the AI & Analytics Engine follows this structure:

  1. Upload a dataset to a project on the AI & Analytics Engine platform (the Engine)
  2. Create a new recipe tied to the dataset
  3. Confirm/edit the recommended schema
  4. Repeat the following until there are no more recommendations/you are satisfied:
    1. View insights and recommendations from the Engine
    2. Choose from the recommended actions and add manual actions
    3. Commit the actions to produce a new temporary dataset; the committed actions will be added to the recipe
  5. Finalize the recipe which will compile the committed actions into a reproducible data processing pipeline
  6. The processed dataset as a result of applying the recipe will become available for building machine learning models
  7. Reuse the recipe to conveniently prepare new incoming data for prediction

Note: only the dataset created at the step of recipe finalization can be accessed. All other datasets created from intermediate steps are temporary and non-accessible.

Data Preparation via API

We can interact with the AI & Analytics Engine via Python

Initialize

As usual, we import the SDK. Additionally, you will need to include the code in the Code for data preparation page in a file to be imported as aia_util

from aiaengine import api
import aia_util

We then establish a connection to the server via a client

client = api.Client()

# path to the file you wish to upload
data_file = "german_credit.csv"

The client will be used extensively in future steps.

Obtain Project ID

In order to add a dataset to the platform we need to first nominate the project to which the data should belong. A project is uniquely identified by its project ID which is different to the project name. We can find the project ID by searching for project name using the below

project_name_to_search_for = "Project 1"
project_id = aia_util.search_project_by_name(client, project_name_to_search_for)

Uploading Data to the platform

We can upload a dataset to the Engine using create_dataset

from datetime import datetime
time_stamp = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

new_dataset = aia_util.upload_data(client,
                              project_id=project_id,
                              name='German Credit ' + time_stamp,
                              description="",
                              path=data_file)


dataset_id = new_dataset.id

Create a recipe

We can create a recipe as below

recipe_obj = aia_util.create_new_recipe(
    client,
    dataset_id,
    target_col = "default_n24m",
    name = "Process German Credit Data",
    description = "Creating a Data Preparation Recipe/Pipeline"
)
print(recipe_obj)

Get Insights

We can ask the platform for insights

# get list of recommended actions of the first iteration
recommended_actions = aia_util.get_insights(client, recipe_obj)

The platform's UI visualizes the recommended actions for you. However, we also provide the data behind the visualizations so that you can perform your own visualizations if you wish. Running the below in a Jupyter notebook session will display the visualization of the recommendations

aia_util.visualize_recommendations(recommended_actions)

Pre-commit actions: Queue, Update, and Delete Actions

To apply an action to a dataset you need to first queue the action. When you queue an action, the platform checks if the action is valid. At this stage, the action hasn't been applied to the whole dataset yet. Instead, the action has been applied to a smaller subset of the dataset for sense checking.

iteration = 1

actions = []

for ra in recommended_actions:
    actions = actions + ra['solution']['actions']

recipe_obj = aia_util.queue_actions(
    client,
    recipe_obj.id,
    actions=json.dumps(actions))

You can queue multiple actions one after another using the aia_util.queue_actions function repeatedly. And the actions is intended to be executed in the order of they are queued.

You can update an action that has been queued even if the action is sandwhich between other actions. This can be done as below

client.recipes.UpdateActions(recipe.UpdateActionsRequest(id = recipe_id, iteration = iteration, actions = replacement_actions, from_index = from_index))

The from_index is the action index from which the queued actions will be replaced with the replacement_actions

To delete actions from the queue, you need to use the same code as above but you can need to choose a from_index. And your replacement_actions must be the actions from from_index with the action you want deleted removed from replacement_actions.

Commit Actions

Once you are done with all the actions that you want queued, you can apply the action to the dataset with a Commit command. Once the action is committed, it is final. That means, you can not edit a recipes committed actions like you can queued actions.

Once committed, the action interpreter will perform the actions on the dataset which will result in a temporary dataset. See code to commit actions

aia_util.commit_actions(client, recipe_obj.id, iteration)
aia_util.wait_for_commit_actions(client, recipe_obj.id, iteration)

Show the final result

To obtain the result of iteration two, the user can use the get_dataframe function as below. The function returns a pandas.DataFrame.

# get list of dataset's files
df = aia_util.get_dataframe(
    client,
    aia_util.get_dataset_id(recipe_obj)
)

df.head()

Finalize the Recipe

Once the user is content with the recipe or the platform can no longer generate more recommendations, the user can finalize the recipe and package it up as a reproducible data processing pipeline

completed_dataset_name = "German Credit Data (Prepared)" + time_stamp

complete_recipe_response = aia_util.finalize_recipe(client, recipe_obj, completed_dataset_name)

output_dataset_id = complete_recipe_response.dataset_id

finalize_df = aia_util.get_dataframe(client, output_dataset_id)

print(finalize_df.head())