Skip to content

Using the Data Preparation Module

Data Preparation Overview

The AI & Analytics Engine's data wrangling module aims to:

  1. Build a data processing pipeline that can be applied to new data in a consistent and reproducible manner,
  2. Provide AI-powered insights into the data and make smart recommendations on how to transform the data for better machine learning outcomes,
  3. Provide clear explanations as to why the presented actions were recommended, so that users can impart trust on the recommendations and feel comfortable adopting them.

In this document you will find detailed explanations of all the Engine's data wrangling capabilities. There are instructions for using both the GUI and the API.

Understand

To fully appreciate the material covered in this page, you will need some familiarity with the concepts of:

  • Schema: A schema is attached to a dataset and it contains information about the column names, and the type of each column. A dataset must have have a schema before actions can be appplied to it. If the user uploads a dataset without a schema (e.g. a CSV or parquet file), then the platform will infer a schema for the dataset, and recommend the appropriate casting actions to convert the columns into the right types.
  • Data wrangling: Prior to using data to train machine learning models, the data must be modified into a more appropriate form. This step is called data wrangling, and involves such things as removing irrelevant data to save training time, and removing or imputing missing values.
  • Recipe: In the Engine, the chain of data transformations applied to a dataset are referred to as a recipe. Once complete, this recipe becomes reuseable on other datasets with a compliant schema.

The Data Preparation Flow

The data preparation process on the AI & Analytics Engine follows this structure:

  1. Upload a dataset to a project on the AI & Analytics Engine platform (the Engine)
  2. Create a new recipe tied to the dataset
  3. Confirm/edit the recommended schema
  4. Repeat the following until there are no more recommendations/you are satisfied:
    1. View insights and recommendations from the Engine
    2. Choose from the recommended actions and add manual actions
    3. Commit the actions to produce a new temporary dataset; the committed actions will be added to the recipe
  5. Finalize the recipe which will compile the committed actions into a reproducible data processing pipeline
  6. The processed dataset as a result of applying the recipe will become available for building machine learning models
  7. Reuse the recipe to conveniently prepare new incoming data for prediction

Note: only the dataset created at the step of recipe finalization can be accessed. All other datasets created from intermediate steps are temporary and non-accessible

Data Preparation via Web Graphical User Interface (GUI)

You can follow the previous sections to create a Project, and add a dataset.

Create a Recipe

When a dataset is first imported, the option is given to create a new data wrangling recipe. If a recipe has previously been created for a dataset and you wish to create a new one, you can do so from the dataset's details page:

Data Wrangling Window

When the recipe is first created the dataset is analyzed. This typically should take less than a minute. The duration depends on the number of variables or columns your tabular dataset contains. Once complete, we see the following screen:

This contains the following:

  1. Search bar. This can be used to find and display particular columns of your choice in the dataset.
  2. Dataset viewport. The first 1000 rows of the tabular dataset are shown to the user, in order to make informed decisions about the actions they want to apply next. Along the top are the names of the columns along with an icon indicating the type of the data it contains. The target column is highlighted for convenience. This view is refreshed whenever actions are queued.
  3. Actions dialogue box. This is the main interface to the data wrangling process. In the currently opened "Suggestions" tab, we see:
    1. Field to enter the target column (if provided, better recommendations are given)
    2. Insight generated by the Engine, click to expand and contract
    3. Recommended actions to address insight
    4. "See analysis" button, click to see a detailed explanation of the provided recommendations

Get Insights

Insights are generated when the recipe is first created and whenever actions are committed. You can see these insights in the "Suggestions" tab of the actions dialogue box. If a target column is chosen, the Engine is able to give better insights tailored to the target.

The first recommended action will always be to cast each column to a particular type, unless the dataset's schema already matches the schema inferred by the Engine.

Upon clicking "see analysis", we can view the Engine's justifications for the suggestions it has provided:

In this case, an analysis of the values in each column is shown and the Engine has decided whether it seems more numeric or categorical in nature. See the actions catalogue for when such actions are recommended and why.

Add Actions

Suggested actions can be added to the action queue by clicking the adjacent plus icon. Upon doing so, you will be brought to the "Recipe" tab where you will see the action in the action queue. Actions can similarly be added from the "Add Actions" tab.

Whenever actions in the queue are added, removed, or edited, the Engine generates a preview of the dataset with the action applied. Note that for some actions that act on the dataset globally, this preview on a subset of the data may not be 100% accurate -- see the actions catalogue.

Suggested Action Manual Action

From the "Queued actions" drop down list in the "Recipe" tab, you can edit and remove actions currently in the queue by clicking the respective icons. The "edit action" dialogue box (rightmost image above) is also displayed when actions are added manually from the "Add Actions" tab.

Commit Actions

Once satisfied with the actions currently in the queue, click "Commit Action" to apply them to the entire dataset. The queued actions will then appear in the "Committed actions" drop down list with a spinner indicating it is being applied.

The Engine will then also analyse the new processed dataset and generate new suggestions tailored to the target column (if provided). This will typically take 1-2 minutes, though it is heavily dependant on the size of the dataset.

Finalize Recipe

When an action is first committed, the "Finalize & End" button will no longer be greyed out. When satisfied with the dataset, click it, name the now complete dataset, and click "Yes" to confirm. The dataset will then be saved and available for training models, and the recipe is also saved as a reusable pipeline of actions to apply to new incoming datasets of the same shape.

Reuse Recipe

Once a recipe has been finalized, it becomes available for reuse to make predicting on new data convenient. To reuse a recipe, first upload the new dataset as described in the Importing Data page. Upon doing so, you will be prompted to choose whether to create a new recipe or use an existing one. Select "Use an existing recipe" and choose the recipe you wish to reuse from the drop down list.

The data wrangling pipeline in the recipe will then be applied to the dataset and once it has finished processing it will become available for use.

Data Preparation via API

We can interact with the AI & Analytics Engine via Python

Initialize

As usual, we import the SDK. Additionally, you will need to include the code in the Code for data preparation page in a file to be imported as aia_util

from aiaengine import api
import aia_util

We then establish a connection to the server via a client

client = api.Client()

# path to the file you wish to upload
data_file = "german_credit.csv"

The client will be used extensively in future steps.

Obtain Project ID

In order to add a dataset to the platform we need to first nominate the project to which the data should belong. A project is uniquely identified by its project ID which is different to the project name. We can find the project ID by searching for project name using the below

project_name_to_search_for = "Project 1"
project_id = aia_util.search_project_by_name(client, project_name_to_search_for)

Uploading Data to the platform

We can upload a dataset to the Engine using create_dataset

from datetime import datetime
time_stamp = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

new_dataset = aia_util.upload_data(client,
                              project_id=project_id,
                              name='German Credit ' + time_stamp,
                              description="",
                              path=data_file)


dataset_id = new_dataset.id

Create a recipe

We can create a recipe as below

recipe_obj = aia_util.create_new_recipe(
    client,
    dataset_id,
    target_col = "default_n24m",
    name = "Process German Credit Data",
    description = "Creating a Data Preparation Recipe/Pipeline"
)
print(recipe_obj)

Get Insights

We can ask the platform for insights

# get list of recommended actions of the first iteration
recommended_actions = aia_util.get_insights(client, recipe_obj)

The platform's UI visualizes the recommended actions for you. However, we also provide the data behind the visualizations so that you can perform your own visualizations if you wish. Running the below in a Jupyter notebook session will display the visualization of the recommendations

aia_util.visualize_recommendations(recommended_actions)

Pre-commit actions: Queue, Update, and Delete Actions

To apply an action to a dataset you need to first queue the action. When you queue an action, the platform checks if the action is valid. At this stage, the action hasn't been applied to the whole dataset yet. Instead, the action has been applied to a smaller subset of the dataset for sense checking.

iteration = 1

actions = []

for ra in recommended_actions:
    actions = actions + ra['solution']['actions']

recipe_obj = aia_util.queue_actions(
    client,
    recipe_obj.id,
    actions=json.dumps(actions))

You can queue multiple actions one after another using the aia_util.queue_actions function repeatedly. And the actions is intended to be executed in the order of they are queued.

You can update an action that has been queued even if the action is sandwhich between other actions. This can be done as below

client.recipes.UpdateActions(recipe.UpdateActionsRequest(id = recipe_id, iteration = iteration, actions = replacement_actions, from_index = from_index))

The from_index is the action index from which the queued actions will be replaced with the replacement_actions

To delete actions from the queue, you need to use the same code as above but you can need to choose a from_index. And your replacement_actions must be the actions from from_index with the action you want deleted removed from replacement_actions.

Commit Actions

Once you are done with all the actions that you want queued, you can apply the action to the dataset with a Commit command. Once the action is committed, it is final. That means, you can not edit a recipes committed actions like you can queued actions.

Once committed, the action interpreter will perform the actions on the dataset which will result in a temporary dataset. See code to commit actions

aia_util.commit_actions(client, recipe_obj.id, iteration)
aia_util.wait_for_commit_actions(client, recipe_obj.id, iteration)

Show the final result

To obtain the result of iteration two, the user can use the get_dataframe function as below. The function returns a pandas.DataFrame.

# get list of dataset's files
df = aia_util.get_dataframe(
    client,
    aia_util.get_dataset_id(recipe_obj)
)

df.head()

Finalize the Recipe

Once the user is content with the recipe or the platform can no longer generate more recommendations, the user can finalize the recipe and package it up as a reproducible data processing pipeline

completed_dataset_name = "German Credit Data (Prepared)" + time_stamp

complete_recipe_response = aia_util.finalize_recipe(client, recipe_obj, completed_dataset_name)

output_dataset_id = complete_recipe_response.dataset_id

finalize_df = aia_util.get_dataframe(client, output_dataset_id)

print(finalize_df.head())