Data Preparation Overview
The AI & Analytics Engine's data wrangling module aims to:
- Build a data processing pipeline that can be applied to new data in a consistent and reproducible manner,
- Provide AI-powered insights into the data and make smart recommendations on how to transform the data for better machine learning outcomes,
- Provide clear explanations as to why the presented actions were recommended, so that users can impart trust on the recommendations and feel comfortable adopting them.
In this document you will find detailed explanations of all the Engine's data wrangling capabilities. There are instructions for using both the GUI and the API.
To fully appreciate the material covered in this page, you will need some familiarity with the concepts of:
- Schema: A schema is attached to a dataset and it contains information about the column names, and the type of each column. A dataset must have have a schema before actions can be appplied to it. If the user uploads a dataset without a schema (e.g. a CSV or parquet file), then the platform will infer a schema for the dataset, and recommend the appropriate casting actions to convert the columns into the right types.
- Data wrangling: Prior to using data to train machine learning models, the data must be modified into a more appropriate form. This step is called data wrangling, and involves such things as removing irrelevant data to save training time, and removing or imputing missing values.
- Recipe: In the Engine, the chain of data transformations applied to a dataset are referred to as a recipe. Once complete, this recipe becomes reuseable on other datasets with a compliant schema.
The Data Preparation Flow
The data preparation process on the AI & Analytics Engine follows this structure:
- Upload a dataset to a project on the AI & Analytics Engine platform (the Engine)
- Create a new recipe tied to the dataset
- Confirm/edit the recommended schema
- Repeat the following until there are no more recommendations/you are satisfied:
- View insights and recommendations from the Engine
- Choose from the recommended actions and add manual actions
- Commit the actions to produce a new temporary dataset; the committed actions will be added to the recipe
- Finalize the recipe which will compile the committed actions into a reproducible data processing pipeline
- The processed dataset as a result of applying the recipe will become available for building machine learning models
- Reuse the recipe to conveniently prepare new incoming data for prediction
Note: only the dataset created at the step of recipe finalization can be accessed. All other datasets created from intermediate steps are temporary and non-accessible.
Data Preparation via API
We can interact with the AI & Analytics Engine via Python
As usual, we import the SDK. Additionally, you will need to include the code in the Code for data preparation page in a file to be imported as
We then establish a connection to the server via a client
client will be used extensively in future steps.
Obtain Project ID
In order to add a dataset to the platform we need to first nominate the project to which the data should belong. A project is uniquely identified by its project ID which is different to the project name. We can find the project ID by searching for project name using the below
project_name_to_search_for = "Project 1" project_id = aia_util.search_project_by_name(client, project_name_to_search_for)
Uploading Data to the platform
We can upload a dataset to the Engine using
from datetime import datetime time_stamp = datetime.now().strftime("%Y-%m-%d-%H-%M-%S") new_dataset = aia_util.upload_data(client, project_id=project_id, name='German Credit ' + time_stamp, description="", path=data_file) dataset_id = new_dataset.id
Create a recipe
We can create a recipe as below
recipe_obj = aia_util.create_new_recipe( client, dataset_id, target_col = "default_n24m", name = "Process German Credit Data", description = "Creating a Data Preparation Recipe/Pipeline" ) print(recipe_obj)
We can ask the platform for insights
# get list of recommended actions of the first iteration recommended_actions = aia_util.get_insights(client, recipe_obj)
The platform's UI visualizes the recommended actions for you. However, we also provide the data behind the visualizations so that you can perform your own visualizations if you wish. Running the below in a Jupyter notebook session will display the visualization of the recommendations
Pre-commit actions: Queue, Update, and Delete Actions
To apply an action to a dataset you need to first queue the action. When you queue an action, the platform checks if the action is valid. At this stage, the action hasn't been applied to the whole dataset yet. Instead, the action has been applied to a smaller subset of the dataset for sense checking.
iteration = 1 actions =  for ra in recommended_actions: actions = actions + ra['solution']['actions'] recipe_obj = aia_util.queue_actions( client, recipe_obj.id, actions=json.dumps(actions))
You can queue multiple actions one after another using the
aia_util.queue_actions function repeatedly. And the actions is intended to be executed in the order of they are queued.
You can update an action that has been queued even if the action is sandwhich between other actions. This can be done as below
client.recipes.UpdateActions(recipe.UpdateActionsRequest(id = recipe_id, iteration = iteration, actions = replacement_actions, from_index = from_index))
from_index is the action index from which the queued actions will be replaced with the
To delete actions from the queue, you need to use the same code as above but you can need to choose a
from_index. And your
replacement_actions must be the actions from
from_index with the action you want deleted removed from
Once you are done with all the actions that you want queued, you can apply the action to the dataset with a Commit command. Once the action is committed, it is final. That means, you can not edit a recipes committed actions like you can queued actions.
Once committed, the action interpreter will perform the actions on the dataset which will result in a temporary dataset. See code to commit actions
aia_util.commit_actions(client, recipe_obj.id, iteration) aia_util.wait_for_commit_actions(client, recipe_obj.id, iteration)
Show the final result
To obtain the result of iteration two, the user can use the
get_dataframe function as below. The function returns a
# get list of dataset's files df = aia_util.get_dataframe( client, aia_util.get_dataset_id(recipe_obj) ) df.head()
Finalize the Recipe
Once the user is content with the recipe or the platform can no longer generate more recommendations, the user can finalize the recipe and package it up as a reproducible data processing pipeline
completed_dataset_name = "German Credit Data (Prepared)" + time_stamp complete_recipe_response = aia_util.finalize_recipe(client, recipe_obj, completed_dataset_name) output_dataset_id = complete_recipe_response.dataset_id finalize_df = aia_util.get_dataframe(client, output_dataset_id) print(finalize_df.head())