Skip to content

Data Preparation Module

Intro

Data scientists spend 60% of their time cleaning data -- CrowdFlower 2016 Data Science report

alt text

In order to reduce that time, we developed the Data Preparation Module which analyzes the data, finds possible issues and suggests and applies corrections in a systematic way to the dataset.

Usually, you upload the dataset in some format (e.g. CSV or parquet or a DBMS like PostgreSQL). Then the data preparation stages begin. The outcome of these stages will be a transformed, cleaned, and feature-engineered dataset which can be used to build machine learning (ML) models immediately.

Why is data preparation important?

A familiar concept in machine learning is Garbage In Garbage Out. It means that the quality of a predictive model depends directly on the quality of the data. There are many issues with unclean data, some obvious and some subtle, that may highly impact the quality of predictive models.

Example problems could be:

  • Data (target) Leakage
  • Highly correlated columns
  • Missing values
  • Messy text columns
  • Outlier values

Another issue is feature engineering. While domain knowledge is the best tool for manipulating features in such a way as to benefit a predictive model, the Data Preparation module can suggest various transformations (actions) which could further increase the quality of the models. Example enhancements include:

  • Historical summarization for date-based datasets e.g. Average Spending Last 6 Months
  • Dimensionality reduction
  • Extraction of date and time components that are predictive of the target column
  • One hot encoding of categorical columns

To summarize, the model will only be as good as the data, hence the preparation is critical.

The Data Preparation process (Data Wrangling)

After the data is uploaded to the platform, the process of preparation begins. We introduce several concepts, marked by italic script, and they are explained through the description of the process.

Overview

The data preparation begins with an empty recipe. A recipe is a reproducible pipeline of data transformations where each transformation is called an action. A recipe is built up one iteration at a time. In the first iteration a schema is recommended for the dataset. Once a schema has been set for the dataset, the second iteration can begin and the AI & Analytics Engine (the Engine) will analyze the dataset and try to find problems and generate insights regarding the data. You may choose to commit some or all of the recommended actions and add them to the recipe. Once you have selected the actions to commit, she can end the current iteration, and the engine will carry out the actions and begin a new iteration. Since actions transform datasets in some way, the Engine may generate different insights in the iteration. This cycle may continue until the Engine can no longer find actionable insights. You may choose to "Finalize" the recipe at any point. When finalizing a recipe, the recipe is compiled into a reproducible pipeline and can be applied to other datasets with a compliant schema

Input Recipe Output

Recipes

  • In order to prepare the data in a systematic way, you create a recipe

  • A recipe is an ordered chain of data transformations achieved through various actions that are applied over the dataset

  • The reproducible nature of a recipe means that the recipe made by you can be applied to a new dataset as long as it has a compliant schema

Relationship of Recipes, Iterations and Actions

Iterations

  • The process of building a recipe flows in iterations. An iteration is an "analysis unit". Once it's finished, you can decide on how to proceed according to the outcome of the analysis:

  • The dataset input for each iteration is the dataset output from the previous iteration

One Iteration

Insights & Problems

  • In each iteration zero or more insights or problems are found. Each insights or problems has a series of associated recommended actions to take advantage of the insight or to fix the problem. For example, a problem could be: Column "Flight destination" is identical to column "Flight destination 2" and the associate action would be to drop "Flight Destination 2".
  • With each problem found, the platform also generates recommended actions that are designed to solve or mitigate that problem. For example, a suggested solution for the duplicated column problem could be: Drop column "Flight destination 2".

  • You are not forced to choose the recommended actions (if any) at the end of an iteration and you are free to modify or dismiss them. You can commit the recommended actions which will transform them into committed actions. A committed action means these actions will be applied to the dataset, and the resultant dataset will be fed to the next iteration.

Pipeline

  • After several iterations, it is expected that the platform will have no more recommendations. At that stage, you can finalize the recipe. The ordered collection of actions within the recipe form a pipeline. The pipeline can be be applied to new data after the recipe is created.

Summary

The main concept behind the data preparation is a time-saving tool for semi-automatic data cleaning which allows building better models in a systematic way. It is accomplished by building a pipeline of actions in an iterative analysis process by the platform together with you.