Skip to content

Importing Data

This section shows how to import data from various sources into the Engine.

Before you begin — Best practices with column names

The following best practices are recommended to be followed before you import data into the Engine.

Columns that begin with double underscores (__) in your data will be ignored at the time of reading the dataset. This is because the prefix __ is a namespace reserved for internal purposes by the Engine. The Engine will also automatically:

  1. Assign names with numbered prefixes starting with Unknown_0 to columns for which there is no name in your data.
  2. Remove special characters such as [](),.; and spaces from column names by substituting them with underscores.
  3. De-duplicate columns with same names

You may want to control this behaviour and clean them yourselves.

In summary, the following practices are recommended:

  1. Make column names unique and non-empty. Leave no columns unnamed or have two columns with the same name.
  2. Use only letters, numbers and underscores and without spaces to name your columns. Use either a letter or a number to start the names.

Python users: Saving dataframes from Pandas and Dask into a dataset

If you intend to load data into pandas or dask, process them, and save them in a tabular file format to later import into the Engine, make sure that row indexes that contain useful data are saved, and unnecessary indexes are skipped:

import pandas as pd

data: pd.DataFrame = pd.read_csv('my_file.csv') # Or read_json(..., lines=True) or read_parquet(...)

# You process your data further, and finalize

final_data = my_processing_pipeline(final_data)

# If the index in the final data is not a range index and contains useful
# information such as timestamp or group names, use reset_index to convert them into columns

final_data = final_data.reset_index(names_of_indexes_you_want_to_retain)

# Then discard the row indexes that do not contain useful data

final_data = final_data.reset_index(drop=True)

# The resulting data now has a range index, which contains no useful data and 
# if kept results in an unnamed column. Hence, use the option to skip it before
# saving to appropriate format(s) for importing:

final_data.to_csv('data_to_import.csv', index=False)
final_data.to_json('data_to_import.jsonl', index=False, orient='records', lines=True)
final_data.to_parquet('data_to_import.parquet', index=False)

Before you begin — Workarounds

Due to certain known issues and limitations with the current release, some datasources will need a few offline workarounds before you can import them with the Engine. Immediate future releases will aim to eliminate this extra work required from users. This section details what specific workarounds you can use in each case, if your dataset falls into these categories.

Tab-separated (.tsv) and Pipe-separated (.psv) files

Simply change the file extension to csv. The data import module of the Engine is able to infer the delimiter character. However, it currently expects all delimited files to have the extension csv. This is a known issue and will be fixed in the future.

Compressed files (ending in .zip, .gz, .bz2 or .xz)

The Engine currently does not support importing tabular files (csv or jsonl) stored in compressed formats. You will need to decompress them offline before importing.

Excel (.xls and .xlsx), SAS (.sas7bdat), STATA (.dta), and SPSS (.sav, .zsav, .por) formats

The Engine currently does not support importing files in excel, sas, or spss format.

Save all such files into .csv format. If you are comfortable writing small scripts in the R/Python programming languages, you can convert to .csv using one of the fofllowing options:

  1. Pandas functions read_sas, read_spss, read_stata, read_excel, ExcelFile.parse
  2. The R packages readr, readxl, and haven

If you have got data in excel format, first save each sheet as separate .csv files. Then upload each sheet as a separate dataset.

Nested jsonl and jsonlines files

If you intend to ingest nested JSON lines files into tabular data, you will need to unnest them yourself. Use an appropriate tool to perform this offline:

  1. If you are importing from Mongo DB, make another collection in your database and import from it: use an aggregation pipeline with the $unwind aggregation stage, coupled with necessary aggregation operators such as $arrayToObject and $objectToArray.
  2. If you have a local jsonlines file with nested data and are familiar with pandas in python, use the JSON normalization functionality from pandas.

Using the GUI

Starting from the page of a project, hover on the floating-action buttons at the bottom right and choose "New Dataset".

You are then taken to the "Choose method" section of the dataset creation dialogue for the dataset importing options.

Uploading Local Files

From the "Choose method" dialog, click "File Upload":

Now you can drag and drop a file (csv, jsonl, jsonlines, or parquet) into the upload area. You will also need to provide a name for the dataset. Click "Create" to start importing the file into the Engine.

Importing from URLs

The Engine provides a convenient way to directly import a csv, jsonl, jsonlines, or parquet file from the web. This provides you the convenience of not having to download from the web to your computer and upload it to your project.

From the "Choose method" dialog, click "HTTP/HTTPS":

You will then need to provide a name and enter the URL(s) for your dataset in the displayed text area. Entering multiple URLs separated by ";" concatenates the datasets together, to account for datasets spread over multiple files. Thus, the datasets in the URLs must have the same number of columns and column names, or else an error is returned.

Importing from Database

You can also import a table or a collection from your database as a dataset. Supported databases are:

  1. MySQL
  2. PostgreSQL
  3. SQL Server
  4. MongoDB
  5. Oracle Database (currently in development)
  6. Cassandra (currently in development)
  7. FTP Server (currently in development)

Click on the appropriate database type to begin:

In the next form, you will then need to enter the following details:

  1. Name of dataset to be created
  2. Host address. Do not include the protocol (such as "mongodb://"), port, user ID, password, or database name.
  3. Port
  4. User name
  5. Password
  6. Database name
  7. Name of table or collection

Next Step

Once data has been uploaded to the Engine, you will be prompted to begin a data wrangling session - more details can be found in the next section of the How-To Guides. In the dataset creation dialogue box, select "Create a new data wrangling recipe" and click "Done" to proceed to the next step.

Using API access through SDK

To access the API functions, you must first authenticate into the Engine by

from aiaengine import api

client = api.Client()

Then you need to import the following modules in order to use the related functions.

import time
import json

from aiaengine.api import util, file, dataset

Uploading a file

Now you can upload a file from you local file system to the Engine as a new dataset. You need to specify parameter values including the project id, the name and description of the dataset, the local path to the uploaded file and the file format. Here is an example of uploading a csv file.

new_dataset = util.create_dataset(
    client,
    project_id=my_project_id, # You can obtain this using the ListUserProjects API call
    name='Dataset Name',
    description='What is your uploaded data about',
    data_files=['/path/to/dataset_file.csv'],
    content_type='text/csv'    # file format
)

The parameter content_type indicates the uploaded file format. If the format of your uploaded file is not csv, simply change content_type according to the table below

File Format content_type
csv 'text/csv'
parquet 'application/binary+parquet'
json 'application/json'

Importing from URLs and Databases

To import from a URL or a Database, first create an "empty" dataset with a name and a description (optional). Store the details of the dataset created:

# Make an empty dataset.
dataset_details = client.datasets.CreateDataset(
    dataset.CreateDatasetRequest(
        project_id=my_project_id,
        name='My Dataset',
        description='This dataset is about...'
    )
)

Now import files into the dataset using the ImportFiles API call. The common template is as follows:

client.files.ImportFiles(
    file.ImportFilesRequest(
        dataset_id=dataset_details.id,
        source=source_name,
        data=additional_info_needed_for_import
    )
)

In the ImportFilesRequest object, you will need to specify the source type in the source field and additional information such as URLs, hotst name/port, etc. in the data field as a dictionary. The table below covers all the cases:

Import Type source Keys required for data Value of type field in data
From web (HTTP/HTTPS) url urls
S3 s3 sourceUrl, awsAccessKeyId, awsSecretAccessKey, awsRegion
MySQL database type, host, port, user, password, database, table mysql
PostgresSQL database type, host, port, user, password, database, table postgres
SQL Server database type, host, port, user, password, database, table sqlserver
Mongo DB database type, host, port, user, password, database, table mongodb

Independent of the type of source you import from, the values inside the "data" dictionary must be a string. Hence, in the case of HTTP upload, you will need to encode the array of URL strings into a single json string, as in this example:

client.files.ImportFiles(
    file.ImportFilesRequest(
        dataset_id=dataset_details.id,
        source='url',
        data={'urls': json.dumps(['https://www.openml.org/data/get_csv/53923/sylva_prior.arff'])}
    )
)

To import from a database, do as follows:

db_connection_details = {
    'type': 'mysql',
    'host': 'sql12.freemysqlhosting.net',
    'port': '3306',
    'user': 'sql12334825',
    'password': '********',
    'database': 'sql12334825',
    'table': 'customers'
}

client.files.ImportFiles(
    file.ImportFilesRequest(
        dataset_id=dataset_details.id,
        source='database',
        data=db_connection_details
    )
)

After import, you can check the status of the database using the following API call:

import_complete = False

while not import_complete:
    # Wait five seconds
    time.sleep(5)

    # Then poll for details of the database
    dataset_details = client.datasets.GetDataset(
        dataset.GetDatasetRequest(id=dataset_details.id)
    )

    import_complete = dataset_details.step != 'file_processing'

print(f'Import complete for dataset {dataset_details.name}')

When the import process is complete, you can proceed to the data preparation step.