Importing data

This section shows how to import data from various sources into the Engine.

Python users: Saving dataframes from Pandas and Dask into a dataset

If you intend to load data into pandas or dask, process them, and save them in a tabular file format to later import into the Engine, make sure that row indexes that contain useful data are saved, and unnecessary indexes are skipped:

import pandas as pd

data: pd.DataFrame = pd.read_csv('my_file.csv') # Or read_json(..., lines=True) or read_parquet(...)

# You process your data further, and finalize

final_data = my_processing_pipeline(final_data)

# If the index in the final data is not a range index and contains useful
# information such as timestamp or group names, use reset_index to convert them into columns

final_data = final_data.reset_index(names_of_indexes_you_want_to_retain)

# Then discard the row indexes that do not contain useful data

final_data = final_data.reset_index(drop=True)

# The resulting data now has a range index, which contains no useful data and 
# if kept results in an unnamed column. Hence, use the option to skip it before
# saving to appropriate format(s) for importing:

final_data.to_csv('data_to_import.csv', index=False)
final_data.to_json('data_to_import.jsonl', index=False, orient='records', lines=True)
final_data.to_parquet('data_to_import.parquet', index=False)

Before you begin — Workarounds

Due to certain known issues and limitations with the current release, some datasources will need a few offline workarounds before you can import them with the Engine. Immediate future releases will aim to eliminate this extra work required from users. This section details what specific workarounds you can use in each case, if your dataset falls into these categories.

Compressed files (ending in `.zip`, `.gz`, `.bz2` or `.xz`)

The Engine currently does not support importing tabular files (csv or jsonl) stored in compressed formats. You will need to decompress them offline before importing.

SAS (`.sas7bdat`), STATA (`.dta`), and SPSS (`.sav`, `.zsav`, `.por`) formats

The Engine currently does not support importing files in sas, or spss format.

Save all such files into .csv format. If you are comfortable writing small scripts in the R/Python programming languages, you can convert to .csv using one of the fofllowing options:

Pandas functions read_sas, read_spss, read_stata
The R packages readr, readxl, and haven

If you have got data in excel format, first save each sheet as separate .csv files. Then upload each sheet as a separate dataset.

Nested `jsonl` and `jsonlines` files

If you intend to ingest nested JSON lines files into tabular data, you will need to unnest them yourself. Use an appropriate tool to perform this offline:

If you are importing from Mongo DB, make another collection in your database and import from it: use an aggregation pipeline with the $unwind aggregation stage, coupled with necessary aggregation operators such as $arrayToObject and $objectToArray.
If you have a local jsonlines file with nested data and are familiar with pandas in python, use the JSON normalization functionality from pandas.

Import data from local files

Import data from CSV data files

PythonJava

from aiaengine import Org, Project, FileSource, Column, DataType

# create a new demo project in the org
org = Org(id='b6240512-cd17-43a0-8297-84c51c1bc5a0') # replace with your org ID
project = org.create_project(name="Demo project using Python SDK", description="Your demo project")
# or you can get an existing project that you want to work on
# project = Project(id='ID_of_your_project') # replace with your own project ID

# import the `German Credit Data` dataset
data_file = 'examples/datasets/german-credit.csv'
# You can use the `print_schema` utility function to print the auto-inferred schema
# print_schema(pd.read_csv(data_file, header=0))

dataset = project.create_dataset(
    name=f"German Credit Data",
    data_source=FileSource(
        file_urls=[data_file],
        schema=[
            Column('checking_status', DataType.Text),
            Column('duration', DataType.Numeric),
            Column('credit_history', DataType.Text),
            Column('purpose', DataType.Text),
            Column('credit_amount', DataType.Numeric),
            Column('savings_status', DataType.Text),
            Column('employment', DataType.Text),
            Column('installment_commitment', DataType.Numeric),
            Column('personal_status', DataType.Text),
            Column('other_parties', DataType.Text),
            Column('residence_since', DataType.Numeric),
            Column('property_magnitude', DataType.Text),
            Column('age', DataType.Numeric),
            Column('other_payment_plans', DataType.Text),
            Column('housing', DataType.Text),
            Column('existing_credits', DataType.Numeric),
            Column('job', DataType.Text),
            Column('num_dependents', DataType.Numeric),
            Column('own_telephone', DataType.Text),
            Column('foreign_worker', DataType.Text),
            Column('class', DataType.Text)
        ]
    )
)
print(dataset.id)

package com.aiaengine.examples.dataset;

import com.aiaengine.Dataset;
import com.aiaengine.Engine;
import com.aiaengine.Org;
import com.aiaengine.Project;
import com.aiaengine.datasource.DataSource;
import com.aiaengine.datasource.Schema;
import com.aiaengine.datasource.file.CSVFileSettings;
import com.aiaengine.datasource.file.FileSourceRequest;
import com.aiaengine.datasource.file.FileType;
import com.aiaengine.org.request.CreateProjectRequest;
import com.aiaengine.project.request.CreateDatasetRequest;

import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.List;

public class ImportCsvApp {
    public static void main(String[] args) throws FileNotFoundException {
        Engine engine = new Engine();
        // create a new demo project in the org
        Org org = engine.getOrg("cae24b10-e6b0-4d61-8cef-a9f4b8f6133d"); // replace with your org ID
        Project project = org.createProject(CreateProjectRequest.builder()
                .name("Demo project using Java SDK")
                .description("Your demo project")
                .build());
        // or you can get an existing project that you want to work on
        // Project project = engine.getProject("ID_of_your_project") // replace with your own project ID

        String dataFilePath = "examples/datasets/german-credit.csv";
        List<Schema.Column> columns = new ArrayList<>();
        columns.add(new Schema.Column("checking_status", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("duration", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("credit_history", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("purpose", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("credit_amount", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("savings_status", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("employment", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("installment_commitment", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("personal_status", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("other_parties", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("residence_since", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("property_magnitude", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("age", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("other_payment_plans", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("housing", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("existing_credits", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("job", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("num_dependents", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("own_telephone", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("foreign_worker", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("class", Schema.SemanticType.TEXT));
        DataSource localDataSource = engine.buildFileSource(FileSourceRequest.builder()
                .fileType(FileType.CSV)
                .url(dataFilePath)
                .fileSettings(new CSVFileSettings())
                .schema(new Schema(columns))
                .build());

        Dataset dataset = project.createDataset(CreateDatasetRequest.builder()
                .name("German Credit Data")
                .dataSource(localDataSource)
                .timeout(900)
                .build());

        System.out.println(dataset.getId());
    }
}

Import from Excel file

PythonJava

from argparse import FileType
from aiaengine import Org, Project, FileSource, Column, FileType, DataType, ExcelSettings

# create a new demo project in the org
org = Org(id='b6240512-cd17-43a0-8297-84c51c1bc5a0') # replace with your org ID
project = org.create_project(name="Demo project using Python SDK", description="Your demo project")
# or you can get an existing project that you want to work on
# project = Project(id='ID_of_your_project') # replace with your own project ID

# import the `German Credit Data` dataset
data_file = 'examples/datasets/german-credit.xlsx'
# You can use the `print_schema` utility function to print the auto-inferred schema
# print_schema(pd.read_excel(data_file, header=0))

dataset = project.create_dataset(
    name=f"German Credit Data (Excel)",
    data_source=FileSource(
        file_urls=[data_file],
        file_type=FileType.Excel,
        file_settings=ExcelSettings(
            data_range='A1:U1001'
        ),
        schema=[
            Column('checking_status', DataType.Text),
            Column('duration', DataType.Numeric),
            Column('credit_history', DataType.Text),
            Column('purpose', DataType.Text),
            Column('credit_amount', DataType.Numeric),
            Column('savings_status', DataType.Text),
            Column('employment', DataType.Text),
            Column('installment_commitment', DataType.Numeric),
            Column('personal_status', DataType.Text),
            Column('other_parties', DataType.Text),
            Column('residence_since', DataType.Numeric),
            Column('property_magnitude', DataType.Text),
            Column('age', DataType.Numeric),
            Column('other_payment_plans', DataType.Text),
            Column('housing', DataType.Text),
            Column('existing_credits', DataType.Numeric),
            Column('job', DataType.Text),
            Column('num_dependents', DataType.Numeric),
            Column('own_telephone', DataType.Text),
            Column('foreign_worker', DataType.Text),
            Column('class', DataType.Text)
        ]
    )
)
print(dataset.id)

package com.aiaengine.examples.dataset;

import com.aiaengine.Dataset;
import com.aiaengine.Engine;
import com.aiaengine.Org;
import com.aiaengine.Project;
import com.aiaengine.datasource.DataSource;
import com.aiaengine.datasource.Schema;
import com.aiaengine.datasource.file.ExcelFileSettings;
import com.aiaengine.datasource.file.FileSourceRequest;
import com.aiaengine.datasource.file.FileType;
import com.aiaengine.org.request.CreateProjectRequest;
import com.aiaengine.project.request.CreateDatasetRequest;

import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.List;

public class ImportExcelApp {
    public static void main(String[] args) throws FileNotFoundException {
        Engine engine = new Engine();
        // create a new demo project in the org
        Org org = engine.getOrg("cae24b10-e6b0-4d61-8cef-a9f4b8f6133d"); // replace with your org ID
        Project project = org.createProject(CreateProjectRequest.builder()
                .name("Demo project using Java SDK")
                .description("Your demo project")
                .build());
        // or you can get an existing project that you want to work on
        // Project project = engine.getProject("ID_of_your_project") // replace with your own project ID

        String dataFilePath = "examples/datasets/german-credit.xlsx";
        List<Schema.Column> columns = new ArrayList<>();
        columns.add(new Schema.Column("checking_status", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("duration", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("credit_history", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("purpose", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("credit_amount", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("savings_status", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("employment", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("installment_commitment", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("personal_status", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("other_parties", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("residence_since", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("property_magnitude", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("age", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("other_payment_plans", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("housing", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("existing_credits", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("job", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("num_dependents", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("own_telephone", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("foreign_worker", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("class", Schema.SemanticType.TEXT));
        DataSource localDataSource = engine.buildFileSource(FileSourceRequest.builder()
                .fileType(FileType.EXCEL)
                .url(dataFilePath)
                .fileSettings(new ExcelFileSettings("A1:U1001"))
                .schema(new Schema(columns))
                .build());

        Dataset dataset = project.createDataset(CreateDatasetRequest.builder()
                .name("German Credit Data (Excel)")
                .dataSource(localDataSource)
                .timeout(900)
                .build());

        System.out.println(dataset.getId());
    }
}

Import data from Parquet files

PythonJava

from argparse import FileType
from aiaengine import Org, Project, FileSource, Column, FileType, DataType

# create a new demo project in the org
org = Org(id='b6240512-cd17-43a0-8297-84c51c1bc5a0') # replace with your org ID
project = org.create_project(name="Demo project using Python SDK", description="Your demo project")
# or you can get an existing project that you want to work on
# project = Project(id='ID_of_your_project') # replace with your own project ID

# import the `German Credit Data` dataset
data_file = 'examples/datasets/german-credit.parquet'
# You can use the `print_schema` utility function to print the auto-inferred schema
# print_schema(pd.read_parquet(data_file))

dataset = project.create_dataset(
    name=f"German Credit Data (Parquet)",
    data_source=FileSource(
        file_urls=[data_file],
        file_type=FileType.Parquet,
        schema=[
            Column('checking_status', DataType.Text),
            Column('duration', DataType.Numeric),
            Column('credit_history', DataType.Text),
            Column('purpose', DataType.Text),
            Column('credit_amount', DataType.Numeric),
            Column('savings_status', DataType.Text),
            Column('employment', DataType.Text),
            Column('installment_commitment', DataType.Numeric),
            Column('personal_status', DataType.Text),
            Column('other_parties', DataType.Text),
            Column('residence_since', DataType.Numeric),
            Column('property_magnitude', DataType.Text),
            Column('age', DataType.Numeric),
            Column('other_payment_plans', DataType.Text),
            Column('housing', DataType.Text),
            Column('existing_credits', DataType.Numeric),
            Column('job', DataType.Text),
            Column('num_dependents', DataType.Numeric),
            Column('own_telephone', DataType.Text),
            Column('foreign_worker', DataType.Text),
            Column('class', DataType.Text)
        ]
    )
)
print(dataset.id)

package com.aiaengine.examples.dataset;

import com.aiaengine.Dataset;
import com.aiaengine.Engine;
import com.aiaengine.Org;
import com.aiaengine.Project;
import com.aiaengine.datasource.DataSource;
import com.aiaengine.datasource.EmptyFileSettings;
import com.aiaengine.datasource.Schema;
import com.aiaengine.datasource.file.FileSourceRequest;
import com.aiaengine.datasource.file.FileType;
import com.aiaengine.org.request.CreateProjectRequest;
import com.aiaengine.project.request.CreateDatasetRequest;

import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.List;

public class ImportParquetApp {
    public static void main(String[] args) throws FileNotFoundException {
        Engine engine = new Engine();
        // create a new demo project in the org
        Org org = engine.getOrg("cae24b10-e6b0-4d61-8cef-a9f4b8f6133d"); // replace with your org ID
        Project project = org.createProject(CreateProjectRequest.builder()
                .name("Demo project using Java SDK")
                .description("Your demo project")
                .build());
        // or you can get an existing project that you want to work on
        // Project project = engine.getProject("ID_of_your_project") // replace with your own project ID

        String dataFilePath = "examples/datasets/german-credit.parquet";
        List<Schema.Column> columns = new ArrayList<>();
        columns.add(new Schema.Column("checking_status", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("duration", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("credit_history", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("purpose", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("credit_amount", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("savings_status", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("employment", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("installment_commitment", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("personal_status", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("other_parties", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("residence_since", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("property_magnitude", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("age", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("other_payment_plans", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("housing", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("existing_credits", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("job", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("num_dependents", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("own_telephone", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("foreign_worker", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("class", Schema.SemanticType.TEXT));
        DataSource localDataSource = engine.buildFileSource(FileSourceRequest.builder()
                .fileType(FileType.PARQUET)
                .url(dataFilePath)
                .fileSettings(new EmptyFileSettings())
                .schema(new Schema(columns))
                .build());

        Dataset dataset = project.createDataset(CreateDatasetRequest.builder()
                .name("German Credit Data (Parquet)")
                .dataSource(localDataSource)
                .timeout(900)
                .build());

        System.out.println(dataset.getId());
    }
}

Import data from JSON files

PythonJava

from argparse import FileType
from aiaengine import Org, Project, FileSource, Column, FileType, DataType

# create a new demo project in the org
org = Org(id='b6240512-cd17-43a0-8297-84c51c1bc5a0') # replace with your org ID
project = org.create_project(name="Demo project using Python SDK", description="Your demo project")
# or you can get an existing project that you want to work on
# project = Project(id='ID_of_your_project') # replace with your own project ID

# import the `German Credit Data` dataset
data_file = 'examples/datasets/german-credit.jsonl'
# You can use the `print_schema` utility function to print the auto-inferred schema
# print_schema(pd.read_json(data_file, orient='records', lines=True))

dataset = project.create_dataset(
    name=f"German Credit Data (JSONL)",
    data_source=FileSource(
        file_urls=[data_file],
        file_type=FileType.JSONLine,
        schema=[
            Column('checking_status', DataType.Text),
            Column('duration', DataType.Numeric),
            Column('credit_history', DataType.Text),
            Column('purpose', DataType.Text),
            Column('credit_amount', DataType.Numeric),
            Column('savings_status', DataType.Text),
            Column('employment', DataType.Text),
            Column('installment_commitment', DataType.Numeric),
            Column('personal_status', DataType.Text),
            Column('other_parties', DataType.Text),
            Column('residence_since', DataType.Numeric),
            Column('property_magnitude', DataType.Text),
            Column('age', DataType.Numeric),
            Column('other_payment_plans', DataType.Text),
            Column('housing', DataType.Text),
            Column('existing_credits', DataType.Numeric),
            Column('job', DataType.Text),
            Column('num_dependents', DataType.Numeric),
            Column('own_telephone', DataType.Text),
            Column('foreign_worker', DataType.Text),
            Column('class', DataType.Text)
        ]
    )
)
print(dataset.id)

package com.aiaengine.examples.dataset;

import com.aiaengine.Dataset;
import com.aiaengine.Engine;
import com.aiaengine.Org;
import com.aiaengine.Project;
import com.aiaengine.datasource.DataSource;
import com.aiaengine.datasource.EmptyFileSettings;
import com.aiaengine.datasource.Schema;
import com.aiaengine.datasource.file.FileSourceRequest;
import com.aiaengine.datasource.file.FileType;
import com.aiaengine.org.request.CreateProjectRequest;
import com.aiaengine.project.request.CreateDatasetRequest;

import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.List;

public class ImportJsonlineApp {
    public static void main(String[] args) throws FileNotFoundException {
        Engine engine = new Engine();
        // create a new demo project in the org
        Org org = engine.getOrg("cae24b10-e6b0-4d61-8cef-a9f4b8f6133d"); // replace with your org ID
        Project project = org.createProject(CreateProjectRequest.builder()
                .name("Demo project using Java SDK")
                .description("Your demo project")
                .build());
        // or you can get an existing project that you want to work on
        // Project project = engine.getProject("ID_of_your_project") // replace with your own project ID

        String dataFilePath = "examples/datasets/german-credit.jsonl";
        List<Schema.Column> columns = new ArrayList<>();
        columns.add(new Schema.Column("checking_status", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("duration", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("credit_history", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("purpose", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("credit_amount", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("savings_status", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("employment", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("installment_commitment", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("personal_status", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("other_parties", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("residence_since", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("property_magnitude", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("age", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("other_payment_plans", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("housing", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("existing_credits", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("job", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("num_dependents", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("own_telephone", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("foreign_worker", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("class", Schema.SemanticType.TEXT));
        DataSource localDataSource = engine.buildFileSource(FileSourceRequest.builder()
                .fileType(FileType.JSON_LINE)
                .url(dataFilePath)
                .fileSettings(new EmptyFileSettings())
                .schema(new Schema(columns))
                .build());

        Dataset dataset = project.createDataset(CreateDatasetRequest.builder()
                .name("German Credit Data (JSONL)")
                .dataSource(localDataSource)
                .timeout(900)
                .build());

        System.out.println(dataset.getId());
    }
}

Importing data from remote files

Import data from public HTTP

PythonJava

from aiaengine import Org, Project, FileSource, Column, DataType

# create a new demo project in the org
org = Org(id='b6240512-cd17-43a0-8297-84c51c1bc5a0') # replace with your org ID
project = org.create_project(name="Demo project using Python SDK", description="Your demo project")
# or you can get an existing project that you want to work on
# project = Project(id='ID_of_your_project') # replace with your own project ID

# import the `German Credit Data` dataset
data_file = 'https://docs.aiaengine.com/downloads/datasets/german-credit.csv'
# You can use the `print_schema` utility function to print the auto-inferred schema
# print_schema(pd.read_csv(data_file, header=0))

dataset = project.create_dataset(
    name=f"German Credit Data (CSV - HTTP)",
    data_source=FileSource(
        file_urls=[data_file],
        schema=[
            Column('checking_status', DataType.Text),
            Column('duration', DataType.Numeric),
            Column('credit_history', DataType.Text),
            Column('purpose', DataType.Text),
            Column('credit_amount', DataType.Numeric),
            Column('savings_status', DataType.Text),
            Column('employment', DataType.Text),
            Column('installment_commitment', DataType.Numeric),
            Column('personal_status', DataType.Text),
            Column('other_parties', DataType.Text),
            Column('residence_since', DataType.Numeric),
            Column('property_magnitude', DataType.Text),
            Column('age', DataType.Numeric),
            Column('other_payment_plans', DataType.Text),
            Column('housing', DataType.Text),
            Column('existing_credits', DataType.Numeric),
            Column('job', DataType.Text),
            Column('num_dependents', DataType.Numeric),
            Column('own_telephone', DataType.Text),
            Column('foreign_worker', DataType.Text),
            Column('class', DataType.Text)
        ]
    )
)
print(dataset.id)

package com.aiaengine.examples.dataset;

import com.aiaengine.Dataset;
import com.aiaengine.Engine;
import com.aiaengine.Org;
import com.aiaengine.Project;
import com.aiaengine.datasource.DataSource;
import com.aiaengine.datasource.Schema;
import com.aiaengine.datasource.file.CSVFileSettings;
import com.aiaengine.datasource.file.FileSourceRequest;
import com.aiaengine.datasource.file.FileType;
import com.aiaengine.org.request.CreateProjectRequest;
import com.aiaengine.project.request.CreateDatasetRequest;

import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.List;

public class ImportCsvHttpApp {
    public static void main(String[] args) throws FileNotFoundException {
        Engine engine = new Engine();
        // create a new demo project in the org
        Org org = engine.getOrg("cae24b10-e6b0-4d61-8cef-a9f4b8f6133d"); // replace with your org ID
        Project project = org.createProject(CreateProjectRequest.builder()
                .name("Demo project using Java SDK")
                .description("Your demo project")
                .build());
        // or you can get an existing project that you want to work on
        // Project project = engine.getProject("ID_of_your_project") // replace with your own project ID

        String dataFilePath = "https://docs.dev.aiaengine.com/downloads/datasets/german-credit.csv";
        List<Schema.Column> columns = new ArrayList<>();
        columns.add(new Schema.Column("checking_status", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("duration", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("credit_history", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("purpose", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("credit_amount", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("savings_status", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("employment", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("installment_commitment", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("personal_status", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("other_parties", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("residence_since", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("property_magnitude", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("age", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("other_payment_plans", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("housing", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("existing_credits", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("job", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("num_dependents", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("own_telephone", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("foreign_worker", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("class", Schema.SemanticType.TEXT));
        DataSource localDataSource = engine.buildFileSource(FileSourceRequest.builder()
                .fileType(FileType.CSV)
                .url(dataFilePath)
                .fileSettings(new CSVFileSettings())
                .schema(new Schema(columns))
                .build());

        Dataset dataset = project.createDataset(CreateDatasetRequest.builder()
                .name("German Credit Data (CSV - HTTP)")
                .dataSource(localDataSource)
                .timeout(900)
                .build());

        System.out.println(dataset.getId());
    }
}

Import data from database systems

Import data from PostgreSQL, MySQL, SQL Server, and MongoDB

PythonJava

from aiaengine import Org, Project, DatabaseSource, DatabaseType, Column, DataType

# create a new demo project in the org
org_id = 'b6240512-cd17-43a0-8297-84c51c1bc5a0' # replace with your org ID
org = Org(org_id)
project = org.create_project(name="Demo project using Python SDK", description="Your demo project")
# or you can get an existing project that you want to work on
# project = Project(id='ID_of_your_project') # replace with your own project ID

dataset = project.create_dataset(
    name=f"German Credit Data (PostgreSQL)",
    data_source=DatabaseSource(
        type=DatabaseType.PostgreSQL,  # supported database types: PostgreSQL, MySQL, SQLServer, MongoDB
        host='postgresql.default.svc',
        port=5432,
        username='postgres',
        password='postgres',
        database='postgres',
        table='german_credit',
        schema=[
            Column('checking_status', DataType.Text),
            Column('duration', DataType.Numeric),
            Column('credit_history', DataType.Text),
            Column('purpose', DataType.Text),
            Column('credit_amount', DataType.Numeric),
            Column('savings_status', DataType.Text),
            Column('employment', DataType.Text),
            Column('installment_commitment', DataType.Numeric),
            Column('personal_status', DataType.Text),
            Column('other_parties', DataType.Text),
            Column('residence_since', DataType.Numeric),
            Column('property_magnitude', DataType.Text),
            Column('age', DataType.Numeric),
            Column('other_payment_plans', DataType.Text),
            Column('housing', DataType.Text),
            Column('existing_credits', DataType.Numeric),
            Column('job', DataType.Text),
            Column('num_dependents', DataType.Numeric),
            Column('own_telephone', DataType.Text),
            Column('foreign_worker', DataType.Text),
            Column('class', DataType.Text)
        ]
    )
)
print(dataset.id)

package com.aiaengine.examples.dataset;

import com.aiaengine.Dataset;
import com.aiaengine.Engine;
import com.aiaengine.Org;
import com.aiaengine.Project;
import com.aiaengine.datasource.DataSource;
import com.aiaengine.datasource.Schema;
import com.aiaengine.datasource.database.DatabaseConnection;
import com.aiaengine.datasource.database.DatabaseType;
import com.aiaengine.project.request.CreateDatasetRequest;

import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.List;

public class ImportPostgresApp {
    public static void main(String[] args) throws FileNotFoundException {
        Engine engine = new Engine();
        // create a new demo project in the org
        Org org = engine.getOrg("cae24b10-e6b0-4d61-8cef-a9f4b8f6133d"); // replace with your org ID
//        Project project = org.createProject(CreateProjectRequest.builder()
//                .name("Demo project using Java SDK")
//                .description("Your demo project")
//                .build());
        // or you can get an existing project that you want to work on
        // Project project = engine.getProject("ID_of_your_project") // replace with your own project ID
        Project project = engine.getProject("403a448d-9d86-497f-a9f6-414afa72a415");

        DatabaseConnection connection = DatabaseConnection.builder()
                .type(DatabaseType.POSTGRES)
                .host("postgresql.default.svc")
                .port(5432)
                .user("postgres")
                .password("postgres")
                .databaseName("postgres")
                .table("german_credit")
                .build();

        List<Schema.Column> columns = new ArrayList<>();
        columns.add(new Schema.Column("checking_status", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("duration", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("credit_history", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("purpose", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("credit_amount", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("savings_status", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("employment", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("installment_commitment", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("personal_status", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("other_parties", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("residence_since", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("property_magnitude", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("age", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("other_payment_plans", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("housing", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("existing_credits", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("job", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("num_dependents", Schema.SemanticType.NUMERIC));
        columns.add(new Schema.Column("own_telephone", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("foreign_worker", Schema.SemanticType.TEXT));
        columns.add(new Schema.Column("class", Schema.SemanticType.TEXT));

        DataSource dbDataSource = engine.buildDatabaseSource(connection, new Schema(columns));
        Dataset dataset = project.createDataset(CreateDatasetRequest.builder()
                .name("German Credit Data (PostgreSQL)")
                .dataSource(dbDataSource)
                .timeout(900)
                .build());

        System.out.println(dataset.getId());
    }
}

Importing data

Python users: Saving dataframes from Pandas and Dask into a dataset

Before you begin — Workarounds

Compressed files (ending in .zip, .gz, .bz2 or .xz)

SAS (.sas7bdat), STATA (.dta), and SPSS (.sav, .zsav, .por) formats

Nested jsonl and jsonlines files

Import data from local files

Import data from CSV data files

Import from Excel file

Import data from Parquet files

Import data from JSON files

Importing data from remote files

Import data from public HTTP

Import data from database systems

Import data from PostgreSQL, MySQL, SQL Server, and MongoDB

Compressed files (ending in `.zip`, `.gz`, `.bz2` or `.xz`)

SAS (`.sas7bdat`), STATA (`.dta`), and SPSS (`.sav`, `.zsav`, `.por`) formats

Nested `jsonl` and `jsonlines` files