Skip to content

Documentation for Dataset and related classes

Dataset

The object model represents a Dataset in the Engine

copy(name, project_id=None, description='', version=-2)

Copy an existing dataset to the project

Parameters:

Name Type Description Default
project_id str

ID of the project to which the dataset will be copied. Defaults to None (same project as current dataset)

None
name str

name of the copied dataset

required
description str

description of the copied dataset. Defaults to ''.

''
version int

version to copy: -2 = all versions, -1 = latest version, otherwise the exact version

-2
type str

dataset type, can be empty or 'intermediate'. Defaults to ''.

required

Returns:

Type Description
Dataset

the new dataset

delete()

Delete dataset

description() property

Description of the dataset

Returns:

Name Type Description
str str

Description of the dataset

download(output_folder, version=LATEST_VERSION)

Download dataset as parquet files into folder

Parameters:

Name Type Description Default
output_folder str

Folder to save data files

required

Returns:

Type Description
List[str]

List of downloaded file paths

list_analyses()

List all completed dataset analyses

Returns:

Type Description
List[DatasetAnalysis]

list of all completed dataset analysis results

name() property

Name of the dataset

Returns:

Name Type Description
str str

Name of the dataset

to_pandas(version=LATEST_VERSION)

Read dataset to Pandas DataFrame

Parameters:

Name Type Description Default
version int

dataset version. Defaults to LATEST_VERSION.

LATEST_VERSION

Returns:

Type Description

pd.DataFrame: the dataset in form of a Pandas DataFrame

update(name, description=None)

Update dataset information

Parameters:

Name Type Description Default
name str

updated dataset name

required
description str

updated dataset description. Defaults to None.

None

update_data(data_source, timeout=DEFAULT_TIMEOUT)

Update a dataset with new data

Parameters:

Name Type Description Default
data_source DataSource

Data source from which this dataset updated

required
timeout int

time to wait for data to be updated. Defaults to DEFAULT_TIMEOUT.

DEFAULT_TIMEOUT

Raises:

Type Description
RuntimeError

runtime error

Returns:

Name Type Description
int int

latest version of the dataset

CSVSettings

Settings for CSV files

Column dataclass

Column definition

DatabaseSource

__init__(type, host, port, username, password, database, table=None, query=None, schema=[])

A DataSource from databases (DBMS: MySQL, SQL Server, PostgreSQL, and MongoDB)

Parameters:

Name Type Description Default
host str

database server hostname

required
port int

database server port

required
username str

database username

required
password str

database password

required
database str

database name

required
table str

database table name. Defaults to None.

None
query str

query to select the data. Defaults to None.

None
schema List[Column]

Schema of the data. Defaults to [].

[]

ExcelSettings dataclass

Settings for Excel files

FileSource

__init__(file_urls, schema=[], file_type=FileType.CSV, file_settings=CSVSettings(), storage_options={})

A DataSource from files (local, HTTP/HTTPS, S3)

Parameters:

Name Type Description Default
file_urls List[str]

URLs of data files. Can be local file paths.

required
schema List[Column]

Schema of the data. Defaults to [].

[]
file_type str

Format of the data files. Can be CSV, JSONLine, Parquet or Excel. Defaults to FileType.CSV.

FileType.CSV
file_settings FileSettings

description. Defaults to CSVSettings().

CSVSettings()

FileType

Supported file types

CSV = 'csv' class-attribute

CSV file

Excel = 'excel' class-attribute

Excel file

JSONLine = 'json' class-attribute

JSON lines file

Parquet = 'parquet' class-attribute

Parquet file

generate_schema(df)

Infer schema from the Pandas dataframe

Parameters:

Name Type Description Default
df pandas.DataFrame

description

required

Returns:

Type Description

List[Column]: inferred schema of the dataframe

print_schema(df)

Print schema of the Pandas DataFrame

Parameters:

Name Type Description Default
df pandas.DataFrame

Pandas DataFrame

required