Documentation for Dataset
and related classes
Dataset
The object model represents a Dataset in the Engine
copy(name, project_id=None, description='', version=-2)
Copy an existing dataset to the project
Parameters:
Name | Type | Description | Default |
---|---|---|---|
project_id |
str
|
ID of the project to which the dataset will be copied. Defaults to None (same project as current dataset) |
None
|
name |
str
|
name of the copied dataset |
required |
description |
str
|
description of the copied dataset. Defaults to ''. |
''
|
version |
int
|
version to copy: -2 = all versions, -1 = latest version, otherwise the exact version |
-2
|
type |
str
|
dataset type, can be empty or 'intermediate'. Defaults to ''. |
required |
Returns:
Type | Description |
---|---|
Dataset
|
the new dataset |
delete()
Delete dataset
description()
property
Description of the dataset
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Description of the dataset |
download(output_folder, version=LATEST_VERSION)
Download dataset as parquet files into folder
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output_folder |
str
|
Folder to save data files |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List of downloaded file paths |
list_analyses()
List all completed dataset analyses
Returns:
Type | Description |
---|---|
List[DatasetAnalysis]
|
list of all completed dataset analysis results |
name()
property
Name of the dataset
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Name of the dataset |
to_pandas(version=LATEST_VERSION)
Read dataset to Pandas DataFrame
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version |
int
|
dataset version. Defaults to LATEST_VERSION. |
LATEST_VERSION
|
Returns:
Type | Description |
---|---|
pd.DataFrame: the dataset in form of a Pandas DataFrame |
update(name, description=None)
Update dataset information
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
updated dataset name |
required |
description |
str
|
updated dataset description. Defaults to None. |
None
|
update_data(data_source, timeout=DEFAULT_TIMEOUT)
Update a dataset with new data
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_source |
DataSource
|
Data source from which this dataset updated |
required |
timeout |
int
|
time to wait for data to be updated. Defaults to DEFAULT_TIMEOUT. |
DEFAULT_TIMEOUT
|
Raises:
Type | Description |
---|---|
RuntimeError
|
runtime error |
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
latest version of the dataset |
CSVSettings
Settings for CSV files
Column
dataclass
Column definition
DatabaseSource
__init__(type, host, port, username, password, database, table=None, query=None, schema=[])
A DataSource from databases (DBMS: MySQL, SQL Server, PostgreSQL, and MongoDB)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
host |
str
|
database server hostname |
required |
port |
int
|
database server port |
required |
username |
str
|
database username |
required |
password |
str
|
database password |
required |
database |
str
|
database name |
required |
table |
str
|
database table name. Defaults to None. |
None
|
query |
str
|
query to select the data. Defaults to None. |
None
|
schema |
List[Column]
|
Schema of the data. Defaults to []. |
[]
|
ExcelSettings
dataclass
Settings for Excel files
FileSource
__init__(file_urls, schema=[], file_type=FileType.CSV, file_settings=CSVSettings(), storage_options={})
A DataSource from files (local, HTTP/HTTPS, S3)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_urls |
List[str]
|
URLs of data files. Can be local file paths. |
required |
schema |
List[Column]
|
Schema of the data. Defaults to []. |
[]
|
file_type |
str
|
Format of the data files. Can be CSV, JSONLine, Parquet or Excel. Defaults to FileType.CSV. |
FileType.CSV
|
file_settings |
FileSettings
|
description. Defaults to CSVSettings(). |
CSVSettings()
|
FileType
Supported file types
CSV = 'csv'
class-attribute
CSV file
Excel = 'excel'
class-attribute
Excel file
JSONLine = 'json'
class-attribute
JSON lines file
Parquet = 'parquet'
class-attribute
Parquet file
generate_schema(df)
Infer schema from the Pandas dataframe
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
pandas.DataFrame
|
description |
required |
Returns:
Type | Description |
---|---|
List[Column]: inferred schema of the dataframe |
print_schema(df)
Print schema of the Pandas DataFrame
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
pandas.DataFrame
|
Pandas DataFrame |
required |