Action Catalogue

Aggregation

Aggregate columns within groups

Unique identifier (API): aggregate

Group by the selected columns and compute aggregations within each group

Parameters	Parameter Description	Default	Parameter Type
merge_back_to_original	Whether to merge the aggregation(s) back to the original data	False	boolean

Examples:

Given the dataset

grp	val	val1	bool1	datetimes
1	0.45	1	True	2023-11-17T13:08:29
1	0.45	2	True	2023-11-17T13:08:29
3	0.45	3	True	2023-11-17T13:08:29
3	0.45	3.5	True	2023-08-03T08:48:29
3	1	3	False	2022-05-29T03:21:59
3	1	4	False	2022-05-29T03:21:59
2	10	5	False	2022-05-29T03:21:59
2	100	6	False	2022-05-29T03:21:59

If we group by grp and peform the following aggregations within each group

mean(val)
sum(val1)
all(bool1)
any(bool)
approx_count_distinct(val, 0)
approx_count_distinct(bool1, 0)
earliest(datetimes, 0)
latest(datetimes, 0)
first(val, 0)
last(val, 0)
approx_median(val)
approx_quantile(val)
count(bool1, "non_null")
count(bool1, "null)

we get

grp	avg_val	sum_val1	all_bool1	any_bool1	approx_count_distinct_val	approx_count_distinct_bool1	earliest_datetime	latest_datetime	first_val	last_val	median_val	quantile_val	count_bool1	count_nulls_bool1
Text	Numeric	Numeric	Boolean	Boolean	Numeric	Numeric	DateTime	DateTime	Numeric	Numeric	Numeric	Numeric	Numeric	Numeric
1	0.45	3.0	True	True	1	1	2023-11-17 13:08:29	2023-11-17 13:08:29	0.45	0.45	0.45	0.45	2	0
2	55	11	False	False	2	1	2022-05-29 03:21:59	2022-05-29 03:21:59	10	100	10.0	10	2	0
3	0.725	13.5	False	True	2

The user may optionally specify merge_back_to_original = True in which case the group by results will be joined with the original table using the group by columns as the join keys. Note: the resultant table using merge_back_to_original = True will have the same number of rows as the original table.

Casting

Cast columns to categorical type

Unique identifier (API): cast_categorical

Cast the selected columns to categorical type.

Parameters	Parameter Description	Default	Parameter Type
categories_specified	If set to 'true', categories must be specified explicitly. If set to 'false', every unique value in column will become a category	False	boolean
categories	Categories in the column		string-list

Cast categorical columns to text type

Unique identifier (API): cast_categorical_to_text

Cast selected categorical columns to text type

Cast columns to DateTime type

Unique identifier (API): cast_datetime

Cast the selected columns to DateTime type with specified format string

Parameters	Parameter Description	Default	Parameter Type	Suggested Values
datetime_format	DateTime format string in the strftime format.	%Y-%m-%dT%H:%M:%S.%fZ	string	['%Y-%m-%dT%H:%M:%S.%f%z', '%Y-%m-%dT%H:%M:%S.%fZ', '%b %d %Y %H:%M:%S.%f', '%Y-%m-%dT%H:%M:%S.%f', '%m-%d-%Y %H:%M:%S.%f', '%d-%m-%Y %H:%M:%S.%f', '%d %b %Y %H:%M:%S.%f', '%Y/%m/%d %H:%M:%S.%f', '%Y-%m-%d %H:%M:%S.%f', '%d/%m/%Y %H:%M:%S.%f', '%Y-%m-%dT%H:%M:%S%z', '%Y-%m-%dT%H:%M:%SZ', '%d%m%Y %H:%M:%S.%f', '%m%d%Y %H:%M:%S.%f', '%Y%m%d %H:%M:%S.%f', '%m-%d-%Y %H:%M:%S', '%Y/%m/%d %H:%M:%S', '%Y-%m-%dT%H:%M:%S', '%b-%d-%Y %H:%M:%S', '%d-%m-%Y %H:%M:%S', '%d-%b-%Y %H:%M:%S', '%d/%m/%Y %H:%M:%S', '%Y-%m-%d %H:%M:%S', '%m/%d/%Y %I:%M:%S %p', '%Y%m%d %H:%M:%S', '%Y-%m-%dT%H:%MZ', '%m%d%Y %H:%M:%S', '%d%m%Y %H:%M:%S', '%m-%d-%Y', '%Y-%m-%d', '%d-%m-%Y', '%m %d %Y', '%d-%b-%Y', '%Y/%m/%d', '%d %m %Y', '%m/%d/%Y', '%Y %m %d', '%d/%m/%Y', '%d %b %Y', '%b-%d-%Y', '%b %d %Y', '%m%d%Y', '%Y%m%d', '%d%m%Y', '%m%Y', '%m %Y', '%m-%Y', '%m/%Y']
timezone	Specifies the timezone the DateTimes belong to	UTC	string	['Africa/Abidjan', 'Africa/Accra', 'Africa/Addis_Ababa', 'Africa/Algiers', 'Africa/Asmara', 'Africa/Bamako', 'Africa/Bangui', 'Africa/Banjul', 'Africa/Bissau', 'Africa/Blantyre', 'Africa/Brazzaville', 'Africa/Bujumbura', 'Africa/Cairo', 'Africa/Casablanca', 'Africa/Ceuta', 'Africa/Conakry', 'Africa/Dakar', 'Africa/Dar_es_Salaam', 'Africa/Djibouti', 'Africa/Douala', 'Africa/El_Aaiun', 'Africa/Freetown', 'Africa/Gaborone', 'Africa/Harare', 'Africa/Johannesburg', 'Africa/Juba', 'Africa/Kampala', 'Africa/Khartoum', 'Africa/Kigali', 'Africa/Kinshasa', 'Africa/Lagos', 'Africa/Libreville', 'Africa/Lome', 'Africa/Luanda', 'Africa/Lubumbashi', 'Africa/Lusaka', 'Africa/Malabo', 'Africa/Maputo', 'Africa/Maseru', 'Africa/Mbabane', 'Africa/Mogadishu', 'Africa/Monrovia', 'Africa/Nairobi', 'Africa/Ndjamena', 'Africa/Niamey', 'Africa/Nouakchott', 'Africa/Ouagadougou', 'Africa/Porto-Novo', 'Africa/Sao_Tome', 'Africa/Tripoli', 'Africa/Tunis', 'Africa/Windhoek', 'America/Adak', 'America/Anchorage', 'America/Anguilla', 'America/Antigua', 'America/Araguaina', 'America/Argentina/Buenos_Aires', 'America/Argentina/Catamarca', 'America/Argentina/Cordoba', 'America/Argentina/Jujuy', 'America/Argentina/La_Rioja', 'America/Argentina/Mendoza', 'America/Argentina/Rio_Gallegos', 'America/Argentina/Salta', 'America/Argentina/San_Juan', 'America/Argentina/San_Luis', 'America/Argentina/Tucuman', 'America/Argentina/Ushuaia', 'America/Aruba', 'America/Asuncion', 'America/Atikokan', 'America/Bahia', 'America/Bahia_Banderas', 'America/Barbados', 'America/Belem', 'America/Belize', 'America/Blanc-Sablon', 'America/Boa_Vista', 'America/Bogota', 'America/Boise', 'America/Cambridge_Bay', 'America/Campo_Grande', 'America/Cancun', 'America/Caracas', 'America/Cayenne', 'America/Cayman', 'America/Chicago', 'America/Chihuahua', 'America/Costa_Rica', 'America/Creston', 'America/Cuiaba', 'America/Curacao', 'America/Danmarkshavn', 'America/Dawson', 'America/Dawson_Creek', 'America/Denver', 'America/Detroit', 'America/Dominica', 'America/Edmonton', 'America/Eirunepe', 'America/El_Salvador', 'America/Fort_Nelson', 'America/Fortaleza', 'America/Glace_Bay', 'America/Goose_Bay', 'America/Grand_Turk', 'America/Grenada', 'America/Guadeloupe', 'America/Guatemala', 'America/Guayaquil', 'America/Guyana', 'America/Halifax', 'America/Havana', 'America/Hermosillo', 'America/Indiana/Indianapolis', 'America/Indiana/Knox', 'America/Indiana/Marengo', 'America/Indiana/Petersburg', 'America/Indiana/Tell_City', 'America/Indiana/Vevay', 'America/Indiana/Vincennes', 'America/Indiana/Winamac', 'America/Inuvik', 'America/Iqaluit', 'America/Jamaica', 'America/Juneau', 'America/Kentucky/Louisville', 'America/Kentucky/Monticello', 'America/Kralendijk', 'America/La_Paz', 'America/Lima', 'America/Los_Angeles', 'America/Lower_Princes', 'America/Maceio', 'America/Managua', 'America/Manaus', 'America/Marigot', 'America/Martinique', 'America/Matamoros', 'America/Mazatlan', 'America/Menominee', 'America/Merida', 'America/Metlakatla', 'America/Mexico_City', 'America/Miquelon', 'America/Moncton', 'America/Monterrey', 'America/Montevideo', 'America/Montserrat', 'America/Nassau', 'America/New_York', 'America/Nipigon', 'America/Nome', 'America/Noronha', 'America/North_Dakota/Beulah', 'America/North_Dakota/Center', 'America/North_Dakota/New_Salem', 'America/Nuuk', 'America/Ojinaga', 'America/Panama', 'America/Pangnirtung', 'America/Paramaribo', 'America/Phoenix', 'America/Port-au-Prince', 'America/Port_of_Spain', 'America/Porto_Velho', 'America/Puerto_Rico', 'America/Punta_Arenas', 'America/Rainy_River', 'America/Rankin_Inlet', 'America/Recife', 'America/Regina', 'America/Resolute', 'America/Rio_Branco', 'America/Santarem', 'America/Santiago', 'America/Santo_Domingo', 'America/Sao_Paulo', 'America/Scoresbysund', 'America/Sitka', 'America/St_Barthelemy', 'America/St_Johns', 'America/St_Kitts', 'America/St_Lucia', 'America/St_Thomas', 'America/St_Vincent', 'America/Swift_Current', 'America/Tegucigalpa', 'America/Thule', 'America/Thunder_Bay', 'America/Tijuana', 'America/Toronto', 'America/Tortola', 'America/Vancouver', 'America/Whitehorse', 'America/Winnipeg', 'America/Yakutat', 'America/Yellowknife', 'Antarctica/Casey', 'Antarctica/Davis', 'Antarctica/DumontDUrville', 'Antarctica/Macquarie', 'Antarctica/Mawson', 'Antarctica/McMurdo', 'Antarctica/Palmer', 'Antarctica/Rothera', 'Antarctica/Syowa', 'Antarctica/Troll', 'Antarctica/Vostok', 'Arctic/Longyearbyen', 'Asia/Aden', 'Asia/Almaty', 'Asia/Amman', 'Asia/Anadyr', 'Asia/Aqtau', 'Asia/Aqtobe', 'Asia/Ashgabat', 'Asia/Atyrau', 'Asia/Baghdad', 'Asia/Bahrain', 'Asia/Baku', 'Asia/Bangkok', 'Asia/Barnaul', 'Asia/Beirut', 'Asia/Bishkek', 'Asia/Brunei', 'Asia/Chita', 'Asia/Choibalsan', 'Asia/Colombo', 'Asia/Damascus', 'Asia/Dhaka', 'Asia/Dili', 'Asia/Dubai', 'Asia/Dushanbe', 'Asia/Famagusta', 'Asia/Gaza', 'Asia/Hebron', 'Asia/Ho_Chi_Minh', 'Asia/Hong_Kong', 'Asia/Hovd', 'Asia/Irkutsk', 'Asia/Jakarta', 'Asia/Jayapura', 'Asia/Jerusalem', 'Asia/Kabul', 'Asia/Kamchatka', 'Asia/Karachi', 'Asia/Kathmandu', 'Asia/Khandyga', 'Asia/Kolkata', 'Asia/Krasnoyarsk', 'Asia/Kuala_Lumpur', 'Asia/Kuching', 'Asia/Kuwait', 'Asia/Macau', 'Asia/Magadan', 'Asia/Makassar', 'Asia/Manila', 'Asia/Muscat', 'Asia/Nicosia', 'Asia/Novokuznetsk', 'Asia/Novosibirsk', 'Asia/Omsk', 'Asia/Oral', 'Asia/Phnom_Penh', 'Asia/Pontianak', 'Asia/Pyongyang', 'Asia/Qatar', 'Asia/Qostanay', 'Asia/Qyzylorda', 'Asia/Riyadh', 'Asia/Sakhalin', 'Asia/Samarkand', 'Asia/Seoul', 'Asia/Shanghai', 'Asia/Singapore', 'Asia/Srednekolymsk', 'Asia/Taipei', 'Asia/Tashkent', 'Asia/Tbilisi', 'Asia/Tehran', 'Asia/Thimphu', 'Asia/Tokyo', 'Asia/Tomsk', 'Asia/Ulaanbaatar', 'Asia/Urumqi', 'Asia/Ust-Nera', 'Asia/Vientiane', 'Asia/Vladivostok', 'Asia/Yakutsk', 'Asia/Yangon', 'Asia/Yekaterinburg', 'Asia/Yerevan', 'Atlantic/Azores', 'Atlantic/Bermuda', 'Atlantic/Canary', 'Atlantic/Cape_Verde', 'Atlantic/Faroe', 'Atlantic/Madeira', 'Atlantic/Reykjavik', 'Atlantic/South_Georgia', 'Atlantic/St_Helena', 'Atlantic/Stanley', 'Australia/Adelaide', 'Australia/Brisbane', 'Australia/Broken_Hill', 'Australia/Darwin', 'Australia/Eucla', 'Australia/Hobart', 'Australia/Lindeman', 'Australia/Lord_Howe', 'Australia/Melbourne', 'Australia/Perth', 'Australia/Sydney', 'Canada/Atlantic', 'Canada/Central', 'Canada/Eastern', 'Canada/Mountain', 'Canada/Newfoundland', 'Canada/Pacific', 'Europe/Amsterdam', 'Europe/Andorra', 'Europe/Astrakhan', 'Europe/Athens', 'Europe/Belgrade', 'Europe/Berlin', 'Europe/Bratislava', 'Europe/Brussels', 'Europe/Bucharest', 'Europe/Budapest', 'Europe/Busingen', 'Europe/Chisinau', 'Europe/Copenhagen', 'Europe/Dublin', 'Europe/Gibraltar', 'Europe/Guernsey', 'Europe/Helsinki', 'Europe/Isle_of_Man', 'Europe/Istanbul', 'Europe/Jersey', 'Europe/Kaliningrad', 'Europe/Kiev', 'Europe/Kirov', 'Europe/Lisbon', 'Europe/Ljubljana', 'Europe/London', 'Europe/Luxembourg', 'Europe/Madrid', 'Europe/Malta', 'Europe/Mariehamn', 'Europe/Minsk', 'Europe/Monaco', 'Europe/Moscow', 'Europe/Oslo', 'Europe/Paris', 'Europe/Podgorica', 'Europe/Prague', 'Europe/Riga', 'Europe/Rome', 'Europe/Samara', 'Europe/San_Marino', 'Europe/Sarajevo', 'Europe/Saratov', 'Europe/Simferopol', 'Europe/Skopje', 'Europe/Sofia', 'Europe/Stockholm', 'Europe/Tallinn', 'Europe/Tirane', 'Europe/Ulyanovsk', 'Europe/Uzhgorod', 'Europe/Vaduz', 'Europe/Vatican', 'Europe/Vienna', 'Europe/Vilnius', 'Europe/Volgograd', 'Europe/Warsaw', 'Europe/Zagreb', 'Europe/Zaporozhye', 'Europe/Zurich', 'GMT', 'Indian/Antananarivo', 'Indian/Chagos', 'Indian/Christmas', 'Indian/Cocos', 'Indian/Comoro', 'Indian/Kerguelen', 'Indian/Mahe', 'Indian/Maldives', 'Indian/Mauritius', 'Indian/Mayotte', 'Indian/Reunion', 'Pacific/Apia', 'Pacific/Auckland', 'Pacific/Bougainville', 'Pacific/Chatham', 'Pacific/Chuuk', 'Pacific/Easter', 'Pacific/Efate', 'Pacific/Enderbury', 'Pacific/Fakaofo', 'Pacific/Fiji', 'Pacific/Funafuti', 'Pacific/Galapagos', 'Pacific/Gambier', 'Pacific/Guadalcanal', 'Pacific/Guam', 'Pacific/Honolulu', 'Pacific/Kiritimati', 'Pacific/Kosrae', 'Pacific/Kwajalein', 'Pacific/Majuro', 'Pacific/Marquesas', 'Pacific/Midway', 'Pacific/Nauru', 'Pacific/Niue', 'Pacific/Norfolk', 'Pacific/Noumea', 'Pacific/Pago_Pago', 'Pacific/Palau', 'Pacific/Pitcairn', 'Pacific/Pohnpei', 'Pacific/Port_Moresby', 'Pacific/Rarotonga', 'Pacific/Saipan', 'Pacific/Tahiti', 'Pacific/Tarawa', 'Pacific/Tongatapu', 'Pacific/Wake', 'Pacific/Wallis', 'US/Alaska', 'US/Arizona', 'US/Central', 'US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']

Cast specified columns to JSON array type

Unique identifier (API): cast_json_array

This action marks input columns as "JSONArray" type, implying that the values in the column are expected to be JSON strings parsable into arrays of elements with uniform type.

Cast specified columns to JSON object type

Unique identifier (API): cast_json_object

This action marks input columns as "JSONObject" type, implying that the values in the column are expected to be JSON strings parsable into objects with attributes and values.

Cast columns to numeric type

Unique identifier (API): cast_numeric

Cast the selected columns to numeric type.

Column Group

Get column group components

Unique identifier (API): get_column_group_components

Extracts specified components from a column group by their suffix. If a specified component suffix does not exist, will return a null column as the corresponding output.

Examples:

For the following input table:

dummy_col	__group_grp_a_0	__group_grp_a_1	__group_grp_a_2
aaa	1	2	1
bbb	Sentence A	Sentence B	Sentence C

with: * input column: 'grp_a' * output columns: ['component_0_from_grp_a', 'component_2_from_grp_a'] * suffixes: [0, 2]

The output is:

dummy_col	__group_grp_a_0	__group_grp_a_1	__group_grp_a_2	component_0_from_grp_a	component_2_from_grp_a
aaa	1	2	1	1	1
bbb	Sentence A	Sentence B	Sentence C	Sentence A	Sentence C

Conversion

Convert numeric column to DateTime

Unique identifier (API): numeric_to_datetime

Converts the given numeric columns into UTC timestamps, interpreting them as offset values in the given unit since the specified epoch.

Parameters	Parameter Description	Default	Parameter Type
units	Specifies the unit of time that the values in the numeric column represent.	Seconds	single-select
rel_to	Specifies the "origin" UTC DateTime. The conversion from Numeric to DateTime shall be relative to this "origin".	1970-01-01	string

Examples:

For units == 'seconds', rel_to="1970-01-01T00:00:00" and for the following input table:

numeric timestamps
0
-86400
86400

The expected output is:

numeric timestamps
1970-01-01 00:00:00
1969-12-31 00:00:00
1970-01-02 00:00:00

Custom Formula

Create a new column with a formula

Unique identifier (API): create_new_column_with_formula_v2

Creates a new column with any user-entered formula. Common Excel/SQL functions and operators for manipulating Numeric, DateTime, Text, and Boolean values are permitted. For a full list of allowed functions, see documentation.

Parameters	Parameter Description	Default	Parameter Type
expression	A formula involving column names. For example: 'column_B / column_A + log(column_D * 2)'		formula

Examples:

Consider the dataset:

manufacturer	model	year	displ	cty	hwy
audi	a4 quattro	1999	1.8	16	25
audi	a6 quattro	1999	2.8	15	24
audi	a6 quattro	2008	4.2	16	23
audi	a6 quattro	2008	4.2	16	23
chevrolet	c1500 suburban 2wd	2008	6	12	17
chevrolet	corvette	1999	5.7	16	26
chevrolet	corvette	1999	5.7	15	23
chevrolet	k1500 tahoe 4wd	1999	6.5	14	17
chevrolet	malibu	1999	3.1	18	26
dodge	caravan 2wd	1999	3.8	15	22

where cty, hwy and year are of Numeric type.

Given the formula 'hwy - cty' and the output column name 'hwy_minus_cty', the result will be:

manufacturer	model	year	displ	cty	hwy	hwy_minus_cty
audi	a4 quattro	1999	1.8	16	25	9
audi	a6 quattro	1999	2.8	15	24	9
audi	a6 quattro	2008	4.2	16	23	7
audi	a6 quattro	2008	4.2	16	23	7
chevrolet	c1500 suburban 2wd	2008	6	12	17	5
chevrolet	corvette	1999	5.7	16	26	10
chevrolet	corvette	1999	5.7	15	23	8
chevrolet	k1500 tahoe 4wd	1999	6.5	14	17	3
chevrolet	malibu	1999	3.1	18	26	8
dodge	caravan 2wd	1999	3.8	15	22	7

where the new hwy_minus_cty column will be of Numeric type.

For the same input data, the formula 'IF(year <= 2000, "20th", "21st")' with the output column name 'century' will result in:

manufacturer	model	year	displ	cty	hwy	century
audi	a4 quattro	1999	1.8	16	25	20th
audi	a6 quattro	1999	2.8	15	24	20th
audi	a6 quattro	2008	4.2	16	23	21st
audi	a6 quattro	2008	4.2	16	23	21st
chevrolet	c1500 suburban 2wd	2008	6	12	17	21st
chevrolet	corvette	1999	5.7	16	26	20th
chevrolet	corvette	1999	5.7	15	23	20th
chevrolet	k1500 tahoe 4wd	1999	6.5	14	17	20th
chevrolet	malibu	1999	3.1	18	26	20th
dodge	caravan 2wd	1999	3.8	15	22	20th

where the new century column will be of Text type.

Transform columns with a formula

Unique identifier (API): transform_columns_with_formula_v2

Transforms one or more column(s) with a user-specified formula. Common Excel/SQL functions and operators for manipulating Numeric, DateTime, Text, Boolean and Categorical values as well as JSON Objects/Arrays are permitted. For a full list of these functions, see documentation.

Parameters	Parameter Description	Default	Parameter Type
formula	Formula where $COL/$ COLUMN refers to the column(s) to be transformed. For example, "col_A + col_B - $COL"		formula

Examples:

For example, given the following dataset

qty_1	qty_2
10	null
20	15
15	6
null	1
8	8

and formula = ""$col + (qty_2 + $col) * 10"", with input column qty_1, the result will be

qty_1	qty_2
null	null
370	15
225	6
null	1
168	8

Data Cleaning/Processing

Concatenate with delimiter

Unique identifier (API): concat_cols_delim

Concatenates one or more text columns / column groups with a specified delimiter.

Parameters	Parameter Description	Default	Parameter Type	Suggested Values
delimiter	Delimiter to be appended between concatenated values		string	[',', ';', '-', '

Examples:

Assuming delimiter == ', ' and the input table is: (one column, one group with 2 components)

col_1	__group_grp_0	__group_grp_1
Hello	World	A
Hello	null	B

The output will be:

col_1	__group_grp_0	__group_grp_1	comb
Hello	World	A	Hello, World, A
Hello	null	B	Hello, , B

Copy

Unique identifier (API): copy

Creates a copy of the specified columns OR column groups

Examples:

For the following input table: (2 columns and 1 group with 2 components)

col_1	col_2	__group_grp_0	__group_grp_1
0.14667831357791283	2194986032	0.6765007779041725	0.876755413362593
0.8424524032441315	-8793057235	0.21887501332663328	0.6874699121416429
0.08379407453894183	-1300419439	0.2108548261597638	0.39335408485745316
0.09723766918372134	2467206050	0.2499576969146754	0.783647656429706
0.13917572512342702	8415264685	0.9150713227141526	0.071808029978221

We select to copy: [col_2, grp]. (grp is the name of the group). The output is:

col_1	col_2	__group_grp_0	__group_grp_1	col_2_copy	__group_grp_copy_0	__group_grp_copy_1
0.14667831357791283	2194986032	0.6765007779041725	0.876755413362593	2194986032	0.6765007779041725	0.876755413362593
0.8424524032441315	-8793057235	0.21887501332663328	0.6874699121416429	-8793057235	0.21887501332663328	0.6874699121416429
0.08379407453894183	-1300419439	0.2108548261597638	0.39335408485745316	-1300419439	0.2108548261597638	0.39335408485745316
0.09723766918372134	2467206050	0.2499576969146754	0.783647656429706	2467206050	0.2499576969146754	0.783647656429706
0.13917572512342702	8415264685	0.9150713227141526	0.071808029978221	8415264685	0.9150713227141526	0.071808029978221

Drop

Unique identifier (API): drop

Drop columns and/or column groups

Examples:

For example, given the following dataset

col_1	col_2	__group_A_0	__group_A_1	__group_B_mean	__group_B_count
0.666099547109415	cat_1	0.8268666013533758	0	0.9842078018727513	0
0.293068258318807	cat_2	0.9747909820892654	1	0.703905046233064	1
0.2961075040827643	cat_3	0.14699408293664762	2	0.5417745101099833	2
0.11255366333183248	cat_4	0.5392931543191276	3	0.35921739535158015	3
0.9707308564097861	cat_5	0.23839121304354394	4	0.8751438273689285	4

with input column col_1 and column group A, the result is

col_2	__group_B_mean	__group_B_count
cat_1	0.9842078018727513	0
cat_2	0.703905046233064	1
cat_3	0.5417745101099833	2
cat_4	0.35921739535158015	3
cat_5	0.8751438273689285	4

Find and replace

Unique identifier (API): find_replace

Finds and replaces the specified strings in the input column(s) according to the given mapping

Parameters	Parameter Description	Default	Parameter Type
replacements	Mapping between the search strings and their replacements		map
match_case	Whether the matches are case sensitive	False	boolean
positional_match	Matches the search only if a match is found in the specified position in the string	anywhere	single-select

Examples:

For the following input table:

col_x	col_y
They won the fifth place	They won the sixth place

No match here	whole word is th

and input columns: ['col_x', 'col_y'] and parameters:

parameters = {
    'replacements': [['th', '##']],
    'positional_match': 'prefix',
    'match_case': 'True'
}

The output will be:

For ('prefix', True):

col_x	col_y
They won ##e fifth place	They won ##e sixth place

No match here	whole word is ##

Impute categorical columns

Unique identifier (API): impute_categorical

Fill missing values in categorical input columns by a specified method

Parameters	Parameter Description	Default	Parameter Type
method	Method for imputing input columns	most_frequent	single-select
fill_value	Value to fill when method is specified as constant		string

Examples:

For example, a categorical column white, black, None, white, yellow is changed to white, black, white, white, yellow with method = ""most_frequent"" (missing values are filled with the most frequent value "white").

Impute numeric columns

Unique identifier (API): impute_numeric

Fill missing values in numeric input columns by a specified method

Parameters	Parameter Description	Default	Parameter Type
method	Method for imputing input columns	mean	single-select
fill_value	Value to fill when method is specified as constant		float

Examples:

For example, a numeric column 100, 120, None, 80, 100, None is changed to 100, 120, 100, 80, 100, 100 with method = ""mean"" (missing values are filled with the mean 100).

Rename

Unique identifier (API): rename

Renames specified columns OR column groups

Examples:

For the following input table: (3 columns and 1 group with 2 components)

col_1	col_2	col_3	__group_grp_0	__group_grp_1
0.5267075	8.24489e+07	1.23345	0.2635994	0.1005395
0.5707881	2.044422e+09	-0.1124	0.3945866	0.4787061
0.6478216	-5.179201e+09	0.00011	0.1314026	0.2335102

We select to rename: [col_2, grp]. (grp is the name of the group) to [col_A, grp_A]. The output is:

col_1	col_A	col_3	__group_grp_A_0	__group_grp_A_1
0.5267075	8.24489e+07	1.23345	0.2635994	0.1005395
0.5707881	2.044422e+09	-0.1124	0.3945866	0.4787061
0.6478216	-5.179201e+09	0.00011	0.1314026	0.2335102

Sort on columns

Unique identifier (API): sort_on_columns

Sort the dataset based on the input columns in a specified order (ascending/descending)

Examples:

For the following input table:

animal	colour	adopted	qty	desexed	note
cat	yellow	2016-07-31T00:00:12	10	True	There is a story
dog	blue	2017-08-20T07:23:54	4	True
cat	blue	2017-09-20T00:23:50	9	True	No story
cat	blue	2018-05-01T04:40:15	8	True	There is a story
dog	yellow	2019-02-05T12:12:02	1	True	There is another story
dog	black	2019-08-20T07:23:02	2	False	There is a story
cat	black	2020-01-20T00:23:50	2	True	No story
dog	yellow	2020-08-02T15:00:08	2	False

We select sorting by: qty in ascending order and adopted in descending order. The output is:

animal	colour	adopted	qty	desexed	note
dog	yellow	2019-02-05 12:12:02	1	True	There is another story
dog	yellow	2020-08-02 15:00:08	2	False
cat	black	2020-01-20 00:23:50	2	True	No story
dog	black	2019-08-20 07:23:02	2	False	There is a story
dog	blue	2017-08-20 07:23:54	4	True
cat	blue	2018-05-01 04:40:15	8	True	There is a story
cat	blue	2017-09-20 00:23:50	9	True	No story
cat	yellow	2016-07-31 00:00:12	10	True	There is a story

Strip HTML tags

Unique identifier (API): strip_html

Strip HTML tags from text to convert webpages to text documents, sans HTML formatting

Examples:

Strip html tags from text e.g. ""<emp> Text </emp>"" becomes ""Text""

Fill null with the nearest preceding/following values

Unique identifier (API): window_fill_null_by_nearest

Fill null with the nearest non-null values in either preceding or following rows after ordering, optionally with partitioning

Parameters	Parameter Description	Default	Parameter Type
method	Whether to use the nearest value in preceding or following rows	preceding	single-select

Examples:

For example, given the following dataset

animal	colour	adopted	qty
cat	yellow	2020-01-01 00:00:00	null
cat	yellow	2020-01-02 00:00:00	null
cat	yellow	2020-01-03 00:00:00	5.0
cat	yellow	2020-01-04 00:00:00	7.0
cat	yellow	2020-01-05 00:00:00	10.0
cat	white	2020-01-06 00:00:00	1.0
cat	white	2020-01-07 00:00:00	4.0
cat	white	2020-01-08 00:00:00	null
cat	white	2020-01-09 00:00:00	7.0
cat	white	2020-01-10 00:00:00	8.0
dog	yellow	2020-01-11 00:00:00	9.0
dog	yellow	2020-01-12 00:00:00	7.0
dog	yellow	2020-01-13 00:00:00	6.0
dog	yellow	2020-01-14 00:00:00	5.0
dog	yellow	2020-01-15 00:00:00	10.0
dog	white	2020-01-16 00:00:00	6.0
dog	white	2020-01-17 00:00:00	5.0
dog	white	2020-01-18 00:00:00	8.0
dog	white	2020-01-19 00:00:00	null
dog	white	2020-01-20 00:00:00	null

with input column qty, partitioned by animal and colour and ordered by adopted for method preceding, the result is

animal	colour	adopted	qty
cat	white	2020-01-06 00:00:00	1.0
cat	white	2020-01-07 00:00:00	4.0
cat	white	2020-01-08 00:00:00	4.0
cat	white	2020-01-09 00:00:00	7.0
cat	white	2020-01-10 00:00:00	8.0
cat	yellow	2020-01-01 00:00:00	null
cat	yellow	2020-01-02 00:00:00	null
cat	yellow	2020-01-03 00:00:00	5.0
cat	yellow	2020-01-04 00:00:00	7.0
cat	yellow	2020-01-05 00:00:00	10.0
dog	white	2020-01-16 00:00:00	6.0
dog	white	2020-01-17 00:00:00	5.0
dog	white	2020-01-18 00:00:00	8.0
dog	white	2020-01-19 00:00:00	8.0
dog	white	2020-01-20 00:00:00	8.0
dog	yellow	2020-01-11 00:00:00	9.0
dog	yellow	2020-01-12 00:00:00	7.0
dog	yellow	2020-01-13 00:00:00	6.0
dog	yellow	2020-01-14 00:00:00	5.0
dog	yellow	2020-01-15 00:00:00	10.0

Date/DateTime

Add time delta

Unique identifier (API): add_timedelta

Adds a time delta (in seconds) to one or more DateTime columns. Changes the columns in place.

Parameters	Parameter Description	Default	Parameter Type
time_delta_sec	This is the time-duration in seconds to be added to one or more DateTime columns.	0	integer

Examples:

For a time_delta_sec == -100 and the given datetime 2022-06-30 23:51:08.825072 the output is 2022-06-30 23:49:28.825072

Create cyclic encoding features

Unique identifier (API): cyclic_encode

A directive that receives one or more columns that are cyclical in nature. denote the column as 'x' and the period of the column as 'p'. For each such columns, creates 2 new columns which are: x_sin = sin(frac{2\pix}p), x_cos = cos(frac{2\pix}p)

Parameters	Parameter Description	Default	Parameter Type
period	The period of the cyclic feature	1	float

Compute time elapsed

Unique identifier (API): diff_datetime

Calculates the time difference in seconds between two DateTime columns

Parameters	Parameter Description	Default	Parameter Type
units	Specifies the units of the time difference result	Seconds	single-select

Examples:

If the time difference units are Hours, for the following input:

col_1	col_2
2020-07-23 11:26:04.812256	2020-07-31 11:26:04.812256

We expect the following output:

col_1	col_2	diff_1
2020-07-23 11:26:04.812256	2020-07-31 11:26:04.812256	-192.0

Extract AM/PM

Unique identifier (API): extract_am_pm

Extracts the AM/PM value for rows of a DateTime column

Examples:

2020-04-15T15:53:44 -> will return: PM

Extract day of month

Unique identifier (API): extract_day_of_month_number

Extract the day of month as a number from 1 to 31

Examples:

2020-04-15T15:53:44 -> will return: 15

Extract day of week name

Unique identifier (API): extract_day_of_week_name

Extract the name of the day in week ('Monday' to 'Sunday')

Examples:

2020-04-15T15:53:44 -> will return: Wednesday

Extract day of week number

Unique identifier (API): extract_day_of_week_number

Extract the day of week as a number ('Sunday' = 1, 'Saturday' = 7)

Examples:

2020-04-15T15:53:44, Days coding is: Monday=0, Tuesday=1, ..., Sunday=6 -> will return: 02.

Extract hours

Unique identifier (API): extract_hour

Extract hour of the day as a number from 0 to 23

Examples:

2020-04-15T15:53:44 -> will return: 15

Extract minutes

Unique identifier (API): extract_minutes

Extract minutes after the hour as a number from 0 to 59

Examples:

2020-04-15T15:53:44 -> will return: 53

Extract month name

Unique identifier (API): extract_month_name

Extracts the months names of a DateTime column

Examples:

2020-04-15T15:53:44 -> will return: April

Extract month number

Unique identifier (API): extract_month_number

Extract the month as a number from 1 to 12

Extract quarter

Unique identifier (API): extract_quarter

Extract the quarter (3-month period) of the year in which the dates fall as an integer from 1 to 4.

Examples:

2020-04-15T15:53:44 -> will return: 02

Extract seconds

Unique identifier (API): extract_seconds

Extract seconds after the minute as a number from 0 to 59

Examples:

2020-04-15T15:53:44 -> will return: 44

Extract timestamp

Unique identifier (API): extract_timestamp

Extracts the POSIX timestamp. (Number of seconds since January 1st, 1970

Examples:

Returns a POSIX timestamp as float. (Number of seconds since January 1st, 1970).

2020-04-15T15:53:44 -> will return: 1586930024.0.

Extract UTC offset

Unique identifier (API): extract_utc_offset

Extracts the UTC offset (in hours) from timezone aware string timestamps

Parameters	Parameter Description	Default	Parameter Type	Suggested Values
date_format	The date_format of the column	%Y-%m-%dT%H:%M:%S%z	string	['%Y-%m-%dT%H:%M:%S.%fZ%z', '%b %d %Y %H:%M:%S.%f%z', '%Y-%m-%dT%H:%M:%S.%f%z', '%m-%d-%Y %H:%M:%S.%f%z', '%d-%m-%Y %H:%M:%S.%f%z', '%d %b %Y %H:%M:%S.%f%z', '%Y/%m/%d %H:%M:%S.%f%z', '%Y-%m-%d %H:%M:%S.%f%z', '%d/%m/%Y %H:%M:%S.%f%z', '%Y-%m-%dT%H:%M:%SZ%z', '%d%m%Y %H:%M:%S.%f%z', '%m%d%Y %H:%M:%S.%f%z', '%Y%m%d %H:%M:%S.%f%z', '%m-%d-%Y %H:%M:%S%z', '%Y/%m/%d %H:%M:%S%z', '%Y-%m-%dT%H:%M:%S%z', '%b-%d-%Y %H:%M:%S%z', '%d-%m-%Y %H:%M:%S%z', '%d-%b-%Y %H:%M:%S%z', '%d/%m/%Y %H:%M:%S%z', '%Y-%m-%d %H:%M:%S%z', '%m/%d/%Y %I:%M:%S %p%z', '%Y%m%d %H:%M:%S%z', '%Y-%m-%dT%H:%MZ%z', '%m%d%Y %H:%M:%S%z', '%d%m%Y %H:%M:%S%z', '%m-%d-%Y%z', '%Y-%m-%d%z', '%d-%m-%Y%z', '%m %d %Y%z', '%d-%b-%Y%z', '%Y/%m/%d%z', '%d %m %Y%z', '%m/%d/%Y%z', '%Y %m %d%z', '%d/%m/%Y%z', '%d %b %Y%z', '%b-%d-%Y%z', '%b %d %Y%z', '%m%d%Y%z', '%Y%m%d%z', '%d%m%Y%z', '%m%Y%z', '%m %Y%z', '%m-%Y%z', '%m/%Y%z']

Examples:

For a date_format == ”%Y-%m-%dT%H:%M:%S%z” and the following input:

col_1
1970-01-01T23:23:14
2000-10-10T10:14:42-0515
1983-02-03T03:00:51
null
2020-07-06T18:52:36-0000
1948-05-14T02:23:14-0845

The expected output is:

col_1	utc_offset_hrs
1970-01-01T23:23:14	0430
2000-10-10T10:14:42-0515	-5.25
1983-02-03T03:00:51	null
null	null
2020-07-06T18:52:36-0000	0.0
1948-05-14T02:23:14-0845	-8.75

Extract UTC offset from UTC timestamp and timezone

Unique identifier (API): extract_utc_offset_tz

Extracts the offset (in hours) from UTC, given the UTC DateTime column and the timezone name column.

Examples:

For the following input:

datetimes	timezones
1970-01-01 23:23:14	Pacific/Pago_Pago
1990-05-20 09:09:09	Asia/Jerusalem
2000-10-10 10:14:42	pacific/pago_pago
1983-02-03 03:00:51	null
null	Arctic/Longyearbyen
2020-07-06 18:52:36	Mars/Jezero crater

The expected output is:

datetimes	timezones	utc_offset_hrs
1970-01-01 23:23:14	Pacific/Pago_Pago	-11.0
1990-05-20 09:09:09	Asia/Jerusalem	3.0
2000-10-10 10:14:42	pacific/pago_pago	null
1983-02-03 03:00:51	null	null
null	Arctic/Longyearbyen	null
2020-07-06 18:52:36	Mars/Jezero crater	null

The reasons for the null values are:

pacific/pago_pago doesn't exist in our timezone list. (Pacific/Pago_Pago does!)
null timezone value
null datetime value
Mars/Jezero crater doesn’t exist in our timezone list. (Maybe in the future)

Extract week number in year

Unique identifier (API): extract_week_number_in_year

Extracts the week number (between 1-52)

Extract year

Unique identifier (API): extract_year

Extract the year as a number

Examples:

2020-04-15T15:53:44 -> will return: 2020

Get daylight savings indicator

Unique identifier (API): get_dst_indicator

Generates a boolean column which indicates if a localized timestamp column (represented by the combination of two columns: A UTC DateTime column and an IANA timezone name column) has an active daylight saving.

Examples:

For the following input:

time_utc	tz_name
2020-08-20 13:13:14	Australia/Melbourne
2020-02-20 13:13:14	Australia/Melbourne
2020-02-20 13:13:14	null
2020-02-20 13:13:14	Mars/Jezero Crater
null	Australia/Melbourne

The expected output is:

time_utc	tz_name	dst_active
2020-08-20 13:13:14	Australia/Melbourne	false
2020-02-20 13:13:14	Australia/Melbourne	true
2020-02-20 13:13:14	null	null
2020-02-20 13:13:14	Mars/Jezero Crater	null
null	Australia/Melbourne	null

And the reasons for the null values are: 1. null timezone value 2. Mars/Jezero crater doesn’t exist in our timezone list. (Maybe in the future) 3. null time_utc

Get local time

Unique identifier (API): get_local_time

Generates local time given the UTC timestamp and the timezone's offset in hours.

Examples:

For the following input: (2 pairs of DateTime | offset, working on both):

datetime_1	offset_1	datetime_2	offset_2
1970-01-01 23:23:14	4.5	null	null
2000-10-10 10:14:42	-5.25	2020-07-06 18:52:36	0.0
1983-02-03 03:00:51	2.75	1948-05-14 02:23:14	null

The expected output is:

datetime_1	offset_1	datetime_2	offset_2	local_time_1	local_time_2
1970-01-01 23:23:14	4.5	null	null	1970-01-02 03:53:14	null
2000-10-10 10:14:42	-5.25	2020-07-06 18:52:36	0.0	2000-10-10 04:59:42	2020-07-06 18:52:36
1983-02-03 03:00:51	2.75	1948-05-14 02:23:14	null	1983-02-03 05:45:51	null

Get timezone name

Unique identifier (API): get_tz_name

Generates a list of timezone names matching the pair of a UTC DateTime column and a numeric UTC offset (in hours) column. The list is in a text format with values separated by a pipe

Examples:

For the following input: (2 pairs of DateTime | offset, working on both)

datetime_1	offset_1	datetime_2	offset_2
1970-01-01 23:23:14	4.5	null	null
2000-10-10 10:14:42	-5.25	2020-07-06 18:52:36	0.0
1983-02-03 03:00:51	2.75	1948-05-14 02:23:14	null

The expected output is:

datetime_1	offset_1	datetime_2	offset_2	timezones_1	timezones_2
1970-01-01 23:23:14	4.5	null	null	Asia/Kabul
2000-10-10 10:14:42	-5.25	2020-07-06 18:52:36	0.0		Africa/Abidjan, Africa/Accra, Africa/Bamako, Africa/Banjul, Africa/Bissau, Africa/Conakry, Africa/Dakar, Africa/Freetown, Africa/Lome, Africa/Monrovia, Africa/Nouakchott, Africa/Ouagadougou, Africa/Sao_Tome, America/Danmarkshavn, America/Scoresbysund, Atlantic/Azores, Atlantic/Reykjavik, Atlantic/St_Helena, GMT, UTC
1983-02-03 03:00:51	2.75	1948-05-14 02:23:14	null

Get timezone name from geolocation

Unique identifier (API): get_tz_name_from_geolocation

Given the geolocation as a pair of numeric columns (latitude and longitude), creates a column of timezone names of each geolocation. Returns null values if no matching timezone is found, or if the geo coordinates are invalid.

Examples:

For the following input:

lat	lng
50.45817318865369	35.99253168643526
30.56916325861897	65.90262778021655
51.936292660801115	26.071955568518586
21.180638320519932	151.80832165098863
67.04735445996346	120.61522491454524
null	97.50232778607207
51.07973011677217	null

The expected output is:

lat	lng	tz_names
50.45817318865369	35.99253168643526	Europe/Moscow
30.56916325861897	65.90262778021655	Asia/Kabul
51.936292660801115	26.071955568518586	Europe/Minsk
21.180638320519932	151.80832165098863	null
67.04735445996346	120.61522491454524	Asia/Yakutsk
null	97.50232778607207	null
51.07973011677217	null	null

Round DateTime

Unique identifier (API): round_datetime

Rounds a DateTime column (in place) to 15, 30 or 60 minutes

Parameters	Parameter Description	Default	Parameter Type
round_to	Time rounding resolution. E.g. for "half_hour", 14:39:11 --> 14:30:00		single-select

Examples:

""2020-04-15T15:53:44"" with rounding to 30 minutes. Output ""2020-04-15T16:00:00""

Feature Engineering

Create features using autoencoder neural network

Unique identifier (API): autoencoder

Create new numerical features by combining existing numerical features in novel and non-linear ways using an autoencoder

Parameters	Parameter Description	Default	Parameter Type	Parameter Min	Parameter Max	Parameter Step
num_epochs	The number of iterations or epochs for optimizing the loss function over the training set.	288	range	1.0	1000.0	1.0
batch_size	The size of a batch of examples used to compute the Stochastic Gradient updates.	128	range	1.0	8192.0	1.0
learning_rate	Learning rates control the size of the update to parameters during each step.	0.001	range	0.001	0.1	0.001
impute	Strategy for imputing missing values	mean	single-select
create_missing_flag	Create a new feature based on missing values	True	boolean
standardize	Standardize, if true, will transform each feature value by subtracting the mean-value and dividing the result by the standard deviation of the feature	True	boolean

Examples:

Using autoencoder to generate novel and non-linear features based on existing numerical features

Bin columns

Unique identifier (API): bin_columns

This action bins one or more numeric columns by specifying the number of equally-spaced bins or a list of bin edges

Parameters	Parameter Description	Default	Parameter Type
method	Whether to use equally-spaced bins or custom cut positions		single-select
n_bins	Number of equally-spaced bins. This creates n equally-spaced bins from the minimum to maximum in the column.		integer
cut_positions	Custom cut positions in ascending order		float-list
output_format	Format for output labels	[lower, upper)	single-select

Examples:

Discretize a numeric column into categorical ranges e.g. given cut positions 100, 150, 200, 30, 110, 180, 220 can be binned as ""x < 100"", ""100 <= x < 150"", ""150 <= x < 200"", ""x >= 200""

Bin columns by quantiles

Unique identifier (API): bin_columns_by_quantiles

This action bins one or more numeric columns by specifying the number of uniformly-split quantiles or a list of custom quantiles.

Parameters	Parameter Description	Default	Parameter Type
method	Whether to use uniformly-split quantiles or custom split quantiles		single-select
n_bins	Number of uniformly-split quantiles. This creates n bins based on uniformly-split quantiles from 0 to 1		integer
split_quantiles	Custom split quantiles in ascending order		float-list
output_format	Format for output labels	[lower, upper)	single-select

Examples:

For example, 20, 50, 80, 100, 200 is converted to ""0 < x < 0.25"", ""0.25 <= x < 0.5"", ""0.5 <= x < 0.75"", , ""0.75 <= x < 1"", ""0.75 <= x < 1"" with split quantiles 0.25, 0.5, 0.75.

Bin DateTime

Unique identifier (API): bin_datetime

Bins a DateTime column (creates 2 new columns) to 15, 30 or 60 minutes

Parameters	Parameter Description	Default	Parameter Type
bin_res	Upper/lower bounds to round towards in order to bin the time. E.g.: 14:53:15 --> [14:45:00, 15:00:00] when using "quarter_hour" resolution		single-select

Examples:

Creates 2 datetime output columns per each input column which ""bin"" the input column between them. The binning resolution is either: hourly, half-hourly, or 15 minutes. Example: ""2020-04-15T15:53:44"" with half-hourly resolution. Output columns: [""2020-04-15T15:30:00"", ""2020-04-15T16:00:00""]

Bin columns such that the target mean follows upward or downward trend

Unique identifier (API): bin_trend

This action bins one or more numeric columns such that the target mean follows an upward or downward trend

Parameters	Parameter Description	Default	Parameter Type
trend	Up or Down Trend		single-select
algo	Algorithm for trend-binning		single-select

Create missing flag

Unique identifier (API): create_missing_flag

For every input column, create a boolean column flagging the locations of missing values.

Extract PCA Components

Unique identifier (API): extract_pca_v2

Create new numerical features by combining existing numerical features in linear ways using PCA

Parameters	Parameter Description	Default	Parameter Type	Parameter Min	Parameter Max	Parameter Step
limit_components_by	Criterion for selecting the principal components. If the minimum explained variance is used, the components that contribute most to the variance are selected until the cumulative variance is not less than the specified ratio of the total variance.	min_explained_variance	single-select
num_components	The number of principal components to extract from the input data		integer
min_explained_variance	This is the threshold on the ratio of explained variance of a component to the total variance. Only the components with highest variance whose cumulative variance is at least the specified ratio of total variance will be selected	0.8	range	0.01	1.0	0.01
numeric_impute_method	Method for imputing Numeric columns	mean	single-select

Apply UMAP

Unique identifier (API): extract_umap_components

Apply the Uniform Manifold Approximation and Projection (UMAP) algorithm to the input columns to achieve dimensionality reduction. This is a manifold-learning algorithm based on neighbours graphs.

Parameters	Parameter Description	Default	Parameter Type	Parameter Min	Parameter Max	Parameter Step
num_output_dimensions	The dimensionality of the space the input will be embedded into.	2	range	1.0	100.0	1.0
seed	The random seed for the UMAP algorithm. By setting it, the output will be reproducible.	2	integer
n_neighbors	This parameter controls how UMAP balances local versus global structure in the data. See documentation of this action in the knowledge hub for details.NOTE: If larger than number of records in the data, this parameter will be truncated to the number of records.	15	range	2.0	100.0	1.0
metric	Defines the metric of the ambient space of the input data.	euclidean	single-select
min_dist	Controls how tightly UMAP is allowed to pack points together. It, quite literally, provides the minimum distance apart that points are allowed to be in the low dimensional representation. This means that low values of the parameter will result in clumpier embeddings. This can be useful if you are interested in clustering, or in finer topological structure. Larger values of the parameter will prevent UMAP from packing point together and will focus instead on the preservation of the broad topological structure instead.	0.1	range	0.0	1.0	0.01

Get vector representation from text

Unique identifier (API): get_feature_vectors_from_text

Obtain numerical features from text using word embedding model

Examples:

Generates a vector representation for the given text column. For the following input table and input field being 'col_text' and output field being 'col_text_vector'

col_text
null
Test 54 45 Test
Dummy sentence
Just some number 00

The output contains text vectors of dimensionality 96:

col_1	__group_col_text_vector__0	__group_col_text_vector__1	.......................	__group_col_text_vector__95
null	0.123	0.786	.......................	0.313
Test 54 45 Test	-0.92	0.576	.......................	0.415
Dummy sentence	0.567	0.321	.......................	-0.28
Just some number 00	0.876	0.476	.......................	-0.47

Create historical summary aggregation features

Unique identifier (API): historical_summ_agg

Create historical summary aggregation features, e.g. the average purchase amount over the last 6 months

Parameters	Parameter Description	Default	Parameter Type
aggregation_functions	Aggregation methods such as "min" or "max" are functions that reduce a column of values to a single value	['sum']	multi-select
timeframe	Time frame for windowing	3	integer

Examples:

For example, creating columns like average spending last 6 months for each customer, by using the date column, monthly spending and customer ID

Label encode categorical columns

Unique identifier (API): label_encode

Encode each input categorical column into a new column of non-negative integers, following either alphabetical (by default) or customised order. Null values and unseen labels (when a finalized recipe fitted by a dataset is applied on new data) remain as null.

Parameters	Parameter Description	Default	Parameter Type
order_of_categories	Category list that determines the order of labels	[]	string-list

Examples:

E.g. convert the Categories ""A"", ""B"", ""C"", NA to their numerical encoding 0, 1, 2, NA

Scale numeric column(s)

Unique identifier (API): scale_numeric_columns

Scale numeric column(s) by a specified method

Parameters	Parameter Description	Default	Parameter Type	Parameter Min	Parameter Max	Parameter Step
method	Method for scaling including: standard (mean removed, scaled to unit variance), min-max (scaled to [0, 1]), max-abs (scaled to [-1, 1]) and robust (median removed, scaled to a specified quantile range)		single-select
lower_quantile	Lower value for quantile range used in scaling	0.25	range	0.0	1.0	0.05
upper_quantile	Upper value for quantile range used in scaling	0.75	range	0.0	1.0	0.05
relative_error	Relative error for approximating median or a specific quantile. Must be no less than 0.001 for computation efficiency	0.001	range	0.001	1.0	0.001

Examples:

For example, a numeric column 100, 120, 80, 10000, 100 is scaled to 0.01, 0.012, 0.008, 1.0, 0.01 with method = ""max_abs"" (all values are divided by the absolute maximum 10000).

Computes weight of Evidence (WOE) encoding for categorical columns

Unique identifier (API): weight_of_evidence

This action converts categorical columns by Weight of Evidence encoding.

Feature Engineering/Text

Get vector representation from text

Unique identifier (API): get_feature_vectors_from_text

Obtain numerical features from text using word embedding model

Examples:

Generates a vector representation for the given text column. For the following input table and input field being 'col_text' and output field being 'col_text_vector'

col_text
null
Test 54 45 Test
Dummy sentence
Just some number 00

The output contains text vectors of dimensionality 96:

col_1	__group_col_text_vector__0	__group_col_text_vector__1	.......................	__group_col_text_vector__95
null	0.123	0.786	.......................	0.313
Test 54 45 Test	-0.92	0.576	.......................	0.415
Dummy sentence	0.567	0.321	.......................	-0.28
Just some number 00	0.876	0.476	.......................	-0.47

Filtering/Custom Formula

Filter rows

Unique identifier (API): filter_rows_with_formula_v2

Filter rows in data by keeping only rows that satisfy the condition specified in user-entered formula. Common Excel/SQL functions and operators for manipulating Numeric, DateTime, Text, and Boolean values are permitted. For a full list of allowed functions, see documentation.

NOTE: The formula entered must evaluate to a boolean value.

Parameters	Parameter Description	Default	Parameter Type
condition	The condition under which the rows are to be kept, entered as a formula.		formula

Examples:

Consider the dataset:

manufacturer	model	year	displ	cty	hwy
audi	a4 quattro	1999	1.8	16	25
audi	a6 quattro	1999	2.8	15	24
audi	a6 quattro	2008	4.2	16	23
audi	a6 quattro	2008	4.2	16	23
chevrolet	c1500 suburban 2wd	2008	6	12	17
chevrolet	corvette	1999	5.7	16	26
chevrolet	corvette	1999	5.7	15	23
chevrolet	k1500 tahoe 4wd	1999	6.5	14	17
chevrolet	malibu	1999	3.1	18	26
dodge	caravan 2wd	1999	3.8	15	22
dodge	caravan 2wd	2008	4	16	23
dodge	durango 4wd	2008	5.7	13	18
dodge	durango 4wd	2008	5.7	13	18
dodge	ram 1500 pickup 4wd	1999	5.9	11	15
ford	expedition 2wd	1999	4.6	11	17
ford	explorer 4wd	1999	5	13	17
ford	f150 pickup 4wd	1999	4.2	14	17
ford	f150 pickup 4wd	1999	4.2	14	17
ford	f150 pickup 4wd	2008	4.6	13	17
ford	mustang	2008	4	17	26

where cty and year are of Numeric type.

Given the formula 'AND(year > 2000, cty >= 15)', the result will be:

manufacturer	model	year	displ	cty	hwy
audi	a6 quattro	2008	4.2	16	23
audi	a6 quattro	2008	4.2	16	23
dodge	caravan 2wd	2008	4	16	23
ford	mustang	2008	4	17	26

Join

Look up aggregated columns from another dataset

Unique identifier (API): aggregate_lookup

Perform aggregation on another dataset and lookup the content of the summarized table

Examples:

Given the primary dataset (n)

grp
1
3
3

and a secondary dataset that we wish to summarize and then perform a lookup of the summarized info:

grp	val	val1	bool1	datetimes
1	0.45	1	True	2023-11-17T13:08:29
1	0.45	2	True	2023-11-17T13:08:29
3	0.45	3	True	2023-11-17T13:08:29
3	0.45	3.5	True	2023-08-03T08:48:29
3	1	3	False	2022-05-29T03:21:59
3	1	4	False	2022-05-29T03:21:59
2	10	5	False	2022-05-29T03:21:59
2	100	6	False	2022-05-29T03:21:59

If we perform the below aggregations on the secondary table and then lookup the results using grp as the lookup key

mean(val)
sum(val1)
all(bool1)
any(bool)
approx_count_distinct(val, 0)
approx_count_distinct(bool1, 0)
earliest(datetimes, 0)
latest(datetimes, 0)
approx_median(val)
approx_quantile(val)
count(bool1, "non_null")
count(bool1, "null)

we get

grp	avg_val	sum_val1	all_bool1	any_bool1	approx_count_distinct_val	approx_count_distinct_bool1	earliest_datetime	latest_datetime	median_val	quantile_val	count_bool1
1	0.45	3.0	True	True	1	1	2023-11-17 13:08:29	2023-11-17 13:08:29	0.45	0.45	2
3	0.725	13.5	False	True	2	2	2022-05-29T03:21:59	2023-08-03T08:48:29	0.45	0.45	4
3	0.725	13.5	False	True	2	2	2022-05-29T03:21:59	2023-08-03T08:48:29	0.45	0.45	4

Join

Unique identifier (API): join

Joins another dataset to current dataset given one or more join keys, retaining the specified selection of columns from both datasets.

Parameters	Parameter Description	Default	Parameter Type
join_type	Specifies how to perform the join.		single-select

Examples:

For the following input tables:

Primary Table: (2 regular columns, 1 column group with 2 components)

Id	Name	__group_Like_0	__group_Like_1
1	Patrick	Apple	Milk
2	Albert	Pear	Cheese
3	Maria	Watermelon	Chickpeas
4	Darwin	Grapes	Beans
5	Elizabeth	Kiwi	Softpeas

Secondary Table: (2 regular columns)

User_ID	Like
3	Stars
1	Climbing
1	Code
6	Rugby
4	Apples

We perform a Join with the following parameters:

join_keys == 'Id' key for primary table.
join_keys == 'User_ID' key for secondary table
selected_columns == ['Id', 'Name', 'Like'] for the primary table.
selected_columns == 'Like' for the secondary table
join_type == 'Outer Join'

the expected output is:

Id	Name	__group_primary_Like_0	__group_primary_Like_1	secondary_Like
null	null	null	null	Rugby
5	Elizabeth	Kiwi	Softpeas	null
1	Patrick	Apple	Milk	Climbing
1	Patrick	Apple	Milk	Code
3	Maria	Watermelon	Chickpeas	Stars
2	Albert	Pear	Cheese	null
4	Darwin	Grapes	Beans	Apples

Notice that in this example the user performed an Outer Join but have opted to to drop the User_Id column of the secondary table. This resulted in the first row of the output to show an Id of null (User_Id == 6 for that row. This value did not exist in the primary table) and thus the result for this row isn't very usable.

Look up columns from another dataset

Unique identifier (API): lookup_from

Creates specified columns by looking them up from a second dataset. Results in null values if there is no match in the second dataset for a particular record.

Parameters	Parameter Description	Default	Parameter Type
multi_match_resolution	What row to use in case of multiple rows are matching in the other dataset under the specified criteria.	last	single-select

Examples:

For the following input tables:

Primary Table: (2 regular columns, 1 column group with 2 components)

Id	Name	__group_Like_0	__group_Like_1
1	Patrick	Apple	Milk
2	Albert	Pear	Cheese
3	Maria	Watermelon	Chickpeas
4	Darwin	Grapes	Beans
5	Elizabeth	Kiwi	Softpeas

Secondary Table: (2 regular columns)

User_ID	Like
3	Stars
1	Climbing
1	Code
6	Rugby
4	Apples

We perform a Lookup with the following parameters:

join_criteria == 'Id' key for primary table.
join_criteria == 'User_ID' key for secondary table
lookup == 'Like' for the secondary table
order_secondary_by == 'Like' for the secondary table
Name prefixes for the current and other datasets as ['left_', 'right_']

the expected output is:

Id	Name	__group_left_Like_0	__group_left_Like_1	right_Like
null	null	null	null	Rugby
5	Elizabeth	Kiwi	Softpeas	null
1	Patrick	Apple	Milk	Climbing
3	Maria	Watermelon	Chickpeas	Stars
2	Albert	Pear	Cheese	null
4	Darwin	Grapes	Beans	Apples

Note: that Lookup only returns one match per row in the primary table. Hence, if there are multiple matches in the secondary table, only the first one will be returned. Hence, it's important to choose a order_secondary_by column in the secondary table.

Outlier Detection

Get anomaly score of records

Unique identifier (API): get_anomaly_scores_from_records

Computes the anomaly score of input data using unsupervised learning. Anomaly score of a record is computed as a measure of the information lost when the record is subject to dimensionality reduction by PCA.

Parameters	Parameter Description	Default	Parameter Type	Parameter Min	Parameter Max	Parameter Step
method	The method used for anomaly detection such as dimensionality reduction using PCA.	pca	single-select
pca_limit_components_by	Criterion for selecting the principal components. If the minimum explained variance is used, the components that contribute most to the variance are selected until the cumulative variance is not less than the specified ratio of the total variance.	specified_explained_var	single-select
pca_num_of_components	The number of principal components to extract from the input data		integer
pca_minimum_explained_variance_needed	This is the threshold on the ratio of cumulative explained variance of the selected principal components to the total variance. Only the components with highest variance whose cumulative variance is at least the specified ratio of total variance will be selected	0.8	range	0.01	1.0	0.01
isolation_forest_num_estimators	The number of trees in the ensemble.	100	integer
max_samples_specification_method	Specifies how max samples for isolation forest is set. Fractional implies that the maximum sample size is a fraction relative to the total number of rows.Count implies that the specified number is treated as an absolute count.	count	single-select
isolation_forest_max_samples_fraction	The fraction of the total number of samples used to train each tree.	1.0	range	0.01	1.0	0.01
isolation_forest_max_samples_count	The number of samples used to train each tree.The value should be greater than 1.	256	integer
seed	Positive integer. Random seed for isolation forest algorithm. Used for obtaining repeatable results.	1	integer

Indicate anomalous values

Unique identifier (API): indicate_anomalous_values

Creates boolean flag columns indicating outliers in the corresponding input columns, using interpercentile range method.E.g. denoting Q1 as the 1st quartile, Q3 as the 3rd quartile and IQR = Q3 - Q1, values smaller than Q1 - 1.5*IQR or larger than Q3 + 1.5IQR will be flagged as outliers.

Parameters	Parameter Description	Default	Parameter Type	Parameter Min	Parameter Max	Parameter Step
decision_mode	How the decision is made for declaring an anomaly: For "Upper threshold": only values larger than the upper threshold will be declared as anomalies. For "Lower threshold": only values lower than the lower threshold will be declared as anomalies. For "Lower and upper threshold": values below the lower threshold and above the upper threshold will be declared as anomalies.	lower_and_upper_thresholds	single-select
low_percentile	The percentile P1 such that all values smaller than P1-alpha*IPR will be flagged as outliers By default, P1 equals Q1 (first quartile).	25	range	0.0	100.0	1.0
high_percentile	The percentile P2 such that all values larger than P2+alpha*IPR will be flagged as outliers By default, P2 equals Q3 (3rd quartile).	75	range	0.0	100.0	1.0
ipr_factor	The interpercentile range (denoted as IPR) equals P2-P1, where P2 and P1 are correspondingly the high and low percentiles. The factor is defined as 'alpha' in the formulas: P1 - alpha * IPR and P2 + alpha * IPR which determine the low and high thresholds for outliers detections. Values above/below these thresholds will be flagged as outliers	1.5	float

Reshaping

Reshape dataset into a pivot table

Unique identifier (API): pivot_v2

Creates a pivot table with specified group-by (index) columns and pivot column. Column groups with equal number of components (one per pivoted value) are created, one per specified column/aggregation function pair.

Parameters	Parameter Description	Default	Parameter Type
pivot_values	List of values that are converted to columns in the output	[]	string-list

Examples:

For example, given the following dataset

animal	size	qty	desexed
cat	small	95	True
cat	large	39	False
cat	small	50	False
cat	large	37	False
cat	small	12	True
dog	large	84	False
dog	small	71	True
dog	large	87	True
dog	small	7	True
dog	large	42	False

with input pivot column size, grouped by column animal, values column qty with aggregation function sum and value column desexed with aggregation function all, the result is

animal	__group_qty_sum_small	__group_qty_sum_large	__group_desexed_all_small	__group_desexed_all_large
dog	78	213	True	False
cat	157	76	False	False

Unpack/Explode JSON arrays into rows

Unique identifier (API): unpack_v2

Reshape a dataset by unpacking one or more JSON array columns and add a new column for position index

Parameters	Parameter Description	Default	Parameter Type
allow_single_quotes	Whether to allow single quoted JSONs	False	boolean

Examples:

For example, given the following dataset

paper_id	author	affilliation
1	["Sara", "Tom", "Max"]	["rmit", "monash", "rmit"]
2	["Sara", "Max", "Lee"]	["rmit", "rmit"]

with input columns ""author"" and ""affilliation"", output column ""position"" for position index, the result is

paper_id	author	affilliation	position
1	Sara	rmit	0
1	Tom	monash	1
1	Max	rmit	2
2	Sara	rmit	0
2	Max	rmit	1
2	Lee	null	2

Unpivot a dataset from wide to long format

Unique identifier (API): unpivot

Unpivot a dataset from wide to long format. Input columns are converted to a single keys column and one or more values column(s) in the output. Columns other than input columns remain.

Text

Change text case

Unique identifier (API): change_text_case

Changes text columns to selected text case (uppercase, lowercase, etc.)

Parameters	Parameter Description	Default	Parameter Type
case	The required change in case of values where "Upper" converts all to uppercase, "Lower" converts all to lowercase and "Capitalize" converts first character to uppercase and remaining to lowercase.		single-select

Examples:

For case == 'upper' and the following input table:

col_1
null
a brown fox quick jumps over the lazy dog
1234
Hello World

The expected output is:

col_1
null
A BROWN FOX QUICK JUMPS OVER THE LAZY DOG
1234
HELLO WORLD

Decode base64-encoded text

Unique identifier (API): decode_base64

Every input column is assumed to be encoded using Base64-encoding. Decode it into thecharacter set (e.g. UTF8) specified. Store the decoded values in the output column

Parameters	Parameter Description	Default	Parameter Type
charset	The encoding to convert the decoded bytes to. This is the charset encoding of the original strings before they were base64 encoded	UTF8	single-select

Extract all matches of a regular expression pattern

Unique identifier (API): extract_all_regex_matches

For each input column, creates a new column of JSONArray type containing all the sub-strings in the corresponding input matching the specified regular expression.

Parameters	Parameter Description	Default	Parameter Type
pattern	The pattern to be found, as a Perl-Compatible Regular Expression. Note that at present, only whole matches are extracted and capturing groups are not supported.		string

Examples:

Given the following input dataset with column types as indicated:

('Description', 'Text')
The IP address of the router: 192.168.1.1
Go to 201 Mills St or call (404) 456-0861
The final amount of the transaction is 25,382.55

Spend $100 or more to enter a draw sponsored by hotels123.com to win a $20,500.00
There is no match 456here

Applying the action with

Input columns = ['Description']
Output columns = ['Matches']
Parameters = {"pattern": r"\b\d+\b(?!$)"}

produces the following dataset with the column types as indicated:

('Description', 'Text')	('Matches', 'JSONArray')
The IP address of the router: 192.168.1.1	["192","168","1"]
Go to 201 Mills St or call (404) 456-0861	["201","404","456"]
The final amount of the transaction is 25,382.55	["25","382"]

Spend $100 or more to enter a draw sponsored by hotels123.com to win a $20,500.00	["100","20","500"]
There is no match 456here	[]

Extract capture groups

Unique identifier (API): extract_capture_groups_to_multicol

Extracts one or more regex capture groups into multiple columns.

Parameters	Parameter Description	Default	Parameter Type
pattern	A regular expression pattern containing one or more capture groups		string

Examples:

For the following pattern == '([aeiou].)([aeiou].)' , the following input table:

col_1	col_2
New_York	bc
Lisbon	df
Tokyo	gh
Paris	jk
null	lm
Munich	np

And assuming we apply the action to column (col_1), with output columns [out_a, out_b], the expected output is:

col_1	col_2	out_a	out_b
New_York	bc	null	null
Lisbon	df	null	null
Tokyo	gh	null	null
Paris	jk	ar	is
null	lm	null	null
Munich	np	un	ic

Get named entities from text

Unique identifier (API): get_named_entities_from_text

Extracts named entities mentioned in input text column into the corresponding output text column as JSON string. The JSON string represents a dictionary-mapping between entity types to the coresponding list of entity mentions in the text.

Examples:

Extracts named entities from a given text column. See the following illustration for an input table with a single text column named col_text and an input field col_text and output field col_entities_from_text:

col_text
Bill Gates addressed Microsoft.
Jack drove to Washington.
Decisions are driven by data.

null
---------------------------------

The output dataframe contains the text column col_entities_from_text:

col_text	col_entities_from_text
Bill Gates addressed Microsoft.	{"People": ["Bill Gates"], "Organizations": ["Microsoft"]}
Jack drove to Washington.	{"People": ["Jack"], "Places": ["Washington"]}
Decisions are driven by data.	{}
	{}
null	{}

Get sentiment scores from text

Unique identifier (API): get_sentiment_scores_from_text

Analyses the sentiment expressed in the input text columns and creates 3 output columns that reflect the strength (a number between 0 to 1, where 1 is the strongest) of positive, neutral and negative sentiment expressed in input text.

Examples:

Extracts sentiment polarity scores from a given text column into three numeric output columns. See the following illustration for an input table and with a single text column named 'tweets' and input field being 'col_text' and output fields being 'positive_sentiment', 'neutral_sentiment' and 'negative_sentiment':

tweets
Thanks for the follow-up. That's why you're the best!
It was a horrible experience, something I would never recommend
''
null
---------------------------------------------------------------

The output dataframe contains the text columns 'positive_sentiment', 'neutral_sentiment' and 'negative_sentiment':

tweets	positive_sentiment	neutral_sentiment	negative_sentiment
Thanks for the follow-up. That's why you're the best!	0.514	0.486	0.0
It was a horrible experience, something I would never recommend	0.0	0.588	0.412
''	0.0	0.0	0.0
null	0.0	0.0	0.0
---------------------------------------------------------------	------------------	-----------------	------------------

Normalize text content

Unique identifier (API): normalize_text

Normalizes a text column by performing one or more normalization steps such as removing non text content or stemming the words.

Parameters	Parameter Description	Default	Parameter Type
steps_to_execute	The required normalization steps	['remove_html_tags', 'replace_urls']	sequence-select

Examples:

If the input data is:

col_1
@JetBlue I was not allowed to drop my bag off for my flight that I was running 10min Late Flight for. Turns out it left over an hour Late Flight.
@VirginAmerica momma I made it! 😁😁😁
@JetBlue bag just delivered and items have been stolen. At first look - my Kate spade bag and bottle of spiced rum!!!
@united then why have I not received my call back its been 3 days...it's an infant car seat how am I supposed to go anywhere with my child
@JetBlue just a heads up, this page no longer exists http://t.co/NsJWVTTjGo
@AmericanAir Thank you, you too!
“@JetBlue: Our fleet's on fleek. http://t.co/vw2v8GVnGq” 😐😑
@AmericanAir yes I did but no response.
@SouthwestAir when is the next flight with free cocktails ?
@AmericanAir have been waiting for a call back for over two hours jan to die tomorrow at 6am Cancelled Flightled what are my options?
@AmericanAir that doesn't really answer my question. Maybe if I provide more details you can give me clarification...

@JetBlue where is our luggage?? Flight 424 has been waiting an hour for bags! #jetblue #poorservice
@united will I make it with the delay?
@SouthwestAir why can't you take me to Knoxville?? yall are my fav #help

applying the action on col_1 with only the remove_stopwords and lemmatize_words stages disabled will result in:

col_1
mention i was not allowed to drop my bag off for my flight that i was running 10min late flight for turns out it left over an hour late flight
mention momma i made it emoji emoji emoji
mention bag just delivered and items have been stolen at first look my kate spade bag and bottle of spiced rum
mention then why have i not received my call back its been number daysit s an infant car seat how am i supposed to go anywhere with my child
mention just a heads up this page no longer exists url
mention thank you you too
mention our fleet s on fleek url emoji emoji
mention yes i did but no response
mention when is the next flight with free cocktails
mention have been waiting for a call back for over two hours jan to die tomorrow at 6am cancelled flightled what are my options
mention that doesn t really answer my question maybe if i provide more details you can give me clarification

mention where is our luggage flight number has been waiting an hour for bags hashtag hashtag
mention will i make it with the delay
mention why can t you take me to knoxville yall are my fav hashtag

For the same data, if all stages including remove_stopwords and lemmatize_words are enabled, the result will be:

col_1
mention allow drop bag flight running 10min late flight turn left hour late flight
mention momma emoji emoji emoji
mention bag deliver item steal look kate spade bag bottle spice rum
mention receive number day infant car seat suppose child
mention head page long exist url
mention thank
mention fleet fleek url emoji emoji
mention yes response
mention flight free cocktail
mention wait hour jan die tomorrow 6 cancel flightled option
mention answer question maybe provide detail clarification

mention luggage flight number wait hour bag hashtag hashtag
mention delay
mention knoxville y fav hashtag

Create flag for regex match

Unique identifier (API): regex_match

Creates an output boolean column which is the result of matching the input column against a regex pattern

Parameters	Parameter Description	Default	Parameter Type
pattern	A regular expression pattern for the matching		string

Examples:

For the following table:

col_1
null
That's my phone number: 054-555-111
Monday is the 1st day of the week
This should match 0490-000-111
This should not match 0490-abc-def

we apply the regex pattern: '0490-\d{3}-\d{3}' Resulting in the following output:

col_1	col_1_regex_match
null	null
That's my phone number: 054-555-111	false
Monday is the 1st day of the week	false
This should match 0490-000-111	true
This should not match 0490-abc-def	false

Substitute using regex

Unique identifier (API): regex_sub

Finds and replaces string patterns in a column via Regex. Can optionally specify additional pattern to filter by, in which case the scope of the find/replace will be restricted to records that match the specified pattern in the column.

Parameters	Parameter Description	Default	Parameter Type
match_pattern	Specify this to restrict the scope of replacement	.*	string
sub_pattern	The pattern that needs to be substituted		string
sub_string	String used to replace values in places where sub_pattern matched the values.		string

Examples:

For the following set of parameters:

match_pattern = '0490-\d{3}-\d{3}'
sub_pattern = '054-\d{3}-\d{3}'
sub_string = 'I WAS REPLACED'

and the following input table:

col_1
null
That's my phone number: 054-555-111
Monday is the 1st day of the week
This should be substituted: 0490-000-111, 054-555-111
This should not be substituted: 0490-000-111, 054-abc-cdf

The output is:

col_1
null
That's my phone number: 054-555-111
Monday is the 1st day of the week
This should be substituted: 0490-000-111, I WAS REPLACED
This should not be substituted: 0490-000-111, 054-abc-cdf

Remove prefix

Unique identifier (API): remove_prefix

Removes a string prefix from the rows of a chosen string column, if the string is found.

Parameters	Parameter Description	Default	Parameter Type
prefix_string	The prefix string to be stripped		string

Examples:

For the prefix_string == 'prefix_' and the following input table:

col_1
null
prefix_to remove
other_not to remove

The expected output is:

col_1
null
to remove
other_not to remove

Remove punctuation

Unique identifier (API): remove_punctuation

Removes the punctuation from the rows of a chosen string column. The punctuation symbols to remove are given by the user as a string

Parameters	Parameter Description	Default	Parameter Type	Suggested Values
punc_symbols	The punctuation symbols to strip	!"#$%&'()+,-./:;<=>?@[]^_`{\|}~ \| string \| ['!"#$%&\'()+,-./:;<=>?@[\\]^_`{	}~']

Examples:

For the following input table:

col_1
null
This. Sentence! Has? Lots of, umm.. punctuation<>
Monday is the 1st day of the week
abc.edf@gmail.com

The output with stripped punctuation is:

col_1
null
This Sentence Has Lots of umm punctuation
Monday is the 1st day of the week
abcedfgmailcom

Remove suffix

Unique identifier (API): remove_suffix

Removes a string suffix from the rows of a chosen string column, if the string is found.

Parameters	Parameter Description	Default	Parameter Type
suffix_string	The suffix string to strip		string

Examples:

For the suffix_string == '_suffix' and the following input table:

col_1
null
remove_suffix
other_not to remove

The expected output is:

col_1
null
remove
other_not to remove

Split on delimiter to multiple columns

Unique identifier (API): split_on_delimiter_to_multicol

For a given column, extracts text components between the delimiters into multiple columns. The delimiter can be one or more specific characters, or a regex pattern to match. The number of splits is determined by the number of provided output column names. Trailing null values are created for records containing fewer components.

Parameters	Parameter Description	Default	Parameter Type	Suggested Values
delimiter	Delimiter to be used for splitting the string		string	[',', ';', '-', '
use_regex	Boolean flag indicating whether the specified delimiter should be treated as a regex pattern or an exact match.	False	boolean
include_tail	Boolean flag indicating wether the remainder (tail) of the string after the final split must be included in the last column	True	boolean

Examples:

For the following set of parameters:

parameters = {'delimiter': ':', 'use_regex': False, 'include_tail': True}

and output columns 'split_0', 'split_1', 'split_2', and the following input table:

col_1
one:two:three:four
no_split

The expected output is:

col_1	split_0	split_1	split_2
one:two:three:four	one	two	three:four
no_split	no_split	null	null

Whereas for include_tail = False, the expected output is:

col_1	split_0	split_1	split_2
one:two:three:four	one	two	three
no_split	no_split	null	null

Split to array using a delimiter or pattern

Unique identifier (API): split_to_array

Obtain array (JSON) columns resulting from splitting the selected text columns using a delimiter or a regex pattern. The number of splits can be specified to be a certain limit, otherwise the splitting of the text is indefinite until no more occurrences of the delimiter are found.

Parameters	Parameter Description	Default	Parameter Type	Suggested Values
delimiter	The delimiter to be used for splitting the string.		string	[',', ';', '-', '
use_regex	To split by a regex pattern, set this value to "True".	False	boolean
limit_splits	Set this to "True" and set the number of splits to the desired number, if not intending to split indefinitely.	False	boolean
max_number_of_splits	How many splits to obtain at most, if the splits are being limited.	2	integer

Examples:

Given the following input dataset with column types as indicated:

('col_1', 'Text')	('col_2', 'Numeric')	('col_3', 'Text')
1,2,3,4,5	1	a
a,b,c,d	3.14159	d
e,f,g	2.23606	x
one, two , three	1.732	no_split
no_split	nan

with:

Input columns as ['col_1', 'col_3'],
Parameters delimiter=',', use_regex=False, limit_splits=True, max_number_of_splits=3,

the result will be the following dataset with column types as indicated:

('col_1', 'JSONArray')	('col_2', 'Numeric')	('col_3', 'JSONArray')
["1","2","3,4,5"]	1	["a
["a","b","c,d"]	3.14159	["d
["e","f","g"]	2.23606	["x
["one"," two "," three"]	1.732	["no_split"]
["no_split"]	nan

Strip whitespaces

Unique identifier (API): strip_whitespaces

Strips whitespaces from rows of chosen Text columns.

Parameters	Parameter Description	Default	Parameter Type
mode	The method for stripping the whitespaces. Stripping can be done at the beginning, end, both of the strings, and can even strip ALL of the whitespaces.		single-select

Examples:

For the following table:

+-----------------------+
|col_1                  |
+-----------------------+
| start_whitespace      |
|end_whitespace         |
| start_end_whitespaces |
|test      tab          |
+-----------------------+

and mode == 'End', the expected output is:

+----------------------+
|col_1                 |
+----------------------+
| start_whitespace     |
|end_whitespace        |
| start_end_whitespaces|
|test      tab         |
+----------------------+

Text/Data Cleaning

Map to Text

Unique identifier (API): map_to_string

Replace values of input column(s) using a given mapping.

Parameters	Parameter Description	Default	Parameter Type
mapping	Value that will be replaced and the replacement.		map

Examples:

Assume we have the following single column table:

col_1
dog
cat
null
horse
dogdog

and the following mapping:

mapping = {
    'dog': 'doggy',
    'cat': 'kitten'
}

After running the action, the expected output is:

col_1
doggy
kitten
null
horse
dogdog

where dog -> doggy, cat -> kitten and the rest of the values were unchanged since they didn't appear in the mapping dictionary.

Parse strings as null

Unique identifier (API): parse_strings_as_null

Parses the specified strings in given column(s) as null values

Parameters	Parameter Description	Default	Parameter Type
strings_to_parse	The list of strings that will be parsed as null		string-list

Examples:

For the following input dataset:

col_x	col_y
11	n/a
45	38
null	31
25	1.#QNAN
49	20

where both columns are of text type, applying the action with the parameter strings_to_parse = ['n/a', '', 'NA', 'N/A', '<NA>', 'NULL', 'nan', '1.#IND', '#NA', '-1.#QNAN', '-nan', 'NaN', '1.#QNAN', '-1.#IND', '-NaN', 'null', '#N/A', '#N/A N/A'] and both columns selected in the input will give the following result:

col_x	col_y
11
45	38
	31
25
49	20

Text/JSON

Extract from JSON by path

Unique identifier (API): extract_from_json_by_path

Extract from JSON by the path. Will return null if the path isn't found

Parameters	Parameter Description	Default	Parameter Type
path	The path of the value to extract		string

Get occurrence-count features

Unique identifier (API): get_occurrence_count_features

Count the occurrence for each element of a JSON array. This achieves the same functionality as CountVectorizer in Scikit-learn package.

Examples:

For example, given the following dataset

customer_id	events_A	events_B	events_C
1	["login", "logout"]	["open", "close"]	["log in", "log out"]
2	["login", "home", "logout"]	["open", "idle", "close"]	["log in", " $home$ ", "log out"]
3	["login", "home", "logout", "login", "logout"]	["open", "idle", "close", "open", "close"]	["log in", "home", "log out", "log in", "log out"]
4	["login", "home", "app", "home", "logout"]	["open", "idle", "active", "idle", "close"]	["log in", " $home$ ", "app", "home", "log out"]
5		{"open": 0, "close": 1}	["log in", " $home$ ", "app", "home", "log out"]

with inputs as events_A, events_B and events_C, outputs as A, B and C, the result is

customer_id	events_A	events_B	events_C	__group_A_logout	__group_A_login	__group_A_home	__group_A_app	__group_B_open	__group_B_close	__group_B_idle	__group_B_active	__group_C_log00out	__group_C_log00in	__group_C_00home00	__group_C_home	__group_C_app	__group_C_00app00
1	["login", "logout"]	["open", "close"]	["log in", "log out"]	1	1	0	0	1	1	0	0	1	1	0	0	0	0
2	["login", "home", "logout"]	["open", "idle", "close"]	["log in", " $home$ ", "log out"]	1	1	1	0	1	1	1	0	1	1	1	0	0	0
3	["login", "home", "logout", "login", "logout"]	["open", "idle", "close", "open", "close"]	["log in", "home", "log out", "log in", "log out"]	2	2	1	0	2	2	1	0	2	2	0	1	0	0
4	["login", "home", "app", "home", "logout"]	["open", "idle", "active", "idle", "close"]	["log in", " $home$ ", "app", "home", "log out"]	1	1	2	1	1	1	2	1	1	1	1	1	1	0
5		{"open": 0, "close": 1}	["log in", " $home$ ", "app", "home", "log out"]	0	0	0	0	0	0	0	0	1	1	1	1	0	1

Here in input column events_C, there are white spaces and special characters such as $ and _ in the elements of JSON arrays. These characters other than numbers or letters are replaced with 00 in the output columns.

Parse JSON objects in column

Unique identifier (API): parse_json_objects

Parses each value in the input column as a JSON object, whose schema is specified by the user. Each attribute is extracted as one output column. All output columns will contain null values wherever the Text string is not parsable as a JSON object. If parsable but one or more of the specified attributes is missing or of incompatible type, their corresponding output columns will contain null values.

Parameters	Parameter Description	Default	Parameter Type
allow_single_quotes	Whether to allow single quoted JSONs	False	boolean

Examples:

For the input dataset with column names and data types as indicated below:

movie_id	title	movie_cast
1	Avatar	{"character": "Neytiri", "gender": 1, "name": "Zoe Saldana"}
2	Pirates of...	{"character": "Will T", "gender": 2, "name": "Orlando Bloom"}
3	Spectre	{"character": "Blofeld", "gender": 2, "name": "C Waltz"}
4	Spider-Man 3	{"character": "MJ", "gender": 1, "name": "Kirsten Dunst"}
5	X-Men
6	Avengers	{"broken_json": "3",
7	Captain America	{"character": "Nick Fury", "gender": "M", "name_missing": true}
8	The Hobbit	["Array instead of object 1", "Array instead of object 2"]

Applying this action with input column = 'movie_cast' and attribute/data type/output column trios [("character", "string", "character_name"), ("gender", "integer", "character_gender"), ("name", "string", "character_name")] produces the output:

movie_id	title	movie_cast	character_name	character_gender	character_acted_by
1	Avatar	{"character": "Neytiri", "gender": 1, "name": "Zoe Saldana"}	Neytiri	1	Zoe Saldana
2	Pirates of...	{"character": "Will T", "gender": 2, "name": "Orlando Bloom"}	Will T	2	Orlando Bloom
3	Spectre	{"character": "Blofeld", "gender": 2, "name": "C Waltz"}	Blofeld	2	C Waltz
4	Spider-Man 3	{"character": "MJ", "gender": 1, "name": "Kirsten Dunst"}	MJ	1	Kirsten Dunst
5	X-Men			nan
6	Avengers	{"broken_json": "3",		nan
7	Captain America	{"character": "Nick Fury", "gender": "M", "name_missing": true}	Nick Fury	nan
8	The Hobbit	["Array instead of object 1", "Array instead of object 2"]		nan

Text/NLP

Get topic distribution probabilities from text

Unique identifier (API): get_topic_distribution_from_text

Analyses the input text columns and extracts underlying topics in the text. Each record is considered a text document that has originated from the underlying topics. The output numeric column group contains the probability distribution over the identified topics for each input record (document)

Parameters	Parameter Description	Default	Parameter Type	Parameter Min	Parameter Max	Parameter Step
num_topics	The number of different topics that are assumed to be present in the input text column.	5	range	2.0	500.0	1.0
max_iterations	Sets the maximum number of iterations that can be run by the algorithm for extracting the underlying topics	20	range	15.0	30.0	1.0
seed	Random seed for topic modeling algorithms. Used for obtaining repeatable results.		integer

Examples:

For the following input:

text
you ever wanted to try some yourself
not only a fun read but also a pretty infographic
something not understandable by humans
text of all jane austens works if
not only a fun read but also a pretty infographic

and the parameter values: num_topics == 2 (Number of topics to be extracted)

We get the following output:

text	__group_text_topics_0	__group_text_topics_1
you ever wanted to try some yourself	0.400954	0.599046
not only a fun read but also a pretty infographic	0.239050	0.760950
something not understandable by humans	0.211497	0.788503
text of all jane austens works if	0.435378	0.564622
not only a fun read but also a pretty infographic	0.434958	0.565042

Get TF-IDF features from text

Unique identifier (API): tf_idf

Applies TF-IDF + dimensionality reduction

Parameters	Parameter Description	Default	Parameter Type	Parameter Min	Parameter Max	Parameter Step
post_processing_method	Algorithm for reducing the dimensionality of the TF-IDF output	truncated_svd	single-select
num_output_dimensions	The number of dimensions after post processing of TF-IDF output. For example, if the post-processing method is truncated SVD, and 100 output dimensions are chosen, then the dimensions retained after SVD truncation correspond to 100 largest singular values.	50	range	2.0	100.0	1.0
hashing_tf_num_features_method	How the number of features produced by the Hashing TF stage is determined. The larger the vocabulary size of the corpus, the more the number of hashing features needed for fidelity. If set to 'Automatic', the number of features to be produced by the Hashing TF stage is set to the smallest power of 2 that is closest to the estimated vocabulary size.	auto	single-select
hashing_tf_num_features	The cardinality of the set of possible hash values output by the hash function. The hash function is used to convert a document (text string) into sparse vectors after tokenization. This can be used to control the number of dimensions output by TF-IDF before dimensionality reduction. Smaller numbers result in higher chance of two terms being confused, while larger numbers result in longer computation times. It is advisable to use a power of two for the parameter, otherwise the features will not be mapped evenly to the columns.	4096	range	32.0	65536.0	32.0

Examples:

For the following input:

x
cant say for sure what they used but
there is an r library that includes the
text of all jane austens works if
you ever wanted to try some yourself
they probably got the idea
from
which uses jane austens work extensively
as an ongoing example when
showcasing the tidytext library
the specific library of jane
austens work is library
im guessing they used tidytext
and janeaustenr in r, then used a slightly
different analysis
using the results from r
in tableau or another data-viz
heavy program to make it palatable for
readers
pun intended sorry who has already
done a lot of
really cool analysis like this
they claimed in the text that
they used pca
but what they dont say is how they
created the matrix that was fed into the pca

and the parameter values: num_components == 2 (Number of PCA components --> Reducing to 2D) num_features == 1024 (Number of hashing features the text is converted into)

We get the following output:

x	__group_x_encoded_0	__group_x_encoded_1
cant say for sure what they used but	-3.6428402533142896	0.4975849683664332
there is an r library that includes the	0.7746395720660133	0.5766808957643564
text of all jane austens works if	0.8119351913774921	0.317909114966179
you ever wanted to try some yourself	0.7487119280397619	-0.8721857902857719
they probably got the idea	-0.14175253549746752	0.5342944067879885
from	0.03192967086734439	0.06422168591488635
which uses jane austens work extensively	0.29369225493121526	0.8832400029096568
as an ongoing example when	0.7196285830258167	-0.7732919475505083
showcasing the tidytext library	0.29799339616128284	0.26371593073556526
the specific library of jane	0.7639567537897723	-0.1252943669742009
austens work is library	0.024263102404871856	0.8407417872671388
im guessing they used tidytext	-0.5380315745685673	0.8071242590633404
and janeaustenr in r, then used a slightly	-0.7396546964226968	-1.5252802271552306
different analysis	-0.0791096097809773	0.016454344219511016
using the results from r	0.39190114778641166	0.2030087287822549
in tableau or another data-viz	0.08255810541303524	0.28434896277132304
heavy program to make it palatable for	-1.0379890382908399	-0.9359229625044558
readers	0.022451788887236538	0.15649840774931126
pun intended sorry who has already	0.12645774034218657	0.5764433381841557
done a lot of	0.0645538301832265	-1.8895773134920169
really cool analysis like this	-0.8499956685146961	-0.6456009071301254
they claimed in the text that	0.31310522183662226	0.566252028711451
they used pca	-0.26125074206706816	0.5402513332430933
but what they dont say is how they	-3.1418969765868843	1.9805533718325532
created the matrix that was fed into the pca	1.3754363219912489	0.2072323270271102

Text/Reshaping

Split by delimiter and unpack into rows

Unique identifier (API): split_and_unpack

Splits the text content in the input column by the specified delimiter and unpacks the result into multiple rows, creating an additional column for position index. The delimiter can be one or more specific characters, or a regex pattern to match.

Parameters	Parameter Description	Default	Parameter Type	Suggested Values
delimiter	Delimiter to be used for splitting the string		string	[',', ';', '-', '
use_regex	Boolean flag indicating whether the specified delimiter should be treated as a regex pattern or an exact match.	False	boolean

Examples:

Given this input dataset:

('col_1', 'Text')	('col_2', 'Numeric')
star	trek
game	of
sein,field	2.23606
the	simpsons
the	big

with the column to split as "col_1" and the additional split position column name provided as "split_position" and use_regex=True, the output dataset will be:

('col_1', 'Text')	('col_2', 'Numeric')	('split_position', 'Numeric')
star	2.71828	0
trek	2.71828	1
game	3.14159	0
of	3.14159	1
thrones	3.14159	2
sein	2.23606	0
field	2.23606	1
the	1.732	0
simpsons	1.732	1
the	nan	0
big	nan	1
bang	nan	2
theory	nan	3

Text/Web

Extract URL parts from URL Text string

Unique identifier (API): extract_url_parts

Extracts URL components from a given column containing URL strings

Time Series

Resample data into a regular time series

Unique identifier (API): resample_time_series_v2

Resample data with a time-index column into a regular time series, with a specified aggregation function for each input column. Null-value interpolation is applied.

Parameters	Parameter Description	Default	Parameter Type
sampling_interval	Interval/Duration between successive records		integer
time_unit	Unit for the time interval	day	single-select

Examples:

For example, given the following dataset

receipt_time	qty	sku	expected_receipt_time
null	30	A231	2012-06-12 02:00:00
2012-06-13 00:10:00	40	A351	2012-06-13 00:00:00
2012-06-13 00:30:00	100	A351	2012-06-13 00:30:00
2012-06-13 20:45:20	29	A456	2012-06-13 00:30:00
2012-06-15 21:40:03	50	A451	2012-06-16 01:30:00
2012-06-17 05:41:23	80	A221	2012-06-20 08:30:00
2012-06-17 11:10:53	70	A221	2012-06-16 20:00:00

with time-index column receipt_time, for parameters sampling_interval = 1, time_unit = ""day"", input columns qty, sku, expected_receipt_time and the corresponding aggregation functions ""sum"", ""topk"" (k_most_frequent = 3), ""latest"", the result is as follows

receipt_time	qty	sku	expected_receipt_time
2012-06-14 00:00:00	169	["A351","A456"]	2012-06-13 00:30:00
2012-06-15 00:00:00	null	null	null
2012-06-16 00:00:00	50	["A451"]	2012-06-16 01:30:00
2012-06-17 00:00:00	null	null	null
2012-06-18 00:00:00	150	["A221"]	2012-06-20 08:30:00

Window Function

Compute window functions

Unique identifier (API): compute_window_function

Compute functions such as cumulative sum, rolling mean, etc. by specifying a window and desired functions with their inputs. Unbounded, shrinking, expanding and rolling windows are supported, with optional partitioning and ordering

Parameters	Parameter Description	Default	Parameter Type
window_start	Whether to set window start unbounded or a specified position relative to current row, with unbounded by default	unbounded	single-select
start_position	Position from which window starts, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively.		integer
window_end	Whether to set window end unbounded or a specified position relative to current row, with unbounded by default	unbounded	single-select
end_position	Position from which window ends, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively.		integer

Examples:

Given the dataset

animal	colour	birth	qty	desexed	note
cat	yellow	2019-01-01 00:00:00	10	True	There is a story
cat	blue	2019-01-03 00:00:00	9	True	No story
dog	blue	2019-01-02 00:00:00	4	True
cat	black	2019-01-07 00:00:00	2	True	No story
cat	blue	2019-01-04 00:00:00	8	True	There is a story
dog	yellow	2019-01-05 00:00:00	1	True	There is another story
dog	black	2019-01-06 00:00:00	2	False	There is a story
dog	yellow	2019-01-08 00:00:00	2	False

for input columns qty, colour, birth, desexed, note and the corresponding aggregation functions sum, most_frequent, latest, all, count (non-null), the result is as follows

animal	colour	birth	qty	desexed	note	qty_sum	colour_most_frequent	birth_latest	desexed_all	note_count
dog	blue	2019-01-02 00:00:00	4	True		4	blue	2019-01-02 00:00:00	True	0
dog	yellow	2019-01-05 00:00:00	1	True	There is another story	5	blue	2019-01-05 00:00:00	True	1
dog	black	2019-01-06 00:00:00	2	False	There is a story	7	black	2019-01-06 00:00:00	False	2
dog	yellow	2019-01-08 00:00:00	2	False		9	yellow	2019-01-08 00:00:00	False	2
cat	yellow	2019-01-01 00:00:00	10	True	There is a story	10	yellow	2019-01-01 00:00:00	True	1
cat	blue	2019-01-03 00:00:00	9	True	No story	19	blue	2019-01-03 00:00:00	True	2
cat	blue	2019-01-04 00:00:00	8	True	There is a story	27	blue	2019-01-04 00:00:00	True	3
cat	black	2019-01-07 00:00:00	2	True	No story	29	blue	2019-01-07 00:00:00	True	4

with window of unbounded start and end position 0, partitioned by animal and ordered by birth.

Rank rows

Unique identifier (API): window_rank

Rank rows after ordering and optional partitioning

Parameters	Parameter Description	Default	Parameter Type
rank_method	Method used to get the rank	rank	single-select

Examples:

For example, the following dataset

+------+------+------+
|animal|colour|credit|
+------+------+------+
|cat   |blue  |-3.0  |
|cat   |blue  |-1.0  |
|cat   |blue  |5.0   |
|cat   |blue  |7.0   |
|cat   |blue  |10.0  |
|cat   |yellow|1.0   |
|cat   |yellow|4.0   |
|cat   |yellow|4.0   |
|cat   |yellow|4.0   |
|cat   |yellow|8.0   |
|dog   |blue  |9.0   |
|dog   |blue  |7.0   |
|dog   |blue  |6.0   |
|dog   |blue  |5.0   |
|dog   |blue  |-10.0 |
|dog   |yellow|null  |
|dog   |yellow|5.0   |
|dog   |yellow|8.0   |
|dog   |yellow|null  |
|dog   |yellow|1.0   |
+------+------+------+

is converted to

+------+------+------+----+
|animal|colour|credit|rank|
+------+------+------+----+
|dog   |blue  |-10.0 |1   |
|dog   |blue  |5.0   |2   |
|dog   |blue  |6.0   |3   |
|dog   |blue  |7.0   |4   |
|dog   |blue  |9.0   |5   |
|cat   |yellow|1.0   |1   |
|cat   |yellow|4.0   |2   |
|cat   |yellow|4.0   |2   |
|cat   |yellow|4.0   |2   |
|cat   |yellow|8.0   |5   |
|cat   |blue  |-3.0  |1   |
|cat   |blue  |-1.0  |2   |
|cat   |blue  |5.0   |3   |
|cat   |blue  |7.0   |4   |
|cat   |blue  |10.0  |5   |
|dog   |yellow|null  |1   |
|dog   |yellow|null  |1   |
|dog   |yellow|1.0   |3   |
|dog   |yellow|5.0   |4   |
|dog   |yellow|8.0   |5   |
+------+------+------+----+

after applying window_rank, partitioned by animal and colour and ordered by credit for option rank.

Shift rows after ordering optionally with partitioning

Unique identifier (API): window_shift

Shift rows for input columns after ordering, optionally with partitioning, by a specified value indicating the numberof shifted rows and shifted direction

Parameters	Parameter Description	Default	Parameter Type
n_rows	Number of rows to shift where positive values indicate moving downwards (lag) and negative values refer to moving moving upwards (lead)	1	integer

Examples:

For example, given the following dataset

animal	colour	adopted	qty	vehicle
cat	yellow	2020-01-01 00:00:00	3.0	car
cat	yellow	2020-01-02 00:00:00	1.0	truck
cat	yellow	2020-01-03 00:00:00	5.0	car
cat	yellow	2020-01-04 00:00:00	7.0	van
cat	yellow	2020-01-05 00:00:00	10.0	van
cat	white	2020-01-06 00:00:00	1.0	car
cat	white	2020-01-07 00:00:00	4.0	van
cat	white	2020-01-08 00:00:00	4.0	car
cat	white	2020-01-09 00:00:00	4.0	truck
cat	white	2020-01-10 00:00:00	8.0	car
dog	yellow	2020-01-11 00:00:00	9.0	car
dog	yellow	2020-01-12 00:00:00	7.0	van
dog	yellow	2020-01-13 00:00:00	6.0	car
dog	yellow	2020-01-14 00:00:00	5.0	van
dog	yellow	2020-01-15 00:00:00	10.0	car
dog	white	2020-01-16 00:00:00	null	van
dog	white	2020-01-17 00:00:00	5.0	car
dog	white	2020-01-18 00:00:00	8.0	car
dog	white	2020-01-19 00:00:00	null	truck
dog	white	2020-01-20 00:00:00	1.0	car

the result is

animal	colour	adopted	qty	vehicle	qty_adjusted	vehicle_adjusted
cat	white	2020-01-06 00:00:00	1.0	car	4.0	van
cat	white	2020-01-07 00:00:00	4.0	van	4.0	car
cat	white	2020-01-08 00:00:00	4.0	car	4.0	truck
cat	white	2020-01-09 00:00:00	4.0	truck	8.0	car
cat	white	2020-01-10 00:00:00	8.0	car	null	null
cat	yellow	2020-01-01 00:00:00	3.0	car	1.0	truck
cat	yellow	2020-01-02 00:00:00	1.0	truck	5.0	car
cat	yellow	2020-01-03 00:00:00	5.0	car	7.0	van
cat	yellow	2020-01-04 00:00:00	7.0	van	10.0	van
cat	yellow	2020-01-05 00:00:00	10.0	van	null	null
dog	white	2020-01-16 00:00:00	null	van	5.0	car
dog	white	2020-01-17 00:00:00	5.0	car	8.0	car
dog	white	2020-01-18 00:00:00	8.0	car	null	truck
dog	white	2020-01-19 00:00:00	null	truck	1.0	car
dog	white	2020-01-20 00:00:00	1.0	car	null	null
dog	yellow	2020-01-11 00:00:00	9.0	car	7.0	van
dog	yellow	2020-01-12 00:00:00	7.0	van	6.0	car
dog	yellow	2020-01-13 00:00:00	6.0	car	5.0	van
dog	yellow	2020-01-14 00:00:00	5.0	van	10.0	car
dog	yellow	2020-01-15 00:00:00	10.0	car	null	null

with input columns qty and vehicle, partitioned by animal and colour and ordered by adopted, for shifted n_rows = -1 (lead).

Deprecated actions

The following actions that were available previously have now been deprecated. This means that you cannot find them any more on the recipe UI, as they are substituted by newer actions that are better in terms of usability or performance. However, existing recipes that contain these actions will continue to be supported.

Column Group

Take

Unique identifier (API): take

Please use the newer version get_column_group_components

Extracts specified components from a column group. If the component is not found, returns a null column.

Parameters	Parameter Description	Default	Parameter Type
address_type	Method of selection of group column component to take		single-select
index	The component index (zero based) to take from the group. E.g.If the group columns are ["__group_grp_a", "__group_grp_b"] and index==1, the taken column will be "__group_grp_b".	0	integer
suffix	The component suffix to take from the group. E.g.If the group columns are ["__group_grp_a", "__group_grp_b"] and suffix=="a", the taken column will be "__group_grp_a".		string

Examples:

For the following set of parameters:

parameters = {'address_type': 'index', 'index': 1}

and the following input table:

__group_grp_a_0	__group_grp_a_1	__group_grp_b_0	__group_grp_b_1
1	2	1	null
Sentence A	Sentence B	Sentence C	Sentence D

The output is:

__group_grp_a_0	__group_grp_a_1	__group_grp_b_0	__group_grp_b_1	col_from_grp_a	col_from_grp_b
1	2	1	null	2	null
Sentence A	Sentence B	Sentence C	Sentence D	Sentence B	Sentence D

Custom Formula

Create a new column with a formula

Unique identifier (API): create_new_column_with_formula

Please use the new version create_new_column_with_formula_v2

Creates a new column with any user-entered formula. Common Excel/SQL functions and operators for manipulating Numeric, DateTime, Text, and Boolean values are permitted. For a full list of allowed functions, see documentation.

Parameters	Parameter Description	Default	Parameter Type
expression	Any formula where symbols correspond to the input columns in lexical order. For eg., if the input selection has `column_B` then `column_A` and if the formula is `y / x`, `column_B` is mapped to `x` and `column_A` is mapped to `y`.		string

Examples:

Consider the dataset:

manufacturer	model	year	displ	cty	hwy
audi	a4 quattro	1999	1.8	16	25
audi	a6 quattro	1999	2.8	15	24
audi	a6 quattro	2008	4.2	16	23
audi	a6 quattro	2008	4.2	16	23
chevrolet	c1500 suburban 2wd	2008	6	12	17
chevrolet	corvette	1999	5.7	16	26
chevrolet	corvette	1999	5.7	15	23
chevrolet	k1500 tahoe 4wd	1999	6.5	14	17
chevrolet	malibu	1999	3.1	18	26
dodge	caravan 2wd	1999	3.8	15	22

where cty, hwy and year are of Numeric type.

Given the input columns selected as ['cty', 'hwy'], the output column name 'hwy_minus_cty', the formula 'y - x', the result will be:

manufacturer	model	year	displ	cty	hwy	hwy_minus_cty
audi	a4 quattro	1999	1.8	16	25	9
audi	a6 quattro	1999	2.8	15	24	9
audi	a6 quattro	2008	4.2	16	23	7
audi	a6 quattro	2008	4.2	16	23	7
chevrolet	c1500 suburban 2wd	2008	6	12	17	5
chevrolet	corvette	1999	5.7	16	26	10
chevrolet	corvette	1999	5.7	15	23	8
chevrolet	k1500 tahoe 4wd	1999	6.5	14	17	3
chevrolet	malibu	1999	3.1	18	26	8
dodge	caravan 2wd	1999	3.8	15	22	7

where the new hwy_minus_cty column will be of Numeric type.

For the same input data, given the input columns selected as ['year'], the output column name 'century' the formula 'IF(x <= 2000, "20th", "21st")' the result will be:

manufacturer	model	year	displ	cty	hwy	century
audi	a4 quattro	1999	1.8	16	25	20th
audi	a6 quattro	1999	2.8	15	24	20th
audi	a6 quattro	2008	4.2	16	23	21st
audi	a6 quattro	2008	4.2	16	23	21st
chevrolet	c1500 suburban 2wd	2008	6	12	17	21st
chevrolet	corvette	1999	5.7	16	26	20th
chevrolet	corvette	1999	5.7	15	23	20th
chevrolet	k1500 tahoe 4wd	1999	6.5	14	17	20th
chevrolet	malibu	1999	3.1	18	26	20th
dodge	caravan 2wd	1999	3.8	15	22	20th

where the new century column will be of Text type.

Transform columns with a formula

Unique identifier (API): transform_columns_with_formula

Please use the new version transform_columns_with_formula_v2

Transform one or more column(s) with a user-specified formula. Common Excel/SQL functions and operators for manipulating Numeric, DateTime, Text and Boolean values are permitted. For a full list of these functions, see documentation.

Parameters	Parameter Description	Default	Parameter Type
formula	Formula where symbol SELF refers to the column(s) to be transformed, and the rest symbols in lexical order are mapped to the other input column(s) if any.For example, if the columns to be transformed are `col_1` and `col_2`, other input columns are `col_A` and `col_B`, with formula `SELF + z - y`, `col_i` is modified as `col_i + col_B - col_A` for i = 1, 2		string

Examples:

For example, given the following dataset

qty_1	qty_2
10	nan
20	15
15	6
nan	1
8	8

and formula = ""SELF + (y + SELF) * 10"", with qty_1 as column to be transformed and qty_2 as the other input column, the result will be

qty_1	qty_2
nan	nan
370	15
225	6
nan	1
168	8

Data Cleaning/Processing

Remove duplicated rows within group

Unique identifier (API): group_by_remove_duplicates

Please use the newer version group_by

Remove duplicated rows for one or more input columns grouped by other selected column(s).

Parameters	Parameter Description	Default	Parameter Type
keep	First or last row to keep	first	single-select

Examples:

For example, given the following dataset

animal	colour	carried	description	qty	time	desexed
cat	yellow	van	There is a story.	0.5416980271485203	2020-10-10 00:00:00	true
cat	yellow	van	There is a story.	0.09442987061798414	2020-10-10 00:00:00	false
cat	yellow	car	There is a story.	0.5766948082733302	2020-10-10 00:00:00	true
cat	yellow	van	There is a story.	0.8015245589198161	2020-01-01 00:00:00	false
cat	yellow	van	There is a story.	0.2523252855972161	2019-01-01 00:00:00	true
cat	white	van	There is a story.	0.044270295367698886	2020-01-01 00:00:00	true
cat	white	car	There is another story.	0.34340627533339274	2020-01-01 00:00:00	true
cat	white	van	There is another story.	0.243885710885979	2019-01-10 00:00:00	true
cat	white	car	There is another story.	0.9875301137422845	2020-01-01 00:00:00	false
cat	white	van	There is a story.	0.4416194595201147	2019-01-10 00:00:00	false
dog	yellow	van	There is another story.	0.18251646225487406	2019-01-10 00:00:00	false
dog	yellow	car	There is a story.	0.9073423733435338	2019-01-10 00:00:00	false
dog	yellow	car	There is another story.	0.5242232544087159	2020-01-01 00:00:00	false
dog	yellow	van	There is a story.	0.09555633797329921	2019-01-10 00:00:00	true
dog	yellow	car	There is a story.	0.13287387217823854	2020-10-10 00:00:00	true
dog	white	car	There is another story.	0.8686814568933312	2020-10-02 00:00:00	true
dog	white	car	There is another story.	0.5376552548800148	2019-01-10 00:00:00	true
dog	white	car	There is another story.	0.6979028231254814	2019-01-10 00:00:00	false
dog	white	van	There is a story.	0.19022329903980462	2020-01-01 00:00:00	true
dog	white	car	There is another story.	0.856685303343519	2020-01-01 00:00:00	false

with input columns animal and colour, keep=""first"", the result is

animal	colour	carried	description	qty	time	desexed
cat	yellow	van	There is a story.	0.5416980271485203	2020-10-10 00:00:00	true
cat	white	van	There is a story.	0.044270295367698886	2020-01-01 00:00:00	true
dog	yellow	van	There is another story.	0.18251646225487406	2019-01-10 00:00:00	false
dog	white	car	There is another story.	0.8686814568933312	2020-10-02 00:00:00	true

Impute columns

Unique identifier (API): impute_column

Please use the newer version impute_numeric or impute_categorical

Impute the missing values in your data

Parameters	Parameter Description	Default	Parameter Type
categorical_strategy	Strategy for imputing Categorical columns	most_frequent	single-select
categorical_fill_value	Filler value for Categoric columns when strategy is constant		string
numeric_strategy	Strategy for imputing Numeric columns	mean	single-select
numeric_fill_value	Filler value for Numeric columns when strategy is constant		float

Feature Engineering

Calculate math function

Unique identifier (API): calculate_math_function

Please use the action create_new_column_with_formula_v2

Calculate a math function on selected columns. Invalid values such as negative input for a logarithmic function result in null outputs.

Parameters	Parameter Description	Default	Parameter Type
math_function	Math function applied to columns		single-select

Examples:

For example, compute the absolute value abs(x).

Calculate math function (2)

Unique identifier (API): calculate_math_function2

Please use the action create_new_column_with_formula_v2

Calculate a math function that takes two inputs. If invalid inputs are encountered (such as division by zero), the outputs will be null.

Parameters	Parameter Description	Default	Parameter Type
math_function	Math function applied to input columns		single-select

Examples:

For example compute the difference (c = a -b) or computing the ratio (c = a/b)

Cluster and extract features

Unique identifier (API): cluster

Please use the newer version get_cluster_features_v2

Cluster the input data using unsupervised learning algorithms. Outputs three features - the cluster ID, cluster size and euclidian distance from the centroid of the cluster.

Parameters	Parameter Description	Default	Parameter Type	Parameter Min	Parameter Max	Parameter Step
algorithm	The algorithm used for clustering, such as k-means or Gaussian Mixture Model.		single-select
num_clusters_method	The method that is utilized for calculating the number of clusters	inferred	single-select
num_clusters	The numbers of clusters to generate.	2	range	2.0	100.0	1.0
max_num_clusters	This defines the search space for the algorithm inferring the best number of clusters automatically. Setting it to a value of `K` will make the algorithm look for the best number of clusters in the range `[2, 3, ..., K]`	10	range	2.0	100.0	1.0
seed	Nonnegative integer. Random seed for clustering algorithms. Used for obtaining repeatable results.	0	integer
tolerance	The minimum change in the squared distances between cluster centroids between two successive iterations, for the algorithm to continue to the next iteration. If the change is below the threshold specified by this parameter, the algorithm is assumed to have converged before reaching the maximum number of iterations. A lower value for this parameter results in higher fidelity in the assignment of cluster IDs, but can make the action run slower especially on large datasets.	0.0001	float
max_iter	The maximum number of iterations regardless of convergence. The higher it is the more accurate the clustering will be, but comes at the cost of a longer runtime.	20	integer
min_cluster_size	The smallest size grouping that HDBSCAN considers as a cluster. Single linkage splits that contain fewer points than this will be considered points "falling out" of a cluster rather than a cluster splitting into two new clusters	5	integer
min_samples_specified	If enabled, allows specifying the minimum number of samples parameter. Otherwise, will be determined automatically.	False	boolean
min_samples	The number of samples in a neighbourhood for a point to be considered a core point. Providing a larger value for this parameter results in a more conservative clustering, which means more points will be declared as noise and clusters will be restricted to progressively more dense areas.	1	integer
cluster_selection_epsilon	A distance threshold. Clusters below this value will be merged. The choice of cluster_selection_epsilon depends on the given distances between your data points. For example, set the value to 0.5 if you don't want to separate clusters that are less than 0.5 units apart.	0	float
allow_single_cluster	By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset.	False	boolean

Extract PCA Components

Unique identifier (API): extract_pca

Please use the newer version extract_pca_v2

Create new numerical features by combining existing numerical features in linear ways using PCA

Parameters	Parameter Description	Default	Parameter Type	Parameter Min	Parameter Max	Parameter Step
limit_components_by	Criterion for selecting the principal components.If the minimum explained variance is used, the components that contribute most to the variance are selected until the cumulative variance is not less than the specified ratio of the total variance.	min_explained_variance	single-select
num_components	The number of principal components to extract from the input data		integer
min_explained_variance	This is the threshold on the ratio of explained variance of a component to the total variance. Only the components with highest variance whose cumulative variance is at least the specified ratio of total variance will be selected	0.8	range	0.01	1.0	0.01
categorical_impute_method	Method for imputing Categorical columns	most_frequent	single-select
categorical_fill_value	Filler value for categorical columns when strategy is constant		string
numeric_impute_method	Method for imputing Numeric columns	mean	single-select
numeric_fill_value	Filler value for Numeric columns when strategy is constant		float

One hot encode categorical columns

Unique identifier (API): one_hot_encode

Please use the newer version one_hot_encode_v2

This action converts categorical columns by one-hot encoding. Missing values remain as null. Unseen values are encoded as all 0s, when a finalized recipe with this action is applied on new data

Examples:

For example, given the following dataset

col_1	col_2
A	yellow
B	null
B	blue
A	blue
E	red

the result is

col_1	col_2	__group_encoded_1_0	__group_encoded_1_1	__group_encoded_1_2	__group_encoded_2_0	__group_encoded_2_1	__group_encoded_2_2
A	yellow	1	0	0	0	0	1
B	null	0	1	0	null	null	null
B	blue	0	1	0	1	0	0
A	blue	1	0	0	1	0	0
E	red	0	0	1	0	1	0

Filtering/Custom Formula

Filter rows

Unique identifier (API): filter_rows_with_formula

Please use the new version filter_rows_with_formula_v2

Filter rows in data by keeping only rows that satisfy the condition specified in user-entered formula. Common Excel/SQL functions and operators for manipulating Numeric, DateTime, Text, and Boolean values are permitted. For a full list of allowed functions, see documentation.

NOTE: The formula entered must evaluate to a boolean value.

Parameters	Parameter Description	Default	Parameter Type
condition	The condition under which the rows are to be kept, entered as a formula. Symbols must correspond to the input columns in lexical order. For eg., if the input selection has `column_B` then `column_A` and if the formula is `y >= x`, `column_B` is mapped to `x` and `column_A` is mapped to `y`.		string

Examples:

Consider the dataset:

manufacturer	model	year	displ	cty	hwy
audi	a4 quattro	1999	1.8	16	25
audi	a6 quattro	1999	2.8	15	24
audi	a6 quattro	2008	4.2	16	23
audi	a6 quattro	2008	4.2	16	23
chevrolet	c1500 suburban 2wd	2008	6	12	17
chevrolet	corvette	1999	5.7	16	26
chevrolet	corvette	1999	5.7	15	23
chevrolet	k1500 tahoe 4wd	1999	6.5	14	17
chevrolet	malibu	1999	3.1	18	26
dodge	caravan 2wd	1999	3.8	15	22
dodge	caravan 2wd	2008	4	16	23
dodge	durango 4wd	2008	5.7	13	18
dodge	durango 4wd	2008	5.7	13	18
dodge	ram 1500 pickup 4wd	1999	5.9	11	15
ford	expedition 2wd	1999	4.6	11	17
ford	explorer 4wd	1999	5	13	17
ford	f150 pickup 4wd	1999	4.2	14	17
ford	f150 pickup 4wd	1999	4.2	14	17
ford	f150 pickup 4wd	2008	4.6	13	17
ford	mustang	2008	4	17	26

where cty and year are of Numeric type.

Given the input columns selected as ['year', 'cty'] and the formula 'AND(x > 2000, y >= 15)', the result will be:

manufacturer	model	year	displ	cty	hwy
audi	a6 quattro	2008	4.2	16	23
audi	a6 quattro	2008	4.2	16	23
dodge	caravan 2wd	2008	4	16	23
ford	mustang	2008	4	17	26

Reshaping

Pivot a column with a specified aggregation

Unique identifier (API): pivot

Please use the newer version pivot_v2

Pivot a column and perform a specified aggregation on one or more numeric columns, grouped by other selected columns

Parameters	Parameter Description	Default	Parameter Type	Parameter Min	Parameter Max	Parameter Step
aggregation_function	Aggregation function applied to input column(s)	mean	single-select
median_relative_error	Relative error for approximating median. Must be no less than 0.001 for computational efficiency.	0.001	range	0.001	1.0	0.001
count_distinct_maximum_error	Maximum estimation error allowed for approximating the number of distinct values. Must be no less than 0.01 for computational efficiency.	0.01	range	0.01	1.0	0.01
pivot_values	List of values that are converted to columns in the output	[]	string-list

Examples:

For example, given the following dataset

animal	size	qty
cat	small	95
cat	medium	67
cat	large	39
cat	huge	45
cat	small	50
cat	medium	18
cat	large	37
cat	huge	22
cat	small	12
cat	medium	45
dog	large	84
dog	huge	48
dog	small	71
dog	medium	95
dog	large	87
dog	huge	35
dog	small	7
dog	medium	25
dog	large	42
dog	huge	45

with input pivot column size, values column qty and aggregation function sum, grouped by column animal, the result is

animal	__group_qty_agg_huge	__group_qty_agg_large	__group_qty_agg_medium	__group_qty_agg_small
dog	128	213	120	78
cat	67	76	130	157

Unpack values to multiple rows

Unique identifier (API): unpack

Please use the newer version unpack_v2

Reshape a dataset by unpacking one or more column group(s) and create a position indication column

Text

Extract capture groups

Unique identifier (API): extract_capture_groups

Please use the newer versions of extract_capture_groups

Extracts one or more regex capture groups into a column group

Parameters	Parameter Description	Default	Parameter Type
pattern	A regular expression pattern containing one or more capture groups		string

Examples:

For the following pattern == '([aeiou].)([aeiou].)' , the following input table and assuming we apply the action over both columns ([col_1, col_2]):

col_1	col_2
New_York	bc
Lisbon	df
Tokyo	gh
Paris	jk
null	lm
Munich	np

The expected output is:

col_1	col_2	__group_grp_a_0	__group_grp_a_1	__group_grp_b_0	__group_grp_b_1
New_York	bc	null	null	null	null
Lisbon	df	null	null	null	null
Tokyo	gh	null	null	null	null
Paris	jk	ar	is	null	null
null	lm	null	null	null	null
Munich	np	un	ic	null	null

Find all matches of a regular expression

Unique identifier (API): regex_findall

Please use the new action extract_all_regex_matches

Creates an output column group containing all the strings that match the regex pattern.

Parameters	Parameter Description	Default	Parameter Type
pattern	A regular expression pattern for the matching		string

Examples:

For pattern == '\d\d' (match non overlapping double digits) and the following input table:

col_1
null
Test 54 45 Test
Dummy sentence
Just some number 00

The output is:

col_1	__group_col_1_re_findall_0	__group_col_1_re_findall_1
null	null	null
Test 54 45 Test	54	45
Dummy sentence	null	null
Just some number 00	00	null

Split text column on delimiter

Unique identifier (API): split_on_delimiter

Please use either split_on_delimiter_to_multicol or split_to_array.

For a given column, extracts text components between the delimiters into a column group. The size of the column group corresponds to the largest number of components found in the column or by the limit on the number of splits. Trailing null values are created for records containing fewer components.

Parameters	Parameter Description	Default	Parameter Type	Suggested Values
delimiter	Delimiter to be used for splitting the string.		string	[',', ';', '-', '
delimiter_type	Type of delimiter	Verbatim	single-select
limit_num_splits	Boolean flag indicating wether you wish to limit the number of splits or not.	False	boolean
num_of_splits	An integer which controls the number of times the delimiter regex is applied.	2	integer

Examples:

For the following set of parameters:

parameters = {'delimiter': ':', 'limit_num_splits': True, 'num_of_splits': 2}

and the following input table:

col_1
one:two:three
no_split

The expected output is:

col_1	__group_col_1_split_grp_0	__group_col_1_split_grp_1
one:two:three	one	two:three
no_split	no_split	null

Text/JSON

Split JSON array column into a column group

Unique identifier (API): split_json_array

Please use unpack_v2

Every row (or cell) of the input column is assumed to be a JSON array stored as a string. Extract each element in the array into a separate column component in a columngroup.

Parameters	Parameter Description	Default	Parameter Type
path	Path to the JSON array component via (Spark-flavor) JSON path. Defaults to $ (the top-level).	$	string

Time Series

Resample data into a regular time series

Unique identifier (API): resample_time_series

Please use the newer version resample_time_series_v2

Resample data with a time-index column into a regular time series, with aggregation for all columns except the nominated time index. Null-value interpolation is applied.

Parameters	Parameter Description	Default	Parameter Type
sampling_interval	Length of intervals between successive records		integer
time_unit	Time unit for intervals between successive records	day	single-select
aggregation_function	Aggregation function to be applied	last	single-select

Examples:

For example, given the following dataset

receipt_time	qty	sku	expected_receipt_time
null	30	A231	2012-06-12 02:00:00
2012-06-13 00:00:00	40	A351	2012-06-15 10:00:00
2012-06-13 20:45:20	29	A456	2012-06-13 00:30:00
2012-06-15 21:40:03	50	A451	2012-06-16 01:30:00
2012-06-17 05:41:23	80	A223	2012-06-16 20:30:00
null	100	A221	2012-06-20 08:30:00
2012-06-17 11:10:53	70	A264	2012-06-16 20:00:00

with time-index column receipt_time for parameters sampling_interval = 12, time_unit = ""hour"" and aggregation_function = ""last"", the result is

receipt_time	qty	sku	expected_receipt_time
2012-06-13 00:00:00	40	A351	2012-06-15 10:00:00
2012-06-13 12:00:00	null	null	null
2012-06-14 00:00:00	29	A456	2012-06-13 00:30:00
2012-06-14 12:00:00	null	null	null
2012-06-15 00:00:00	null	null	null
2012-06-15 12:00:00	null	null	null
2012-06-16 00:00:00	50	A451	2012-06-16 01:30:00
2012-06-16 12:00:00	null	null	null
2012-06-17 00:00:00	null	null	null
2012-06-17 12:00:00	70	A264	2012-06-16 20:00:00

Window Function

Count the number of values over partition

Unique identifier (API): window_aggregate_count

Please use the newer version compute_rolling_window_function

This action counts the number of values in one or more input column(s) partitioned by other selected column(s).

Parameters	Parameter Description	Default	Parameter Type
value_type	What to count. The default is to count only non-null values.For approximately counting distinct values (see https://en.wikipedia.org/wiki/HyperLogLog), null values are ignored and the allowed estimation error is 0.05 at maximum	non_null	single-select

Examples:

For example, given the following dataset

colour	vehicle	animal	description	qty	time	verified
blue	car	cat	There is another story.	0.9371423365841736	null	true
blue	van	null	null	0.504013991387411	2019-01-02 00:00:00	true
yellow	car	cat	There is another story.	0.7664983791343694	2019-01-03 00:00:00	false
yellow	van	null	null	0.391325429830197	2019-01-04 00:00:00	true
yellow	car	cat	There is another story.	null	2019-01-05 00:00:00	true
blue	car	null	There is another story.	null	2019-01-06 00:00:00	true
yellow	car	dog	There is another story.	null	null	true
blue	van	dog	There is another story.	0.8223637606171873	null	false
blue	van	cat	There is another story.	0.9805952019930625	2019-01-09 00:00:00	false
yellow	van	dog	null	null	2019-01-10 00:00:00	true
blue	van	null	null	null	2019-01-11 00:00:00	true
blue	car	null	There is another story.	0.010846767436331839	2019-01-12 00:00:00	false
blue	car	cat	There is a story.	0.30042645923211464	2019-01-13 00:00:00	true
yellow	van	dog	There is a story.	0.7759473050527014	2019-01-14 00:00:00	true
yellow	van	dog	There is another story.	0.5250835307237919	2019-01-15 00:00:00	true
yellow	car	dog	There is another story.	0.4708643894438649	null	true
blue	van	dog	null	0.8425979127421865	null	false
yellow	van	dog	There is another story.	0.9901884866319394	2019-01-18 00:00:00	true
blue	car	dog	There is a story.	0.6662684130667034	2019-01-19 00:00:00	true
blue	car	cat	There is another story.	null	2019-01-20 00:00:00	true

with input columns animal, description, qty, time, verified partitioned by colour and vehicle for option non_null, the result is

colour	vehicle	animal	description	qty	time	verified	animal_count	description_count	qty_count	time_count	verified_count
blue	car	cat	There is another story.	0.9371423365841736	null	true	4	6	4	5	6
blue	van	null	null	0.504013991387411	2019-01-02 00:00:00	true	3	2	4	3	5
yellow	car	cat	There is another story.	0.7664983791343694	2019-01-03 00:00:00	false	4	4	2	2	4
yellow	van	null	null	0.391325429830197	2019-01-04 00:00:00	true	4	3	4	5	5
yellow	car	cat	There is another story.	null	2019-01-05 00:00:00	true	4	4	2	2	4
blue	car	null	There is another story.	null	2019-01-06 00:00:00	true	4	6	4	5	6
yellow	car	dog	There is another story.	null	null	true	4	4	2	2	4
blue	van	dog	There is another story.	0.8223637606171873	null	false	3	2	4	3	5
blue	van	cat	There is another story.	0.9805952019930625	2019-01-09 00:00:00	false	3	2	4	3	5
yellow	van	dog	null	null	2019-01-10 00:00:00	true	4	3	4	5	5
blue	van	null	null	null	2019-01-11 00:00:00	true	3	2	4	3	5
blue	car	null	There is another story.	0.010846767436331839	2019-01-12 00:00:00	false	4	6	4	5	6
blue	car	cat	There is a story.	0.30042645923211464	2019-01-13 00:00:00	true	4	6	4	5	6
yellow	van	dog	There is a story.	0.7759473050527014	2019-01-14 00:00:00	true	4	3	4	5	5
yellow	van	dog	There is another story.	0.5250835307237919	2019-01-15 00:00:00	true	4	3	4	5	5
yellow	car	dog	There is another story.	0.4708643894438649	null	true	4	4	2	2	4
blue	van	dog	null	0.8425979127421865	null	false	3	2	4	3	5
yellow	van	dog	There is another story.	0.9901884866319394	2019-01-18 00:00:00	true	4	3	4	5	5
blue	car	dog	There is a story.	0.6662684130667034	2019-01-19 00:00:00	true	4	6	4	5	6
blue	car	cat	There is another story.	null	2019-01-20 00:00:00	true	4	6	4	5	6

Get DateTime value by partition

Unique identifier (API): window_aggregate_datetime

Please use the newer version compute_rolling_window_function

This action applies a window aggregation function on one or more DateTime column(s) partitioned by other selected column(s).

Parameters	Parameter Description	Default	Parameter Type
agg_function	Aggregation function applied to input column(s)		single-select

Examples:

For example, given the following dataset

animal	colour	time_born	time_adopted
cat	blue	2019-01-01 00:00:00	null
cat	yellow	2019-01-02 00:00:00	2019-01-01 00:00:00
dog	yellow	2019-01-03 00:00:00	2019-01-01 00:00:00
cat	yellow	2019-01-04 00:00:00	null
cat	yellow	2019-01-05 00:00:00	null
dog	blue	2019-01-06 00:00:00	2019-01-02 00:00:00
cat	blue	2019-01-07 00:00:00	2019-01-02 00:00:00
cat	blue	null	2019-01-02 00:00:00
dog	yellow	2019-01-09 00:00:00	2019-01-02 00:00:00
cat	yellow	2019-01-10 00:00:00	2019-01-02 00:00:00
dog	blue	null	null
cat	blue	2019-01-12 00:00:00	2019-01-03 00:00:00
dog	blue	null	2019-01-03 00:00:00
cat	blue	2019-01-14 00:00:00	2019-01-03 00:00:00
cat	blue	null	2019-01-03 00:00:00
cat	blue	2019-01-16 00:00:00	2019-01-04 00:00:00
cat	blue	2019-01-17 00:00:00	2019-01-04 00:00:00
cat	yellow	null	2019-01-04 00:00:00
dog	blue	2019-01-19 00:00:00	null
dog	yellow	2019-01-20 00:00:00	2019-01-04 00:00:00

with input columns time_born and time_adopted, partitioned by animal and colour for option earliest, the result is

animal	colour	time_born	time_adopted	agg_birth	agg_adoption
cat	blue	2019-01-01 00:00:00	null	2019-01-01 00:00:00	2019-01-02 00:00:00
cat	yellow	2019-01-02 00:00:00	2019-01-01 00:00:00	2019-01-02 00:00:00	2019-01-01 00:00:00
dog	yellow	2019-01-03 00:00:00	2019-01-01 00:00:00	2019-01-03 00:00:00	2019-01-01 00:00:00
cat	yellow	2019-01-04 00:00:00	null	2019-01-02 00:00:00	2019-01-01 00:00:00
cat	yellow	2019-01-05 00:00:00	null	2019-01-02 00:00:00	2019-01-01 00:00:00
dog	blue	2019-01-06 00:00:00	2019-01-02 00:00:00	2019-01-06 00:00:00	2019-01-02 00:00:00
cat	blue	2019-01-07 00:00:00	2019-01-02 00:00:00	2019-01-01 00:00:00	2019-01-02 00:00:00
cat	blue	null	2019-01-02 00:00:00	2019-01-01 00:00:00	2019-01-02 00:00:00
dog	yellow	2019-01-09 00:00:00	2019-01-02 00:00:00	2019-01-03 00:00:00	2019-01-01 00:00:00
cat	yellow	2019-01-10 00:00:00	2019-01-02 00:00:00	2019-01-02 00:00:00	2019-01-01 00:00:00
dog	blue	null	null	2019-01-06 00:00:00	2019-01-02 00:00:00
cat	blue	2019-01-12 00:00:00	2019-01-03 00:00:00	2019-01-01 00:00:00	2019-01-02 00:00:00
dog	blue	null	2019-01-03 00:00:00	2019-01-06 00:00:00	2019-01-02 00:00:00
cat	blue	2019-01-14 00:00:00	2019-01-03 00:00:00	2019-01-01 00:00:00	2019-01-02 00:00:00
cat	blue	null	2019-01-03 00:00:00	2019-01-01 00:00:00	2019-01-02 00:00:00
cat	blue	2019-01-16 00:00:00	2019-01-04 00:00:00	2019-01-01 00:00:00	2019-01-02 00:00:00
cat	blue	2019-01-17 00:00:00	2019-01-04 00:00:00	2019-01-01 00:00:00	2019-01-02 00:00:00
cat	yellow	null	2019-01-04 00:00:00	2019-01-02 00:00:00	2019-01-01 00:00:00
dog	blue	2019-01-19 00:00:00	null	2019-01-06 00:00:00	2019-01-02 00:00:00
dog	yellow	2019-01-20 00:00:00	2019-01-04 00:00:00	2019-01-03 00:00:00	2019-01-01 00:00:00

Make a boolean indicator over partition

Unique identifier (API): window_aggregate_indicate_boolean

Please use the newer version compute_rolling_window_function

This action applies a window aggregation function as an overall indicator on one or more boolean columns partitioned by other selected column(s).

Parameters	Parameter Description	Default	Parameter Type
indicator_function	Indicator function applied to input column(s)		single-select

Examples:

For example, given the following dataset

animal	colour	adopted	registered
cat	blue	true	true
dog	yellow	true	false
cat	yellow	false	true
cat	blue	false	true
dog	blue	false	true
dog	blue	false	false
cat	yellow	true	true
dog	yellow	true	true
dog	yellow	true	true
dog	blue	true	false
dog	yellow	false	false
dog	yellow	false	false
dog	blue	false	true
cat	yellow	true	false
cat	yellow	true	true
dog	blue	false	false
dog	yellow	true	true
dog	blue	false	false
cat	yellow	false	false
dog	blue	false	false

with input columns adopted and registered, partitioned by animal and colour for option any, the result is

animal	colour	adopted	registered	agg_adopted	agg_registered
cat	blue	true	true	true	true
dog	yellow	true	false	true	true
cat	yellow	false	true	true	true
cat	blue	false	true	true	true
dog	blue	false	true	true	true
dog	blue	false	false	true	true
cat	yellow	true	true	true	true
dog	yellow	true	true	true	true
dog	yellow	true	true	true	true
dog	blue	true	false	true	true
dog	yellow	false	false	true	true
dog	yellow	false	false	true	true
dog	blue	false	true	true	true
cat	yellow	true	false	true	true
cat	yellow	true	true	true	true
dog	blue	false	false	true	true
dog	yellow	true	true	true	true
dog	blue	false	false	true	true
cat	yellow	false	false	true	true
dog	blue	false	false	true	true

Aggregate numeric column(s) over partition

Unique identifier (API): window_aggregate_numeric

Please use the newer version compute_rolling_window_function

This action applies a window aggregation function on one or more numeric columns partitioned by other selected column(s).

Parameters	Parameter Description	Parameter Type	Parameter Min	Parameter Max	Parameter Step
agg_function	Aggregation function applied to input column(s)	single-select
relative_error	Relative error for approximating median or a specific quantile. Must be no less than 0.001 for computational efficiency.	range	0.001	1.0	0.001
quantile_to_compute	Specify which quantile to compute for the quantile aggregation function. Must be a number between 0 and 1.	range	0.0	1.0	0.05

Examples:

For example, given the following dataset

colour	vehicle	qty_1	qty_2
yellow	van	0.11247070608569998	5
yellow	van	0.28328160937677216	-15
blue	car	0.517063017219678	19
blue	van	0.5301237478473354	-9
blue	van	0.3620932896270347	10
yellow	car	0.4225691387726025	15
yellow	car	0.290901753183477	-19
blue	van	0.8861529756930526	19
yellow	van	0.0663419211386469	12
blue	car	0.5966970598200722	-4
blue	car	0.420710037504774	-8
yellow	van	0.6275038840320153	-14
yellow	car	0.9498072805894898	-16
blue	car	0.9065231957598833	-3
yellow	van	0.6876214007342292	-14
blue	van	0.12799068964691773	-16
yellow	car	0.8456065368763648	-20
yellow	van	0.7136156920988985	10
yellow	van	0.31253912094683933	-16
yellow	van	0.8326289665837018	-20

with input columns qty_1 and qty_2 and aggregation function mean partitioned by colour and vehicle, the result is

colour	vehicle	qty_1	qty_2	aggregation_1	aggregation_2
yellow	van	0.11247070608569998	5	0.4545004126246004	-6.5
yellow	van	0.28328160937677216	-15	0.4545004126246004	-6.5
blue	car	0.517063017219678	19	0.6102483275761019	1.0
blue	van	0.5301237478473354	-9	0.4765901757035851	1.0
blue	van	0.3620932896270347	10	0.4765901757035851	1.0
yellow	car	0.4225691387726025	15	0.6272211773554836	-10.0
yellow	car	0.290901753183477	-19	0.6272211773554836	-10.0
blue	van	0.8861529756930526	19	0.4765901757035851	1.0
yellow	van	0.0663419211386469	12	0.4545004126246004	-6.5
blue	car	0.5966970598200722	-4	0.6102483275761019	1.0
blue	car	0.420710037504774	-8	0.6102483275761019	1.0
yellow	van	0.6275038840320153	-14	0.4545004126246004	-6.5
yellow	car	0.9498072805894898	-16	0.6272211773554836	-10.0
blue	car	0.9065231957598833	-3	0.6102483275761019	1.0
yellow	van	0.6876214007342292	-14	0.4545004126246004	-6.5
blue	van	0.12799068964691773	-16	0.4765901757035851	1.0
yellow	car	0.8456065368763648	-20	0.6272211773554836	-10.0
yellow	van	0.7136156920988985	10	0.4545004126246004	-6.5
yellow	van	0.31253912094683933	-16	0.4545004126246004	-6.5
yellow	van	0.8326289665837018	-20	0.4545004126246004	-6.5

Window Function/Rolling

Make a rolling boolean indicator with optional partitioning

Unique identifier (API): window_rolling_boolean

Please use the newer version compute_rolling_window_function

Compute an aggregation as a rolling indicator on one or more boolean columns, optionally partitioned by other selected column(s).

Parameters	Parameter Description	Default	Parameter Type
indicator_function	Indicator function applied to input column(s)		single-select
window_start	Whether to set window start unbounded or a specified position relative to current row, with unbounded by default	unbounded	single-select
start_position	Position from which window starts, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively.		integer
window_end	Whether to set window end unbounded or a specified position relative to current row, with unbounded by default	unbounded	single-select
end_position	Position from which window ends, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively.		integer

Examples:

For example, given the following dataset

animal	colour	date	cured	adopted
cat	blue	2019-01-01 00:00:00	false	true
cat	blue	2019-01-02 00:00:00	true	false
dog	yellow	2019-01-03 00:00:00	true	true
cat	yellow	2019-01-04 00:00:00	true	false
cat	yellow	2019-01-05 00:00:00	true	false
cat	blue	2019-01-06 00:00:00	false	false
dog	blue	2019-01-07 00:00:00	true	true
dog	blue	2019-01-08 00:00:00	false	false
dog	yellow	2019-01-09 00:00:00	false	false
dog	yellow	2019-01-10 00:00:00	false	false
cat	blue	2019-01-11 00:00:00	false	false
cat	yellow	2019-01-12 00:00:00	false	true
dog	yellow	2019-01-13 00:00:00	true	false
dog	blue	2019-01-14 00:00:00	true	false
cat	blue	2019-01-15 00:00:00	false	false
dog	yellow	2019-01-16 00:00:00	false	false
dog	yellow	2019-01-17 00:00:00	false	true
cat	blue	2019-01-18 00:00:00	false	false
cat	yellow	2019-01-19 00:00:00	false	false
dog	yellow	2019-01-20 00:00:00	false	false

with input columns cured and adopted, partitioned by animal and colour and ordered by date for indicator function all, rolling window with unbounded start and current row as end, the result is

animal	colour	date	cured	adopted	roll_cured	roll_adopted
dog	blue	2019-01-07 00:00:00	true	true	true	true
dog	blue	2019-01-08 00:00:00	false	false	false	false
dog	blue	2019-01-14 00:00:00	true	false	false	false
cat	yellow	2019-01-04 00:00:00	true	false	true	false
cat	yellow	2019-01-05 00:00:00	true	false	true	false
cat	yellow	2019-01-12 00:00:00	false	true	false	false
cat	yellow	2019-01-19 00:00:00	false	false	false	false
cat	blue	2019-01-01 00:00:00	false	true	false	true
cat	blue	2019-01-02 00:00:00	true	false	false	false
cat	blue	2019-01-06 00:00:00	false	false	false	false
cat	blue	2019-01-11 00:00:00	false	false	false	false
cat	blue	2019-01-15 00:00:00	false	false	false	false
cat	blue	2019-01-18 00:00:00	false	false	false	false
dog	yellow	2019-01-03 00:00:00	true	true	true	true
dog	yellow	2019-01-09 00:00:00	false	false	false	false
dog	yellow	2019-01-10 00:00:00	false	false	false	false
dog	yellow	2019-01-13 00:00:00	true	false	false	false
dog	yellow	2019-01-16 00:00:00	false	false	false	false
dog	yellow	2019-01-17 00:00:00	false	true	false	false
dog	yellow	2019-01-20 00:00:00	false	false	false	false

Get a DateTime value over rolling windows, with optional partitioning

Unique identifier (API): window_rolling_datetime

Please use the newer version compute_rolling_window_function

Get an aggregation function on one or more DateTime columns over rolling windows, optionally partitioned by other selected column(s).

Parameters	Parameter Description	Default	Parameter Type
aggregation_function	Aggregation function applied to input column(s)		single-select
window_start	Whether to set window start unbounded or a specified position relative to current row, with unbounded by default	unbounded	single-select
start_position	Position from which window starts, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively.		integer
window_end	Whether to set window end unbounded or a specified position relative to current row, with unbounded by default	unbounded	single-select
end_position	Position from which window ends, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively.		integer

Examples:

For example, given the following dataset

animal	time_born	time_adopted
cat	2019-01-01 00:00:00	2019-02-01 00:00:00
cat	2019-01-02 00:00:00	2019-03-01 00:00:00
cat	2019-01-03 00:00:00	2019-03-01 00:00:00
cat	2019-01-04 00:00:00	2020-04-01 00:00:00
cat	2019-01-05 00:00:00	2019-05-01 00:00:00
dog	2019-01-06 00:00:00	2020-06-01 00:00:00
dog	2019-01-07 00:00:00	2019-05-01 00:00:00
dog	2019-01-08 00:00:00	2020-07-01 00:00:00
dog	2019-01-09 00:00:00	2020-06-01 00:00:00
dog	2019-01-10 00:00:00	2019-03-01 00:00:00

the result is

animal	time_born	time_adopted	agg_time_adopted
dog	2019-01-06 00:00:00	2020-06-01 00:00:00	2020-06-01 00:00:00
dog	2019-01-07 00:00:00	2019-05-01 00:00:00	2020-06-01 00:00:00
dog	2019-01-08 00:00:00	2020-07-01 00:00:00	2020-07-01 00:00:00
dog	2019-01-09 00:00:00	2020-06-01 00:00:00	2020-07-01 00:00:00
dog	2019-01-10 00:00:00	2019-03-01 00:00:00	2020-07-01 00:00:00
cat	2019-01-01 00:00:00	2019-02-01 00:00:00	2019-02-01 00:00:00
cat	2019-01-02 00:00:00	2019-03-01 00:00:00	2019-03-01 00:00:00
cat	2019-01-03 00:00:00	2019-03-01 00:00:00	2019-03-01 00:00:00
cat	2019-01-04 00:00:00	2020-04-01 00:00:00	2020-04-01 00:00:00
cat	2019-01-05 00:00:00	2019-05-01 00:00:00	2020-04-01 00:00:00

with input column time_adopted, partitioned by animal and ordered by time_born, for aggregation function latest. The rolling window starts with unbounded and ends with current row, for example, the window of the 4th row in the input dataset

|cat   |2019-01-04 00:00:00|2020-04-01 00:00:00|

is all the rows above down to the current row

|cat   |2019-01-01 00:00:00|2019-02-01 00:00:00|
|cat   |2019-01-02 00:00:00|2019-03-01 00:00:00|
|cat   |2019-01-03 00:00:00|2019-03-01 00:00:00|
|cat   |2019-01-04 00:00:00|2020-04-01 00:00:00|

Compute a rolling numeric aggregation, with optional partitioning

Unique identifier (API): window_rolling_numeric

Please use the newer version compute_rolling_window_function

Compute an aggregation function on one or more numeric columns over rolling/expanding/shrinking windows, optionally partitioned by other selected column(s). Boolean columns are accepted as well and are converted to numeric first.

Parameters	Parameter Description	Default	Parameter Type	Parameter Min	Parameter Max	Parameter Step
agg_function	Aggregation function applied to input column(s)		single-select
relative_error	Relative error for approximating median or a specific quantile. Must be no less than 0.001 for computational efficiency.		range	0.001	1.0	0.001
quantile_to_compute	Specify which quantile to compute for the quantile rolling function. Must be a number between 0 and 1.		range	0.0	1.0	0.05
window_start	Whether to set window start unbounded or a specified position relative to current row, with unbounded by default	unbounded	single-select
start_position	Position from which window starts, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively.		integer
window_end	Whether to set window end unbounded or a specified position relative to current row, with unbounded by default	unbounded	single-select
end_position	Position from which window ends, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively.		integer

Examples:

For example, given the following dataset

colour	animal	birth	qty
blue	cat	2019-01-01 00:00:00	0.9459212999951968
yellow	dog	2019-01-02 00:00:00	0.923807614788432
yellow	dog	2019-01-03 00:00:00	0.8160265494527545
blue	cat	2019-01-04 00:00:00	0.21414080632280563
yellow	dog	2019-01-05 00:00:00	0.31936422789647023
blue	dog	2019-01-06 00:00:00	0.8090282829464444
yellow	cat	2019-01-07 00:00:00	0.23359299737078676
yellow	cat	2019-01-08 00:00:00	0.9244320791985904
blue	cat	2019-01-09 00:00:00	0.6161751606853043
blue	cat	2019-01-10 00:00:00	0.26992044210702926
yellow	dog	2019-01-11 00:00:00	0.40755657119101496
yellow	cat	2019-01-12 00:00:00	0.11657357211440877
yellow	cat	2019-01-13 00:00:00	0.8757826191122454
yellow	cat	2019-01-14 00:00:00	0.3989306451516289
blue	dog	2019-01-15 00:00:00	0.09120177599596369
blue	dog	2019-01-16 00:00:00	0.8228651830898742
yellow	cat	2019-01-17 00:00:00	0.6855467490320383
blue	cat	2019-01-18 00:00:00	0.27373326299975465
yellow	cat	2019-01-19 00:00:00	0.10849671654816584
blue	cat	2019-01-20 00:00:00	0.27289802805739094

the result is

colour	animal	birth	qty	roll_qty
yellow	dog	2019-01-02 00:00:00	0.923807614788432	0.923807614788432
yellow	dog	2019-01-03 00:00:00	0.8160265494527545	1.7398341642411865
yellow	dog	2019-01-05 00:00:00	0.31936422789647023	2.059198392137657
yellow	dog	2019-01-11 00:00:00	0.40755657119101496	2.466754963328672
blue	cat	2019-01-01 00:00:00	0.9459212999951968	0.9459212999951968
blue	cat	2019-01-04 00:00:00	0.21414080632280563	1.1600621063180023
blue	cat	2019-01-09 00:00:00	0.6161751606853043	1.7762372670033066
blue	cat	2019-01-10 00:00:00	0.26992044210702926	2.046157709110336
blue	cat	2019-01-18 00:00:00	0.27373326299975465	2.31989097211009
blue	cat	2019-01-20 00:00:00	0.27289802805739094	2.592789000167481
blue	dog	2019-01-06 00:00:00	0.8090282829464444	0.8090282829464444
blue	dog	2019-01-15 00:00:00	0.09120177599596369	0.900230058942408
blue	dog	2019-01-16 00:00:00	0.8228651830898742	1.7230952420322823
yellow	cat	2019-01-07 00:00:00	0.23359299737078676	0.23359299737078676
yellow	cat	2019-01-08 00:00:00	0.9244320791985904	1.1580250765693771
yellow	cat	2019-01-12 00:00:00	0.11657357211440877	1.274598648683786
yellow	cat	2019-01-13 00:00:00	0.8757826191122454	2.1503812677960314
yellow	cat	2019-01-14 00:00:00	0.3989306451516289	2.5493119129476605
yellow	cat	2019-01-17 00:00:00	0.6855467490320383	3.2348586619796986
yellow	cat	2019-01-19 00:00:00	0.10849671654816584	3.3433553785278645

with input column qty, partitioned by colour and animal and ordered by birth for aggregation function sum, rolling window with unbounded start and current row as end.