Skip to content

Action Catalogue

Aggregation

Aggregate columns within groups

Unique identifier (API): aggregate

Group by the selected columns and compute aggregations within each group

Parameters Parameter Description Default Parameter Type
merge_back_to_original Whether to merge the aggregation(s) back to the original data False boolean
Examples:

Given the dataset

grp val val1 bool1 datetimes
1 0.45 1 True 2023-11-17T13:08:29
1 0.45 2 True 2023-11-17T13:08:29
3 0.45 3 True 2023-11-17T13:08:29
3 0.45 3.5 True 2023-08-03T08:48:29
3 1 3 False 2022-05-29T03:21:59
3 1 4 False 2022-05-29T03:21:59
2 10 5 False 2022-05-29T03:21:59
2 100 6 False 2022-05-29T03:21:59

If we group by grp and peform the following aggregations within each group

  • mean(val)
  • sum(val1)
  • all(bool1)
  • any(bool)
  • approx_count_distinct(val, 0)
  • approx_count_distinct(bool1, 0)
  • earliest(datetimes, 0)
  • latest(datetimes, 0)
  • first(val, 0)
  • last(val, 0)
  • approx_median(val)
  • approx_quantile(val)
  • count(bool1, "non_null")
  • count(bool1, "null)

we get

grp avg_val sum_val1 all_bool1 any_bool1 approx_count_distinct_val approx_count_distinct_bool1 earliest_datetime latest_datetime first_val last_val median_val quantile_val count_bool1 count_nulls_bool1
Text Numeric Numeric Boolean Boolean Numeric Numeric DateTime DateTime Numeric Numeric Numeric Numeric Numeric Numeric
1 0.45 3.0 True True 1 1 2023-11-17 13:08:29 2023-11-17 13:08:29 0.45 0.45 0.45 0.45 2 0
2 55 11 False False 2 1 2022-05-29 03:21:59 2022-05-29 03:21:59 10 100 10.0 10 2 0
3 0.725 13.5 False True 2

The user may optionally specify merge_back_to_original = True in which case the group by results will be joined with the original table using the group by columns as the join keys. Note: the resultant table using merge_back_to_original = True will have the same number of rows as the original table.

Casting

Cast columns to categorical type

Unique identifier (API): cast_categorical

Cast the selected columns to categorical type.

Parameters Parameter Description Default Parameter Type
categories_specified If set to 'true', categories must be specified explicitly. If set to 'false', every unique value in column will become a category False boolean
categories Categories in the column string-list

Cast categorical columns to text type

Unique identifier (API): cast_categorical_to_text

Cast selected categorical columns to text type

Cast columns to DateTime type

Unique identifier (API): cast_datetime

Cast the selected columns to DateTime type with specified format string

Parameters Parameter Description Default Parameter Type Suggested Values
datetime_format DateTime format string in the strftime format. %Y-%m-%dT%H:%M:%S.%fZ string ['%Y-%m-%dT%H:%M:%S.%f%z', '%Y-%m-%dT%H:%M:%S.%fZ', '%b %d %Y %H:%M:%S.%f', '%Y-%m-%dT%H:%M:%S.%f', '%m-%d-%Y %H:%M:%S.%f', '%d-%m-%Y %H:%M:%S.%f', '%d %b %Y %H:%M:%S.%f', '%Y/%m/%d %H:%M:%S.%f', '%Y-%m-%d %H:%M:%S.%f', '%d/%m/%Y %H:%M:%S.%f', '%Y-%m-%dT%H:%M:%S%z', '%Y-%m-%dT%H:%M:%SZ', '%d%m%Y %H:%M:%S.%f', '%m%d%Y %H:%M:%S.%f', '%Y%m%d %H:%M:%S.%f', '%m-%d-%Y %H:%M:%S', '%Y/%m/%d %H:%M:%S', '%Y-%m-%dT%H:%M:%S', '%b-%d-%Y %H:%M:%S', '%d-%m-%Y %H:%M:%S', '%d-%b-%Y %H:%M:%S', '%d/%m/%Y %H:%M:%S', '%Y-%m-%d %H:%M:%S', '%m/%d/%Y %I:%M:%S %p', '%Y%m%d %H:%M:%S', '%Y-%m-%dT%H:%MZ', '%m%d%Y %H:%M:%S', '%d%m%Y %H:%M:%S', '%m-%d-%Y', '%Y-%m-%d', '%d-%m-%Y', '%m %d %Y', '%d-%b-%Y', '%Y/%m/%d', '%d %m %Y', '%m/%d/%Y', '%Y %m %d', '%d/%m/%Y', '%d %b %Y', '%b-%d-%Y', '%b %d %Y', '%m%d%Y', '%Y%m%d', '%d%m%Y', '%m%Y', '%m %Y', '%m-%Y', '%m/%Y']
timezone Specifies the timezone the DateTimes belong to UTC string ['Africa/Abidjan', 'Africa/Accra', 'Africa/Addis_Ababa', 'Africa/Algiers', 'Africa/Asmara', 'Africa/Bamako', 'Africa/Bangui', 'Africa/Banjul', 'Africa/Bissau', 'Africa/Blantyre', 'Africa/Brazzaville', 'Africa/Bujumbura', 'Africa/Cairo', 'Africa/Casablanca', 'Africa/Ceuta', 'Africa/Conakry', 'Africa/Dakar', 'Africa/Dar_es_Salaam', 'Africa/Djibouti', 'Africa/Douala', 'Africa/El_Aaiun', 'Africa/Freetown', 'Africa/Gaborone', 'Africa/Harare', 'Africa/Johannesburg', 'Africa/Juba', 'Africa/Kampala', 'Africa/Khartoum', 'Africa/Kigali', 'Africa/Kinshasa', 'Africa/Lagos', 'Africa/Libreville', 'Africa/Lome', 'Africa/Luanda', 'Africa/Lubumbashi', 'Africa/Lusaka', 'Africa/Malabo', 'Africa/Maputo', 'Africa/Maseru', 'Africa/Mbabane', 'Africa/Mogadishu', 'Africa/Monrovia', 'Africa/Nairobi', 'Africa/Ndjamena', 'Africa/Niamey', 'Africa/Nouakchott', 'Africa/Ouagadougou', 'Africa/Porto-Novo', 'Africa/Sao_Tome', 'Africa/Tripoli', 'Africa/Tunis', 'Africa/Windhoek', 'America/Adak', 'America/Anchorage', 'America/Anguilla', 'America/Antigua', 'America/Araguaina', 'America/Argentina/Buenos_Aires', 'America/Argentina/Catamarca', 'America/Argentina/Cordoba', 'America/Argentina/Jujuy', 'America/Argentina/La_Rioja', 'America/Argentina/Mendoza', 'America/Argentina/Rio_Gallegos', 'America/Argentina/Salta', 'America/Argentina/San_Juan', 'America/Argentina/San_Luis', 'America/Argentina/Tucuman', 'America/Argentina/Ushuaia', 'America/Aruba', 'America/Asuncion', 'America/Atikokan', 'America/Bahia', 'America/Bahia_Banderas', 'America/Barbados', 'America/Belem', 'America/Belize', 'America/Blanc-Sablon', 'America/Boa_Vista', 'America/Bogota', 'America/Boise', 'America/Cambridge_Bay', 'America/Campo_Grande', 'America/Cancun', 'America/Caracas', 'America/Cayenne', 'America/Cayman', 'America/Chicago', 'America/Chihuahua', 'America/Costa_Rica', 'America/Creston', 'America/Cuiaba', 'America/Curacao', 'America/Danmarkshavn', 'America/Dawson', 'America/Dawson_Creek', 'America/Denver', 'America/Detroit', 'America/Dominica', 'America/Edmonton', 'America/Eirunepe', 'America/El_Salvador', 'America/Fort_Nelson', 'America/Fortaleza', 'America/Glace_Bay', 'America/Goose_Bay', 'America/Grand_Turk', 'America/Grenada', 'America/Guadeloupe', 'America/Guatemala', 'America/Guayaquil', 'America/Guyana', 'America/Halifax', 'America/Havana', 'America/Hermosillo', 'America/Indiana/Indianapolis', 'America/Indiana/Knox', 'America/Indiana/Marengo', 'America/Indiana/Petersburg', 'America/Indiana/Tell_City', 'America/Indiana/Vevay', 'America/Indiana/Vincennes', 'America/Indiana/Winamac', 'America/Inuvik', 'America/Iqaluit', 'America/Jamaica', 'America/Juneau', 'America/Kentucky/Louisville', 'America/Kentucky/Monticello', 'America/Kralendijk', 'America/La_Paz', 'America/Lima', 'America/Los_Angeles', 'America/Lower_Princes', 'America/Maceio', 'America/Managua', 'America/Manaus', 'America/Marigot', 'America/Martinique', 'America/Matamoros', 'America/Mazatlan', 'America/Menominee', 'America/Merida', 'America/Metlakatla', 'America/Mexico_City', 'America/Miquelon', 'America/Moncton', 'America/Monterrey', 'America/Montevideo', 'America/Montserrat', 'America/Nassau', 'America/New_York', 'America/Nipigon', 'America/Nome', 'America/Noronha', 'America/North_Dakota/Beulah', 'America/North_Dakota/Center', 'America/North_Dakota/New_Salem', 'America/Nuuk', 'America/Ojinaga', 'America/Panama', 'America/Pangnirtung', 'America/Paramaribo', 'America/Phoenix', 'America/Port-au-Prince', 'America/Port_of_Spain', 'America/Porto_Velho', 'America/Puerto_Rico', 'America/Punta_Arenas', 'America/Rainy_River', 'America/Rankin_Inlet', 'America/Recife', 'America/Regina', 'America/Resolute', 'America/Rio_Branco', 'America/Santarem', 'America/Santiago', 'America/Santo_Domingo', 'America/Sao_Paulo', 'America/Scoresbysund', 'America/Sitka', 'America/St_Barthelemy', 'America/St_Johns', 'America/St_Kitts', 'America/St_Lucia', 'America/St_Thomas', 'America/St_Vincent', 'America/Swift_Current', 'America/Tegucigalpa', 'America/Thule', 'America/Thunder_Bay', 'America/Tijuana', 'America/Toronto', 'America/Tortola', 'America/Vancouver', 'America/Whitehorse', 'America/Winnipeg', 'America/Yakutat', 'America/Yellowknife', 'Antarctica/Casey', 'Antarctica/Davis', 'Antarctica/DumontDUrville', 'Antarctica/Macquarie', 'Antarctica/Mawson', 'Antarctica/McMurdo', 'Antarctica/Palmer', 'Antarctica/Rothera', 'Antarctica/Syowa', 'Antarctica/Troll', 'Antarctica/Vostok', 'Arctic/Longyearbyen', 'Asia/Aden', 'Asia/Almaty', 'Asia/Amman', 'Asia/Anadyr', 'Asia/Aqtau', 'Asia/Aqtobe', 'Asia/Ashgabat', 'Asia/Atyrau', 'Asia/Baghdad', 'Asia/Bahrain', 'Asia/Baku', 'Asia/Bangkok', 'Asia/Barnaul', 'Asia/Beirut', 'Asia/Bishkek', 'Asia/Brunei', 'Asia/Chita', 'Asia/Choibalsan', 'Asia/Colombo', 'Asia/Damascus', 'Asia/Dhaka', 'Asia/Dili', 'Asia/Dubai', 'Asia/Dushanbe', 'Asia/Famagusta', 'Asia/Gaza', 'Asia/Hebron', 'Asia/Ho_Chi_Minh', 'Asia/Hong_Kong', 'Asia/Hovd', 'Asia/Irkutsk', 'Asia/Jakarta', 'Asia/Jayapura', 'Asia/Jerusalem', 'Asia/Kabul', 'Asia/Kamchatka', 'Asia/Karachi', 'Asia/Kathmandu', 'Asia/Khandyga', 'Asia/Kolkata', 'Asia/Krasnoyarsk', 'Asia/Kuala_Lumpur', 'Asia/Kuching', 'Asia/Kuwait', 'Asia/Macau', 'Asia/Magadan', 'Asia/Makassar', 'Asia/Manila', 'Asia/Muscat', 'Asia/Nicosia', 'Asia/Novokuznetsk', 'Asia/Novosibirsk', 'Asia/Omsk', 'Asia/Oral', 'Asia/Phnom_Penh', 'Asia/Pontianak', 'Asia/Pyongyang', 'Asia/Qatar', 'Asia/Qostanay', 'Asia/Qyzylorda', 'Asia/Riyadh', 'Asia/Sakhalin', 'Asia/Samarkand', 'Asia/Seoul', 'Asia/Shanghai', 'Asia/Singapore', 'Asia/Srednekolymsk', 'Asia/Taipei', 'Asia/Tashkent', 'Asia/Tbilisi', 'Asia/Tehran', 'Asia/Thimphu', 'Asia/Tokyo', 'Asia/Tomsk', 'Asia/Ulaanbaatar', 'Asia/Urumqi', 'Asia/Ust-Nera', 'Asia/Vientiane', 'Asia/Vladivostok', 'Asia/Yakutsk', 'Asia/Yangon', 'Asia/Yekaterinburg', 'Asia/Yerevan', 'Atlantic/Azores', 'Atlantic/Bermuda', 'Atlantic/Canary', 'Atlantic/Cape_Verde', 'Atlantic/Faroe', 'Atlantic/Madeira', 'Atlantic/Reykjavik', 'Atlantic/South_Georgia', 'Atlantic/St_Helena', 'Atlantic/Stanley', 'Australia/Adelaide', 'Australia/Brisbane', 'Australia/Broken_Hill', 'Australia/Darwin', 'Australia/Eucla', 'Australia/Hobart', 'Australia/Lindeman', 'Australia/Lord_Howe', 'Australia/Melbourne', 'Australia/Perth', 'Australia/Sydney', 'Canada/Atlantic', 'Canada/Central', 'Canada/Eastern', 'Canada/Mountain', 'Canada/Newfoundland', 'Canada/Pacific', 'Europe/Amsterdam', 'Europe/Andorra', 'Europe/Astrakhan', 'Europe/Athens', 'Europe/Belgrade', 'Europe/Berlin', 'Europe/Bratislava', 'Europe/Brussels', 'Europe/Bucharest', 'Europe/Budapest', 'Europe/Busingen', 'Europe/Chisinau', 'Europe/Copenhagen', 'Europe/Dublin', 'Europe/Gibraltar', 'Europe/Guernsey', 'Europe/Helsinki', 'Europe/Isle_of_Man', 'Europe/Istanbul', 'Europe/Jersey', 'Europe/Kaliningrad', 'Europe/Kiev', 'Europe/Kirov', 'Europe/Lisbon', 'Europe/Ljubljana', 'Europe/London', 'Europe/Luxembourg', 'Europe/Madrid', 'Europe/Malta', 'Europe/Mariehamn', 'Europe/Minsk', 'Europe/Monaco', 'Europe/Moscow', 'Europe/Oslo', 'Europe/Paris', 'Europe/Podgorica', 'Europe/Prague', 'Europe/Riga', 'Europe/Rome', 'Europe/Samara', 'Europe/San_Marino', 'Europe/Sarajevo', 'Europe/Saratov', 'Europe/Simferopol', 'Europe/Skopje', 'Europe/Sofia', 'Europe/Stockholm', 'Europe/Tallinn', 'Europe/Tirane', 'Europe/Ulyanovsk', 'Europe/Uzhgorod', 'Europe/Vaduz', 'Europe/Vatican', 'Europe/Vienna', 'Europe/Vilnius', 'Europe/Volgograd', 'Europe/Warsaw', 'Europe/Zagreb', 'Europe/Zaporozhye', 'Europe/Zurich', 'GMT', 'Indian/Antananarivo', 'Indian/Chagos', 'Indian/Christmas', 'Indian/Cocos', 'Indian/Comoro', 'Indian/Kerguelen', 'Indian/Mahe', 'Indian/Maldives', 'Indian/Mauritius', 'Indian/Mayotte', 'Indian/Reunion', 'Pacific/Apia', 'Pacific/Auckland', 'Pacific/Bougainville', 'Pacific/Chatham', 'Pacific/Chuuk', 'Pacific/Easter', 'Pacific/Efate', 'Pacific/Enderbury', 'Pacific/Fakaofo', 'Pacific/Fiji', 'Pacific/Funafuti', 'Pacific/Galapagos', 'Pacific/Gambier', 'Pacific/Guadalcanal', 'Pacific/Guam', 'Pacific/Honolulu', 'Pacific/Kiritimati', 'Pacific/Kosrae', 'Pacific/Kwajalein', 'Pacific/Majuro', 'Pacific/Marquesas', 'Pacific/Midway', 'Pacific/Nauru', 'Pacific/Niue', 'Pacific/Norfolk', 'Pacific/Noumea', 'Pacific/Pago_Pago', 'Pacific/Palau', 'Pacific/Pitcairn', 'Pacific/Pohnpei', 'Pacific/Port_Moresby', 'Pacific/Rarotonga', 'Pacific/Saipan', 'Pacific/Tahiti', 'Pacific/Tarawa', 'Pacific/Tongatapu', 'Pacific/Wake', 'Pacific/Wallis', 'US/Alaska', 'US/Arizona', 'US/Central', 'US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']

Cast specified columns to JSON array type

Unique identifier (API): cast_json_array

This action marks input columns as "JSONArray" type, implying that the values in the column are expected to be JSON strings parsable into arrays of elements with uniform type.

Cast specified columns to JSON object type

Unique identifier (API): cast_json_object

This action marks input columns as "JSONObject" type, implying that the values in the column are expected to be JSON strings parsable into objects with attributes and values.

Cast columns to numeric type

Unique identifier (API): cast_numeric

Cast the selected columns to numeric type.

Column Group

Get column group components

Unique identifier (API): get_column_group_components

Extracts specified components from a column group by their suffix. If a specified component suffix does not exist, will return a null column as the corresponding output.

Examples:

For the following input table:

dummy_col __group_grp_a_0 __group_grp_a_1 __group_grp_a_2
aaa 1 2 1
bbb Sentence A Sentence B Sentence C

with: * input column: 'grp_a' * output columns: ['component_0_from_grp_a', 'component_2_from_grp_a'] * suffixes: [0, 2]

The output is:

dummy_col __group_grp_a_0 __group_grp_a_1 __group_grp_a_2 component_0_from_grp_a component_2_from_grp_a
aaa 1 2 1 1 1
bbb Sentence A Sentence B Sentence C Sentence A Sentence C

Conversion

Convert numeric column to DateTime

Unique identifier (API): numeric_to_datetime

Converts the given numeric columns into UTC timestamps, interpreting them as offset values in the given unit since the specified epoch.

Parameters Parameter Description Default Parameter Type
units Specifies the unit of time that the values in the numeric column represent. Seconds single-select
rel_to Specifies the "origin" UTC DateTime. The conversion from Numeric to DateTime shall be relative to this "origin". 1970-01-01 string
Examples:

For units == 'seconds', rel_to="1970-01-01T00:00:00" and for the following input table:

numeric timestamps
0
-86400
86400

The expected output is:

numeric timestamps
1970-01-01 00:00:00
1969-12-31 00:00:00
1970-01-02 00:00:00

Custom Formula

Create a new column with a formula

Unique identifier (API): create_new_column_with_formula_v2

Creates a new column with any user-entered formula. Common Excel/SQL functions and operators for manipulating Numeric, DateTime, Text, and Boolean values are permitted. For a full list of allowed functions, see documentation.

Parameters Parameter Description Default Parameter Type
expression A formula involving column names. For example: 'column_B / column_A + log(column_D * 2)' formula
Examples:

Consider the dataset:

manufacturer model year displ cty hwy
audi a4 quattro 1999 1.8 16 25
audi a6 quattro 1999 2.8 15 24
audi a6 quattro 2008 4.2 16 23
audi a6 quattro 2008 4.2 16 23
chevrolet c1500 suburban 2wd 2008 6 12 17
chevrolet corvette 1999 5.7 16 26
chevrolet corvette 1999 5.7 15 23
chevrolet k1500 tahoe 4wd 1999 6.5 14 17
chevrolet malibu 1999 3.1 18 26
dodge caravan 2wd 1999 3.8 15 22

where cty, hwy and year are of Numeric type.

Given the formula 'hwy - cty' and the output column name 'hwy_minus_cty', the result will be:

manufacturer model year displ cty hwy hwy_minus_cty
audi a4 quattro 1999 1.8 16 25 9
audi a6 quattro 1999 2.8 15 24 9
audi a6 quattro 2008 4.2 16 23 7
audi a6 quattro 2008 4.2 16 23 7
chevrolet c1500 suburban 2wd 2008 6 12 17 5
chevrolet corvette 1999 5.7 16 26 10
chevrolet corvette 1999 5.7 15 23 8
chevrolet k1500 tahoe 4wd 1999 6.5 14 17 3
chevrolet malibu 1999 3.1 18 26 8
dodge caravan 2wd 1999 3.8 15 22 7

where the new hwy_minus_cty column will be of Numeric type.

For the same input data, the formula 'IF(year <= 2000, "20th", "21st")' with the output column name 'century' will result in:

manufacturer model year displ cty hwy century
audi a4 quattro 1999 1.8 16 25 20th
audi a6 quattro 1999 2.8 15 24 20th
audi a6 quattro 2008 4.2 16 23 21st
audi a6 quattro 2008 4.2 16 23 21st
chevrolet c1500 suburban 2wd 2008 6 12 17 21st
chevrolet corvette 1999 5.7 16 26 20th
chevrolet corvette 1999 5.7 15 23 20th
chevrolet k1500 tahoe 4wd 1999 6.5 14 17 20th
chevrolet malibu 1999 3.1 18 26 20th
dodge caravan 2wd 1999 3.8 15 22 20th

where the new century column will be of Text type.

Transform columns with a formula

Unique identifier (API): transform_columns_with_formula_v2

Transforms one or more column(s) with a user-specified formula. Common Excel/SQL functions and operators for manipulating Numeric, DateTime, Text, Boolean and Categorical values as well as JSON Objects/Arrays are permitted. For a full list of these functions, see documentation.

Parameters Parameter Description Default Parameter Type
formula Formula where COL/COLUMN refers to the column(s) to be transformed. For example, "col_A + col_B - $COL" formula
Examples:

For example, given the following dataset

qty_1 qty_2
10 null
20 15
15 6
null 1
8 8

and formula = ""$col + (qty_2 + $col) * 10"", with input column qty_1, the result will be

qty_1 qty_2
null null
370 15
225 6
null 1
168 8

Data Cleaning/Processing

Concatenate with delimiter

Unique identifier (API): concat_cols_delim

Concatenates one or more text columns / column groups with a specified delimiter.

Parameters Parameter Description Default Parameter Type Suggested Values
delimiter Delimiter to be appended between concatenated values string [',', ';', '-', '
Examples:

Assuming delimiter == ', ' and the input table is: (one column, one group with 2 components)

col_1 __group_grp_0 __group_grp_1
Hello World A
Hello null B

The output will be:

col_1 __group_grp_0 __group_grp_1 comb
Hello World A Hello, World, A
Hello null B Hello, , B

Copy

Unique identifier (API): copy

Creates a copy of the specified columns OR column groups

Examples:

For the following input table: (2 columns and 1 group with 2 components)

col_1 col_2 __group_grp_0 __group_grp_1
0.14667831357791283 2194986032 0.6765007779041725 0.876755413362593
0.8424524032441315 -8793057235 0.21887501332663328 0.6874699121416429
0.08379407453894183 -1300419439 0.2108548261597638 0.39335408485745316
0.09723766918372134 2467206050 0.2499576969146754 0.783647656429706
0.13917572512342702 8415264685 0.9150713227141526 0.071808029978221

We select to copy: [col_2, grp]. (grp is the name of the group). The output is:

col_1 col_2 __group_grp_0 __group_grp_1 col_2_copy __group_grp_copy_0 __group_grp_copy_1
0.14667831357791283 2194986032 0.6765007779041725 0.876755413362593 2194986032 0.6765007779041725 0.876755413362593
0.8424524032441315 -8793057235 0.21887501332663328 0.6874699121416429 -8793057235 0.21887501332663328 0.6874699121416429
0.08379407453894183 -1300419439 0.2108548261597638 0.39335408485745316 -1300419439 0.2108548261597638 0.39335408485745316
0.09723766918372134 2467206050 0.2499576969146754 0.783647656429706 2467206050 0.2499576969146754 0.783647656429706
0.13917572512342702 8415264685 0.9150713227141526 0.071808029978221 8415264685 0.9150713227141526 0.071808029978221

Drop

Unique identifier (API): drop

Drop columns and/or column groups

Examples:

For example, given the following dataset

col_1 col_2 __group_A_0 __group_A_1 __group_B_mean __group_B_count
0.666099547109415 cat_1 0.8268666013533758 0 0.9842078018727513 0
0.293068258318807 cat_2 0.9747909820892654 1 0.703905046233064 1
0.2961075040827643 cat_3 0.14699408293664762 2 0.5417745101099833 2
0.11255366333183248 cat_4 0.5392931543191276 3 0.35921739535158015 3
0.9707308564097861 cat_5 0.23839121304354394 4 0.8751438273689285 4

with input column col_1 and column group A, the result is

col_2 __group_B_mean __group_B_count
cat_1 0.9842078018727513 0
cat_2 0.703905046233064 1
cat_3 0.5417745101099833 2
cat_4 0.35921739535158015 3
cat_5 0.8751438273689285 4

Find and replace

Unique identifier (API): find_replace

Finds and replaces the specified strings in the input column(s) according to the given mapping

Parameters Parameter Description Default Parameter Type
replacements Mapping between the search strings and their replacements map
match_case Whether the matches are case sensitive False boolean
positional_match Matches the search only if a match is found in the specified position in the string anywhere single-select
Examples:

For the following input table:

col_x col_y
They won the fifth place They won the sixth place
No match here whole word is th

and input columns: ['col_x', 'col_y'] and parameters:

parameters = {
    'replacements': [['th', '##']],
    'positional_match': 'prefix',
    'match_case': 'True'
}

The output will be:

For ('prefix', True):

col_x col_y
They won ##e fifth place They won ##e sixth place
No match here whole word is ##

Impute categorical columns

Unique identifier (API): impute_categorical

Fill missing values in categorical input columns by a specified method

Parameters Parameter Description Default Parameter Type
method Method for imputing input columns most_frequent single-select
fill_value Value to fill when method is specified as constant string
Examples:

For example, a categorical column white, black, None, white, yellow is changed to white, black, white, white, yellow with method = ""most_frequent"" (missing values are filled with the most frequent value "white").

Impute numeric columns

Unique identifier (API): impute_numeric

Fill missing values in numeric input columns by a specified method

Parameters Parameter Description Default Parameter Type
method Method for imputing input columns mean single-select
fill_value Value to fill when method is specified as constant float
Examples:

For example, a numeric column 100, 120, None, 80, 100, None is changed to 100, 120, 100, 80, 100, 100 with method = ""mean"" (missing values are filled with the mean 100).

Rename

Unique identifier (API): rename

Renames specified columns OR column groups

Examples:

For the following input table: (3 columns and 1 group with 2 components)

col_1 col_2 col_3 __group_grp_0 __group_grp_1
0.5267075 8.24489e+07 1.23345 0.2635994 0.1005395
0.5707881 2.044422e+09 -0.1124 0.3945866 0.4787061
0.6478216 -5.179201e+09 0.00011 0.1314026 0.2335102

We select to rename: [col_2, grp]. (grp is the name of the group) to [col_A, grp_A]. The output is:

col_1 col_A col_3 __group_grp_A_0 __group_grp_A_1
0.5267075 8.24489e+07 1.23345 0.2635994 0.1005395
0.5707881 2.044422e+09 -0.1124 0.3945866 0.4787061
0.6478216 -5.179201e+09 0.00011 0.1314026 0.2335102

Sort on columns

Unique identifier (API): sort_on_columns

Sort the dataset based on the input columns in a specified order (ascending/descending)

Examples:

For the following input table:

animal colour adopted qty desexed note
cat yellow 2016-07-31T00:00:12 10 True There is a story
dog blue 2017-08-20T07:23:54 4 True
cat blue 2017-09-20T00:23:50 9 True No story
cat blue 2018-05-01T04:40:15 8 True There is a story
dog yellow 2019-02-05T12:12:02 1 True There is another story
dog black 2019-08-20T07:23:02 2 False There is a story
cat black 2020-01-20T00:23:50 2 True No story
dog yellow 2020-08-02T15:00:08 2 False

We select sorting by: qty in ascending order and adopted in descending order. The output is:

animal colour adopted qty desexed note
dog yellow 2019-02-05 12:12:02 1 True There is another story
dog yellow 2020-08-02 15:00:08 2 False
cat black 2020-01-20 00:23:50 2 True No story
dog black 2019-08-20 07:23:02 2 False There is a story
dog blue 2017-08-20 07:23:54 4 True
cat blue 2018-05-01 04:40:15 8 True There is a story
cat blue 2017-09-20 00:23:50 9 True No story
cat yellow 2016-07-31 00:00:12 10 True There is a story

Strip HTML tags

Unique identifier (API): strip_html

Strip HTML tags from text to convert webpages to text documents, sans HTML formatting

Examples:

Strip html tags from text e.g. ""<emp> Text </emp>"" becomes ""Text""

Fill null with the nearest preceding/following values

Unique identifier (API): window_fill_null_by_nearest

Fill null with the nearest non-null values in either preceding or following rows after ordering, optionally with partitioning

Parameters Parameter Description Default Parameter Type
method Whether to use the nearest value in preceding or following rows preceding single-select
Examples:

For example, given the following dataset

animal colour adopted qty
cat yellow 2020-01-01 00:00:00 null
cat yellow 2020-01-02 00:00:00 null
cat yellow 2020-01-03 00:00:00 5.0
cat yellow 2020-01-04 00:00:00 7.0
cat yellow 2020-01-05 00:00:00 10.0
cat white 2020-01-06 00:00:00 1.0
cat white 2020-01-07 00:00:00 4.0
cat white 2020-01-08 00:00:00 null
cat white 2020-01-09 00:00:00 7.0
cat white 2020-01-10 00:00:00 8.0
dog yellow 2020-01-11 00:00:00 9.0
dog yellow 2020-01-12 00:00:00 7.0
dog yellow 2020-01-13 00:00:00 6.0
dog yellow 2020-01-14 00:00:00 5.0
dog yellow 2020-01-15 00:00:00 10.0
dog white 2020-01-16 00:00:00 6.0
dog white 2020-01-17 00:00:00 5.0
dog white 2020-01-18 00:00:00 8.0
dog white 2020-01-19 00:00:00 null
dog white 2020-01-20 00:00:00 null

with input column qty, partitioned by animal and colour and ordered by adopted for method preceding, the result is

animal colour adopted qty
cat white 2020-01-06 00:00:00 1.0
cat white 2020-01-07 00:00:00 4.0
cat white 2020-01-08 00:00:00 4.0
cat white 2020-01-09 00:00:00 7.0
cat white 2020-01-10 00:00:00 8.0
cat yellow 2020-01-01 00:00:00 null
cat yellow 2020-01-02 00:00:00 null
cat yellow 2020-01-03 00:00:00 5.0
cat yellow 2020-01-04 00:00:00 7.0
cat yellow 2020-01-05 00:00:00 10.0
dog white 2020-01-16 00:00:00 6.0
dog white 2020-01-17 00:00:00 5.0
dog white 2020-01-18 00:00:00 8.0
dog white 2020-01-19 00:00:00 8.0
dog white 2020-01-20 00:00:00 8.0
dog yellow 2020-01-11 00:00:00 9.0
dog yellow 2020-01-12 00:00:00 7.0
dog yellow 2020-01-13 00:00:00 6.0
dog yellow 2020-01-14 00:00:00 5.0
dog yellow 2020-01-15 00:00:00 10.0

Date/DateTime

Add time delta

Unique identifier (API): add_timedelta

Adds a time delta (in seconds) to one or more DateTime columns. Changes the columns in place.

Parameters Parameter Description Default Parameter Type
time_delta_sec This is the time-duration in seconds to be added to one or more DateTime columns. 0 integer
Examples:

For a time_delta_sec == -100 and the given datetime 2022-06-30 23:51:08.825072 the output is 2022-06-30 23:49:28.825072

Create cyclic encoding features

Unique identifier (API): cyclic_encode

A directive that receives one or more columns that are cyclical in nature. denote the column as 'x' and the period of the column as 'p'. For each such columns, creates 2 new columns which are: x_sin = sin(frac{2\pix}p), x_cos = cos(frac{2\pix}p)

Parameters Parameter Description Default Parameter Type
period The period of the cyclic feature 1 float

Compute time elapsed

Unique identifier (API): diff_datetime

Calculates the time difference in seconds between two DateTime columns

Parameters Parameter Description Default Parameter Type
units Specifies the units of the time difference result Seconds single-select
Examples:

If the time difference units are Hours, for the following input:

col_1 col_2
2020-07-23 11:26:04.812256 2020-07-31 11:26:04.812256

We expect the following output:

col_1 col_2 diff_1
2020-07-23 11:26:04.812256 2020-07-31 11:26:04.812256 -192.0

Extract AM/PM

Unique identifier (API): extract_am_pm

Extracts the AM/PM value for rows of a DateTime column

Examples:

2020-04-15T15:53:44 -> will return: PM

Extract day of month

Unique identifier (API): extract_day_of_month_number

Extract the day of month as a number from 1 to 31

Examples:

2020-04-15T15:53:44 -> will return: 15

Extract day of week name

Unique identifier (API): extract_day_of_week_name

Extract the name of the day in week ('Monday' to 'Sunday')

Examples:

2020-04-15T15:53:44 -> will return: Wednesday

Extract day of week number

Unique identifier (API): extract_day_of_week_number

Extract the day of week as a number ('Sunday' = 1, 'Saturday' = 7)

Examples:

2020-04-15T15:53:44, Days coding is: Monday=0, Tuesday=1, ..., Sunday=6 -> will return: 02.

Extract hours

Unique identifier (API): extract_hour

Extract hour of the day as a number from 0 to 23

Examples:

2020-04-15T15:53:44 -> will return: 15

Extract minutes

Unique identifier (API): extract_minutes

Extract minutes after the hour as a number from 0 to 59

Examples:

2020-04-15T15:53:44 -> will return: 53

Extract month name

Unique identifier (API): extract_month_name

Extracts the months names of a DateTime column

Examples:

2020-04-15T15:53:44 -> will return: April

Extract month number

Unique identifier (API): extract_month_number

Extract the month as a number from 1 to 12

Extract quarter

Unique identifier (API): extract_quarter

Extract the quarter (3-month period) of the year in which the dates fall as an integer from 1 to 4.

Examples:

2020-04-15T15:53:44 -> will return: 02

Extract seconds

Unique identifier (API): extract_seconds

Extract seconds after the minute as a number from 0 to 59

Examples:

2020-04-15T15:53:44 -> will return: 44

Extract timestamp

Unique identifier (API): extract_timestamp

Extracts the POSIX timestamp. (Number of seconds since January 1st, 1970

Examples:

Returns a POSIX timestamp as float. (Number of seconds since January 1st, 1970).

2020-04-15T15:53:44 -> will return: 1586930024.0.

Extract UTC offset

Unique identifier (API): extract_utc_offset

Extracts the UTC offset (in hours) from timezone aware string timestamps

Parameters Parameter Description Default Parameter Type Suggested Values
date_format The date_format of the column %Y-%m-%dT%H:%M:%S%z string ['%Y-%m-%dT%H:%M:%S.%fZ%z', '%b %d %Y %H:%M:%S.%f%z', '%Y-%m-%dT%H:%M:%S.%f%z', '%m-%d-%Y %H:%M:%S.%f%z', '%d-%m-%Y %H:%M:%S.%f%z', '%d %b %Y %H:%M:%S.%f%z', '%Y/%m/%d %H:%M:%S.%f%z', '%Y-%m-%d %H:%M:%S.%f%z', '%d/%m/%Y %H:%M:%S.%f%z', '%Y-%m-%dT%H:%M:%SZ%z', '%d%m%Y %H:%M:%S.%f%z', '%m%d%Y %H:%M:%S.%f%z', '%Y%m%d %H:%M:%S.%f%z', '%m-%d-%Y %H:%M:%S%z', '%Y/%m/%d %H:%M:%S%z', '%Y-%m-%dT%H:%M:%S%z', '%b-%d-%Y %H:%M:%S%z', '%d-%m-%Y %H:%M:%S%z', '%d-%b-%Y %H:%M:%S%z', '%d/%m/%Y %H:%M:%S%z', '%Y-%m-%d %H:%M:%S%z', '%m/%d/%Y %I:%M:%S %p%z', '%Y%m%d %H:%M:%S%z', '%Y-%m-%dT%H:%MZ%z', '%m%d%Y %H:%M:%S%z', '%d%m%Y %H:%M:%S%z', '%m-%d-%Y%z', '%Y-%m-%d%z', '%d-%m-%Y%z', '%m %d %Y%z', '%d-%b-%Y%z', '%Y/%m/%d%z', '%d %m %Y%z', '%m/%d/%Y%z', '%Y %m %d%z', '%d/%m/%Y%z', '%d %b %Y%z', '%b-%d-%Y%z', '%b %d %Y%z', '%m%d%Y%z', '%Y%m%d%z', '%d%m%Y%z', '%m%Y%z', '%m %Y%z', '%m-%Y%z', '%m/%Y%z']
Examples:

For a date_format == ”%Y-%m-%dT%H:%M:%S%z” and the following input:

col_1
1970-01-01T23:23:14
2000-10-10T10:14:42-0515
1983-02-03T03:00:51
null
2020-07-06T18:52:36-0000
1948-05-14T02:23:14-0845

The expected output is:

col_1 utc_offset_hrs
1970-01-01T23:23:14 0430
2000-10-10T10:14:42-0515 -5.25
1983-02-03T03:00:51 null
null null
2020-07-06T18:52:36-0000 0.0
1948-05-14T02:23:14-0845 -8.75

Extract UTC offset from UTC timestamp and timezone

Unique identifier (API): extract_utc_offset_tz

Extracts the offset (in hours) from UTC, given the UTC DateTime column and the timezone name column.

Examples:

For the following input:

datetimes timezones
1970-01-01 23:23:14 Pacific/Pago_Pago
1990-05-20 09:09:09 Asia/Jerusalem
2000-10-10 10:14:42 pacific/pago_pago
1983-02-03 03:00:51 null
null Arctic/Longyearbyen
2020-07-06 18:52:36 Mars/Jezero crater

The expected output is:

datetimes timezones utc_offset_hrs
1970-01-01 23:23:14 Pacific/Pago_Pago -11.0
1990-05-20 09:09:09 Asia/Jerusalem 3.0
2000-10-10 10:14:42 pacific/pago_pago null
1983-02-03 03:00:51 null null
null Arctic/Longyearbyen null
2020-07-06 18:52:36 Mars/Jezero crater null

The reasons for the null values are:

  1. pacific/pago_pago doesn't exist in our timezone list. (Pacific/Pago_Pago does!)
  2. null timezone value
  3. null datetime value
  4. Mars/Jezero crater doesn’t exist in our timezone list. (Maybe in the future)

Extract week number in year

Unique identifier (API): extract_week_number_in_year

Extracts the week number (between 1-52)

Extract year

Unique identifier (API): extract_year

Extract the year as a number

Examples:

2020-04-15T15:53:44 -> will return: 2020

Get daylight savings indicator

Unique identifier (API): get_dst_indicator

Generates a boolean column which indicates if a localized timestamp column (represented by the combination of two columns: A UTC DateTime column and an IANA timezone name column) has an active daylight saving.

Examples:

For the following input:

time_utc tz_name
2020-08-20 13:13:14 Australia/Melbourne
2020-02-20 13:13:14 Australia/Melbourne
2020-02-20 13:13:14 null
2020-02-20 13:13:14 Mars/Jezero Crater
null Australia/Melbourne

The expected output is:

time_utc tz_name dst_active
2020-08-20 13:13:14 Australia/Melbourne false
2020-02-20 13:13:14 Australia/Melbourne true
2020-02-20 13:13:14 null null
2020-02-20 13:13:14 Mars/Jezero Crater null
null Australia/Melbourne null

And the reasons for the null values are: 1. null timezone value 2. Mars/Jezero crater doesn’t exist in our timezone list. (Maybe in the future) 3. null time_utc

Get local time

Unique identifier (API): get_local_time

Generates local time given the UTC timestamp and the timezone's offset in hours.

Examples:

For the following input: (2 pairs of DateTime | offset, working on both):

datetime_1 offset_1 datetime_2 offset_2
1970-01-01 23:23:14 4.5 null null
2000-10-10 10:14:42 -5.25 2020-07-06 18:52:36 0.0
1983-02-03 03:00:51 2.75 1948-05-14 02:23:14 null

The expected output is:

datetime_1 offset_1 datetime_2 offset_2 local_time_1 local_time_2
1970-01-01 23:23:14 4.5 null null 1970-01-02 03:53:14 null
2000-10-10 10:14:42 -5.25 2020-07-06 18:52:36 0.0 2000-10-10 04:59:42 2020-07-06 18:52:36
1983-02-03 03:00:51 2.75 1948-05-14 02:23:14 null 1983-02-03 05:45:51 null

Get timezone name

Unique identifier (API): get_tz_name

Generates a list of timezone names matching the pair of a UTC DateTime column and a numeric UTC offset (in hours) column. The list is in a text format with values separated by a pipe

Examples:

For the following input: (2 pairs of DateTime | offset, working on both)

datetime_1 offset_1 datetime_2 offset_2
1970-01-01 23:23:14 4.5 null null
2000-10-10 10:14:42 -5.25 2020-07-06 18:52:36 0.0
1983-02-03 03:00:51 2.75 1948-05-14 02:23:14 null

The expected output is:

datetime_1 offset_1 datetime_2 offset_2 timezones_1 timezones_2
1970-01-01 23:23:14 4.5 null null Asia/Kabul
2000-10-10 10:14:42 -5.25 2020-07-06 18:52:36 0.0 Africa/Abidjan, Africa/Accra, Africa/Bamako, Africa/Banjul, Africa/Bissau, Africa/Conakry, Africa/Dakar, Africa/Freetown, Africa/Lome, Africa/Monrovia, Africa/Nouakchott, Africa/Ouagadougou, Africa/Sao_Tome, America/Danmarkshavn, America/Scoresbysund, Atlantic/Azores, Atlantic/Reykjavik, Atlantic/St_Helena, GMT, UTC
1983-02-03 03:00:51 2.75 1948-05-14 02:23:14 null

Get timezone name from geolocation

Unique identifier (API): get_tz_name_from_geolocation

Given the geolocation as a pair of numeric columns (latitude and longitude), creates a column of timezone names of each geolocation. Returns null values if no matching timezone is found, or if the geo coordinates are invalid.

Examples:

For the following input:

lat lng
50.45817318865369 35.99253168643526
30.56916325861897 65.90262778021655
51.936292660801115 26.071955568518586
21.180638320519932 151.80832165098863
67.04735445996346 120.61522491454524
null 97.50232778607207
51.07973011677217 null

The expected output is:

lat lng tz_names
50.45817318865369 35.99253168643526 Europe/Moscow
30.56916325861897 65.90262778021655 Asia/Kabul
51.936292660801115 26.071955568518586 Europe/Minsk
21.180638320519932 151.80832165098863 null
67.04735445996346 120.61522491454524 Asia/Yakutsk
null 97.50232778607207 null
51.07973011677217 null null

Round DateTime

Unique identifier (API): round_datetime

Rounds a DateTime column (in place) to 15, 30 or 60 minutes

Parameters Parameter Description Default Parameter Type
round_to Time rounding resolution. E.g. for "half_hour", 14:39:11 --> 14:30:00 single-select
Examples:

""2020-04-15T15:53:44"" with rounding to 30 minutes. Output ""2020-04-15T16:00:00""

Feature Engineering

Create features using autoencoder neural network

Unique identifier (API): autoencoder

Create new numerical features by combining existing numerical features in novel and non-linear ways using an autoencoder

Parameters Parameter Description Default Parameter Type Parameter Min Parameter Max Parameter Step
num_epochs The number of iterations or epochs for optimizing the loss function over the training set. 288 range 1.0 1000.0 1.0
batch_size The size of a batch of examples used to compute the Stochastic Gradient updates. 128 range 1.0 8192.0 1.0
learning_rate Learning rates control the size of the update to parameters during each step. 0.001 range 0.001 0.1 0.001
impute Strategy for imputing missing values mean single-select
create_missing_flag Create a new feature based on missing values True boolean
standardize Standardize, if true, will transform each feature value by subtracting the mean-value and dividing the result by the standard deviation of the feature True boolean
Examples:

Using autoencoder to generate novel and non-linear features based on existing numerical features

Bin columns

Unique identifier (API): bin_columns

This action bins one or more numeric columns by specifying the number of equally-spaced bins or a list of bin edges

Parameters Parameter Description Default Parameter Type
method Whether to use equally-spaced bins or custom cut positions single-select
n_bins Number of equally-spaced bins. This creates n equally-spaced bins from the minimum to maximum in the column. integer
cut_positions Custom cut positions in ascending order float-list
output_format Format for output labels [lower, upper) single-select
Examples:

Discretize a numeric column into categorical ranges e.g. given cut positions 100, 150, 200, 30, 110, 180, 220 can be binned as ""x < 100"", ""100 <= x < 150"", ""150 <= x < 200"", ""x >= 200""

Bin columns by quantiles

Unique identifier (API): bin_columns_by_quantiles

This action bins one or more numeric columns by specifying the number of uniformly-split quantiles or a list of custom quantiles.

Parameters Parameter Description Default Parameter Type
method Whether to use uniformly-split quantiles or custom split quantiles single-select
n_bins Number of uniformly-split quantiles. This creates n bins based on uniformly-split quantiles from 0 to 1 integer
split_quantiles Custom split quantiles in ascending order float-list
output_format Format for output labels [lower, upper) single-select
Examples:

For example, 20, 50, 80, 100, 200 is converted to ""0 < x < 0.25"", ""0.25 <= x < 0.5"", ""0.5 <= x < 0.75"", , ""0.75 <= x < 1"", ""0.75 <= x < 1"" with split quantiles 0.25, 0.5, 0.75.

Bin DateTime

Unique identifier (API): bin_datetime

Bins a DateTime column (creates 2 new columns) to 15, 30 or 60 minutes

Parameters Parameter Description Default Parameter Type
bin_res Upper/lower bounds to round towards in order to bin the time. E.g.: 14:53:15 --> [14:45:00, 15:00:00] when using "quarter_hour" resolution single-select
Examples:

Creates 2 datetime output columns per each input column which ""bin"" the input column between them. The binning resolution is either: hourly, half-hourly, or 15 minutes. Example: ""2020-04-15T15:53:44"" with half-hourly resolution. Output columns: [""2020-04-15T15:30:00"", ""2020-04-15T16:00:00""]

Bin columns such that the target mean follows upward or downward trend

Unique identifier (API): bin_trend

This action bins one or more numeric columns such that the target mean follows an upward or downward trend

Parameters Parameter Description Default Parameter Type
trend Up or Down Trend single-select
algo Algorithm for trend-binning single-select

Create missing flag

Unique identifier (API): create_missing_flag

For every input column, create a boolean column flagging the locations of missing values.

Extract PCA Components

Unique identifier (API): extract_pca_v2

Create new numerical features by combining existing numerical features in linear ways using PCA

Parameters Parameter Description Default Parameter Type Parameter Min Parameter Max Parameter Step
limit_components_by Criterion for selecting the principal components. If the minimum explained variance is used, the components that contribute most to the variance are selected until the cumulative variance is not less than the specified ratio of the total variance. min_explained_variance single-select
num_components The number of principal components to extract from the input data integer
min_explained_variance This is the threshold on the ratio of explained variance of a component to the total variance. Only the components with highest variance whose cumulative variance is at least the specified ratio of total variance will be selected 0.8 range 0.01 1.0 0.01
numeric_impute_method Method for imputing Numeric columns mean single-select

Apply UMAP

Unique identifier (API): extract_umap_components

Apply the Uniform Manifold Approximation and Projection (UMAP) algorithm to the input columns to achieve dimensionality reduction. This is a manifold-learning algorithm based on neighbours graphs.

Parameters Parameter Description Default Parameter Type Parameter Min Parameter Max Parameter Step
num_output_dimensions The dimensionality of the space the input will be embedded into. 2 range 1.0 100.0 1.0
seed The random seed for the UMAP algorithm. By setting it, the output will be reproducible. 2 integer
n_neighbors This parameter controls how UMAP balances local versus global structure in the data. See documentation of this action in the knowledge hub for details.NOTE: If larger than number of records in the data, this parameter will be truncated to the number of records. 15 range 2.0 100.0 1.0
metric Defines the metric of the ambient space of the input data. euclidean single-select
min_dist Controls how tightly UMAP is allowed to pack points together. It, quite literally, provides the minimum distance apart that points are allowed to be in the low dimensional representation. This means that low values of the parameter will result in clumpier embeddings. This can be useful if you are interested in clustering, or in finer topological structure. Larger values of the parameter will prevent UMAP from packing point together and will focus instead on the preservation of the broad topological structure instead. 0.1 range 0.0 1.0 0.01

Get vector representation from text

Unique identifier (API): get_feature_vectors_from_text

Obtain numerical features from text using word embedding model

Examples:

Generates a vector representation for the given text column. For the following input table and input field being 'col_text' and output field being 'col_text_vector'

col_text
null
Test 54 45 Test
Dummy sentence
Just some number 00

The output contains text vectors of dimensionality 96:

col_1 __group_col_text_vector__0 __group_col_text_vector__1 ....................... __group_col_text_vector__95
null 0.123 0.786 ....................... 0.313
Test 54 45 Test -0.92 0.576 ....................... 0.415
Dummy sentence 0.567 0.321 ....................... -0.28
Just some number 00 0.876 0.476 ....................... -0.47

Create historical summary aggregation features

Unique identifier (API): historical_summ_agg

Create historical summary aggregation features, e.g. the average purchase amount over the last 6 months

Parameters Parameter Description Default Parameter Type
aggregation_functions Aggregation methods such as "min" or "max" are functions that reduce a column of values to a single value ['sum'] multi-select
timeframe Time frame for windowing 3 integer
Examples:

For example, creating columns like average spending last 6 months for each customer, by using the date column, monthly spending and customer ID

Label encode categorical columns

Unique identifier (API): label_encode

Encode each input categorical column into a new column of non-negative integers, following either alphabetical (by default) or customised order. Null values and unseen labels (when a finalized recipe fitted by a dataset is applied on new data) remain as null.

Parameters Parameter Description Default Parameter Type
order_of_categories Category list that determines the order of labels [] string-list
Examples:

E.g. convert the Categories ""A"", ""B"", ""C"", NA to their numerical encoding 0, 1, 2, NA

Scale numeric column(s)

Unique identifier (API): scale_numeric_columns

Scale numeric column(s) by a specified method

Parameters Parameter Description Default Parameter Type Parameter Min Parameter Max Parameter Step
method Method for scaling including: standard (mean removed, scaled to unit variance), min-max (scaled to [0, 1]), max-abs (scaled to [-1, 1]) and robust (median removed, scaled to a specified quantile range) single-select
lower_quantile Lower value for quantile range used in scaling 0.25 range 0.0 1.0 0.05
upper_quantile Upper value for quantile range used in scaling 0.75 range 0.0 1.0 0.05
relative_error Relative error for approximating median or a specific quantile. Must be no less than 0.001 for computation efficiency 0.001 range 0.001 1.0 0.001
Examples:

For example, a numeric column 100, 120, 80, 10000, 100 is scaled to 0.01, 0.012, 0.008, 1.0, 0.01 with method = ""max_abs"" (all values are divided by the absolute maximum 10000).

Computes weight of Evidence (WOE) encoding for categorical columns

Unique identifier (API): weight_of_evidence

This action converts categorical columns by Weight of Evidence encoding.

Feature Engineering/Text

Get vector representation from text

Unique identifier (API): get_feature_vectors_from_text

Obtain numerical features from text using word embedding model

Examples:

Generates a vector representation for the given text column. For the following input table and input field being 'col_text' and output field being 'col_text_vector'

col_text
null
Test 54 45 Test
Dummy sentence
Just some number 00

The output contains text vectors of dimensionality 96:

col_1 __group_col_text_vector__0 __group_col_text_vector__1 ....................... __group_col_text_vector__95
null 0.123 0.786 ....................... 0.313
Test 54 45 Test -0.92 0.576 ....................... 0.415
Dummy sentence 0.567 0.321 ....................... -0.28
Just some number 00 0.876 0.476 ....................... -0.47

Filtering/Custom Formula

Filter rows

Unique identifier (API): filter_rows_with_formula_v2

Filter rows in data by keeping only rows that satisfy the condition specified in user-entered formula. Common Excel/SQL functions and operators for manipulating Numeric, DateTime, Text, and Boolean values are permitted. For a full list of allowed functions, see documentation.

NOTE: The formula entered must evaluate to a boolean value.

Parameters Parameter Description Default Parameter Type
condition The condition under which the rows are to be kept, entered as a formula. formula
Examples:

Consider the dataset:

manufacturer model year displ cty hwy
audi a4 quattro 1999 1.8 16 25
audi a6 quattro 1999 2.8 15 24
audi a6 quattro 2008 4.2 16 23
audi a6 quattro 2008 4.2 16 23
chevrolet c1500 suburban 2wd 2008 6 12 17
chevrolet corvette 1999 5.7 16 26
chevrolet corvette 1999 5.7 15 23
chevrolet k1500 tahoe 4wd 1999 6.5 14 17
chevrolet malibu 1999 3.1 18 26
dodge caravan 2wd 1999 3.8 15 22
dodge caravan 2wd 2008 4 16 23
dodge durango 4wd 2008 5.7 13 18
dodge durango 4wd 2008 5.7 13 18
dodge ram 1500 pickup 4wd 1999 5.9 11 15
ford expedition 2wd 1999 4.6 11 17
ford explorer 4wd 1999 5 13 17
ford f150 pickup 4wd 1999 4.2 14 17
ford f150 pickup 4wd 1999 4.2 14 17
ford f150 pickup 4wd 2008 4.6 13 17
ford mustang 2008 4 17 26

where cty and year are of Numeric type.

Given the formula 'AND(year > 2000, cty >= 15)', the result will be:

manufacturer model year displ cty hwy
audi a6 quattro 2008 4.2 16 23
audi a6 quattro 2008 4.2 16 23
dodge caravan 2wd 2008 4 16 23
ford mustang 2008 4 17 26

Join

Look up aggregated columns from another dataset

Unique identifier (API): aggregate_lookup

Perform aggregation on another dataset and lookup the content of the summarized table

Examples:

Given the primary dataset (n)

grp
1
3
3

and a secondary dataset that we wish to summarize and then perform a lookup of the summarized info:

grp val val1 bool1 datetimes
1 0.45 1 True 2023-11-17T13:08:29
1 0.45 2 True 2023-11-17T13:08:29
3 0.45 3 True 2023-11-17T13:08:29
3 0.45 3.5 True 2023-08-03T08:48:29
3 1 3 False 2022-05-29T03:21:59
3 1 4 False 2022-05-29T03:21:59
2 10 5 False 2022-05-29T03:21:59
2 100 6 False 2022-05-29T03:21:59

If we perform the below aggregations on the secondary table and then lookup the results using grp as the lookup key

  • mean(val)
  • sum(val1)
  • all(bool1)
  • any(bool)
  • approx_count_distinct(val, 0)
  • approx_count_distinct(bool1, 0)
  • earliest(datetimes, 0)
  • latest(datetimes, 0)
  • approx_median(val)
  • approx_quantile(val)
  • count(bool1, "non_null")
  • count(bool1, "null)

we get

grp avg_val sum_val1 all_bool1 any_bool1 approx_count_distinct_val approx_count_distinct_bool1 earliest_datetime latest_datetime median_val quantile_val count_bool1 count_nulls_bool1
1 0.45 3.0 True True 1 1 2023-11-17 13:08:29 2023-11-17 13:08:29 0.45 0.45 2 0
3 0.725 13.5 False True 2 2 2022-05-29T03:21:59 2023-08-03T08:48:29 0.45 0.45 4 0
3 0.725 13.5 False True 2 2 2022-05-29T03:21:59 2023-08-03T08:48:29 0.45 0.45 4 0

Join

Unique identifier (API): join

Joins another dataset to current dataset given one or more join keys, retaining the specified selection of columns from both datasets.

Parameters Parameter Description Default Parameter Type
join_type Specifies how to perform the join. single-select
Examples:

For the following input tables:

Primary Table: (2 regular columns, 1 column group with 2 components)

Id Name __group_Like_0 __group_Like_1
1 Patrick Apple Milk
2 Albert Pear Cheese
3 Maria Watermelon Chickpeas
4 Darwin Grapes Beans
5 Elizabeth Kiwi Softpeas

Secondary Table: (2 regular columns)

User_ID Like
3 Stars
1 Climbing
1 Code
6 Rugby
4 Apples

We perform a Join with the following parameters:

  • join_keys == 'Id' key for primary table.
  • join_keys == 'User_ID' key for secondary table
  • selected_columns == ['Id', 'Name', 'Like'] for the primary table.
  • selected_columns == 'Like' for the secondary table
  • join_type == 'Outer Join'

the expected output is:

Id Name __group_primary_Like_0 __group_primary_Like_1 secondary_Like
null null null null Rugby
5 Elizabeth Kiwi Softpeas null
1 Patrick Apple Milk Climbing
1 Patrick Apple Milk Code
3 Maria Watermelon Chickpeas Stars
2 Albert Pear Cheese null
4 Darwin Grapes Beans Apples

Notice that in this example the user performed an Outer Join but have opted to to drop the User_Id column of the secondary table. This resulted in the first row of the output to show an Id of null (User_Id == 6 for that row. This value did not exist in the primary table) and thus the result for this row isn't very usable.

Look up columns from another dataset

Unique identifier (API): lookup_from

Creates specified columns by looking them up from a second dataset. Results in null values if there is no match in the second dataset for a particular record.

Parameters Parameter Description Default Parameter Type
multi_match_resolution What row to use in case of multiple rows are matching in the other dataset under the specified criteria. last single-select
Examples:

For the following input tables:

Primary Table: (2 regular columns, 1 column group with 2 components)

Id Name __group_Like_0 __group_Like_1
1 Patrick Apple Milk
2 Albert Pear Cheese
3 Maria Watermelon Chickpeas
4 Darwin Grapes Beans
5 Elizabeth Kiwi Softpeas

Secondary Table: (2 regular columns)

User_ID Like
3 Stars
1 Climbing
1 Code
6 Rugby
4 Apples

We perform a Lookup with the following parameters:

  • join_criteria == 'Id' key for primary table.
  • join_criteria == 'User_ID' key for secondary table
  • lookup == 'Like' for the secondary table
  • order_secondary_by == 'Like' for the secondary table
  • Name prefixes for the current and other datasets as ['left_', 'right_']

the expected output is:

Id Name __group_left_Like_0 __group_left_Like_1 right_Like
null null null null Rugby
5 Elizabeth Kiwi Softpeas null
1 Patrick Apple Milk Climbing
3 Maria Watermelon Chickpeas Stars
2 Albert Pear Cheese null
4 Darwin Grapes Beans Apples

Note: that Lookup only returns one match per row in the primary table. Hence, if there are multiple matches in the secondary table, only the first one will be returned. Hence, it's important to choose a order_secondary_by column in the secondary table.

Outlier Detection

Get anomaly score of records

Unique identifier (API): get_anomaly_scores_from_records

Computes the anomaly score of input data using unsupervised learning. Anomaly score of a record is computed as a measure of the information lost when the record is subject to dimensionality reduction by PCA.

Parameters Parameter Description Default Parameter Type Parameter Min Parameter Max Parameter Step
method The method used for anomaly detection such as dimensionality reduction using PCA. pca single-select
pca_limit_components_by Criterion for selecting the principal components. If the minimum explained variance is used, the components that contribute most to the variance are selected until the cumulative variance is not less than the specified ratio of the total variance. specified_explained_var single-select
pca_num_of_components The number of principal components to extract from the input data integer
pca_minimum_explained_variance_needed This is the threshold on the ratio of cumulative explained variance of the selected principal components to the total variance. Only the components with highest variance whose cumulative variance is at least the specified ratio of total variance will be selected 0.8 range 0.01 1.0 0.01
isolation_forest_num_estimators The number of trees in the ensemble. 100 integer
max_samples_specification_method Specifies how max samples for isolation forest is set. Fractional implies that the maximum sample size is a fraction relative to the total number of rows.Count implies that the specified number is treated as an absolute count. count single-select
isolation_forest_max_samples_fraction The fraction of the total number of samples used to train each tree. 1.0 range 0.01 1.0 0.01
isolation_forest_max_samples_count The number of samples used to train each tree.The value should be greater than 1. 256 integer
seed Positive integer. Random seed for isolation forest algorithm. Used for obtaining repeatable results. 1 integer

Indicate anomalous values

Unique identifier (API): indicate_anomalous_values

Creates boolean flag columns indicating outliers in the corresponding input columns, using interpercentile range method.E.g. denoting Q1 as the 1st quartile, Q3 as the 3rd quartile and IQR = Q3 - Q1, values smaller than Q1 - 1.5*IQR or larger than Q3 + 1.5IQR will be flagged as outliers.

Parameters Parameter Description Default Parameter Type Parameter Min Parameter Max Parameter Step
decision_mode How the decision is made for declaring an anomaly: For "Upper threshold": only values larger than the upper threshold will be declared as anomalies. For "Lower threshold": only values lower than the lower threshold will be declared as anomalies. For "Lower and upper threshold": values below the lower threshold and above the upper threshold will be declared as anomalies. lower_and_upper_thresholds single-select
low_percentile The percentile P1 such that all values smaller than P1-alpha*IPR will be flagged as outliers By default, P1 equals Q1 (first quartile). 25 range 0.0 100.0 1.0
high_percentile The percentile P2 such that all values larger than P2+alpha*IPR will be flagged as outliers By default, P2 equals Q3 (3rd quartile). 75 range 0.0 100.0 1.0
ipr_factor The interpercentile range (denoted as IPR) equals P2-P1, where P2 and P1 are correspondingly the high and low percentiles. The factor is defined as 'alpha' in the formulas: P1 - alpha * IPR and P2 + alpha * IPR which determine the low and high thresholds for outliers detections. Values above/below these thresholds will be flagged as outliers 1.5 float

Reshaping

Reshape dataset into a pivot table

Unique identifier (API): pivot_v2

Creates a pivot table with specified group-by (index) columns and pivot column. Column groups with equal number of components (one per pivoted value) are created, one per specified column/aggregation function pair.

Parameters Parameter Description Default Parameter Type
pivot_values List of values that are converted to columns in the output [] string-list
Examples:

For example, given the following dataset

animal size qty desexed
cat small 95 True
cat large 39 False
cat small 50 False
cat large 37 False
cat small 12 True
dog large 84 False
dog small 71 True
dog large 87 True
dog small 7 True
dog large 42 False

with input pivot column size, grouped by column animal, values column qty with aggregation function sum and value column desexed with aggregation function all, the result is

animal __group_qty_sum_small __group_qty_sum_large __group_desexed_all_small __group_desexed_all_large
dog 78 213 True False
cat 157 76 False False

Unpack/Explode JSON arrays into rows

Unique identifier (API): unpack_v2

Reshape a dataset by unpacking one or more JSON array columns and add a new column for position index

Parameters Parameter Description Default Parameter Type
allow_single_quotes Whether to allow single quoted JSONs False boolean
Examples:

For example, given the following dataset

paper_id author affilliation
1 ["Sara", "Tom", "Max"] ["rmit", "monash", "rmit"]
2 ["Sara", "Max", "Lee"] ["rmit", "rmit"]

with input columns ""author"" and ""affilliation"", output column ""position"" for position index, the result is

paper_id author affilliation position
1 Sara rmit 0
1 Tom monash 1
1 Max rmit 2
2 Sara rmit 0
2 Max rmit 1
2 Lee null 2

Unpivot a dataset from wide to long format

Unique identifier (API): unpivot

Unpivot a dataset from wide to long format. Input columns are converted to a single keys column and one or more values column(s) in the output. Columns other than input columns remain.

Text

Change text case

Unique identifier (API): change_text_case

Changes text columns to selected text case (uppercase, lowercase, etc.)

Parameters Parameter Description Default Parameter Type
case The required change in case of values where "Upper" converts all to uppercase, "Lower" converts all to lowercase and "Capitalize" converts first character to uppercase and remaining to lowercase. single-select
Examples:

For case == 'upper' and the following input table:

col_1
null
a brown fox quick jumps over the lazy dog
1234
Hello World

The expected output is:

col_1
null
A BROWN FOX QUICK JUMPS OVER THE LAZY DOG
1234
HELLO WORLD

Decode base64-encoded text

Unique identifier (API): decode_base64

Every input column is assumed to be encoded using Base64-encoding. Decode it into thecharacter set (e.g. UTF8) specified. Store the decoded values in the output column

Parameters Parameter Description Default Parameter Type
charset The encoding to convert the decoded bytes to. This is the charset encoding of the original strings before they were base64 encoded UTF8 single-select

Extract all matches of a regular expression pattern

Unique identifier (API): extract_all_regex_matches

For each input column, creates a new column of JSONArray type containing all the sub-strings in the corresponding input matching the specified regular expression.

Parameters Parameter Description Default Parameter Type
pattern The pattern to be found, as a Perl-Compatible Regular Expression. Note that at present, only whole matches are extracted and capturing groups are not supported. string
Examples:

Given the following input dataset with column types as indicated:

('Description', 'Text')
The IP address of the router: 192.168.1.1
Go to 201 Mills St or call (404) 456-0861
The final amount of the transaction is 25,382.55
 
Spend $100 or more to enter a draw sponsored by hotels123.com to win a $20,500.00
There is no match 456here

Applying the action with

  • Input columns = ['Description']
  • Output columns = ['Matches']
  • Parameters = {"pattern": r"\b\d+\b(?!$)"}

produces the following dataset with the column types as indicated:

('Description', 'Text') ('Matches', 'JSONArray')
The IP address of the router: 192.168.1.1 ["192","168","1"]
Go to 201 Mills St or call (404) 456-0861 ["201","404","456"]
The final amount of the transaction is 25,382.55 ["25","382"]
 
Spend $100 or more to enter a draw sponsored by hotels123.com to win a $20,500.00 ["100","20","500"]
There is no match 456here []

Extract capture groups

Unique identifier (API): extract_capture_groups_to_multicol

Extracts one or more regex capture groups into multiple columns.

Parameters Parameter Description Default Parameter Type
pattern A regular expression pattern containing one or more capture groups string
Examples:

For the following pattern == '([aeiou].)([aeiou].)' , the following input table:

col_1 col_2
New_York bc
Lisbon df
Tokyo gh
Paris jk
null lm
Munich np

And assuming we apply the action to column (col_1), with output columns [out_a, out_b], the expected output is:

col_1 col_2 out_a out_b
New_York bc null null
Lisbon df null null
Tokyo gh null null
Paris jk ar is
null lm null null
Munich np un ic

Get named entities from text

Unique identifier (API): get_named_entities_from_text

Extracts named entities mentioned in input text column into the corresponding output text column as JSON string. The JSON string represents a dictionary-mapping between entity types to the coresponding list of entity mentions in the text.

Examples:

Extracts named entities from a given text column. See the following illustration for an input table with a single text column named col_text and an input field col_text and output field col_entities_from_text:

col_text
Bill Gates addressed Microsoft.
Jack drove to Washington.
Decisions are driven by data.
null
---------------------------------

The output dataframe contains the text column col_entities_from_text:

col_text col_entities_from_text
Bill Gates addressed Microsoft. {"People": ["Bill Gates"], "Organizations": ["Microsoft"]}
Jack drove to Washington. {"People": ["Jack"], "Places": ["Washington"]}
Decisions are driven by data. {}
{}
null {}

Get sentiment scores from text

Unique identifier (API): get_sentiment_scores_from_text

Analyses the sentiment expressed in the input text columns and creates 3 output columns that reflect the strength (a number between 0 to 1, where 1 is the strongest) of positive, neutral and negative sentiment expressed in input text.

Examples:

Extracts sentiment polarity scores from a given text column into three numeric output columns. See the following illustration for an input table and with a single text column named 'tweets' and input field being 'col_text' and output fields being 'positive_sentiment', 'neutral_sentiment' and 'negative_sentiment':

tweets
Thanks for the follow-up. That's why you're the best!
It was a horrible experience, something I would never recommend
''
null
---------------------------------------------------------------

The output dataframe contains the text columns 'positive_sentiment', 'neutral_sentiment' and 'negative_sentiment':

tweets positive_sentiment neutral_sentiment negative_sentiment
Thanks for the follow-up. That's why you're the best! 0.514 0.486 0.0
It was a horrible experience, something I would never recommend 0.0 0.588 0.412
'' 0.0 0.0 0.0
null 0.0 0.0 0.0
--------------------------------------------------------------- ------------------ ----------------- ------------------

Normalize text content

Unique identifier (API): normalize_text

Normalizes a text column by performing one or more normalization steps such as removing non text content or stemming the words.

Parameters Parameter Description Default Parameter Type
steps_to_execute The required normalization steps ['remove_html_tags', 'replace_urls'] sequence-select
Examples:

If the input data is:

col_1
@JetBlue I was not allowed to drop my bag off for my flight that I was running 10min Late Flight for. Turns out it left over an hour Late Flight.
@VirginAmerica momma I made it! 😁😁😁
@JetBlue bag just delivered and items have been stolen. At first look - my Kate spade bag and bottle of spiced rum!!!
@united then why have I not received my call back its been 3 days...it's an infant car seat how am I supposed to go anywhere with my child
@JetBlue just a heads up, this page no longer exists http://t.co/NsJWVTTjGo
@AmericanAir Thank you, you too!
“@JetBlue: Our fleet's on fleek. http://t.co/vw2v8GVnGq” 😐😑
@AmericanAir yes I did but no response.
@SouthwestAir when is the next flight with free cocktails ?
@AmericanAir have been waiting for a call back for over two hours jan to die tomorrow at 6am Cancelled Flightled what are my options?
@AmericanAir that doesn't really answer my question. Maybe if I provide more details you can give me clarification...
@JetBlue where is our luggage?? Flight 424 has been waiting an hour for bags! #jetblue #poorservice
@united will I make it with the delay?
@SouthwestAir why can't you take me to Knoxville?? yall are my fav #help

applying the action on col_1 with only the remove_stopwords and lemmatize_words stages disabled will result in:

col_1
mention i was not allowed to drop my bag off for my flight that i was running 10min late flight for turns out it left over an hour late flight
mention momma i made it emoji emoji emoji
mention bag just delivered and items have been stolen at first look my kate spade bag and bottle of spiced rum
mention then why have i not received my call back its been number daysit s an infant car seat how am i supposed to go anywhere with my child
mention just a heads up this page no longer exists url
mention thank you you too
mention our fleet s on fleek url emoji emoji
mention yes i did but no response
mention when is the next flight with free cocktails
mention have been waiting for a call back for over two hours jan to die tomorrow at 6am cancelled flightled what are my options
mention that doesn t really answer my question maybe if i provide more details you can give me clarification
mention where is our luggage flight number has been waiting an hour for bags hashtag hashtag
mention will i make it with the delay
mention why can t you take me to knoxville yall are my fav hashtag

For the same data, if all stages including remove_stopwords and lemmatize_words are enabled, the result will be:

col_1
mention allow drop bag flight running 10min late flight turn left hour late flight
mention momma emoji emoji emoji
mention bag deliver item steal look kate spade bag bottle spice rum
mention receive number day infant car seat suppose child
mention head page long exist url
mention thank
mention fleet fleek url emoji emoji
mention yes response
mention flight free cocktail
mention wait hour jan die tomorrow 6 cancel flightled option
mention answer question maybe provide detail clarification
mention luggage flight number wait hour bag hashtag hashtag
mention delay
mention knoxville y fav hashtag

Create flag for regex match

Unique identifier (API): regex_match

Creates an output boolean column which is the result of matching the input column against a regex pattern

Parameters Parameter Description Default Parameter Type
pattern A regular expression pattern for the matching string
Examples:

For the following table:

col_1
null
That's my phone number: 054-555-111
Monday is the 1st day of the week
This should match 0490-000-111
This should not match 0490-abc-def

we apply the regex pattern: '0490-\d{3}-\d{3}' Resulting in the following output:

col_1 col_1_regex_match
null null
That's my phone number: 054-555-111 false
Monday is the 1st day of the week false
This should match 0490-000-111 true
This should not match 0490-abc-def false

Substitute using regex

Unique identifier (API): regex_sub

Finds and replaces string patterns in a column via Regex. Can optionally specify additional pattern to filter by, in which case the scope of the find/replace will be restricted to records that match the specified pattern in the column.

Parameters Parameter Description Default Parameter Type
match_pattern Specify this to restrict the scope of replacement .* string
sub_pattern The pattern that needs to be substituted string
sub_string String used to replace values in places where sub_pattern matched the values. string
Examples:

For the following set of parameters:

match_pattern = '0490-\d{3}-\d{3}'
sub_pattern = '054-\d{3}-\d{3}'
sub_string = 'I WAS REPLACED'
and the following input table:

col_1
null
That's my phone number: 054-555-111
Monday is the 1st day of the week
This should be substituted: 0490-000-111, 054-555-111
This should not be substituted: 0490-000-111, 054-abc-cdf

The output is:

col_1
null
That's my phone number: 054-555-111
Monday is the 1st day of the week
This should be substituted: 0490-000-111, I WAS REPLACED
This should not be substituted: 0490-000-111, 054-abc-cdf

Remove prefix

Unique identifier (API): remove_prefix

Removes a string prefix from the rows of a chosen string column, if the string is found.

Parameters Parameter Description Default Parameter Type
prefix_string The prefix string to be stripped string
Examples:

For the prefix_string == 'prefix_' and the following input table:

col_1
null
prefix_to remove
other_not to remove

The expected output is:

col_1
null
to remove
other_not to remove

Remove punctuation

Unique identifier (API): remove_punctuation

Removes the punctuation from the rows of a chosen string column. The punctuation symbols to remove are given by the user as a string

Parameters Parameter Description Default Parameter Type Suggested Values
punc_symbols The punctuation symbols to strip !"#$%&'()*+,-./:;<=>?@[]^_{|}~ | string | ['!"#$%&\'()*+,-./:;<=>?@[\\]^_{ }~']
Examples:

For the following input table:

col_1
null
This. Sentence! Has? Lots of, umm.. punctuation<>
Monday is the 1st day of the week
abc.edf@gmail.com

The output with stripped punctuation is:

col_1
null
This Sentence Has Lots of umm punctuation
Monday is the 1st day of the week
abcedfgmailcom

Remove suffix

Unique identifier (API): remove_suffix

Removes a string suffix from the rows of a chosen string column, if the string is found.

Parameters Parameter Description Default Parameter Type
suffix_string The suffix string to strip string
Examples:

For the suffix_string == '_suffix' and the following input table:

col_1
null
remove_suffix
other_not to remove

The expected output is:

col_1
null
remove
other_not to remove

Split on delimiter to multiple columns

Unique identifier (API): split_on_delimiter_to_multicol

For a given column, extracts text components between the delimiters into multiple columns. The delimiter can be one or more specific characters, or a regex pattern to match. The number of splits is determined by the number of provided output column names. Trailing null values are created for records containing fewer components.

Parameters Parameter Description Default Parameter Type Suggested Values
delimiter Delimiter to be used for splitting the string string [',', ';', '-', '
use_regex Boolean flag indicating whether the specified delimiter should be treated as a regex pattern or an exact match. False boolean
include_tail Boolean flag indicating wether the remainder (tail) of the string after the final split must be included in the last column True boolean
Examples:

For the following set of parameters:

parameters = {'delimiter': ':', 'use_regex': False, 'include_tail': True}

and output columns 'split_0', 'split_1', 'split_2', and the following input table:

col_1
one:two:three:four
no_split

The expected output is:

col_1 split_0 split_1 split_2
one:two:three:four one two three:four
no_split no_split null null

Whereas for include_tail = False, the expected output is:

col_1 split_0 split_1 split_2
one:two:three:four one two three
no_split no_split null null

Split to array using a delimiter or pattern

Unique identifier (API): split_to_array

Obtain array (JSON) columns resulting from splitting the selected text columns using a delimiter or a regex pattern. The number of splits can be specified to be a certain limit, otherwise the splitting of the text is indefinite until no more occurrences of the delimiter are found.

Parameters Parameter Description Default Parameter Type Suggested Values
delimiter The delimiter to be used for splitting the string. string [',', ';', '-', '
use_regex To split by a regex pattern, set this value to "True". False boolean
limit_splits Set this to "True" and set the number of splits to the desired number, if not intending to split indefinitely. False boolean
max_number_of_splits How many splits to obtain at most, if the splits are being limited. 2 integer
Examples:

Given the following input dataset with column types as indicated:

('col_1', 'Text') ('col_2', 'Numeric') ('col_3', 'Text')
1,2,3,4,5 1 a
a,b,c,d 3.14159 d
e,f,g 2.23606 x
one, two , three 1.732 no_split
no_split nan

with:

  • Input columns as ['col_1', 'col_3'],
  • Parameters delimiter=',', use_regex=False, limit_splits=True, max_number_of_splits=3,

the result will be the following dataset with column types as indicated:

('col_1', 'JSONArray') ('col_2', 'Numeric') ('col_3', 'JSONArray')
["1","2","3,4,5"] 1 ["a
["a","b","c,d"] 3.14159 ["d
["e","f","g"] 2.23606 ["x
["one"," two "," three"] 1.732 ["no_split"]
["no_split"] nan

Strip whitespaces

Unique identifier (API): strip_whitespaces

Strips whitespaces from rows of chosen Text columns.

Parameters Parameter Description Default Parameter Type
mode The method for stripping the whitespaces. Stripping can be done at the beginning, end, both of the strings, and can even strip ALL of the whitespaces. single-select
Examples:

For the following table:

+-----------------------+
|col_1                  |
+-----------------------+
| start_whitespace      |
|end_whitespace         |
| start_end_whitespaces |
|test      tab          |
+-----------------------+

and mode == 'End', the expected output is:

+----------------------+
|col_1                 |
+----------------------+
| start_whitespace     |
|end_whitespace        |
| start_end_whitespaces|
|test      tab         |
+----------------------+

Text/Data Cleaning

Map to Text

Unique identifier (API): map_to_string

Replace values of input column(s) using a given mapping.

Parameters Parameter Description Default Parameter Type
mapping Value that will be replaced and the replacement. map
Examples:

Assume we have the following single column table:

col_1
dog
cat
null
horse
dogdog

and the following mapping:

mapping = {
    'dog': 'doggy',
    'cat': 'kitten'
}
After running the action, the expected output is:

col_1
doggy
kitten
null
horse
dogdog

where dog -> doggy, cat -> kitten and the rest of the values were unchanged since they didn't appear in the mapping dictionary.

Parse strings as null

Unique identifier (API): parse_strings_as_null

Parses the specified strings in given column(s) as null values

Parameters Parameter Description Default Parameter Type
strings_to_parse The list of strings that will be parsed as null string-list
Examples:

For the following input dataset:

col_x col_y
11 n/a
45 38
null 31
25 1.#QNAN
49 20

where both columns are of text type, applying the action with the parameter strings_to_parse = ['n/a', '', 'NA', 'N/A', '<NA>', 'NULL', 'nan', '1.#IND', '#NA', '-1.#QNAN', '-nan', 'NaN', '1.#QNAN', '-1.#IND', '-NaN', 'null', '#N/A', '#N/A N/A'] and both columns selected in the input will give the following result:

col_x col_y
11
45 38
31
25
49 20

Text/JSON

Extract from JSON by path

Unique identifier (API): extract_from_json_by_path

Extract from JSON by the path. Will return null if the path isn't found

Parameters Parameter Description Default Parameter Type
path The path of the value to extract string

Get occurrence-count features

Unique identifier (API): get_occurrence_count_features

Count the occurrence for each element of a JSON array. This achieves the same functionality as CountVectorizer in Scikit-learn package.

Examples:

For example, given the following dataset

customer_id events_A events_B events_C
1 ["login", "logout"] ["open", "close"] ["log in", "log out"]
2 ["login", "home", "logout"] ["open", "idle", "close"] ["log in", "home", "log out"]
3 ["login", "home", "logout", "login", "logout"] ["open", "idle", "close", "open", "close"] ["log in", "home", "log out", "log in", "log out"]
4 ["login", "home", "app", "home", "logout"] ["open", "idle", "active", "idle", "close"] ["log in", "home", "app", "home", "log out"]
5 {"open": 0, "close": 1} ["log in", "home", "app", "home", "log out"]

with inputs as events_A, events_B and events_C, outputs as A, B and C, the result is

customer_id events_A events_B events_C __group_A_logout __group_A_login __group_A_home __group_A_app __group_B_open __group_B_close __group_B_idle __group_B_active __group_C_log00out __group_C_log00in __group_C_00home00 __group_C_home __group_C_app __group_C_00app00
1 ["login", "logout"] ["open", "close"] ["log in", "log out"] 1 1 0 0 1 1 0 0 1 1 0 0 0 0
2 ["login", "home", "logout"] ["open", "idle", "close"] ["log in", "home", "log out"] 1 1 1 0 1 1 1 0 1 1 1 0 0 0
3 ["login", "home", "logout", "login", "logout"] ["open", "idle", "close", "open", "close"] ["log in", "home", "log out", "log in", "log out"] 2 2 1 0 2 2 1 0 2 2 0 1 0 0
4 ["login", "home", "app", "home", "logout"] ["open", "idle", "active", "idle", "close"] ["log in", "home", "app", "home", "log out"] 1 1 2 1 1 1 2 1 1 1 1 1 1 0
5 {"open": 0, "close": 1} ["log in", "home", "app", "home", "log out"] 0 0 0 0 0 0 0 0 1 1 1 1 0 1

Here in input column events_C, there are white spaces and special characters such as $ and _ in the elements of JSON arrays. These characters other than numbers or letters are replaced with 00 in the output columns.

Parse JSON objects in column

Unique identifier (API): parse_json_objects

Parses each value in the input column as a JSON object, whose schema is specified by the user. Each attribute is extracted as one output column. All output columns will contain null values wherever the Text string is not parsable as a JSON object. If parsable but one or more of the specified attributes is missing or of incompatible type, their corresponding output columns will contain null values.

Parameters Parameter Description Default Parameter Type
allow_single_quotes Whether to allow single quoted JSONs False boolean
Examples:

For the input dataset with column names and data types as indicated below:

movie_id title movie_cast
1 Avatar {"character": "Neytiri", "gender": 1, "name": "Zoe Saldana"}
2 Pirates of... {"character": "Will T", "gender": 2, "name": "Orlando Bloom"}
3 Spectre {"character": "Blofeld", "gender": 2, "name": "C Waltz"}
4 Spider-Man 3 {"character": "MJ", "gender": 1, "name": "Kirsten Dunst"}
5 X-Men
6 Avengers {"broken_json": "3",
7 Captain America {"character": "Nick Fury", "gender": "M", "name_missing": true}
8 The Hobbit ["Array instead of object 1", "Array instead of object 2"]

Applying this action with input column = 'movie_cast' and attribute/data type/output column trios [("character", "string", "character_name"), ("gender", "integer", "character_gender"), ("name", "string", "character_name")] produces the output:

movie_id title movie_cast character_name character_gender character_acted_by
1 Avatar {"character": "Neytiri", "gender": 1, "name": "Zoe Saldana"} Neytiri 1 Zoe Saldana
2 Pirates of... {"character": "Will T", "gender": 2, "name": "Orlando Bloom"} Will T 2 Orlando Bloom
3 Spectre {"character": "Blofeld", "gender": 2, "name": "C Waltz"} Blofeld 2 C Waltz
4 Spider-Man 3 {"character": "MJ", "gender": 1, "name": "Kirsten Dunst"} MJ 1 Kirsten Dunst
5 X-Men nan
6 Avengers {"broken_json": "3", nan
7 Captain America {"character": "Nick Fury", "gender": "M", "name_missing": true} Nick Fury nan
8 The Hobbit ["Array instead of object 1", "Array instead of object 2"] nan

Text/NLP

Get topic distribution probabilities from text

Unique identifier (API): get_topic_distribution_from_text

Analyses the input text columns and extracts underlying topics in the text. Each record is considered a text document that has originated from the underlying topics. The output numeric column group contains the probability distribution over the identified topics for each input record (document)

Parameters Parameter Description Default Parameter Type Parameter Min Parameter Max Parameter Step
num_topics The number of different topics that are assumed to be present in the input text column. 5 range 2.0 500.0 1.0
max_iterations Sets the maximum number of iterations that can be run by the algorithm for extracting the underlying topics 20 range 15.0 30.0 1.0
seed Random seed for topic modeling algorithms. Used for obtaining repeatable results. integer
Examples:

For the following input:

text
you ever wanted to try some yourself
not only a fun read but also a pretty infographic
something not understandable by humans
text of all jane austens works if
not only a fun read but also a pretty infographic

and the parameter values: num_topics == 2 (Number of topics to be extracted)

We get the following output:

text __group_text_topics_0 __group_text_topics_1
you ever wanted to try some yourself 0.400954 0.599046
not only a fun read but also a pretty infographic 0.239050 0.760950
something not understandable by humans 0.211497 0.788503
text of all jane austens works if 0.435378 0.564622
not only a fun read but also a pretty infographic 0.434958 0.565042

Get TF-IDF features from text

Unique identifier (API): tf_idf

Applies TF-IDF + dimensionality reduction

Parameters Parameter Description Default Parameter Type Parameter Min Parameter Max Parameter Step
post_processing_method Algorithm for reducing the dimensionality of the TF-IDF output truncated_svd single-select
num_output_dimensions The number of dimensions after post processing of TF-IDF output. For example, if the post-processing method is truncated SVD, and 100 output dimensions are chosen, then the dimensions retained after SVD truncation correspond to 100 largest singular values. 50 range 2.0 100.0 1.0
hashing_tf_num_features_method How the number of features produced by the Hashing TF stage is determined. The larger the vocabulary size of the corpus, the more the number of hashing features needed for fidelity. If set to 'Automatic', the number of features to be produced by the Hashing TF stage is set to the smallest power of 2 that is closest to the estimated vocabulary size. auto single-select
hashing_tf_num_features The cardinality of the set of possible hash values output by the hash function. The hash function is used to convert a document (text string) into sparse vectors after tokenization. This can be used to control the number of dimensions output by TF-IDF before dimensionality reduction. Smaller numbers result in higher chance of two terms being confused, while larger numbers result in longer computation times. It is advisable to use a power of two for the parameter, otherwise the features will not be mapped evenly to the columns. 4096 range 32.0 65536.0 32.0
Examples:

For the following input:

x
cant say for sure what they used but
there is an r library that includes the
text of all jane austens works if
you ever wanted to try some yourself
they probably got the idea
from
which uses jane austens work extensively
as an ongoing example when
showcasing the tidytext library
the specific library of jane
austens work is library
im guessing they used tidytext
and janeaustenr in r, then used a slightly
different analysis
using the results from r
in tableau or another data-viz
heavy program to make it palatable for
readers
pun intended sorry who has already
done a lot of
really cool analysis like this
they claimed in the text that
they used pca
but what they dont say is how they
created the matrix that was fed into the pca

and the parameter values: num_components == 2 (Number of PCA components --> Reducing to 2D) num_features == 1024 (Number of hashing features the text is converted into)

We get the following output:

x __group_x_encoded_0 __group_x_encoded_1
cant say for sure what they used but -3.6428402533142896 0.4975849683664332
there is an r library that includes the 0.7746395720660133 0.5766808957643564
text of all jane austens works if 0.8119351913774921 0.317909114966179
you ever wanted to try some yourself 0.7487119280397619 -0.8721857902857719
they probably got the idea -0.14175253549746752 0.5342944067879885
from 0.03192967086734439 0.06422168591488635
which uses jane austens work extensively 0.29369225493121526 0.8832400029096568
as an ongoing example when 0.7196285830258167 -0.7732919475505083
showcasing the tidytext library 0.29799339616128284 0.26371593073556526
the specific library of jane 0.7639567537897723 -0.1252943669742009
austens work is library 0.024263102404871856 0.8407417872671388
im guessing they used tidytext -0.5380315745685673 0.8071242590633404
and janeaustenr in r, then used a slightly -0.7396546964226968 -1.5252802271552306
different analysis -0.0791096097809773 0.016454344219511016
using the results from r 0.39190114778641166 0.2030087287822549
in tableau or another data-viz 0.08255810541303524 0.28434896277132304
heavy program to make it palatable for -1.0379890382908399 -0.9359229625044558
readers 0.022451788887236538 0.15649840774931126
pun intended sorry who has already 0.12645774034218657 0.5764433381841557
done a lot of 0.0645538301832265 -1.8895773134920169
really cool analysis like this -0.8499956685146961 -0.6456009071301254
they claimed in the text that 0.31310522183662226 0.566252028711451
they used pca -0.26125074206706816 0.5402513332430933
but what they dont say is how they -3.1418969765868843 1.9805533718325532
created the matrix that was fed into the pca 1.3754363219912489 0.2072323270271102

Text/Reshaping

Split by delimiter and unpack into rows

Unique identifier (API): split_and_unpack

Splits the text content in the input column by the specified delimiter and unpacks the result into multiple rows, creating an additional column for position index. The delimiter can be one or more specific characters, or a regex pattern to match.

Parameters Parameter Description Default Parameter Type Suggested Values
delimiter Delimiter to be used for splitting the string string [',', ';', '-', '
use_regex Boolean flag indicating whether the specified delimiter should be treated as a regex pattern or an exact match. False boolean
Examples:

Given this input dataset:

('col_1', 'Text') ('col_2', 'Numeric')
star trek
game of
sein,field 2.23606
the simpsons
the big

with the column to split as "col_1" and the additional split position column name provided as "split_position" and use_regex=True, the output dataset will be:

('col_1', 'Text') ('col_2', 'Numeric') ('split_position', 'Numeric')
star 2.71828 0
trek 2.71828 1
game 3.14159 0
of 3.14159 1
thrones 3.14159 2
sein 2.23606 0
field 2.23606 1
the 1.732 0
simpsons 1.732 1
the nan 0
big nan 1
bang nan 2
theory nan 3

Text/Web

Extract URL parts from URL Text string

Unique identifier (API): extract_url_parts

Extracts URL components from a given column containing URL strings

Time Series

Resample data into a regular time series

Unique identifier (API): resample_time_series_v2

Resample data with a time-index column into a regular time series, with a specified aggregation function for each input column. Null-value interpolation is applied.

Parameters Parameter Description Default Parameter Type
sampling_interval Interval/Duration between successive records integer
time_unit Unit for the time interval day single-select
Examples:

For example, given the following dataset

receipt_time qty sku expected_receipt_time
null 30 A231 2012-06-12 02:00:00
2012-06-13 00:10:00 40 A351 2012-06-13 00:00:00
2012-06-13 00:30:00 100 A351 2012-06-13 00:30:00
2012-06-13 20:45:20 29 A456 2012-06-13 00:30:00
2012-06-15 21:40:03 50 A451 2012-06-16 01:30:00
2012-06-17 05:41:23 80 A221 2012-06-20 08:30:00
2012-06-17 11:10:53 70 A221 2012-06-16 20:00:00

with time-index column receipt_time, for parameters sampling_interval = 1, time_unit = ""day"", input columns qty, sku, expected_receipt_time and the corresponding aggregation functions ""sum"", ""topk"" (k_most_frequent = 3), ""latest"", the result is as follows

receipt_time qty sku expected_receipt_time
2012-06-14 00:00:00 169 ["A351","A456"] 2012-06-13 00:30:00
2012-06-15 00:00:00 null null null
2012-06-16 00:00:00 50 ["A451"] 2012-06-16 01:30:00
2012-06-17 00:00:00 null null null
2012-06-18 00:00:00 150 ["A221"] 2012-06-20 08:30:00

Window Function

Compute window functions

Unique identifier (API): compute_window_function

Compute functions such as cumulative sum, rolling mean, etc. by specifying a window and desired functions with their inputs. Unbounded, shrinking, expanding and rolling windows are supported, with optional partitioning and ordering

Parameters Parameter Description Default Parameter Type
window_start Whether to set window start unbounded or a specified position relative to current row, with unbounded by default unbounded single-select
start_position Position from which window starts, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively. integer
window_end Whether to set window end unbounded or a specified position relative to current row, with unbounded by default unbounded single-select
end_position Position from which window ends, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively. integer
Examples:

Given the dataset

animal colour birth qty desexed note
cat yellow 2019-01-01 00:00:00 10 True There is a story
cat blue 2019-01-03 00:00:00 9 True No story
dog blue 2019-01-02 00:00:00 4 True
cat black 2019-01-07 00:00:00 2 True No story
cat blue 2019-01-04 00:00:00 8 True There is a story
dog yellow 2019-01-05 00:00:00 1 True There is another story
dog black 2019-01-06 00:00:00 2 False There is a story
dog yellow 2019-01-08 00:00:00 2 False

for input columns qty, colour, birth, desexed, note and the corresponding aggregation functions sum, most_frequent, latest, all, count (non-null), the result is as follows

animal colour birth qty desexed note qty_sum colour_most_frequent birth_latest desexed_all note_count
dog blue 2019-01-02 00:00:00 4 True 4 blue 2019-01-02 00:00:00 True 0
dog yellow 2019-01-05 00:00:00 1 True There is another story 5 blue 2019-01-05 00:00:00 True 1
dog black 2019-01-06 00:00:00 2 False There is a story 7 black 2019-01-06 00:00:00 False 2
dog yellow 2019-01-08 00:00:00 2 False 9 yellow 2019-01-08 00:00:00 False 2
cat yellow 2019-01-01 00:00:00 10 True There is a story 10 yellow 2019-01-01 00:00:00 True 1
cat blue 2019-01-03 00:00:00 9 True No story 19 blue 2019-01-03 00:00:00 True 2
cat blue 2019-01-04 00:00:00 8 True There is a story 27 blue 2019-01-04 00:00:00 True 3
cat black 2019-01-07 00:00:00 2 True No story 29 blue 2019-01-07 00:00:00 True 4

with window of unbounded start and end position 0, partitioned by animal and ordered by birth.

Rank rows

Unique identifier (API): window_rank

Rank rows after ordering and optional partitioning

Parameters Parameter Description Default Parameter Type
rank_method Method used to get the rank rank single-select
Examples:

For example, the following dataset

+------+------+------+
|animal|colour|credit|
+------+------+------+
|cat   |blue  |-3.0  |
|cat   |blue  |-1.0  |
|cat   |blue  |5.0   |
|cat   |blue  |7.0   |
|cat   |blue  |10.0  |
|cat   |yellow|1.0   |
|cat   |yellow|4.0   |
|cat   |yellow|4.0   |
|cat   |yellow|4.0   |
|cat   |yellow|8.0   |
|dog   |blue  |9.0   |
|dog   |blue  |7.0   |
|dog   |blue  |6.0   |
|dog   |blue  |5.0   |
|dog   |blue  |-10.0 |
|dog   |yellow|null  |
|dog   |yellow|5.0   |
|dog   |yellow|8.0   |
|dog   |yellow|null  |
|dog   |yellow|1.0   |
+------+------+------+

is converted to

+------+------+------+----+
|animal|colour|credit|rank|
+------+------+------+----+
|dog   |blue  |-10.0 |1   |
|dog   |blue  |5.0   |2   |
|dog   |blue  |6.0   |3   |
|dog   |blue  |7.0   |4   |
|dog   |blue  |9.0   |5   |
|cat   |yellow|1.0   |1   |
|cat   |yellow|4.0   |2   |
|cat   |yellow|4.0   |2   |
|cat   |yellow|4.0   |2   |
|cat   |yellow|8.0   |5   |
|cat   |blue  |-3.0  |1   |
|cat   |blue  |-1.0  |2   |
|cat   |blue  |5.0   |3   |
|cat   |blue  |7.0   |4   |
|cat   |blue  |10.0  |5   |
|dog   |yellow|null  |1   |
|dog   |yellow|null  |1   |
|dog   |yellow|1.0   |3   |
|dog   |yellow|5.0   |4   |
|dog   |yellow|8.0   |5   |
+------+------+------+----+

after applying window_rank, partitioned by animal and colour and ordered by credit for option rank.

Shift rows after ordering optionally with partitioning

Unique identifier (API): window_shift

Shift rows for input columns after ordering, optionally with partitioning, by a specified value indicating the numberof shifted rows and shifted direction

Parameters Parameter Description Default Parameter Type
n_rows Number of rows to shift where positive values indicate moving downwards (lag) and negative values refer to moving moving upwards (lead) 1 integer
Examples:

For example, given the following dataset

animal colour adopted qty vehicle
cat yellow 2020-01-01 00:00:00 3.0 car
cat yellow 2020-01-02 00:00:00 1.0 truck
cat yellow 2020-01-03 00:00:00 5.0 car
cat yellow 2020-01-04 00:00:00 7.0 van
cat yellow 2020-01-05 00:00:00 10.0 van
cat white 2020-01-06 00:00:00 1.0 car
cat white 2020-01-07 00:00:00 4.0 van
cat white 2020-01-08 00:00:00 4.0 car
cat white 2020-01-09 00:00:00 4.0 truck
cat white 2020-01-10 00:00:00 8.0 car
dog yellow 2020-01-11 00:00:00 9.0 car
dog yellow 2020-01-12 00:00:00 7.0 van
dog yellow 2020-01-13 00:00:00 6.0 car
dog yellow 2020-01-14 00:00:00 5.0 van
dog yellow 2020-01-15 00:00:00 10.0 car
dog white 2020-01-16 00:00:00 null van
dog white 2020-01-17 00:00:00 5.0 car
dog white 2020-01-18 00:00:00 8.0 car
dog white 2020-01-19 00:00:00 null truck
dog white 2020-01-20 00:00:00 1.0 car

the result is

animal colour adopted qty vehicle qty_adjusted vehicle_adjusted
cat white 2020-01-06 00:00:00 1.0 car 4.0 van
cat white 2020-01-07 00:00:00 4.0 van 4.0 car
cat white 2020-01-08 00:00:00 4.0 car 4.0 truck
cat white 2020-01-09 00:00:00 4.0 truck 8.0 car
cat white 2020-01-10 00:00:00 8.0 car null null
cat yellow 2020-01-01 00:00:00 3.0 car 1.0 truck
cat yellow 2020-01-02 00:00:00 1.0 truck 5.0 car
cat yellow 2020-01-03 00:00:00 5.0 car 7.0 van
cat yellow 2020-01-04 00:00:00 7.0 van 10.0 van
cat yellow 2020-01-05 00:00:00 10.0 van null null
dog white 2020-01-16 00:00:00 null van 5.0 car
dog white 2020-01-17 00:00:00 5.0 car 8.0 car
dog white 2020-01-18 00:00:00 8.0 car null truck
dog white 2020-01-19 00:00:00 null truck 1.0 car
dog white 2020-01-20 00:00:00 1.0 car null null
dog yellow 2020-01-11 00:00:00 9.0 car 7.0 van
dog yellow 2020-01-12 00:00:00 7.0 van 6.0 car
dog yellow 2020-01-13 00:00:00 6.0 car 5.0 van
dog yellow 2020-01-14 00:00:00 5.0 van 10.0 car
dog yellow 2020-01-15 00:00:00 10.0 car null null

with input columns qty and vehicle, partitioned by animal and colour and ordered by adopted, for shifted n_rows = -1 (lead).

Deprecated actions

The following actions that were available previously have now been deprecated. This means that you cannot find them any more on the recipe UI, as they are substituted by newer actions that are better in terms of usability or performance. However, existing recipes that contain these actions will continue to be supported.

Column Group

Take

Unique identifier (API): take

Please use the newer version get_column_group_components

Extracts specified components from a column group. If the component is not found, returns a null column.

Parameters Parameter Description Default Parameter Type
address_type Method of selection of group column component to take single-select
index The component index (zero based) to take from the group. E.g.If the group columns are ["__group_grp_a", "__group_grp_b"] and index==1, the taken column will be "__group_grp_b". 0 integer
suffix The component suffix to take from the group. E.g.If the group columns are ["__group_grp_a", "__group_grp_b"] and suffix=="a", the taken column will be "__group_grp_a". string
Examples:

For the following set of parameters:

parameters = {'address_type': 'index', 'index': 1}
and the following input table:

__group_grp_a_0 __group_grp_a_1 __group_grp_b_0 __group_grp_b_1
1 2 1 null
Sentence A Sentence B Sentence C Sentence D

The output is:

__group_grp_a_0 __group_grp_a_1 __group_grp_b_0 __group_grp_b_1 col_from_grp_a col_from_grp_b
1 2 1 null 2 null
Sentence A Sentence B Sentence C Sentence D Sentence B Sentence D

Custom Formula

Create a new column with a formula

Unique identifier (API): create_new_column_with_formula

Please use the new version create_new_column_with_formula_v2

Creates a new column with any user-entered formula. Common Excel/SQL functions and operators for manipulating Numeric, DateTime, Text, and Boolean values are permitted. For a full list of allowed functions, see documentation.

Parameters Parameter Description Default Parameter Type
expression Any formula where symbols correspond to the input columns in lexical order. For eg., if the input selection has column_B then column_A and if the formula is y / x, column_B is mapped to x and column_A is mapped to y. string
Examples:

Consider the dataset:

manufacturer model year displ cty hwy
audi a4 quattro 1999 1.8 16 25
audi a6 quattro 1999 2.8 15 24
audi a6 quattro 2008 4.2 16 23
audi a6 quattro 2008 4.2 16 23
chevrolet c1500 suburban 2wd 2008 6 12 17
chevrolet corvette 1999 5.7 16 26
chevrolet corvette 1999 5.7 15 23
chevrolet k1500 tahoe 4wd 1999 6.5 14 17
chevrolet malibu 1999 3.1 18 26
dodge caravan 2wd 1999 3.8 15 22

where cty, hwy and year are of Numeric type.

Given the input columns selected as ['cty', 'hwy'], the output column name 'hwy_minus_cty', the formula 'y - x', the result will be:

manufacturer model year displ cty hwy hwy_minus_cty
audi a4 quattro 1999 1.8 16 25 9
audi a6 quattro 1999 2.8 15 24 9
audi a6 quattro 2008 4.2 16 23 7
audi a6 quattro 2008 4.2 16 23 7
chevrolet c1500 suburban 2wd 2008 6 12 17 5
chevrolet corvette 1999 5.7 16 26 10
chevrolet corvette 1999 5.7 15 23 8
chevrolet k1500 tahoe 4wd 1999 6.5 14 17 3
chevrolet malibu 1999 3.1 18 26 8
dodge caravan 2wd 1999 3.8 15 22 7

where the new hwy_minus_cty column will be of Numeric type.

For the same input data, given the input columns selected as ['year'], the output column name 'century' the formula 'IF(x <= 2000, "20th", "21st")' the result will be:

manufacturer model year displ cty hwy century
audi a4 quattro 1999 1.8 16 25 20th
audi a6 quattro 1999 2.8 15 24 20th
audi a6 quattro 2008 4.2 16 23 21st
audi a6 quattro 2008 4.2 16 23 21st
chevrolet c1500 suburban 2wd 2008 6 12 17 21st
chevrolet corvette 1999 5.7 16 26 20th
chevrolet corvette 1999 5.7 15 23 20th
chevrolet k1500 tahoe 4wd 1999 6.5 14 17 20th
chevrolet malibu 1999 3.1 18 26 20th
dodge caravan 2wd 1999 3.8 15 22 20th

where the new century column will be of Text type.

Transform columns with a formula

Unique identifier (API): transform_columns_with_formula

Please use the new version transform_columns_with_formula_v2

Transform one or more column(s) with a user-specified formula. Common Excel/SQL functions and operators for manipulating Numeric, DateTime, Text and Boolean values are permitted. For a full list of these functions, see documentation.

Parameters Parameter Description Default Parameter Type
formula Formula where symbol SELF refers to the column(s) to be transformed, and the rest symbols in lexical order are mapped to the other input column(s) if any.For example, if the columns to be transformed are col_1 and col_2, other input columns are col_A and col_B, with formula SELF + z - y, col_i is modified as col_i + col_B - col_A for i = 1, 2 string
Examples:

For example, given the following dataset

qty_1 qty_2
10 nan
20 15
15 6
nan 1
8 8

and formula = ""SELF + (y + SELF) * 10"", with qty_1 as column to be transformed and qty_2 as the other input column, the result will be

qty_1 qty_2
nan nan
370 15
225 6
nan 1
168 8

Data Cleaning/Processing

Remove duplicated rows within group

Unique identifier (API): group_by_remove_duplicates

Please use the newer version group_by

Remove duplicated rows for one or more input columns grouped by other selected column(s).

Parameters Parameter Description Default Parameter Type
keep First or last row to keep first single-select
Examples:

For example, given the following dataset

animal colour carried description qty time desexed
cat yellow van There is a story. 0.5416980271485203 2020-10-10 00:00:00 true
cat yellow van There is a story. 0.09442987061798414 2020-10-10 00:00:00 false
cat yellow car There is a story. 0.5766948082733302 2020-10-10 00:00:00 true
cat yellow van There is a story. 0.8015245589198161 2020-01-01 00:00:00 false
cat yellow van There is a story. 0.2523252855972161 2019-01-01 00:00:00 true
cat white van There is a story. 0.044270295367698886 2020-01-01 00:00:00 true
cat white car There is another story. 0.34340627533339274 2020-01-01 00:00:00 true
cat white van There is another story. 0.243885710885979 2019-01-10 00:00:00 true
cat white car There is another story. 0.9875301137422845 2020-01-01 00:00:00 false
cat white van There is a story. 0.4416194595201147 2019-01-10 00:00:00 false
dog yellow van There is another story. 0.18251646225487406 2019-01-10 00:00:00 false
dog yellow car There is a story. 0.9073423733435338 2019-01-10 00:00:00 false
dog yellow car There is another story. 0.5242232544087159 2020-01-01 00:00:00 false
dog yellow van There is a story. 0.09555633797329921 2019-01-10 00:00:00 true
dog yellow car There is a story. 0.13287387217823854 2020-10-10 00:00:00 true
dog white car There is another story. 0.8686814568933312 2020-10-02 00:00:00 true
dog white car There is another story. 0.5376552548800148 2019-01-10 00:00:00 true
dog white car There is another story. 0.6979028231254814 2019-01-10 00:00:00 false
dog white van There is a story. 0.19022329903980462 2020-01-01 00:00:00 true
dog white car There is another story. 0.856685303343519 2020-01-01 00:00:00 false

with input columns animal and colour, keep=""first"", the result is

animal colour carried description qty time desexed
cat yellow van There is a story. 0.5416980271485203 2020-10-10 00:00:00 true
cat white van There is a story. 0.044270295367698886 2020-01-01 00:00:00 true
dog yellow van There is another story. 0.18251646225487406 2019-01-10 00:00:00 false
dog white car There is another story. 0.8686814568933312 2020-10-02 00:00:00 true

Impute columns

Unique identifier (API): impute_column

Please use the newer version impute_numeric or impute_categorical

Impute the missing values in your data

Parameters Parameter Description Default Parameter Type
categorical_strategy Strategy for imputing Categorical columns most_frequent single-select
categorical_fill_value Filler value for Categoric columns when strategy is constant string
numeric_strategy Strategy for imputing Numeric columns mean single-select
numeric_fill_value Filler value for Numeric columns when strategy is constant float

Feature Engineering

Calculate math function

Unique identifier (API): calculate_math_function

Please use the action create_new_column_with_formula_v2

Calculate a math function on selected columns. Invalid values such as negative input for a logarithmic function result in null outputs.

Parameters Parameter Description Default Parameter Type
math_function Math function applied to columns single-select
Examples:

For example, compute the absolute value abs(x).

Calculate math function (2)

Unique identifier (API): calculate_math_function2

Please use the action create_new_column_with_formula_v2

Calculate a math function that takes two inputs. If invalid inputs are encountered (such as division by zero), the outputs will be null.

Parameters Parameter Description Default Parameter Type
math_function Math function applied to input columns single-select
Examples:

For example compute the difference (c = a -b) or computing the ratio (c = a/b)

Cluster and extract features

Unique identifier (API): cluster

Please use the newer version get_cluster_features_v2

Cluster the input data using unsupervised learning algorithms. Outputs three features - the cluster ID, cluster size and euclidian distance from the centroid of the cluster.

Parameters Parameter Description Default Parameter Type Parameter Min Parameter Max Parameter Step
algorithm The algorithm used for clustering, such as k-means or Gaussian Mixture Model. single-select
num_clusters_method The method that is utilized for calculating the number of clusters inferred single-select
num_clusters The numbers of clusters to generate. 2 range 2.0 100.0 1.0
max_num_clusters This defines the search space for the algorithm inferring the best number of clusters automatically. Setting it to a value of K will make the algorithm look for the best number of clusters in the range [2, 3, ..., K] 10 range 2.0 100.0 1.0
seed Nonnegative integer. Random seed for clustering algorithms. Used for obtaining repeatable results. 0 integer
tolerance The minimum change in the squared distances between cluster centroids between two successive iterations, for the algorithm to continue to the next iteration. If the change is below the threshold specified by this parameter, the algorithm is assumed to have converged before reaching the maximum number of iterations. A lower value for this parameter results in higher fidelity in the assignment of cluster IDs, but can make the action run slower especially on large datasets. 0.0001 float
max_iter The maximum number of iterations regardless of convergence. The higher it is the more accurate the clustering will be, but comes at the cost of a longer runtime. 20 integer
min_cluster_size The smallest size grouping that HDBSCAN considers as a cluster. Single linkage splits that contain fewer points than this will be considered points "falling out" of a cluster rather than a cluster splitting into two new clusters 5 integer
min_samples_specified If enabled, allows specifying the minimum number of samples parameter. Otherwise, will be determined automatically. False boolean
min_samples The number of samples in a neighbourhood for a point to be considered a core point. Providing a larger value for this parameter results in a more conservative clustering, which means more points will be declared as noise and clusters will be restricted to progressively more dense areas. 1 integer
cluster_selection_epsilon A distance threshold. Clusters below this value will be merged. The choice of cluster_selection_epsilon depends on the given distances between your data points. For example, set the value to 0.5 if you don't want to separate clusters that are less than 0.5 units apart. 0 float
allow_single_cluster By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset. False boolean

Extract PCA Components

Unique identifier (API): extract_pca

Please use the newer version extract_pca_v2

Create new numerical features by combining existing numerical features in linear ways using PCA

Parameters Parameter Description Default Parameter Type Parameter Min Parameter Max Parameter Step
limit_components_by Criterion for selecting the principal components.If the minimum explained variance is used, the components that contribute most to the variance are selected until the cumulative variance is not less than the specified ratio of the total variance. min_explained_variance single-select
num_components The number of principal components to extract from the input data integer
min_explained_variance This is the threshold on the ratio of explained variance of a component to the total variance. Only the components with highest variance whose cumulative variance is at least the specified ratio of total variance will be selected 0.8 range 0.01 1.0 0.01
categorical_impute_method Method for imputing Categorical columns most_frequent single-select
categorical_fill_value Filler value for categorical columns when strategy is constant string
numeric_impute_method Method for imputing Numeric columns mean single-select
numeric_fill_value Filler value for Numeric columns when strategy is constant float

One hot encode categorical columns

Unique identifier (API): one_hot_encode

Please use the newer version one_hot_encode_v2

This action converts categorical columns by one-hot encoding. Missing values remain as null. Unseen values are encoded as all 0s, when a finalized recipe with this action is applied on new data

Examples:

For example, given the following dataset

col_1 col_2
A yellow
B null
B blue
A blue
E red

the result is

col_1 col_2 __group_encoded_1_0 __group_encoded_1_1 __group_encoded_1_2 __group_encoded_2_0 __group_encoded_2_1 __group_encoded_2_2
A yellow 1 0 0 0 0 1
B null 0 1 0 null null null
B blue 0 1 0 1 0 0
A blue 1 0 0 1 0 0
E red 0 0 1 0 1 0

Filtering/Custom Formula

Filter rows

Unique identifier (API): filter_rows_with_formula

Please use the new version filter_rows_with_formula_v2

Filter rows in data by keeping only rows that satisfy the condition specified in user-entered formula. Common Excel/SQL functions and operators for manipulating Numeric, DateTime, Text, and Boolean values are permitted. For a full list of allowed functions, see documentation.

NOTE: The formula entered must evaluate to a boolean value.

Parameters Parameter Description Default Parameter Type
condition The condition under which the rows are to be kept, entered as a formula. Symbols must correspond to the input columns in lexical order. For eg., if the input selection has column_B then column_A and if the formula is y >= x, column_B is mapped to x and column_A is mapped to y. string
Examples:

Consider the dataset:

manufacturer model year displ cty hwy
audi a4 quattro 1999 1.8 16 25
audi a6 quattro 1999 2.8 15 24
audi a6 quattro 2008 4.2 16 23
audi a6 quattro 2008 4.2 16 23
chevrolet c1500 suburban 2wd 2008 6 12 17
chevrolet corvette 1999 5.7 16 26
chevrolet corvette 1999 5.7 15 23
chevrolet k1500 tahoe 4wd 1999 6.5 14 17
chevrolet malibu 1999 3.1 18 26
dodge caravan 2wd 1999 3.8 15 22
dodge caravan 2wd 2008 4 16 23
dodge durango 4wd 2008 5.7 13 18
dodge durango 4wd 2008 5.7 13 18
dodge ram 1500 pickup 4wd 1999 5.9 11 15
ford expedition 2wd 1999 4.6 11 17
ford explorer 4wd 1999 5 13 17
ford f150 pickup 4wd 1999 4.2 14 17
ford f150 pickup 4wd 1999 4.2 14 17
ford f150 pickup 4wd 2008 4.6 13 17
ford mustang 2008 4 17 26

where cty and year are of Numeric type.

Given the input columns selected as ['year', 'cty'] and the formula 'AND(x > 2000, y >= 15)', the result will be:

manufacturer model year displ cty hwy
audi a6 quattro 2008 4.2 16 23
audi a6 quattro 2008 4.2 16 23
dodge caravan 2wd 2008 4 16 23
ford mustang 2008 4 17 26

Reshaping

Pivot a column with a specified aggregation

Unique identifier (API): pivot

Please use the newer version pivot_v2

Pivot a column and perform a specified aggregation on one or more numeric columns, grouped by other selected columns

Parameters Parameter Description Default Parameter Type Parameter Min Parameter Max Parameter Step
aggregation_function Aggregation function applied to input column(s) mean single-select
median_relative_error Relative error for approximating median. Must be no less than 0.001 for computational efficiency. 0.001 range 0.001 1.0 0.001
count_distinct_maximum_error Maximum estimation error allowed for approximating the number of distinct values. Must be no less than 0.01 for computational efficiency. 0.01 range 0.01 1.0 0.01
pivot_values List of values that are converted to columns in the output [] string-list
Examples:

For example, given the following dataset

animal size qty
cat small 95
cat medium 67
cat large 39
cat huge 45
cat small 50
cat medium 18
cat large 37
cat huge 22
cat small 12
cat medium 45
dog large 84
dog huge 48
dog small 71
dog medium 95
dog large 87
dog huge 35
dog small 7
dog medium 25
dog large 42
dog huge 45

with input pivot column size, values column qty and aggregation function sum, grouped by column animal, the result is

animal __group_qty_agg_huge __group_qty_agg_large __group_qty_agg_medium __group_qty_agg_small
dog 128 213 120 78
cat 67 76 130 157

Unpack values to multiple rows

Unique identifier (API): unpack

Please use the newer version unpack_v2

Reshape a dataset by unpacking one or more column group(s) and create a position indication column

Text

Extract capture groups

Unique identifier (API): extract_capture_groups

Please use the newer versions of extract_capture_groups

Extracts one or more regex capture groups into a column group

Parameters Parameter Description Default Parameter Type
pattern A regular expression pattern containing one or more capture groups string
Examples:

For the following pattern == '([aeiou].)([aeiou].)' , the following input table and assuming we apply the action over both columns ([col_1, col_2]):

col_1 col_2
New_York bc
Lisbon df
Tokyo gh
Paris jk
null lm
Munich np

The expected output is:

col_1 col_2 __group_grp_a_0 __group_grp_a_1 __group_grp_b_0 __group_grp_b_1
New_York bc null null null null
Lisbon df null null null null
Tokyo gh null null null null
Paris jk ar is null null
null lm null null null null
Munich np un ic null null

Find all matches of a regular expression

Unique identifier (API): regex_findall

Please use the new action extract_all_regex_matches

Creates an output column group containing all the strings that match the regex pattern.

Parameters Parameter Description Default Parameter Type
pattern A regular expression pattern for the matching string
Examples:

For pattern == '\d\d' (match non overlapping double digits) and the following input table:

col_1
null
Test 54 45 Test
Dummy sentence
Just some number 00

The output is:

col_1 __group_col_1_re_findall_0 __group_col_1_re_findall_1
null null null
Test 54 45 Test 54 45
Dummy sentence null null
Just some number 00 00 null

Split text column on delimiter

Unique identifier (API): split_on_delimiter

Please use either split_on_delimiter_to_multicol or split_to_array.

For a given column, extracts text components between the delimiters into a column group. The size of the column group corresponds to the largest number of components found in the column or by the limit on the number of splits. Trailing null values are created for records containing fewer components.

Parameters Parameter Description Default Parameter Type Suggested Values
delimiter Delimiter to be used for splitting the string. string [',', ';', '-', '
delimiter_type Type of delimiter Verbatim single-select
limit_num_splits Boolean flag indicating wether you wish to limit the number of splits or not. False boolean
num_of_splits An integer which controls the number of times the delimiter regex is applied. 2 integer
Examples:

For the following set of parameters:

parameters = {'delimiter': ':', 'limit_num_splits': True, 'num_of_splits': 2}

and the following input table:

col_1
one:two:three
no_split

The expected output is:

col_1 __group_col_1_split_grp_0 __group_col_1_split_grp_1
one:two:three one two:three
no_split no_split null

Text/JSON

Split JSON array column into a column group

Unique identifier (API): split_json_array

Please use unpack_v2

Every row (or cell) of the input column is assumed to be a JSON array stored as a string. Extract each element in the array into a separate column component in a columngroup.

Parameters Parameter Description Default Parameter Type
path Path to the JSON array component via (Spark-flavor) JSON path. Defaults to $ (the top-level). $ string

Time Series

Resample data into a regular time series

Unique identifier (API): resample_time_series

Please use the newer version resample_time_series_v2

Resample data with a time-index column into a regular time series, with aggregation for all columns except the nominated time index. Null-value interpolation is applied.

Parameters Parameter Description Default Parameter Type
sampling_interval Length of intervals between successive records integer
time_unit Time unit for intervals between successive records day single-select
aggregation_function Aggregation function to be applied last single-select
Examples:

For example, given the following dataset

receipt_time qty sku expected_receipt_time
null 30 A231 2012-06-12 02:00:00
2012-06-13 00:00:00 40 A351 2012-06-15 10:00:00
2012-06-13 20:45:20 29 A456 2012-06-13 00:30:00
2012-06-15 21:40:03 50 A451 2012-06-16 01:30:00
2012-06-17 05:41:23 80 A223 2012-06-16 20:30:00
null 100 A221 2012-06-20 08:30:00
2012-06-17 11:10:53 70 A264 2012-06-16 20:00:00

with time-index column receipt_time for parameters sampling_interval = 12, time_unit = ""hour"" and aggregation_function = ""last"", the result is

receipt_time qty sku expected_receipt_time
2012-06-13 00:00:00 40 A351 2012-06-15 10:00:00
2012-06-13 12:00:00 null null null
2012-06-14 00:00:00 29 A456 2012-06-13 00:30:00
2012-06-14 12:00:00 null null null
2012-06-15 00:00:00 null null null
2012-06-15 12:00:00 null null null
2012-06-16 00:00:00 50 A451 2012-06-16 01:30:00
2012-06-16 12:00:00 null null null
2012-06-17 00:00:00 null null null
2012-06-17 12:00:00 70 A264 2012-06-16 20:00:00

Window Function

Count the number of values over partition

Unique identifier (API): window_aggregate_count

Please use the newer version compute_rolling_window_function

This action counts the number of values in one or more input column(s) partitioned by other selected column(s).

Parameters Parameter Description Default Parameter Type
value_type What to count. The default is to count only non-null values.For approximately counting distinct values (see https://en.wikipedia.org/wiki/HyperLogLog), null values are ignored and the allowed estimation error is 0.05 at maximum non_null single-select
Examples:

For example, given the following dataset

colour vehicle animal description qty time verified
blue car cat There is another story. 0.9371423365841736 null true
blue van null null 0.504013991387411 2019-01-02 00:00:00 true
yellow car cat There is another story. 0.7664983791343694 2019-01-03 00:00:00 false
yellow van null null 0.391325429830197 2019-01-04 00:00:00 true
yellow car cat There is another story. null 2019-01-05 00:00:00 true
blue car null There is another story. null 2019-01-06 00:00:00 true
yellow car dog There is another story. null null true
blue van dog There is another story. 0.8223637606171873 null false
blue van cat There is another story. 0.9805952019930625 2019-01-09 00:00:00 false
yellow van dog null null 2019-01-10 00:00:00 true
blue van null null null 2019-01-11 00:00:00 true
blue car null There is another story. 0.010846767436331839 2019-01-12 00:00:00 false
blue car cat There is a story. 0.30042645923211464 2019-01-13 00:00:00 true
yellow van dog There is a story. 0.7759473050527014 2019-01-14 00:00:00 true
yellow van dog There is another story. 0.5250835307237919 2019-01-15 00:00:00 true
yellow car dog There is another story. 0.4708643894438649 null true
blue van dog null 0.8425979127421865 null false
yellow van dog There is another story. 0.9901884866319394 2019-01-18 00:00:00 true
blue car dog There is a story. 0.6662684130667034 2019-01-19 00:00:00 true
blue car cat There is another story. null 2019-01-20 00:00:00 true

with input columns animal, description, qty, time, verified partitioned by colour and vehicle for option non_null, the result is

colour vehicle animal description qty time verified animal_count description_count qty_count time_count verified_count
blue car cat There is another story. 0.9371423365841736 null true 4 6 4 5 6
blue van null null 0.504013991387411 2019-01-02 00:00:00 true 3 2 4 3 5
yellow car cat There is another story. 0.7664983791343694 2019-01-03 00:00:00 false 4 4 2 2 4
yellow van null null 0.391325429830197 2019-01-04 00:00:00 true 4 3 4 5 5
yellow car cat There is another story. null 2019-01-05 00:00:00 true 4 4 2 2 4
blue car null There is another story. null 2019-01-06 00:00:00 true 4 6 4 5 6
yellow car dog There is another story. null null true 4 4 2 2 4
blue van dog There is another story. 0.8223637606171873 null false 3 2 4 3 5
blue van cat There is another story. 0.9805952019930625 2019-01-09 00:00:00 false 3 2 4 3 5
yellow van dog null null 2019-01-10 00:00:00 true 4 3 4 5 5
blue van null null null 2019-01-11 00:00:00 true 3 2 4 3 5
blue car null There is another story. 0.010846767436331839 2019-01-12 00:00:00 false 4 6 4 5 6
blue car cat There is a story. 0.30042645923211464 2019-01-13 00:00:00 true 4 6 4 5 6
yellow van dog There is a story. 0.7759473050527014 2019-01-14 00:00:00 true 4 3 4 5 5
yellow van dog There is another story. 0.5250835307237919 2019-01-15 00:00:00 true 4 3 4 5 5
yellow car dog There is another story. 0.4708643894438649 null true 4 4 2 2 4
blue van dog null 0.8425979127421865 null false 3 2 4 3 5
yellow van dog There is another story. 0.9901884866319394 2019-01-18 00:00:00 true 4 3 4 5 5
blue car dog There is a story. 0.6662684130667034 2019-01-19 00:00:00 true 4 6 4 5 6
blue car cat There is another story. null 2019-01-20 00:00:00 true 4 6 4 5 6

Get DateTime value by partition

Unique identifier (API): window_aggregate_datetime

Please use the newer version compute_rolling_window_function

This action applies a window aggregation function on one or more DateTime column(s) partitioned by other selected column(s).

Parameters Parameter Description Default Parameter Type
agg_function Aggregation function applied to input column(s) single-select
Examples:

For example, given the following dataset

animal colour time_born time_adopted
cat blue 2019-01-01 00:00:00 null
cat yellow 2019-01-02 00:00:00 2019-01-01 00:00:00
dog yellow 2019-01-03 00:00:00 2019-01-01 00:00:00
cat yellow 2019-01-04 00:00:00 null
cat yellow 2019-01-05 00:00:00 null
dog blue 2019-01-06 00:00:00 2019-01-02 00:00:00
cat blue 2019-01-07 00:00:00 2019-01-02 00:00:00
cat blue null 2019-01-02 00:00:00
dog yellow 2019-01-09 00:00:00 2019-01-02 00:00:00
cat yellow 2019-01-10 00:00:00 2019-01-02 00:00:00
dog blue null null
cat blue 2019-01-12 00:00:00 2019-01-03 00:00:00
dog blue null 2019-01-03 00:00:00
cat blue 2019-01-14 00:00:00 2019-01-03 00:00:00
cat blue null 2019-01-03 00:00:00
cat blue 2019-01-16 00:00:00 2019-01-04 00:00:00
cat blue 2019-01-17 00:00:00 2019-01-04 00:00:00
cat yellow null 2019-01-04 00:00:00
dog blue 2019-01-19 00:00:00 null
dog yellow 2019-01-20 00:00:00 2019-01-04 00:00:00

with input columns time_born and time_adopted, partitioned by animal and colour for option earliest, the result is

animal colour time_born time_adopted agg_birth agg_adoption
cat blue 2019-01-01 00:00:00 null 2019-01-01 00:00:00 2019-01-02 00:00:00
cat yellow 2019-01-02 00:00:00 2019-01-01 00:00:00 2019-01-02 00:00:00 2019-01-01 00:00:00
dog yellow 2019-01-03 00:00:00 2019-01-01 00:00:00 2019-01-03 00:00:00 2019-01-01 00:00:00
cat yellow 2019-01-04 00:00:00 null 2019-01-02 00:00:00 2019-01-01 00:00:00
cat yellow 2019-01-05 00:00:00 null 2019-01-02 00:00:00 2019-01-01 00:00:00
dog blue 2019-01-06 00:00:00 2019-01-02 00:00:00 2019-01-06 00:00:00 2019-01-02 00:00:00
cat blue 2019-01-07 00:00:00 2019-01-02 00:00:00 2019-01-01 00:00:00 2019-01-02 00:00:00
cat blue null 2019-01-02 00:00:00 2019-01-01 00:00:00 2019-01-02 00:00:00
dog yellow 2019-01-09 00:00:00 2019-01-02 00:00:00 2019-01-03 00:00:00 2019-01-01 00:00:00
cat yellow 2019-01-10 00:00:00 2019-01-02 00:00:00 2019-01-02 00:00:00 2019-01-01 00:00:00
dog blue null null 2019-01-06 00:00:00 2019-01-02 00:00:00
cat blue 2019-01-12 00:00:00 2019-01-03 00:00:00 2019-01-01 00:00:00 2019-01-02 00:00:00
dog blue null 2019-01-03 00:00:00 2019-01-06 00:00:00 2019-01-02 00:00:00
cat blue 2019-01-14 00:00:00 2019-01-03 00:00:00 2019-01-01 00:00:00 2019-01-02 00:00:00
cat blue null 2019-01-03 00:00:00 2019-01-01 00:00:00 2019-01-02 00:00:00
cat blue 2019-01-16 00:00:00 2019-01-04 00:00:00 2019-01-01 00:00:00 2019-01-02 00:00:00
cat blue 2019-01-17 00:00:00 2019-01-04 00:00:00 2019-01-01 00:00:00 2019-01-02 00:00:00
cat yellow null 2019-01-04 00:00:00 2019-01-02 00:00:00 2019-01-01 00:00:00
dog blue 2019-01-19 00:00:00 null 2019-01-06 00:00:00 2019-01-02 00:00:00
dog yellow 2019-01-20 00:00:00 2019-01-04 00:00:00 2019-01-03 00:00:00 2019-01-01 00:00:00

Make a boolean indicator over partition

Unique identifier (API): window_aggregate_indicate_boolean

Please use the newer version compute_rolling_window_function

This action applies a window aggregation function as an overall indicator on one or more boolean columns partitioned by other selected column(s).

Parameters Parameter Description Default Parameter Type
indicator_function Indicator function applied to input column(s) single-select
Examples:

For example, given the following dataset

animal colour adopted registered
cat blue true true
dog yellow true false
cat yellow false true
cat blue false true
dog blue false true
dog blue false false
cat yellow true true
dog yellow true true
dog yellow true true
dog blue true false
dog yellow false false
dog yellow false false
dog blue false true
cat yellow true false
cat yellow true true
dog blue false false
dog yellow true true
dog blue false false
cat yellow false false
dog blue false false

with input columns adopted and registered, partitioned by animal and colour for option any, the result is

animal colour adopted registered agg_adopted agg_registered
cat blue true true true true
dog yellow true false true true
cat yellow false true true true
cat blue false true true true
dog blue false true true true
dog blue false false true true
cat yellow true true true true
dog yellow true true true true
dog yellow true true true true
dog blue true false true true
dog yellow false false true true
dog yellow false false true true
dog blue false true true true
cat yellow true false true true
cat yellow true true true true
dog blue false false true true
dog yellow true true true true
dog blue false false true true
cat yellow false false true true
dog blue false false true true

Aggregate numeric column(s) over partition

Unique identifier (API): window_aggregate_numeric

Please use the newer version compute_rolling_window_function

This action applies a window aggregation function on one or more numeric columns partitioned by other selected column(s).

Parameters Parameter Description Default Parameter Type Parameter Min Parameter Max Parameter Step
agg_function Aggregation function applied to input column(s) single-select
relative_error Relative error for approximating median or a specific quantile. Must be no less than 0.001 for computational efficiency. range 0.001 1.0 0.001
quantile_to_compute Specify which quantile to compute for the quantile aggregation function. Must be a number between 0 and 1. range 0.0 1.0 0.05
Examples:

For example, given the following dataset

colour vehicle qty_1 qty_2
yellow van 0.11247070608569998 5
yellow van 0.28328160937677216 -15
blue car 0.517063017219678 19
blue van 0.5301237478473354 -9
blue van 0.3620932896270347 10
yellow car 0.4225691387726025 15
yellow car 0.290901753183477 -19
blue van 0.8861529756930526 19
yellow van 0.0663419211386469 12
blue car 0.5966970598200722 -4
blue car 0.420710037504774 -8
yellow van 0.6275038840320153 -14
yellow car 0.9498072805894898 -16
blue car 0.9065231957598833 -3
yellow van 0.6876214007342292 -14
blue van 0.12799068964691773 -16
yellow car 0.8456065368763648 -20
yellow van 0.7136156920988985 10
yellow van 0.31253912094683933 -16
yellow van 0.8326289665837018 -20

with input columns qty_1 and qty_2 and aggregation function mean partitioned by colour and vehicle, the result is

colour vehicle qty_1 qty_2 aggregation_1 aggregation_2
yellow van 0.11247070608569998 5 0.4545004126246004 -6.5
yellow van 0.28328160937677216 -15 0.4545004126246004 -6.5
blue car 0.517063017219678 19 0.6102483275761019 1.0
blue van 0.5301237478473354 -9 0.4765901757035851 1.0
blue van 0.3620932896270347 10 0.4765901757035851 1.0
yellow car 0.4225691387726025 15 0.6272211773554836 -10.0
yellow car 0.290901753183477 -19 0.6272211773554836 -10.0
blue van 0.8861529756930526 19 0.4765901757035851 1.0
yellow van 0.0663419211386469 12 0.4545004126246004 -6.5
blue car 0.5966970598200722 -4 0.6102483275761019 1.0
blue car 0.420710037504774 -8 0.6102483275761019 1.0
yellow van 0.6275038840320153 -14 0.4545004126246004 -6.5
yellow car 0.9498072805894898 -16 0.6272211773554836 -10.0
blue car 0.9065231957598833 -3 0.6102483275761019 1.0
yellow van 0.6876214007342292 -14 0.4545004126246004 -6.5
blue van 0.12799068964691773 -16 0.4765901757035851 1.0
yellow car 0.8456065368763648 -20 0.6272211773554836 -10.0
yellow van 0.7136156920988985 10 0.4545004126246004 -6.5
yellow van 0.31253912094683933 -16 0.4545004126246004 -6.5
yellow van 0.8326289665837018 -20 0.4545004126246004 -6.5

Window Function/Rolling

Make a rolling boolean indicator with optional partitioning

Unique identifier (API): window_rolling_boolean

Please use the newer version compute_rolling_window_function

Compute an aggregation as a rolling indicator on one or more boolean columns, optionally partitioned by other selected column(s).

Parameters Parameter Description Default Parameter Type
indicator_function Indicator function applied to input column(s) single-select
window_start Whether to set window start unbounded or a specified position relative to current row, with unbounded by default unbounded single-select
start_position Position from which window starts, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively. integer
window_end Whether to set window end unbounded or a specified position relative to current row, with unbounded by default unbounded single-select
end_position Position from which window ends, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively. integer
Examples:

For example, given the following dataset

animal colour date cured adopted
cat blue 2019-01-01 00:00:00 false true
cat blue 2019-01-02 00:00:00 true false
dog yellow 2019-01-03 00:00:00 true true
cat yellow 2019-01-04 00:00:00 true false
cat yellow 2019-01-05 00:00:00 true false
cat blue 2019-01-06 00:00:00 false false
dog blue 2019-01-07 00:00:00 true true
dog blue 2019-01-08 00:00:00 false false
dog yellow 2019-01-09 00:00:00 false false
dog yellow 2019-01-10 00:00:00 false false
cat blue 2019-01-11 00:00:00 false false
cat yellow 2019-01-12 00:00:00 false true
dog yellow 2019-01-13 00:00:00 true false
dog blue 2019-01-14 00:00:00 true false
cat blue 2019-01-15 00:00:00 false false
dog yellow 2019-01-16 00:00:00 false false
dog yellow 2019-01-17 00:00:00 false true
cat blue 2019-01-18 00:00:00 false false
cat yellow 2019-01-19 00:00:00 false false
dog yellow 2019-01-20 00:00:00 false false

with input columns cured and adopted, partitioned by animal and colour and ordered by date for indicator function all, rolling window with unbounded start and current row as end, the result is

animal colour date cured adopted roll_cured roll_adopted
dog blue 2019-01-07 00:00:00 true true true true
dog blue 2019-01-08 00:00:00 false false false false
dog blue 2019-01-14 00:00:00 true false false false
cat yellow 2019-01-04 00:00:00 true false true false
cat yellow 2019-01-05 00:00:00 true false true false
cat yellow 2019-01-12 00:00:00 false true false false
cat yellow 2019-01-19 00:00:00 false false false false
cat blue 2019-01-01 00:00:00 false true false true
cat blue 2019-01-02 00:00:00 true false false false
cat blue 2019-01-06 00:00:00 false false false false
cat blue 2019-01-11 00:00:00 false false false false
cat blue 2019-01-15 00:00:00 false false false false
cat blue 2019-01-18 00:00:00 false false false false
dog yellow 2019-01-03 00:00:00 true true true true
dog yellow 2019-01-09 00:00:00 false false false false
dog yellow 2019-01-10 00:00:00 false false false false
dog yellow 2019-01-13 00:00:00 true false false false
dog yellow 2019-01-16 00:00:00 false false false false
dog yellow 2019-01-17 00:00:00 false true false false
dog yellow 2019-01-20 00:00:00 false false false false

Get a DateTime value over rolling windows, with optional partitioning

Unique identifier (API): window_rolling_datetime

Please use the newer version compute_rolling_window_function

Get an aggregation function on one or more DateTime columns over rolling windows, optionally partitioned by other selected column(s).

Parameters Parameter Description Default Parameter Type
aggregation_function Aggregation function applied to input column(s) single-select
window_start Whether to set window start unbounded or a specified position relative to current row, with unbounded by default unbounded single-select
start_position Position from which window starts, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively. integer
window_end Whether to set window end unbounded or a specified position relative to current row, with unbounded by default unbounded single-select
end_position Position from which window ends, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively. integer
Examples:

For example, given the following dataset

animal time_born time_adopted
cat 2019-01-01 00:00:00 2019-02-01 00:00:00
cat 2019-01-02 00:00:00 2019-03-01 00:00:00
cat 2019-01-03 00:00:00 2019-03-01 00:00:00
cat 2019-01-04 00:00:00 2020-04-01 00:00:00
cat 2019-01-05 00:00:00 2019-05-01 00:00:00
dog 2019-01-06 00:00:00 2020-06-01 00:00:00
dog 2019-01-07 00:00:00 2019-05-01 00:00:00
dog 2019-01-08 00:00:00 2020-07-01 00:00:00
dog 2019-01-09 00:00:00 2020-06-01 00:00:00
dog 2019-01-10 00:00:00 2019-03-01 00:00:00

the result is

animal time_born time_adopted agg_time_adopted
dog 2019-01-06 00:00:00 2020-06-01 00:00:00 2020-06-01 00:00:00
dog 2019-01-07 00:00:00 2019-05-01 00:00:00 2020-06-01 00:00:00
dog 2019-01-08 00:00:00 2020-07-01 00:00:00 2020-07-01 00:00:00
dog 2019-01-09 00:00:00 2020-06-01 00:00:00 2020-07-01 00:00:00
dog 2019-01-10 00:00:00 2019-03-01 00:00:00 2020-07-01 00:00:00
cat 2019-01-01 00:00:00 2019-02-01 00:00:00 2019-02-01 00:00:00
cat 2019-01-02 00:00:00 2019-03-01 00:00:00 2019-03-01 00:00:00
cat 2019-01-03 00:00:00 2019-03-01 00:00:00 2019-03-01 00:00:00
cat 2019-01-04 00:00:00 2020-04-01 00:00:00 2020-04-01 00:00:00
cat 2019-01-05 00:00:00 2019-05-01 00:00:00 2020-04-01 00:00:00

with input column time_adopted, partitioned by animal and ordered by time_born, for aggregation function latest. The rolling window starts with unbounded and ends with current row, for example, the window of the 4th row in the input dataset

|cat   |2019-01-04 00:00:00|2020-04-01 00:00:00|

is all the rows above down to the current row

|cat   |2019-01-01 00:00:00|2019-02-01 00:00:00|
|cat   |2019-01-02 00:00:00|2019-03-01 00:00:00|
|cat   |2019-01-03 00:00:00|2019-03-01 00:00:00|
|cat   |2019-01-04 00:00:00|2020-04-01 00:00:00|

Compute a rolling numeric aggregation, with optional partitioning

Unique identifier (API): window_rolling_numeric

Please use the newer version compute_rolling_window_function

Compute an aggregation function on one or more numeric columns over rolling/expanding/shrinking windows, optionally partitioned by other selected column(s). Boolean columns are accepted as well and are converted to numeric first.

Parameters Parameter Description Default Parameter Type Parameter Min Parameter Max Parameter Step
agg_function Aggregation function applied to input column(s) single-select
relative_error Relative error for approximating median or a specific quantile. Must be no less than 0.001 for computational efficiency. range 0.001 1.0 0.001
quantile_to_compute Specify which quantile to compute for the quantile rolling function. Must be a number between 0 and 1. range 0.0 1.0 0.05
window_start Whether to set window start unbounded or a specified position relative to current row, with unbounded by default unbounded single-select
start_position Position from which window starts, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively. integer
window_end Whether to set window end unbounded or a specified position relative to current row, with unbounded by default unbounded single-select
end_position Position from which window ends, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively. integer
Examples:

For example, given the following dataset

colour animal birth qty
blue cat 2019-01-01 00:00:00 0.9459212999951968
yellow dog 2019-01-02 00:00:00 0.923807614788432
yellow dog 2019-01-03 00:00:00 0.8160265494527545
blue cat 2019-01-04 00:00:00 0.21414080632280563
yellow dog 2019-01-05 00:00:00 0.31936422789647023
blue dog 2019-01-06 00:00:00 0.8090282829464444
yellow cat 2019-01-07 00:00:00 0.23359299737078676
yellow cat 2019-01-08 00:00:00 0.9244320791985904
blue cat 2019-01-09 00:00:00 0.6161751606853043
blue cat 2019-01-10 00:00:00 0.26992044210702926
yellow dog 2019-01-11 00:00:00 0.40755657119101496
yellow cat 2019-01-12 00:00:00 0.11657357211440877
yellow cat 2019-01-13 00:00:00 0.8757826191122454
yellow cat 2019-01-14 00:00:00 0.3989306451516289
blue dog 2019-01-15 00:00:00 0.09120177599596369
blue dog 2019-01-16 00:00:00 0.8228651830898742
yellow cat 2019-01-17 00:00:00 0.6855467490320383
blue cat 2019-01-18 00:00:00 0.27373326299975465
yellow cat 2019-01-19 00:00:00 0.10849671654816584
blue cat 2019-01-20 00:00:00 0.27289802805739094

the result is

colour animal birth qty roll_qty
yellow dog 2019-01-02 00:00:00 0.923807614788432 0.923807614788432
yellow dog 2019-01-03 00:00:00 0.8160265494527545 1.7398341642411865
yellow dog 2019-01-05 00:00:00 0.31936422789647023 2.059198392137657
yellow dog 2019-01-11 00:00:00 0.40755657119101496 2.466754963328672
blue cat 2019-01-01 00:00:00 0.9459212999951968 0.9459212999951968
blue cat 2019-01-04 00:00:00 0.21414080632280563 1.1600621063180023
blue cat 2019-01-09 00:00:00 0.6161751606853043 1.7762372670033066
blue cat 2019-01-10 00:00:00 0.26992044210702926 2.046157709110336
blue cat 2019-01-18 00:00:00 0.27373326299975465 2.31989097211009
blue cat 2019-01-20 00:00:00 0.27289802805739094 2.592789000167481
blue dog 2019-01-06 00:00:00 0.8090282829464444 0.8090282829464444
blue dog 2019-01-15 00:00:00 0.09120177599596369 0.900230058942408
blue dog 2019-01-16 00:00:00 0.8228651830898742 1.7230952420322823
yellow cat 2019-01-07 00:00:00 0.23359299737078676 0.23359299737078676
yellow cat 2019-01-08 00:00:00 0.9244320791985904 1.1580250765693771
yellow cat 2019-01-12 00:00:00 0.11657357211440877 1.274598648683786
yellow cat 2019-01-13 00:00:00 0.8757826191122454 2.1503812677960314
yellow cat 2019-01-14 00:00:00 0.3989306451516289 2.5493119129476605
yellow cat 2019-01-17 00:00:00 0.6855467490320383 3.2348586619796986
yellow cat 2019-01-19 00:00:00 0.10849671654816584 3.3433553785278645

with input column qty, partitioned by colour and animal and ordered by birth for aggregation function sum, rolling window with unbounded start and current row as end.