Action Catalogue
Aggregation
Aggregate columns within groups
Unique identifier (API): aggregate
Group by the selected columns and compute aggregations within each group
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
merge_back_to_original | Whether to merge the aggregation(s) back to the original data | False | boolean |
Examples:
Given the dataset
grp | val | val1 | bool1 | datetimes |
---|---|---|---|---|
1 | 0.45 | 1 | True | 2023-11-17T13:08:29 |
1 | 0.45 | 2 | True | 2023-11-17T13:08:29 |
3 | 0.45 | 3 | True | 2023-11-17T13:08:29 |
3 | 0.45 | 3.5 | True | 2023-08-03T08:48:29 |
3 | 1 | 3 | False | 2022-05-29T03:21:59 |
3 | 1 | 4 | False | 2022-05-29T03:21:59 |
2 | 10 | 5 | False | 2022-05-29T03:21:59 |
2 | 100 | 6 | False | 2022-05-29T03:21:59 |
If we group by grp
and peform the following aggregations within each group
mean(val)
sum(val1)
all(bool1)
any(bool)
approx_count_distinct(val, 0)
approx_count_distinct(bool1, 0)
earliest(datetimes, 0)
latest(datetimes, 0)
first(val, 0)
last(val, 0)
approx_median(val)
approx_quantile(val)
count(bool1, "non_null")
count(bool1, "null)
we get
grp | avg_val | sum_val1 | all_bool1 | any_bool1 | approx_count_distinct_val | approx_count_distinct_bool1 | earliest_datetime | latest_datetime | first_val | last_val | median_val | quantile_val | count_bool1 | count_nulls_bool1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text | Numeric | Numeric | Boolean | Boolean | Numeric | Numeric | DateTime | DateTime | Numeric | Numeric | Numeric | Numeric | Numeric | Numeric |
1 | 0.45 | 3.0 | True | True | 1 | 1 | 2023-11-17 13:08:29 | 2023-11-17 13:08:29 | 0.45 | 0.45 | 0.45 | 0.45 | 2 | 0 |
2 | 55 | 11 | False | False | 2 | 1 | 2022-05-29 03:21:59 | 2022-05-29 03:21:59 | 10 | 100 | 10.0 | 10 | 2 | 0 |
3 | 0.725 | 13.5 | False | True | 2 |
The user may optionally specify merge_back_to_original = True
in which case the group by results will be joined with the original table using the group by columns as the join keys. Note: the resultant table using merge_back_to_original = True
will have the same number of rows as the original table.
Casting
Cast columns to categorical type
Unique identifier (API): cast_categorical
Cast the selected columns to categorical type.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
categories_specified | If set to 'true', categories must be specified explicitly. If set to 'false', every unique value in column will become a category | False | boolean |
categories | Categories in the column | string-list |
Cast categorical columns to text type
Unique identifier (API): cast_categorical_to_text
Cast selected categorical columns to text type
Cast columns to DateTime type
Unique identifier (API): cast_datetime
Cast the selected columns to DateTime type with specified format string
Parameters | Parameter Description | Default | Parameter Type | Suggested Values |
---|---|---|---|---|
datetime_format | DateTime format string in the strftime format. | %Y-%m-%dT%H:%M:%S.%fZ | string | ['%Y-%m-%dT%H:%M:%S.%f%z', '%Y-%m-%dT%H:%M:%S.%fZ', '%b %d %Y %H:%M:%S.%f', '%Y-%m-%dT%H:%M:%S.%f', '%m-%d-%Y %H:%M:%S.%f', '%d-%m-%Y %H:%M:%S.%f', '%d %b %Y %H:%M:%S.%f', '%Y/%m/%d %H:%M:%S.%f', '%Y-%m-%d %H:%M:%S.%f', '%d/%m/%Y %H:%M:%S.%f', '%Y-%m-%dT%H:%M:%S%z', '%Y-%m-%dT%H:%M:%SZ', '%d%m%Y %H:%M:%S.%f', '%m%d%Y %H:%M:%S.%f', '%Y%m%d %H:%M:%S.%f', '%m-%d-%Y %H:%M:%S', '%Y/%m/%d %H:%M:%S', '%Y-%m-%dT%H:%M:%S', '%b-%d-%Y %H:%M:%S', '%d-%m-%Y %H:%M:%S', '%d-%b-%Y %H:%M:%S', '%d/%m/%Y %H:%M:%S', '%Y-%m-%d %H:%M:%S', '%m/%d/%Y %I:%M:%S %p', '%Y%m%d %H:%M:%S', '%Y-%m-%dT%H:%MZ', '%m%d%Y %H:%M:%S', '%d%m%Y %H:%M:%S', '%m-%d-%Y', '%Y-%m-%d', '%d-%m-%Y', '%m %d %Y', '%d-%b-%Y', '%Y/%m/%d', '%d %m %Y', '%m/%d/%Y', '%Y %m %d', '%d/%m/%Y', '%d %b %Y', '%b-%d-%Y', '%b %d %Y', '%m%d%Y', '%Y%m%d', '%d%m%Y', '%m%Y', '%m %Y', '%m-%Y', '%m/%Y'] |
timezone | Specifies the timezone the DateTimes belong to | UTC | string | ['Africa/Abidjan', 'Africa/Accra', 'Africa/Addis_Ababa', 'Africa/Algiers', 'Africa/Asmara', 'Africa/Bamako', 'Africa/Bangui', 'Africa/Banjul', 'Africa/Bissau', 'Africa/Blantyre', 'Africa/Brazzaville', 'Africa/Bujumbura', 'Africa/Cairo', 'Africa/Casablanca', 'Africa/Ceuta', 'Africa/Conakry', 'Africa/Dakar', 'Africa/Dar_es_Salaam', 'Africa/Djibouti', 'Africa/Douala', 'Africa/El_Aaiun', 'Africa/Freetown', 'Africa/Gaborone', 'Africa/Harare', 'Africa/Johannesburg', 'Africa/Juba', 'Africa/Kampala', 'Africa/Khartoum', 'Africa/Kigali', 'Africa/Kinshasa', 'Africa/Lagos', 'Africa/Libreville', 'Africa/Lome', 'Africa/Luanda', 'Africa/Lubumbashi', 'Africa/Lusaka', 'Africa/Malabo', 'Africa/Maputo', 'Africa/Maseru', 'Africa/Mbabane', 'Africa/Mogadishu', 'Africa/Monrovia', 'Africa/Nairobi', 'Africa/Ndjamena', 'Africa/Niamey', 'Africa/Nouakchott', 'Africa/Ouagadougou', 'Africa/Porto-Novo', 'Africa/Sao_Tome', 'Africa/Tripoli', 'Africa/Tunis', 'Africa/Windhoek', 'America/Adak', 'America/Anchorage', 'America/Anguilla', 'America/Antigua', 'America/Araguaina', 'America/Argentina/Buenos_Aires', 'America/Argentina/Catamarca', 'America/Argentina/Cordoba', 'America/Argentina/Jujuy', 'America/Argentina/La_Rioja', 'America/Argentina/Mendoza', 'America/Argentina/Rio_Gallegos', 'America/Argentina/Salta', 'America/Argentina/San_Juan', 'America/Argentina/San_Luis', 'America/Argentina/Tucuman', 'America/Argentina/Ushuaia', 'America/Aruba', 'America/Asuncion', 'America/Atikokan', 'America/Bahia', 'America/Bahia_Banderas', 'America/Barbados', 'America/Belem', 'America/Belize', 'America/Blanc-Sablon', 'America/Boa_Vista', 'America/Bogota', 'America/Boise', 'America/Cambridge_Bay', 'America/Campo_Grande', 'America/Cancun', 'America/Caracas', 'America/Cayenne', 'America/Cayman', 'America/Chicago', 'America/Chihuahua', 'America/Costa_Rica', 'America/Creston', 'America/Cuiaba', 'America/Curacao', 'America/Danmarkshavn', 'America/Dawson', 'America/Dawson_Creek', 'America/Denver', 'America/Detroit', 'America/Dominica', 'America/Edmonton', 'America/Eirunepe', 'America/El_Salvador', 'America/Fort_Nelson', 'America/Fortaleza', 'America/Glace_Bay', 'America/Goose_Bay', 'America/Grand_Turk', 'America/Grenada', 'America/Guadeloupe', 'America/Guatemala', 'America/Guayaquil', 'America/Guyana', 'America/Halifax', 'America/Havana', 'America/Hermosillo', 'America/Indiana/Indianapolis', 'America/Indiana/Knox', 'America/Indiana/Marengo', 'America/Indiana/Petersburg', 'America/Indiana/Tell_City', 'America/Indiana/Vevay', 'America/Indiana/Vincennes', 'America/Indiana/Winamac', 'America/Inuvik', 'America/Iqaluit', 'America/Jamaica', 'America/Juneau', 'America/Kentucky/Louisville', 'America/Kentucky/Monticello', 'America/Kralendijk', 'America/La_Paz', 'America/Lima', 'America/Los_Angeles', 'America/Lower_Princes', 'America/Maceio', 'America/Managua', 'America/Manaus', 'America/Marigot', 'America/Martinique', 'America/Matamoros', 'America/Mazatlan', 'America/Menominee', 'America/Merida', 'America/Metlakatla', 'America/Mexico_City', 'America/Miquelon', 'America/Moncton', 'America/Monterrey', 'America/Montevideo', 'America/Montserrat', 'America/Nassau', 'America/New_York', 'America/Nipigon', 'America/Nome', 'America/Noronha', 'America/North_Dakota/Beulah', 'America/North_Dakota/Center', 'America/North_Dakota/New_Salem', 'America/Nuuk', 'America/Ojinaga', 'America/Panama', 'America/Pangnirtung', 'America/Paramaribo', 'America/Phoenix', 'America/Port-au-Prince', 'America/Port_of_Spain', 'America/Porto_Velho', 'America/Puerto_Rico', 'America/Punta_Arenas', 'America/Rainy_River', 'America/Rankin_Inlet', 'America/Recife', 'America/Regina', 'America/Resolute', 'America/Rio_Branco', 'America/Santarem', 'America/Santiago', 'America/Santo_Domingo', 'America/Sao_Paulo', 'America/Scoresbysund', 'America/Sitka', 'America/St_Barthelemy', 'America/St_Johns', 'America/St_Kitts', 'America/St_Lucia', 'America/St_Thomas', 'America/St_Vincent', 'America/Swift_Current', 'America/Tegucigalpa', 'America/Thule', 'America/Thunder_Bay', 'America/Tijuana', 'America/Toronto', 'America/Tortola', 'America/Vancouver', 'America/Whitehorse', 'America/Winnipeg', 'America/Yakutat', 'America/Yellowknife', 'Antarctica/Casey', 'Antarctica/Davis', 'Antarctica/DumontDUrville', 'Antarctica/Macquarie', 'Antarctica/Mawson', 'Antarctica/McMurdo', 'Antarctica/Palmer', 'Antarctica/Rothera', 'Antarctica/Syowa', 'Antarctica/Troll', 'Antarctica/Vostok', 'Arctic/Longyearbyen', 'Asia/Aden', 'Asia/Almaty', 'Asia/Amman', 'Asia/Anadyr', 'Asia/Aqtau', 'Asia/Aqtobe', 'Asia/Ashgabat', 'Asia/Atyrau', 'Asia/Baghdad', 'Asia/Bahrain', 'Asia/Baku', 'Asia/Bangkok', 'Asia/Barnaul', 'Asia/Beirut', 'Asia/Bishkek', 'Asia/Brunei', 'Asia/Chita', 'Asia/Choibalsan', 'Asia/Colombo', 'Asia/Damascus', 'Asia/Dhaka', 'Asia/Dili', 'Asia/Dubai', 'Asia/Dushanbe', 'Asia/Famagusta', 'Asia/Gaza', 'Asia/Hebron', 'Asia/Ho_Chi_Minh', 'Asia/Hong_Kong', 'Asia/Hovd', 'Asia/Irkutsk', 'Asia/Jakarta', 'Asia/Jayapura', 'Asia/Jerusalem', 'Asia/Kabul', 'Asia/Kamchatka', 'Asia/Karachi', 'Asia/Kathmandu', 'Asia/Khandyga', 'Asia/Kolkata', 'Asia/Krasnoyarsk', 'Asia/Kuala_Lumpur', 'Asia/Kuching', 'Asia/Kuwait', 'Asia/Macau', 'Asia/Magadan', 'Asia/Makassar', 'Asia/Manila', 'Asia/Muscat', 'Asia/Nicosia', 'Asia/Novokuznetsk', 'Asia/Novosibirsk', 'Asia/Omsk', 'Asia/Oral', 'Asia/Phnom_Penh', 'Asia/Pontianak', 'Asia/Pyongyang', 'Asia/Qatar', 'Asia/Qostanay', 'Asia/Qyzylorda', 'Asia/Riyadh', 'Asia/Sakhalin', 'Asia/Samarkand', 'Asia/Seoul', 'Asia/Shanghai', 'Asia/Singapore', 'Asia/Srednekolymsk', 'Asia/Taipei', 'Asia/Tashkent', 'Asia/Tbilisi', 'Asia/Tehran', 'Asia/Thimphu', 'Asia/Tokyo', 'Asia/Tomsk', 'Asia/Ulaanbaatar', 'Asia/Urumqi', 'Asia/Ust-Nera', 'Asia/Vientiane', 'Asia/Vladivostok', 'Asia/Yakutsk', 'Asia/Yangon', 'Asia/Yekaterinburg', 'Asia/Yerevan', 'Atlantic/Azores', 'Atlantic/Bermuda', 'Atlantic/Canary', 'Atlantic/Cape_Verde', 'Atlantic/Faroe', 'Atlantic/Madeira', 'Atlantic/Reykjavik', 'Atlantic/South_Georgia', 'Atlantic/St_Helena', 'Atlantic/Stanley', 'Australia/Adelaide', 'Australia/Brisbane', 'Australia/Broken_Hill', 'Australia/Darwin', 'Australia/Eucla', 'Australia/Hobart', 'Australia/Lindeman', 'Australia/Lord_Howe', 'Australia/Melbourne', 'Australia/Perth', 'Australia/Sydney', 'Canada/Atlantic', 'Canada/Central', 'Canada/Eastern', 'Canada/Mountain', 'Canada/Newfoundland', 'Canada/Pacific', 'Europe/Amsterdam', 'Europe/Andorra', 'Europe/Astrakhan', 'Europe/Athens', 'Europe/Belgrade', 'Europe/Berlin', 'Europe/Bratislava', 'Europe/Brussels', 'Europe/Bucharest', 'Europe/Budapest', 'Europe/Busingen', 'Europe/Chisinau', 'Europe/Copenhagen', 'Europe/Dublin', 'Europe/Gibraltar', 'Europe/Guernsey', 'Europe/Helsinki', 'Europe/Isle_of_Man', 'Europe/Istanbul', 'Europe/Jersey', 'Europe/Kaliningrad', 'Europe/Kiev', 'Europe/Kirov', 'Europe/Lisbon', 'Europe/Ljubljana', 'Europe/London', 'Europe/Luxembourg', 'Europe/Madrid', 'Europe/Malta', 'Europe/Mariehamn', 'Europe/Minsk', 'Europe/Monaco', 'Europe/Moscow', 'Europe/Oslo', 'Europe/Paris', 'Europe/Podgorica', 'Europe/Prague', 'Europe/Riga', 'Europe/Rome', 'Europe/Samara', 'Europe/San_Marino', 'Europe/Sarajevo', 'Europe/Saratov', 'Europe/Simferopol', 'Europe/Skopje', 'Europe/Sofia', 'Europe/Stockholm', 'Europe/Tallinn', 'Europe/Tirane', 'Europe/Ulyanovsk', 'Europe/Uzhgorod', 'Europe/Vaduz', 'Europe/Vatican', 'Europe/Vienna', 'Europe/Vilnius', 'Europe/Volgograd', 'Europe/Warsaw', 'Europe/Zagreb', 'Europe/Zaporozhye', 'Europe/Zurich', 'GMT', 'Indian/Antananarivo', 'Indian/Chagos', 'Indian/Christmas', 'Indian/Cocos', 'Indian/Comoro', 'Indian/Kerguelen', 'Indian/Mahe', 'Indian/Maldives', 'Indian/Mauritius', 'Indian/Mayotte', 'Indian/Reunion', 'Pacific/Apia', 'Pacific/Auckland', 'Pacific/Bougainville', 'Pacific/Chatham', 'Pacific/Chuuk', 'Pacific/Easter', 'Pacific/Efate', 'Pacific/Enderbury', 'Pacific/Fakaofo', 'Pacific/Fiji', 'Pacific/Funafuti', 'Pacific/Galapagos', 'Pacific/Gambier', 'Pacific/Guadalcanal', 'Pacific/Guam', 'Pacific/Honolulu', 'Pacific/Kiritimati', 'Pacific/Kosrae', 'Pacific/Kwajalein', 'Pacific/Majuro', 'Pacific/Marquesas', 'Pacific/Midway', 'Pacific/Nauru', 'Pacific/Niue', 'Pacific/Norfolk', 'Pacific/Noumea', 'Pacific/Pago_Pago', 'Pacific/Palau', 'Pacific/Pitcairn', 'Pacific/Pohnpei', 'Pacific/Port_Moresby', 'Pacific/Rarotonga', 'Pacific/Saipan', 'Pacific/Tahiti', 'Pacific/Tarawa', 'Pacific/Tongatapu', 'Pacific/Wake', 'Pacific/Wallis', 'US/Alaska', 'US/Arizona', 'US/Central', 'US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC'] |
Cast specified columns to JSON array type
Unique identifier (API): cast_json_array
This action marks input columns as "JSONArray" type, implying that the values in the column are expected to be JSON strings parsable into arrays of elements with uniform type.
Cast specified columns to JSON object type
Unique identifier (API): cast_json_object
This action marks input columns as "JSONObject" type, implying that the values in the column are expected to be JSON strings parsable into objects with attributes and values.
Cast columns to numeric type
Unique identifier (API): cast_numeric
Cast the selected columns to numeric type.
Column Group
Get column group components
Unique identifier (API): get_column_group_components
Extracts specified components from a column group by their suffix. If a specified component suffix does not exist, will return a null column as the corresponding output.
Examples:
For the following input table:
dummy_col | __group_grp_a_0 | __group_grp_a_1 | __group_grp_a_2 |
---|---|---|---|
aaa | 1 | 2 | 1 |
bbb | Sentence A | Sentence B | Sentence C |
with:
* input column: 'grp_a'
* output columns: ['component_0_from_grp_a', 'component_2_from_grp_a']
* suffixes: [0, 2]
The output is:
dummy_col | __group_grp_a_0 | __group_grp_a_1 | __group_grp_a_2 | component_0_from_grp_a | component_2_from_grp_a |
---|---|---|---|---|---|
aaa | 1 | 2 | 1 | 1 | 1 |
bbb | Sentence A | Sentence B | Sentence C | Sentence A | Sentence C |
Conversion
Convert numeric column to DateTime
Unique identifier (API): numeric_to_datetime
Converts the given numeric columns into UTC timestamps, interpreting them as offset values in the given unit since the specified epoch.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
units | Specifies the unit of time that the values in the numeric column represent. | Seconds | single-select |
rel_to | Specifies the "origin" UTC DateTime. The conversion from Numeric to DateTime shall be relative to this "origin". | 1970-01-01 | string |
Examples:
For units == 'seconds'
, rel_to="1970-01-01T00:00:00"
and for the following input table:
numeric timestamps |
---|
0 |
-86400 |
86400 |
The expected output is:
numeric timestamps |
---|
1970-01-01 00:00:00 |
1969-12-31 00:00:00 |
1970-01-02 00:00:00 |
Custom Formula
Create a new column with a formula
Unique identifier (API): create_new_column_with_formula_v2
Creates a new column with any user-entered formula. Common Excel/SQL functions and operators for manipulating Numeric, DateTime, Text, and Boolean values are permitted. For a full list of allowed functions, see documentation.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
expression | A formula involving column names. For example: 'column_B / column_A + log(column_D * 2)' | formula |
Examples:
Consider the dataset:
manufacturer | model | year | displ | cty | hwy |
---|---|---|---|---|---|
audi | a4 quattro | 1999 | 1.8 | 16 | 25 |
audi | a6 quattro | 1999 | 2.8 | 15 | 24 |
audi | a6 quattro | 2008 | 4.2 | 16 | 23 |
audi | a6 quattro | 2008 | 4.2 | 16 | 23 |
chevrolet | c1500 suburban 2wd | 2008 | 6 | 12 | 17 |
chevrolet | corvette | 1999 | 5.7 | 16 | 26 |
chevrolet | corvette | 1999 | 5.7 | 15 | 23 |
chevrolet | k1500 tahoe 4wd | 1999 | 6.5 | 14 | 17 |
chevrolet | malibu | 1999 | 3.1 | 18 | 26 |
dodge | caravan 2wd | 1999 | 3.8 | 15 | 22 |
where cty
, hwy
and year
are of Numeric
type.
Given the formula 'hwy - cty'
and the output column name 'hwy_minus_cty'
, the result will be:
manufacturer | model | year | displ | cty | hwy | hwy_minus_cty |
---|---|---|---|---|---|---|
audi | a4 quattro | 1999 | 1.8 | 16 | 25 | 9 |
audi | a6 quattro | 1999 | 2.8 | 15 | 24 | 9 |
audi | a6 quattro | 2008 | 4.2 | 16 | 23 | 7 |
audi | a6 quattro | 2008 | 4.2 | 16 | 23 | 7 |
chevrolet | c1500 suburban 2wd | 2008 | 6 | 12 | 17 | 5 |
chevrolet | corvette | 1999 | 5.7 | 16 | 26 | 10 |
chevrolet | corvette | 1999 | 5.7 | 15 | 23 | 8 |
chevrolet | k1500 tahoe 4wd | 1999 | 6.5 | 14 | 17 | 3 |
chevrolet | malibu | 1999 | 3.1 | 18 | 26 | 8 |
dodge | caravan 2wd | 1999 | 3.8 | 15 | 22 | 7 |
where the new hwy_minus_cty
column will be of Numeric
type.
For the same input data, the formula 'IF(year <= 2000, "20th", "21st")'
with the output column name 'century'
will result in:
manufacturer | model | year | displ | cty | hwy | century |
---|---|---|---|---|---|---|
audi | a4 quattro | 1999 | 1.8 | 16 | 25 | 20th |
audi | a6 quattro | 1999 | 2.8 | 15 | 24 | 20th |
audi | a6 quattro | 2008 | 4.2 | 16 | 23 | 21st |
audi | a6 quattro | 2008 | 4.2 | 16 | 23 | 21st |
chevrolet | c1500 suburban 2wd | 2008 | 6 | 12 | 17 | 21st |
chevrolet | corvette | 1999 | 5.7 | 16 | 26 | 20th |
chevrolet | corvette | 1999 | 5.7 | 15 | 23 | 20th |
chevrolet | k1500 tahoe 4wd | 1999 | 6.5 | 14 | 17 | 20th |
chevrolet | malibu | 1999 | 3.1 | 18 | 26 | 20th |
dodge | caravan 2wd | 1999 | 3.8 | 15 | 22 | 20th |
where the new century
column will be of Text
type.
Transform columns with a formula
Unique identifier (API): transform_columns_with_formula_v2
Transforms one or more column(s) with a user-specified formula. Common Excel/SQL functions and operators for manipulating Numeric, DateTime, Text, Boolean and Categorical values as well as JSON Objects/Arrays are permitted. For a full list of these functions, see documentation.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
formula | Formula where COL/COLUMN refers to the column(s) to be transformed. For example, "col_A + col_B - $COL" | formula |
Examples:
For example, given the following dataset
qty_1 | qty_2 |
---|---|
10 | null |
20 | 15 |
15 | 6 |
null | 1 |
8 | 8 |
and formula = ""$col + (qty_2 + $col) * 10""
, with input column qty_1
, the result will be
qty_1 | qty_2 |
---|---|
null | null |
370 | 15 |
225 | 6 |
null | 1 |
168 | 8 |
Data Cleaning/Processing
Concatenate with delimiter
Unique identifier (API): concat_cols_delim
Concatenates one or more text columns / column groups with a specified delimiter.
Parameters | Parameter Description | Default | Parameter Type | Suggested Values |
---|---|---|---|---|
delimiter | Delimiter to be appended between concatenated values | string | [',', ';', '-', ' |
Examples:
Assuming delimiter == ', '
and the input table is: (one column, one group with 2 components)
col_1 | __group_grp_0 | __group_grp_1 |
---|---|---|
Hello | World | A |
Hello | null | B |
The output will be:
col_1 | __group_grp_0 | __group_grp_1 | comb |
---|---|---|---|
Hello | World | A | Hello, World, A |
Hello | null | B | Hello, , B |
Copy
Unique identifier (API): copy
Creates a copy of the specified columns OR column groups
Examples:
For the following input table: (2 columns and 1 group with 2 components)
col_1 | col_2 | __group_grp_0 | __group_grp_1 |
---|---|---|---|
0.14667831357791283 | 2194986032 | 0.6765007779041725 | 0.876755413362593 |
0.8424524032441315 | -8793057235 | 0.21887501332663328 | 0.6874699121416429 |
0.08379407453894183 | -1300419439 | 0.2108548261597638 | 0.39335408485745316 |
0.09723766918372134 | 2467206050 | 0.2499576969146754 | 0.783647656429706 |
0.13917572512342702 | 8415264685 | 0.9150713227141526 | 0.071808029978221 |
We select to copy: [col_2, grp]
. (grp
is the name of the group). The output is:
col_1 | col_2 | __group_grp_0 | __group_grp_1 | col_2_copy | __group_grp_copy_0 | __group_grp_copy_1 |
---|---|---|---|---|---|---|
0.14667831357791283 | 2194986032 | 0.6765007779041725 | 0.876755413362593 | 2194986032 | 0.6765007779041725 | 0.876755413362593 |
0.8424524032441315 | -8793057235 | 0.21887501332663328 | 0.6874699121416429 | -8793057235 | 0.21887501332663328 | 0.6874699121416429 |
0.08379407453894183 | -1300419439 | 0.2108548261597638 | 0.39335408485745316 | -1300419439 | 0.2108548261597638 | 0.39335408485745316 |
0.09723766918372134 | 2467206050 | 0.2499576969146754 | 0.783647656429706 | 2467206050 | 0.2499576969146754 | 0.783647656429706 |
0.13917572512342702 | 8415264685 | 0.9150713227141526 | 0.071808029978221 | 8415264685 | 0.9150713227141526 | 0.071808029978221 |
Drop
Unique identifier (API): drop
Drop columns and/or column groups
Examples:
For example, given the following dataset
col_1 | col_2 | __group_A_0 | __group_A_1 | __group_B_mean | __group_B_count |
---|---|---|---|---|---|
0.666099547109415 | cat_1 | 0.8268666013533758 | 0 | 0.9842078018727513 | 0 |
0.293068258318807 | cat_2 | 0.9747909820892654 | 1 | 0.703905046233064 | 1 |
0.2961075040827643 | cat_3 | 0.14699408293664762 | 2 | 0.5417745101099833 | 2 |
0.11255366333183248 | cat_4 | 0.5392931543191276 | 3 | 0.35921739535158015 | 3 |
0.9707308564097861 | cat_5 | 0.23839121304354394 | 4 | 0.8751438273689285 | 4 |
with input column col_1
and column group A
, the result is
col_2 | __group_B_mean | __group_B_count |
---|---|---|
cat_1 | 0.9842078018727513 | 0 |
cat_2 | 0.703905046233064 | 1 |
cat_3 | 0.5417745101099833 | 2 |
cat_4 | 0.35921739535158015 | 3 |
cat_5 | 0.8751438273689285 | 4 |
Find and replace
Unique identifier (API): find_replace
Finds and replaces the specified strings in the input column(s) according to the given mapping
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
replacements | Mapping between the search strings and their replacements | map | |
match_case | Whether the matches are case sensitive | False | boolean |
positional_match | Matches the search only if a match is found in the specified position in the string | anywhere | single-select |
Examples:
For the following input table:
col_x | col_y |
---|---|
They won the fifth place | They won the sixth place |
No match here | whole word is th |
and input columns: ['col_x', 'col_y']
and parameters:
The output will be:
For ('prefix', True)
:
col_x | col_y |
---|---|
They won ##e fifth place | They won ##e sixth place |
No match here | whole word is ## |
Impute categorical columns
Unique identifier (API): impute_categorical
Fill missing values in categorical input columns by a specified method
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
method | Method for imputing input columns | most_frequent | single-select |
fill_value | Value to fill when method is specified as constant | string |
Examples:
For example, a categorical column white, black, None, white, yellow
is changed to white, black, white, white, yellow
with method = ""most_frequent""
(missing values are filled with the most frequent value "white").
Impute numeric columns
Unique identifier (API): impute_numeric
Fill missing values in numeric input columns by a specified method
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
method | Method for imputing input columns | mean | single-select |
fill_value | Value to fill when method is specified as constant | float |
Examples:
For example, a numeric column 100, 120, None, 80, 100, None
is changed to 100, 120, 100, 80, 100, 100
with method = ""mean""
(missing values are filled with the mean 100).
Rename
Unique identifier (API): rename
Renames specified columns OR column groups
Examples:
For the following input table: (3 columns and 1 group with 2 components)
col_1 | col_2 | col_3 | __group_grp_0 | __group_grp_1 |
---|---|---|---|---|
0.5267075 | 8.24489e+07 | 1.23345 | 0.2635994 | 0.1005395 |
0.5707881 | 2.044422e+09 | -0.1124 | 0.3945866 | 0.4787061 |
0.6478216 | -5.179201e+09 | 0.00011 | 0.1314026 | 0.2335102 |
We select to rename: [col_2, grp]
. (grp
is the name of the group) to [col_A, grp_A]
. The output is:
col_1 | col_A | col_3 | __group_grp_A_0 | __group_grp_A_1 |
---|---|---|---|---|
0.5267075 | 8.24489e+07 | 1.23345 | 0.2635994 | 0.1005395 |
0.5707881 | 2.044422e+09 | -0.1124 | 0.3945866 | 0.4787061 |
0.6478216 | -5.179201e+09 | 0.00011 | 0.1314026 | 0.2335102 |
Sort on columns
Unique identifier (API): sort_on_columns
Sort the dataset based on the input columns in a specified order (ascending/descending)
Examples:
For the following input table:
animal | colour | adopted | qty | desexed | note |
---|---|---|---|---|---|
cat | yellow | 2016-07-31T00:00:12 | 10 | True | There is a story |
dog | blue | 2017-08-20T07:23:54 | 4 | True | |
cat | blue | 2017-09-20T00:23:50 | 9 | True | No story |
cat | blue | 2018-05-01T04:40:15 | 8 | True | There is a story |
dog | yellow | 2019-02-05T12:12:02 | 1 | True | There is another story |
dog | black | 2019-08-20T07:23:02 | 2 | False | There is a story |
cat | black | 2020-01-20T00:23:50 | 2 | True | No story |
dog | yellow | 2020-08-02T15:00:08 | 2 | False |
We select sorting by: qty
in ascending
order and adopted
in descending
order.
The output is:
animal | colour | adopted | qty | desexed | note |
---|---|---|---|---|---|
dog | yellow | 2019-02-05 12:12:02 | 1 | True | There is another story |
dog | yellow | 2020-08-02 15:00:08 | 2 | False | |
cat | black | 2020-01-20 00:23:50 | 2 | True | No story |
dog | black | 2019-08-20 07:23:02 | 2 | False | There is a story |
dog | blue | 2017-08-20 07:23:54 | 4 | True | |
cat | blue | 2018-05-01 04:40:15 | 8 | True | There is a story |
cat | blue | 2017-09-20 00:23:50 | 9 | True | No story |
cat | yellow | 2016-07-31 00:00:12 | 10 | True | There is a story |
Strip HTML tags
Unique identifier (API): strip_html
Strip HTML tags from text to convert webpages to text documents, sans HTML formatting
Examples:
Strip html tags from text e.g. ""<emp> Text </emp>""
becomes ""Text""
Fill null with the nearest preceding/following values
Unique identifier (API): window_fill_null_by_nearest
Fill null with the nearest non-null values in either preceding or following rows after ordering, optionally with partitioning
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
method | Whether to use the nearest value in preceding or following rows | preceding | single-select |
Examples:
For example, given the following dataset
animal | colour | adopted | qty |
---|---|---|---|
cat | yellow | 2020-01-01 00:00:00 | null |
cat | yellow | 2020-01-02 00:00:00 | null |
cat | yellow | 2020-01-03 00:00:00 | 5.0 |
cat | yellow | 2020-01-04 00:00:00 | 7.0 |
cat | yellow | 2020-01-05 00:00:00 | 10.0 |
cat | white | 2020-01-06 00:00:00 | 1.0 |
cat | white | 2020-01-07 00:00:00 | 4.0 |
cat | white | 2020-01-08 00:00:00 | null |
cat | white | 2020-01-09 00:00:00 | 7.0 |
cat | white | 2020-01-10 00:00:00 | 8.0 |
dog | yellow | 2020-01-11 00:00:00 | 9.0 |
dog | yellow | 2020-01-12 00:00:00 | 7.0 |
dog | yellow | 2020-01-13 00:00:00 | 6.0 |
dog | yellow | 2020-01-14 00:00:00 | 5.0 |
dog | yellow | 2020-01-15 00:00:00 | 10.0 |
dog | white | 2020-01-16 00:00:00 | 6.0 |
dog | white | 2020-01-17 00:00:00 | 5.0 |
dog | white | 2020-01-18 00:00:00 | 8.0 |
dog | white | 2020-01-19 00:00:00 | null |
dog | white | 2020-01-20 00:00:00 | null |
with input column qty
, partitioned by animal
and colour
and ordered by adopted
for method preceding
, the result is
animal | colour | adopted | qty |
---|---|---|---|
cat | white | 2020-01-06 00:00:00 | 1.0 |
cat | white | 2020-01-07 00:00:00 | 4.0 |
cat | white | 2020-01-08 00:00:00 | 4.0 |
cat | white | 2020-01-09 00:00:00 | 7.0 |
cat | white | 2020-01-10 00:00:00 | 8.0 |
cat | yellow | 2020-01-01 00:00:00 | null |
cat | yellow | 2020-01-02 00:00:00 | null |
cat | yellow | 2020-01-03 00:00:00 | 5.0 |
cat | yellow | 2020-01-04 00:00:00 | 7.0 |
cat | yellow | 2020-01-05 00:00:00 | 10.0 |
dog | white | 2020-01-16 00:00:00 | 6.0 |
dog | white | 2020-01-17 00:00:00 | 5.0 |
dog | white | 2020-01-18 00:00:00 | 8.0 |
dog | white | 2020-01-19 00:00:00 | 8.0 |
dog | white | 2020-01-20 00:00:00 | 8.0 |
dog | yellow | 2020-01-11 00:00:00 | 9.0 |
dog | yellow | 2020-01-12 00:00:00 | 7.0 |
dog | yellow | 2020-01-13 00:00:00 | 6.0 |
dog | yellow | 2020-01-14 00:00:00 | 5.0 |
dog | yellow | 2020-01-15 00:00:00 | 10.0 |
Date/DateTime
Add time delta
Unique identifier (API): add_timedelta
Adds a time delta (in seconds) to one or more DateTime columns. Changes the columns in place.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
time_delta_sec | This is the time-duration in seconds to be added to one or more DateTime columns. | 0 | integer |
Examples:
For a time_delta_sec == -100
and the given datetime 2022-06-30 23:51:08.825072
the output is 2022-06-30 23:49:28.825072
Create cyclic encoding features
Unique identifier (API): cyclic_encode
A directive that receives one or more columns that are cyclical in nature. denote the column as 'x' and the period of the column as 'p'. For each such columns, creates 2 new columns which are: x_sin = sin(frac{2\pix}p)
, x_cos = cos(frac{2\pix}p)
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
period | The period of the cyclic feature | 1 | float |
Compute time elapsed
Unique identifier (API): diff_datetime
Calculates the time difference in seconds between two DateTime columns
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
units | Specifies the units of the time difference result | Seconds | single-select |
Examples:
If the time difference units are Hours
, for the following input:
col_1 | col_2 |
---|---|
2020-07-23 11:26:04.812256 | 2020-07-31 11:26:04.812256 |
We expect the following output:
col_1 | col_2 | diff_1 |
---|---|---|
2020-07-23 11:26:04.812256 | 2020-07-31 11:26:04.812256 | -192.0 |
Extract AM/PM
Unique identifier (API): extract_am_pm
Extracts the AM/PM value for rows of a DateTime column
Examples:
2020-04-15T15:53:44 -> will return: PM
Extract day of month
Unique identifier (API): extract_day_of_month_number
Extract the day of month as a number from 1 to 31
Examples:
2020-04-15T15:53:44 -> will return: 15
Extract day of week name
Unique identifier (API): extract_day_of_week_name
Extract the name of the day in week ('Monday' to 'Sunday')
Examples:
2020-04-15T15:53:44 -> will return: Wednesday
Extract day of week number
Unique identifier (API): extract_day_of_week_number
Extract the day of week as a number ('Sunday' = 1, 'Saturday' = 7)
Examples:
2020-04-15T15:53:44, Days coding is: Monday=0, Tuesday=1, ..., Sunday=6 -> will return: 02.
Extract hours
Unique identifier (API): extract_hour
Extract hour of the day as a number from 0 to 23
Examples:
2020-04-15T15:53:44 -> will return: 15
Extract minutes
Unique identifier (API): extract_minutes
Extract minutes after the hour as a number from 0 to 59
Examples:
2020-04-15T15:53:44 -> will return: 53
Extract month name
Unique identifier (API): extract_month_name
Extracts the months names of a DateTime column
Examples:
2020-04-15T15:53:44 -> will return: April
Extract month number
Unique identifier (API): extract_month_number
Extract the month as a number from 1 to 12
Extract quarter
Unique identifier (API): extract_quarter
Extract the quarter (3-month period) of the year in which the dates fall as an integer from 1 to 4.
Examples:
2020-04-15T15:53:44 -> will return: 02
Extract seconds
Unique identifier (API): extract_seconds
Extract seconds after the minute as a number from 0 to 59
Examples:
2020-04-15T15:53:44 -> will return: 44
Extract timestamp
Unique identifier (API): extract_timestamp
Extracts the POSIX timestamp. (Number of seconds since January 1st, 1970
Examples:
Returns a POSIX timestamp as float. (Number of seconds since January 1st, 1970).
2020-04-15T15:53:44 -> will return: 1586930024.0.
Extract UTC offset
Unique identifier (API): extract_utc_offset
Extracts the UTC offset (in hours) from timezone aware string timestamps
Parameters | Parameter Description | Default | Parameter Type | Suggested Values |
---|---|---|---|---|
date_format | The date_format of the column | %Y-%m-%dT%H:%M:%S%z | string | ['%Y-%m-%dT%H:%M:%S.%fZ%z', '%b %d %Y %H:%M:%S.%f%z', '%Y-%m-%dT%H:%M:%S.%f%z', '%m-%d-%Y %H:%M:%S.%f%z', '%d-%m-%Y %H:%M:%S.%f%z', '%d %b %Y %H:%M:%S.%f%z', '%Y/%m/%d %H:%M:%S.%f%z', '%Y-%m-%d %H:%M:%S.%f%z', '%d/%m/%Y %H:%M:%S.%f%z', '%Y-%m-%dT%H:%M:%SZ%z', '%d%m%Y %H:%M:%S.%f%z', '%m%d%Y %H:%M:%S.%f%z', '%Y%m%d %H:%M:%S.%f%z', '%m-%d-%Y %H:%M:%S%z', '%Y/%m/%d %H:%M:%S%z', '%Y-%m-%dT%H:%M:%S%z', '%b-%d-%Y %H:%M:%S%z', '%d-%m-%Y %H:%M:%S%z', '%d-%b-%Y %H:%M:%S%z', '%d/%m/%Y %H:%M:%S%z', '%Y-%m-%d %H:%M:%S%z', '%m/%d/%Y %I:%M:%S %p%z', '%Y%m%d %H:%M:%S%z', '%Y-%m-%dT%H:%MZ%z', '%m%d%Y %H:%M:%S%z', '%d%m%Y %H:%M:%S%z', '%m-%d-%Y%z', '%Y-%m-%d%z', '%d-%m-%Y%z', '%m %d %Y%z', '%d-%b-%Y%z', '%Y/%m/%d%z', '%d %m %Y%z', '%m/%d/%Y%z', '%Y %m %d%z', '%d/%m/%Y%z', '%d %b %Y%z', '%b-%d-%Y%z', '%b %d %Y%z', '%m%d%Y%z', '%Y%m%d%z', '%d%m%Y%z', '%m%Y%z', '%m %Y%z', '%m-%Y%z', '%m/%Y%z'] |
Examples:
For a date_format == ”%Y-%m-%dT%H:%M:%S%z”
and the following input:
col_1 |
---|
1970-01-01T23:23:14 |
2000-10-10T10:14:42-0515 |
1983-02-03T03:00:51 |
null |
2020-07-06T18:52:36-0000 |
1948-05-14T02:23:14-0845 |
The expected output is:
col_1 | utc_offset_hrs |
---|---|
1970-01-01T23:23:14 | 0430 |
2000-10-10T10:14:42-0515 | -5.25 |
1983-02-03T03:00:51 | null |
null | null |
2020-07-06T18:52:36-0000 | 0.0 |
1948-05-14T02:23:14-0845 | -8.75 |
Extract UTC offset from UTC timestamp and timezone
Unique identifier (API): extract_utc_offset_tz
Extracts the offset (in hours) from UTC, given the UTC DateTime column and the timezone name column.
Examples:
For the following input:
datetimes | timezones |
---|---|
1970-01-01 23:23:14 | Pacific/Pago_Pago |
1990-05-20 09:09:09 | Asia/Jerusalem |
2000-10-10 10:14:42 | pacific/pago_pago |
1983-02-03 03:00:51 | null |
null | Arctic/Longyearbyen |
2020-07-06 18:52:36 | Mars/Jezero crater |
The expected output is:
datetimes | timezones | utc_offset_hrs |
---|---|---|
1970-01-01 23:23:14 | Pacific/Pago_Pago | -11.0 |
1990-05-20 09:09:09 | Asia/Jerusalem | 3.0 |
2000-10-10 10:14:42 | pacific/pago_pago | null |
1983-02-03 03:00:51 | null | null |
null | Arctic/Longyearbyen | null |
2020-07-06 18:52:36 | Mars/Jezero crater | null |
The reasons for the null values are:
pacific/pago_pago
doesn't exist in our timezone list. (Pacific/Pago_Pago
does!)null
timezone valuenull
datetime valueMars/Jezero crater
doesn’t exist in our timezone list. (Maybe in the future)
Extract week number in year
Unique identifier (API): extract_week_number_in_year
Extracts the week number (between 1-52)
Extract year
Unique identifier (API): extract_year
Extract the year as a number
Examples:
2020-04-15T15:53:44 -> will return: 2020
Get daylight savings indicator
Unique identifier (API): get_dst_indicator
Generates a boolean column which indicates if a localized timestamp column (represented by the combination of two columns: A UTC DateTime column and an IANA timezone name column) has an active daylight saving.
Examples:
For the following input:
time_utc | tz_name |
---|---|
2020-08-20 13:13:14 | Australia/Melbourne |
2020-02-20 13:13:14 | Australia/Melbourne |
2020-02-20 13:13:14 | null |
2020-02-20 13:13:14 | Mars/Jezero Crater |
null | Australia/Melbourne |
The expected output is:
time_utc | tz_name | dst_active |
---|---|---|
2020-08-20 13:13:14 | Australia/Melbourne | false |
2020-02-20 13:13:14 | Australia/Melbourne | true |
2020-02-20 13:13:14 | null | null |
2020-02-20 13:13:14 | Mars/Jezero Crater | null |
null | Australia/Melbourne | null |
And the reasons for the null values are: 1. null timezone value 2. Mars/Jezero crater doesn’t exist in our timezone list. (Maybe in the future) 3. null time_utc
Get local time
Unique identifier (API): get_local_time
Generates local time given the UTC timestamp and the timezone's offset in hours.
Examples:
For the following input: (2 pairs of DateTime | offset, working on both):
datetime_1 | offset_1 | datetime_2 | offset_2 |
---|---|---|---|
1970-01-01 23:23:14 | 4.5 | null | null |
2000-10-10 10:14:42 | -5.25 | 2020-07-06 18:52:36 | 0.0 |
1983-02-03 03:00:51 | 2.75 | 1948-05-14 02:23:14 | null |
The expected output is:
datetime_1 | offset_1 | datetime_2 | offset_2 | local_time_1 | local_time_2 |
---|---|---|---|---|---|
1970-01-01 23:23:14 | 4.5 | null | null | 1970-01-02 03:53:14 | null |
2000-10-10 10:14:42 | -5.25 | 2020-07-06 18:52:36 | 0.0 | 2000-10-10 04:59:42 | 2020-07-06 18:52:36 |
1983-02-03 03:00:51 | 2.75 | 1948-05-14 02:23:14 | null | 1983-02-03 05:45:51 | null |
Get timezone name
Unique identifier (API): get_tz_name
Generates a list of timezone names matching the pair of a UTC DateTime column and a numeric UTC offset (in hours) column. The list is in a text format with values separated by a pipe
Examples:
For the following input: (2 pairs of DateTime | offset, working on both)
datetime_1 | offset_1 | datetime_2 | offset_2 |
---|---|---|---|
1970-01-01 23:23:14 | 4.5 | null | null |
2000-10-10 10:14:42 | -5.25 | 2020-07-06 18:52:36 | 0.0 |
1983-02-03 03:00:51 | 2.75 | 1948-05-14 02:23:14 | null |
The expected output is:
datetime_1 | offset_1 | datetime_2 | offset_2 | timezones_1 | timezones_2 |
---|---|---|---|---|---|
1970-01-01 23:23:14 | 4.5 | null | null | Asia/Kabul | |
2000-10-10 10:14:42 | -5.25 | 2020-07-06 18:52:36 | 0.0 | Africa/Abidjan, Africa/Accra, Africa/Bamako, Africa/Banjul, Africa/Bissau, Africa/Conakry, Africa/Dakar, Africa/Freetown, Africa/Lome, Africa/Monrovia, Africa/Nouakchott, Africa/Ouagadougou, Africa/Sao_Tome, America/Danmarkshavn, America/Scoresbysund, Atlantic/Azores, Atlantic/Reykjavik, Atlantic/St_Helena, GMT, UTC | |
1983-02-03 03:00:51 | 2.75 | 1948-05-14 02:23:14 | null |
Get timezone name from geolocation
Unique identifier (API): get_tz_name_from_geolocation
Given the geolocation as a pair of numeric columns (latitude and longitude), creates a column of timezone names of each geolocation. Returns null values if no matching timezone is found, or if the geo coordinates are invalid.
Examples:
For the following input:
lat | lng |
---|---|
50.45817318865369 | 35.99253168643526 |
30.56916325861897 | 65.90262778021655 |
51.936292660801115 | 26.071955568518586 |
21.180638320519932 | 151.80832165098863 |
67.04735445996346 | 120.61522491454524 |
null | 97.50232778607207 |
51.07973011677217 | null |
The expected output is:
lat | lng | tz_names |
---|---|---|
50.45817318865369 | 35.99253168643526 | Europe/Moscow |
30.56916325861897 | 65.90262778021655 | Asia/Kabul |
51.936292660801115 | 26.071955568518586 | Europe/Minsk |
21.180638320519932 | 151.80832165098863 | null |
67.04735445996346 | 120.61522491454524 | Asia/Yakutsk |
null | 97.50232778607207 | null |
51.07973011677217 | null | null |
Round DateTime
Unique identifier (API): round_datetime
Rounds a DateTime column (in place) to 15, 30 or 60 minutes
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
round_to | Time rounding resolution. E.g. for "half_hour", 14:39:11 --> 14:30:00 | single-select |
Examples:
""2020-04-15T15:53:44"" with rounding to 30 minutes. Output ""2020-04-15T16:00:00""
Feature Engineering
Create features using autoencoder neural network
Unique identifier (API): autoencoder
Create new numerical features by combining existing numerical features in novel and non-linear ways using an autoencoder
Parameters | Parameter Description | Default | Parameter Type | Parameter Min | Parameter Max | Parameter Step |
---|---|---|---|---|---|---|
num_epochs | The number of iterations or epochs for optimizing the loss function over the training set. | 288 | range | 1.0 | 1000.0 | 1.0 |
batch_size | The size of a batch of examples used to compute the Stochastic Gradient updates. | 128 | range | 1.0 | 8192.0 | 1.0 |
learning_rate | Learning rates control the size of the update to parameters during each step. | 0.001 | range | 0.001 | 0.1 | 0.001 |
impute | Strategy for imputing missing values | mean | single-select | |||
create_missing_flag | Create a new feature based on missing values | True | boolean | |||
standardize | Standardize, if true, will transform each feature value by subtracting the mean-value and dividing the result by the standard deviation of the feature | True | boolean |
Examples:
Using autoencoder to generate novel and non-linear features based on existing numerical features
Bin columns
Unique identifier (API): bin_columns
This action bins one or more numeric columns by specifying the number of equally-spaced bins or a list of bin edges
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
method | Whether to use equally-spaced bins or custom cut positions | single-select | |
n_bins | Number of equally-spaced bins. This creates n equally-spaced bins from the minimum to maximum in the column. | integer | |
cut_positions | Custom cut positions in ascending order | float-list | |
output_format | Format for output labels | [lower, upper) | single-select |
Examples:
Discretize a numeric column into categorical ranges e.g. given cut positions 100, 150, 200
, 30, 110, 180, 220
can be binned as ""x < 100"", ""100 <= x < 150"", ""150 <= x < 200"", ""x >= 200""
Bin columns by quantiles
Unique identifier (API): bin_columns_by_quantiles
This action bins one or more numeric columns by specifying the number of uniformly-split quantiles or a list of custom quantiles.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
method | Whether to use uniformly-split quantiles or custom split quantiles | single-select | |
n_bins | Number of uniformly-split quantiles. This creates n bins based on uniformly-split quantiles from 0 to 1 | integer | |
split_quantiles | Custom split quantiles in ascending order | float-list | |
output_format | Format for output labels | [lower, upper) | single-select |
Examples:
For example, 20, 50, 80, 100, 200
is converted to ""0 < x < 0.25"", ""0.25 <= x < 0.5"", ""0.5 <= x < 0.75"", , ""0.75 <= x < 1"", ""0.75 <= x < 1""
with split quantiles 0.25, 0.5, 0.75
.
Bin DateTime
Unique identifier (API): bin_datetime
Bins a DateTime column (creates 2 new columns) to 15, 30 or 60 minutes
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
bin_res | Upper/lower bounds to round towards in order to bin the time. E.g.: 14:53:15 --> [14:45:00, 15:00:00] when using "quarter_hour" resolution | single-select |
Examples:
Creates 2 datetime output columns per each input column which ""bin"" the input column between them. The binning resolution is either: hourly, half-hourly, or 15 minutes. Example: ""2020-04-15T15:53:44"" with half-hourly resolution. Output columns: [""2020-04-15T15:30:00"", ""2020-04-15T16:00:00""]
Bin columns such that the target mean follows upward or downward trend
Unique identifier (API): bin_trend
This action bins one or more numeric columns such that the target mean follows an upward or downward trend
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
trend | Up or Down Trend | single-select | |
algo | Algorithm for trend-binning | single-select |
Create missing flag
Unique identifier (API): create_missing_flag
For every input column, create a boolean column flagging the locations of missing values.
Extract PCA Components
Unique identifier (API): extract_pca_v2
Create new numerical features by combining existing numerical features in linear ways using PCA
Parameters | Parameter Description | Default | Parameter Type | Parameter Min | Parameter Max | Parameter Step |
---|---|---|---|---|---|---|
limit_components_by | Criterion for selecting the principal components. If the minimum explained variance is used, the components that contribute most to the variance are selected until the cumulative variance is not less than the specified ratio of the total variance. | min_explained_variance | single-select | |||
num_components | The number of principal components to extract from the input data | integer | ||||
min_explained_variance | This is the threshold on the ratio of explained variance of a component to the total variance. Only the components with highest variance whose cumulative variance is at least the specified ratio of total variance will be selected | 0.8 | range | 0.01 | 1.0 | 0.01 |
numeric_impute_method | Method for imputing Numeric columns | mean | single-select |
Apply UMAP
Unique identifier (API): extract_umap_components
Apply the Uniform Manifold Approximation and Projection (UMAP) algorithm to the input columns to achieve dimensionality reduction. This is a manifold-learning algorithm based on neighbours graphs.
Parameters | Parameter Description | Default | Parameter Type | Parameter Min | Parameter Max | Parameter Step |
---|---|---|---|---|---|---|
num_output_dimensions | The dimensionality of the space the input will be embedded into. | 2 | range | 1.0 | 100.0 | 1.0 |
seed | The random seed for the UMAP algorithm. By setting it, the output will be reproducible. | 2 | integer | |||
n_neighbors | This parameter controls how UMAP balances local versus global structure in the data. See documentation of this action in the knowledge hub for details.NOTE: If larger than number of records in the data, this parameter will be truncated to the number of records. | 15 | range | 2.0 | 100.0 | 1.0 |
metric | Defines the metric of the ambient space of the input data. | euclidean | single-select | |||
min_dist | Controls how tightly UMAP is allowed to pack points together. It, quite literally, provides the minimum distance apart that points are allowed to be in the low dimensional representation. This means that low values of the parameter will result in clumpier embeddings. This can be useful if you are interested in clustering, or in finer topological structure. Larger values of the parameter will prevent UMAP from packing point together and will focus instead on the preservation of the broad topological structure instead. | 0.1 | range | 0.0 | 1.0 | 0.01 |
Get vector representation from text
Unique identifier (API): get_feature_vectors_from_text
Obtain numerical features from text using word embedding model
Examples:
Generates a vector representation for the given text column. For the following input table and input field being 'col_text' and output field being 'col_text_vector'
col_text |
---|
null |
Test 54 45 Test |
Dummy sentence |
Just some number 00 |
The output contains text vectors of dimensionality 96:
col_1 | __group_col_text_vector__0 | __group_col_text_vector__1 | ....................... | __group_col_text_vector__95 |
---|---|---|---|---|
null | 0.123 | 0.786 | ....................... | 0.313 |
Test 54 45 Test | -0.92 | 0.576 | ....................... | 0.415 |
Dummy sentence | 0.567 | 0.321 | ....................... | -0.28 |
Just some number 00 | 0.876 | 0.476 | ....................... | -0.47 |
Create historical summary aggregation features
Unique identifier (API): historical_summ_agg
Create historical summary aggregation features, e.g. the average purchase amount over the last 6 months
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
aggregation_functions | Aggregation methods such as "min" or "max" are functions that reduce a column of values to a single value | ['sum'] | multi-select |
timeframe | Time frame for windowing | 3 | integer |
Examples:
For example, creating columns like average spending last 6 months for each customer, by using the date column, monthly spending and customer ID
Label encode categorical columns
Unique identifier (API): label_encode
Encode each input categorical column into a new column of non-negative integers, following either alphabetical (by default) or customised order. Null values and unseen labels (when a finalized recipe fitted by a dataset is applied on new data) remain as null.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
order_of_categories | Category list that determines the order of labels | [] | string-list |
Examples:
E.g. convert the Categories ""A"", ""B"", ""C"", NA
to their numerical encoding 0, 1, 2, NA
Scale numeric column(s)
Unique identifier (API): scale_numeric_columns
Scale numeric column(s) by a specified method
Parameters | Parameter Description | Default | Parameter Type | Parameter Min | Parameter Max | Parameter Step |
---|---|---|---|---|---|---|
method | Method for scaling including: standard (mean removed, scaled to unit variance), min-max (scaled to [0, 1]), max-abs (scaled to [-1, 1]) and robust (median removed, scaled to a specified quantile range) | single-select | ||||
lower_quantile | Lower value for quantile range used in scaling | 0.25 | range | 0.0 | 1.0 | 0.05 |
upper_quantile | Upper value for quantile range used in scaling | 0.75 | range | 0.0 | 1.0 | 0.05 |
relative_error | Relative error for approximating median or a specific quantile. Must be no less than 0.001 for computation efficiency | 0.001 | range | 0.001 | 1.0 | 0.001 |
Examples:
For example, a numeric column 100, 120, 80, 10000, 100
is scaled to 0.01, 0.012, 0.008, 1.0, 0.01
with method = ""max_abs""
(all values are divided by the absolute maximum 10000).
Computes weight of Evidence (WOE) encoding for categorical columns
Unique identifier (API): weight_of_evidence
This action converts categorical columns by Weight of Evidence encoding.
Feature Engineering/Text
Get vector representation from text
Unique identifier (API): get_feature_vectors_from_text
Obtain numerical features from text using word embedding model
Examples:
Generates a vector representation for the given text column. For the following input table and input field being 'col_text' and output field being 'col_text_vector'
col_text |
---|
null |
Test 54 45 Test |
Dummy sentence |
Just some number 00 |
The output contains text vectors of dimensionality 96:
col_1 | __group_col_text_vector__0 | __group_col_text_vector__1 | ....................... | __group_col_text_vector__95 |
---|---|---|---|---|
null | 0.123 | 0.786 | ....................... | 0.313 |
Test 54 45 Test | -0.92 | 0.576 | ....................... | 0.415 |
Dummy sentence | 0.567 | 0.321 | ....................... | -0.28 |
Just some number 00 | 0.876 | 0.476 | ....................... | -0.47 |
Filtering/Custom Formula
Filter rows
Unique identifier (API): filter_rows_with_formula_v2
Filter rows in data by keeping only rows that satisfy the condition specified in user-entered formula. Common Excel/SQL functions and operators for manipulating Numeric, DateTime, Text, and Boolean values are permitted. For a full list of allowed functions, see documentation.
NOTE: The formula entered must evaluate to a boolean value.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
condition | The condition under which the rows are to be kept, entered as a formula. | formula |
Examples:
Consider the dataset:
manufacturer | model | year | displ | cty | hwy |
---|---|---|---|---|---|
audi | a4 quattro | 1999 | 1.8 | 16 | 25 |
audi | a6 quattro | 1999 | 2.8 | 15 | 24 |
audi | a6 quattro | 2008 | 4.2 | 16 | 23 |
audi | a6 quattro | 2008 | 4.2 | 16 | 23 |
chevrolet | c1500 suburban 2wd | 2008 | 6 | 12 | 17 |
chevrolet | corvette | 1999 | 5.7 | 16 | 26 |
chevrolet | corvette | 1999 | 5.7 | 15 | 23 |
chevrolet | k1500 tahoe 4wd | 1999 | 6.5 | 14 | 17 |
chevrolet | malibu | 1999 | 3.1 | 18 | 26 |
dodge | caravan 2wd | 1999 | 3.8 | 15 | 22 |
dodge | caravan 2wd | 2008 | 4 | 16 | 23 |
dodge | durango 4wd | 2008 | 5.7 | 13 | 18 |
dodge | durango 4wd | 2008 | 5.7 | 13 | 18 |
dodge | ram 1500 pickup 4wd | 1999 | 5.9 | 11 | 15 |
ford | expedition 2wd | 1999 | 4.6 | 11 | 17 |
ford | explorer 4wd | 1999 | 5 | 13 | 17 |
ford | f150 pickup 4wd | 1999 | 4.2 | 14 | 17 |
ford | f150 pickup 4wd | 1999 | 4.2 | 14 | 17 |
ford | f150 pickup 4wd | 2008 | 4.6 | 13 | 17 |
ford | mustang | 2008 | 4 | 17 | 26 |
where cty
and year
are of Numeric
type.
Given the formula 'AND(year > 2000, cty >= 15)'
, the result will be:
manufacturer | model | year | displ | cty | hwy |
---|---|---|---|---|---|
audi | a6 quattro | 2008 | 4.2 | 16 | 23 |
audi | a6 quattro | 2008 | 4.2 | 16 | 23 |
dodge | caravan 2wd | 2008 | 4 | 16 | 23 |
ford | mustang | 2008 | 4 | 17 | 26 |
Join
Look up aggregated columns from another dataset
Unique identifier (API): aggregate_lookup
Perform aggregation on another dataset and lookup the content of the summarized table
Examples:
Given the primary dataset (n)
grp |
---|
1 |
3 |
3 |
and a secondary dataset that we wish to summarize and then perform a lookup of the summarized info:
grp | val | val1 | bool1 | datetimes |
---|---|---|---|---|
1 | 0.45 | 1 | True | 2023-11-17T13:08:29 |
1 | 0.45 | 2 | True | 2023-11-17T13:08:29 |
3 | 0.45 | 3 | True | 2023-11-17T13:08:29 |
3 | 0.45 | 3.5 | True | 2023-08-03T08:48:29 |
3 | 1 | 3 | False | 2022-05-29T03:21:59 |
3 | 1 | 4 | False | 2022-05-29T03:21:59 |
2 | 10 | 5 | False | 2022-05-29T03:21:59 |
2 | 100 | 6 | False | 2022-05-29T03:21:59 |
If we perform the below aggregations on the secondary table and then lookup the results using grp
as the lookup key
mean(val)
sum(val1)
all(bool1)
any(bool)
approx_count_distinct(val, 0)
approx_count_distinct(bool1, 0)
earliest(datetimes, 0)
latest(datetimes, 0)
approx_median(val)
approx_quantile(val)
count(bool1, "non_null")
count(bool1, "null)
we get
grp | avg_val | sum_val1 | all_bool1 | any_bool1 | approx_count_distinct_val | approx_count_distinct_bool1 | earliest_datetime | latest_datetime | median_val | quantile_val | count_bool1 | count_nulls_bool1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.45 | 3.0 | True | True | 1 | 1 | 2023-11-17 13:08:29 | 2023-11-17 13:08:29 | 0.45 | 0.45 | 2 | 0 |
3 | 0.725 | 13.5 | False | True | 2 | 2 | 2022-05-29T03:21:59 | 2023-08-03T08:48:29 | 0.45 | 0.45 | 4 | 0 |
3 | 0.725 | 13.5 | False | True | 2 | 2 | 2022-05-29T03:21:59 | 2023-08-03T08:48:29 | 0.45 | 0.45 | 4 | 0 |
Join
Unique identifier (API): join
Joins another dataset to current dataset given one or more join keys, retaining the specified selection of columns from both datasets.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
join_type | Specifies how to perform the join. | single-select |
Examples:
For the following input tables:
Primary Table: (2 regular columns, 1 column group with 2 components)
Id | Name | __group_Like_0 | __group_Like_1 |
---|---|---|---|
1 | Patrick | Apple | Milk |
2 | Albert | Pear | Cheese |
3 | Maria | Watermelon | Chickpeas |
4 | Darwin | Grapes | Beans |
5 | Elizabeth | Kiwi | Softpeas |
Secondary Table: (2 regular columns)
User_ID | Like |
---|---|
3 | Stars |
1 | Climbing |
1 | Code |
6 | Rugby |
4 | Apples |
We perform a Join
with the following parameters:
join_keys == 'Id'
key for primary table.join_keys == 'User_ID'
key for secondary tableselected_columns == ['Id', 'Name', 'Like']
for the primary table.selected_columns == 'Like'
for the secondary tablejoin_type == 'Outer Join'
the expected output is:
Id | Name | __group_primary_Like_0 | __group_primary_Like_1 | secondary_Like |
---|---|---|---|---|
null | null | null | null | Rugby |
5 | Elizabeth | Kiwi | Softpeas | null |
1 | Patrick | Apple | Milk | Climbing |
1 | Patrick | Apple | Milk | Code |
3 | Maria | Watermelon | Chickpeas | Stars |
2 | Albert | Pear | Cheese | null |
4 | Darwin | Grapes | Beans | Apples |
Notice that in this example the user performed an Outer Join
but have opted to to
drop the User_Id
column of the secondary table. This resulted in the first row of the
output to show an Id
of null
(User_Id == 6
for that row. This value did not
exist in the primary table) and thus the result for this row isn't very usable.
Look up columns from another dataset
Unique identifier (API): lookup_from
Creates specified columns by looking them up from a second dataset. Results in null values if there is no match in the second dataset for a particular record.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
multi_match_resolution | What row to use in case of multiple rows are matching in the other dataset under the specified criteria. | last | single-select |
Examples:
For the following input tables:
Primary Table: (2 regular columns, 1 column group with 2 components)
Id | Name | __group_Like_0 | __group_Like_1 |
---|---|---|---|
1 | Patrick | Apple | Milk |
2 | Albert | Pear | Cheese |
3 | Maria | Watermelon | Chickpeas |
4 | Darwin | Grapes | Beans |
5 | Elizabeth | Kiwi | Softpeas |
Secondary Table: (2 regular columns)
User_ID | Like |
---|---|
3 | Stars |
1 | Climbing |
1 | Code |
6 | Rugby |
4 | Apples |
We perform a Lookup
with the following parameters:
join_criteria == 'Id'
key for primary table.join_criteria == 'User_ID'
key for secondary tablelookup == 'Like'
for the secondary tableorder_secondary_by == 'Like'
for the secondary table- Name prefixes for the current and other datasets as
['left_', 'right_']
the expected output is:
Id | Name | __group_left_Like_0 | __group_left_Like_1 | right_Like |
---|---|---|---|---|
null | null | null | null | Rugby |
5 | Elizabeth | Kiwi | Softpeas | null |
1 | Patrick | Apple | Milk | Climbing |
3 | Maria | Watermelon | Chickpeas | Stars |
2 | Albert | Pear | Cheese | null |
4 | Darwin | Grapes | Beans | Apples |
Note: that Lookup
only returns one match per row in the primary table. Hence, if there are multiple matches in the secondary table, only the first one will be returned. Hence, it's important to choose a order_secondary_by
column in the secondary table.
Outlier Detection
Get anomaly score of records
Unique identifier (API): get_anomaly_scores_from_records
Computes the anomaly score of input data using unsupervised learning. Anomaly score of a record is computed as a measure of the information lost when the record is subject to dimensionality reduction by PCA.
Parameters | Parameter Description | Default | Parameter Type | Parameter Min | Parameter Max | Parameter Step |
---|---|---|---|---|---|---|
method | The method used for anomaly detection such as dimensionality reduction using PCA. | pca | single-select | |||
pca_limit_components_by | Criterion for selecting the principal components. If the minimum explained variance is used, the components that contribute most to the variance are selected until the cumulative variance is not less than the specified ratio of the total variance. | specified_explained_var | single-select | |||
pca_num_of_components | The number of principal components to extract from the input data | integer | ||||
pca_minimum_explained_variance_needed | This is the threshold on the ratio of cumulative explained variance of the selected principal components to the total variance. Only the components with highest variance whose cumulative variance is at least the specified ratio of total variance will be selected | 0.8 | range | 0.01 | 1.0 | 0.01 |
isolation_forest_num_estimators | The number of trees in the ensemble. | 100 | integer | |||
max_samples_specification_method | Specifies how max samples for isolation forest is set. Fractional implies that the maximum sample size is a fraction relative to the total number of rows.Count implies that the specified number is treated as an absolute count. | count | single-select | |||
isolation_forest_max_samples_fraction | The fraction of the total number of samples used to train each tree. | 1.0 | range | 0.01 | 1.0 | 0.01 |
isolation_forest_max_samples_count | The number of samples used to train each tree.The value should be greater than 1. | 256 | integer | |||
seed | Positive integer. Random seed for isolation forest algorithm. Used for obtaining repeatable results. | 1 | integer |
Indicate anomalous values
Unique identifier (API): indicate_anomalous_values
Creates boolean flag columns indicating outliers in the corresponding input columns, using interpercentile range method.E.g. denoting Q1 as the 1st quartile, Q3 as the 3rd quartile and IQR = Q3 - Q1, values smaller than Q1 - 1.5*IQR or larger than Q3 + 1.5IQR will be flagged as outliers.
Parameters | Parameter Description | Default | Parameter Type | Parameter Min | Parameter Max | Parameter Step |
---|---|---|---|---|---|---|
decision_mode | How the decision is made for declaring an anomaly: For "Upper threshold": only values larger than the upper threshold will be declared as anomalies. For "Lower threshold": only values lower than the lower threshold will be declared as anomalies. For "Lower and upper threshold": values below the lower threshold and above the upper threshold will be declared as anomalies. | lower_and_upper_thresholds | single-select | |||
low_percentile | The percentile P1 such that all values smaller than P1-alpha*IPR will be flagged as outliers By default, P1 equals Q1 (first quartile). | 25 | range | 0.0 | 100.0 | 1.0 |
high_percentile | The percentile P2 such that all values larger than P2+alpha*IPR will be flagged as outliers By default, P2 equals Q3 (3rd quartile). | 75 | range | 0.0 | 100.0 | 1.0 |
ipr_factor | The interpercentile range (denoted as IPR) equals P2-P1, where P2 and P1 are correspondingly the high and low percentiles. The factor is defined as 'alpha' in the formulas: P1 - alpha * IPR and P2 + alpha * IPR which determine the low and high thresholds for outliers detections. Values above/below these thresholds will be flagged as outliers | 1.5 | float |
Reshaping
Reshape dataset into a pivot table
Unique identifier (API): pivot_v2
Creates a pivot table with specified group-by (index) columns and pivot column. Column groups with equal number of components (one per pivoted value) are created, one per specified column/aggregation function pair.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
pivot_values | List of values that are converted to columns in the output | [] | string-list |
Examples:
For example, given the following dataset
animal | size | qty | desexed |
---|---|---|---|
cat | small | 95 | True |
cat | large | 39 | False |
cat | small | 50 | False |
cat | large | 37 | False |
cat | small | 12 | True |
dog | large | 84 | False |
dog | small | 71 | True |
dog | large | 87 | True |
dog | small | 7 | True |
dog | large | 42 | False |
with input pivot column size
, grouped by column animal
, values column qty
with aggregation function sum
and value column desexed
with aggregation function all
, the result is
animal | __group_qty_sum_small | __group_qty_sum_large | __group_desexed_all_small | __group_desexed_all_large |
---|---|---|---|---|
dog | 78 | 213 | True | False |
cat | 157 | 76 | False | False |
Unpack/Explode JSON arrays into rows
Unique identifier (API): unpack_v2
Reshape a dataset by unpacking one or more JSON array columns and add a new column for position index
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
allow_single_quotes | Whether to allow single quoted JSONs | False | boolean |
Examples:
For example, given the following dataset
paper_id | author | affilliation |
---|---|---|
1 | ["Sara", "Tom", "Max"] | ["rmit", "monash", "rmit"] |
2 | ["Sara", "Max", "Lee"] | ["rmit", "rmit"] |
with input columns ""author""
and ""affilliation""
, output column ""position""
for position index, the result is
paper_id | author | affilliation | position |
---|---|---|---|
1 | Sara | rmit | 0 |
1 | Tom | monash | 1 |
1 | Max | rmit | 2 |
2 | Sara | rmit | 0 |
2 | Max | rmit | 1 |
2 | Lee | null | 2 |
Unpivot a dataset from wide to long format
Unique identifier (API): unpivot
Unpivot a dataset from wide to long format. Input columns are converted to a single keys column and one or more values column(s) in the output. Columns other than input columns remain.
Text
Change text case
Unique identifier (API): change_text_case
Changes text columns to selected text case (uppercase, lowercase, etc.)
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
case | The required change in case of values where "Upper" converts all to uppercase, "Lower" converts all to lowercase and "Capitalize" converts first character to uppercase and remaining to lowercase. | single-select |
Examples:
For case == 'upper'
and the following input table:
col_1 |
---|
null |
a brown fox quick jumps over the lazy dog |
1234 |
Hello World |
The expected output is:
col_1 |
---|
null |
A BROWN FOX QUICK JUMPS OVER THE LAZY DOG |
1234 |
HELLO WORLD |
Decode base64-encoded text
Unique identifier (API): decode_base64
Every input column is assumed to be encoded using Base64-encoding. Decode it into thecharacter set (e.g. UTF8) specified. Store the decoded values in the output column
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
charset | The encoding to convert the decoded bytes to. This is the charset encoding of the original strings before they were base64 encoded | UTF8 | single-select |
Extract all matches of a regular expression pattern
Unique identifier (API): extract_all_regex_matches
For each input column, creates a new column of JSONArray type containing all the sub-strings in the corresponding input matching the specified regular expression.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
pattern | The pattern to be found, as a Perl-Compatible Regular Expression. Note that at present, only whole matches are extracted and capturing groups are not supported. | string |
Examples:
Given the following input dataset with column types as indicated:
('Description', 'Text') |
---|
The IP address of the router: 192.168.1.1 |
Go to 201 Mills St or call (404) 456-0861 |
The final amount of the transaction is 25,382.55 |
Spend $100 or more to enter a draw sponsored by hotels123.com to win a $20,500.00 |
There is no match 456here |
Applying the action with
- Input columns =
['Description']
- Output columns =
['Matches']
- Parameters =
{"pattern": r"\b\d+\b(?!$)"}
produces the following dataset with the column types as indicated:
('Description', 'Text') | ('Matches', 'JSONArray') |
---|---|
The IP address of the router: 192.168.1.1 | ["192","168","1"] |
Go to 201 Mills St or call (404) 456-0861 | ["201","404","456"] |
The final amount of the transaction is 25,382.55 | ["25","382"] |
Spend $100 or more to enter a draw sponsored by hotels123.com to win a $20,500.00 | ["100","20","500"] |
There is no match 456here | [] |
Extract capture groups
Unique identifier (API): extract_capture_groups_to_multicol
Extracts one or more regex capture groups into multiple columns.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
pattern | A regular expression pattern containing one or more capture groups | string |
Examples:
For the following pattern == '([aeiou].)([aeiou].)'
, the following input table:
col_1 | col_2 |
---|---|
New_York | bc |
Lisbon | df |
Tokyo | gh |
Paris | jk |
null | lm |
Munich | np |
And assuming we apply the action to column (col_1
), with output columns
[out_a, out_b]
, the expected output is:
col_1 | col_2 | out_a | out_b |
---|---|---|---|
New_York | bc | null | null |
Lisbon | df | null | null |
Tokyo | gh | null | null |
Paris | jk | ar | is |
null | lm | null | null |
Munich | np | un | ic |
Get named entities from text
Unique identifier (API): get_named_entities_from_text
Extracts named entities mentioned in input text column into the corresponding output text column as JSON string. The JSON string represents a dictionary-mapping between entity types to the coresponding list of entity mentions in the text.
Examples:
Extracts named entities from a given text column. See the following illustration for an input table with a single text column named col_text
and an input field col_text
and output field col_entities_from_text
:
col_text |
---|
Bill Gates addressed Microsoft. |
Jack drove to Washington. |
Decisions are driven by data. |
null |
--------------------------------- |
The output dataframe contains the text column col_entities_from_text
:
col_text | col_entities_from_text |
---|---|
Bill Gates addressed Microsoft. | {"People": ["Bill Gates"], "Organizations": ["Microsoft"]} |
Jack drove to Washington. | {"People": ["Jack"], "Places": ["Washington"]} |
Decisions are driven by data. | {} |
{} | |
null | {} |
Get sentiment scores from text
Unique identifier (API): get_sentiment_scores_from_text
Analyses the sentiment expressed in the input text columns and creates 3 output columns that reflect the strength (a number between 0 to 1, where 1 is the strongest) of positive, neutral and negative sentiment expressed in input text.
Examples:
Extracts sentiment polarity scores from a given text column into three numeric output columns. See the following illustration for an input table and with a single text column named 'tweets' and input field being 'col_text' and output fields being 'positive_sentiment', 'neutral_sentiment' and 'negative_sentiment':
tweets |
---|
Thanks for the follow-up. That's why you're the best! |
It was a horrible experience, something I would never recommend |
'' |
null |
--------------------------------------------------------------- |
The output dataframe contains the text columns 'positive_sentiment', 'neutral_sentiment' and 'negative_sentiment':
tweets | positive_sentiment | neutral_sentiment | negative_sentiment |
---|---|---|---|
Thanks for the follow-up. That's why you're the best! | 0.514 | 0.486 | 0.0 |
It was a horrible experience, something I would never recommend | 0.0 | 0.588 | 0.412 |
'' | 0.0 | 0.0 | 0.0 |
null | 0.0 | 0.0 | 0.0 |
--------------------------------------------------------------- | ------------------ | ----------------- | ------------------ |
Normalize text content
Unique identifier (API): normalize_text
Normalizes a text column by performing one or more normalization steps such as removing non text content or stemming the words.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
steps_to_execute | The required normalization steps | ['remove_html_tags', 'replace_urls'] | sequence-select |
Examples:
If the input data is:
col_1 |
---|
@JetBlue I was not allowed to drop my bag off for my flight that I was running 10min Late Flight for. Turns out it left over an hour Late Flight. |
@VirginAmerica momma I made it! 😁😁😁 |
@JetBlue bag just delivered and items have been stolen. At first look - my Kate spade bag and bottle of spiced rum!!! |
@united then why have I not received my call back its been 3 days...it's an infant car seat how am I supposed to go anywhere with my child |
@JetBlue just a heads up, this page no longer exists http://t.co/NsJWVTTjGo |
@AmericanAir Thank you, you too! |
“@JetBlue: Our fleet's on fleek. http://t.co/vw2v8GVnGq” 😐😑 |
@AmericanAir yes I did but no response. |
@SouthwestAir when is the next flight with free cocktails ? |
@AmericanAir have been waiting for a call back for over two hours jan to die tomorrow at 6am Cancelled Flightled what are my options? |
@AmericanAir that doesn't really answer my question. Maybe if I provide more details you can give me clarification... |
@JetBlue where is our luggage?? Flight 424 has been waiting an hour for bags! #jetblue #poorservice |
@united will I make it with the delay? |
@SouthwestAir why can't you take me to Knoxville?? yall are my fav #help |
applying the action on col_1
with only the remove_stopwords
and lemmatize_words
stages disabled will result in:
col_1 |
---|
mention i was not allowed to drop my bag off for my flight that i was running 10min late flight for turns out it left over an hour late flight |
mention momma i made it emoji emoji emoji |
mention bag just delivered and items have been stolen at first look my kate spade bag and bottle of spiced rum |
mention then why have i not received my call back its been number daysit s an infant car seat how am i supposed to go anywhere with my child |
mention just a heads up this page no longer exists url |
mention thank you you too |
mention our fleet s on fleek url emoji emoji |
mention yes i did but no response |
mention when is the next flight with free cocktails |
mention have been waiting for a call back for over two hours jan to die tomorrow at 6am cancelled flightled what are my options |
mention that doesn t really answer my question maybe if i provide more details you can give me clarification |
mention where is our luggage flight number has been waiting an hour for bags hashtag hashtag |
mention will i make it with the delay |
mention why can t you take me to knoxville yall are my fav hashtag |
For the same data, if all stages including remove_stopwords
and lemmatize_words
are enabled, the result will be:
col_1 |
---|
mention allow drop bag flight running 10min late flight turn left hour late flight |
mention momma emoji emoji emoji |
mention bag deliver item steal look kate spade bag bottle spice rum |
mention receive number day infant car seat suppose child |
mention head page long exist url |
mention thank |
mention fleet fleek url emoji emoji |
mention yes response |
mention flight free cocktail |
mention wait hour jan die tomorrow 6 cancel flightled option |
mention answer question maybe provide detail clarification |
mention luggage flight number wait hour bag hashtag hashtag |
mention delay |
mention knoxville y fav hashtag |
Create flag for regex match
Unique identifier (API): regex_match
Creates an output boolean column which is the result of matching the input column against a regex pattern
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
pattern | A regular expression pattern for the matching | string |
Examples:
For the following table:
col_1 |
---|
null |
That's my phone number: 054-555-111 |
Monday is the 1st day of the week |
This should match 0490-000-111 |
This should not match 0490-abc-def |
we apply the regex pattern: '0490-\d{3}-\d{3}'
Resulting in the following output:
col_1 | col_1_regex_match |
---|---|
null | null |
That's my phone number: 054-555-111 | false |
Monday is the 1st day of the week | false |
This should match 0490-000-111 | true |
This should not match 0490-abc-def | false |
Substitute using regex
Unique identifier (API): regex_sub
Finds and replaces string patterns in a column via Regex. Can optionally specify additional pattern to filter by, in which case the scope of the find/replace will be restricted to records that match the specified pattern in the column.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
match_pattern | Specify this to restrict the scope of replacement | .* | string |
sub_pattern | The pattern that needs to be substituted | string | |
sub_string | String used to replace values in places where sub_pattern matched the values. | string |
Examples:
For the following set of parameters:
and the following input table:col_1 |
---|
null |
That's my phone number: 054-555-111 |
Monday is the 1st day of the week |
This should be substituted: 0490-000-111, 054-555-111 |
This should not be substituted: 0490-000-111, 054-abc-cdf |
The output is:
col_1 |
---|
null |
That's my phone number: 054-555-111 |
Monday is the 1st day of the week |
This should be substituted: 0490-000-111, I WAS REPLACED |
This should not be substituted: 0490-000-111, 054-abc-cdf |
Remove prefix
Unique identifier (API): remove_prefix
Removes a string prefix from the rows of a chosen string column, if the string is found.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
prefix_string | The prefix string to be stripped | string |
Examples:
For the prefix_string == 'prefix_'
and the following input table:
col_1 |
---|
null |
prefix_to remove |
other_not to remove |
The expected output is:
col_1 |
---|
null |
to remove |
other_not to remove |
Remove punctuation
Unique identifier (API): remove_punctuation
Removes the punctuation from the rows of a chosen string column. The punctuation symbols to remove are given by the user as a string
Parameters | Parameter Description | Default | Parameter Type | Suggested Values |
---|---|---|---|---|
punc_symbols | The punctuation symbols to strip | !"#$%&'()*+,-./:;<=>?@[]^_{|}~ | string | ['!"#$%&\'()*+,-./:;<=>?@[\\]^_ { |
}~'] |
Examples:
For the following input table:
col_1 |
---|
null |
This. Sentence! Has? Lots of, umm.. punctuation<> |
Monday is the 1st day of the week |
abc.edf@gmail.com |
The output with stripped punctuation is:
col_1 |
---|
null |
This Sentence Has Lots of umm punctuation |
Monday is the 1st day of the week |
abcedfgmailcom |
Remove suffix
Unique identifier (API): remove_suffix
Removes a string suffix from the rows of a chosen string column, if the string is found.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
suffix_string | The suffix string to strip | string |
Examples:
For the suffix_string == '_suffix'
and the following input table:
col_1 |
---|
null |
remove_suffix |
other_not to remove |
The expected output is:
col_1 |
---|
null |
remove |
other_not to remove |
Split on delimiter to multiple columns
Unique identifier (API): split_on_delimiter_to_multicol
For a given column, extracts text components between the delimiters into multiple columns. The delimiter can be one or more specific characters, or a regex pattern to match. The number of splits is determined by the number of provided output column names. Trailing null values are created for records containing fewer components.
Parameters | Parameter Description | Default | Parameter Type | Suggested Values |
---|---|---|---|---|
delimiter | Delimiter to be used for splitting the string | string | [',', ';', '-', ' | |
use_regex | Boolean flag indicating whether the specified delimiter should be treated as a regex pattern or an exact match. | False | boolean | |
include_tail | Boolean flag indicating wether the remainder (tail) of the string after the final split must be included in the last column | True | boolean |
Examples:
For the following set of parameters:
and output columns 'split_0', 'split_1', 'split_2'
, and the following input table:
col_1 |
---|
one:two:three:four |
no_split |
The expected output is:
col_1 | split_0 | split_1 | split_2 |
---|---|---|---|
one:two:three:four | one | two | three:four |
no_split | no_split | null | null |
Whereas for include_tail = False
, the expected output is:
col_1 | split_0 | split_1 | split_2 |
---|---|---|---|
one:two:three:four | one | two | three |
no_split | no_split | null | null |
Split to array using a delimiter or pattern
Unique identifier (API): split_to_array
Obtain array (JSON) columns resulting from splitting the selected text columns using a delimiter or a regex pattern. The number of splits can be specified to be a certain limit, otherwise the splitting of the text is indefinite until no more occurrences of the delimiter are found.
Parameters | Parameter Description | Default | Parameter Type | Suggested Values |
---|---|---|---|---|
delimiter | The delimiter to be used for splitting the string. | string | [',', ';', '-', ' | |
use_regex | To split by a regex pattern, set this value to "True". | False | boolean | |
limit_splits | Set this to "True" and set the number of splits to the desired number, if not intending to split indefinitely. | False | boolean | |
max_number_of_splits | How many splits to obtain at most, if the splits are being limited. | 2 | integer |
Examples:
Given the following input dataset with column types as indicated:
('col_1', 'Text') | ('col_2', 'Numeric') | ('col_3', 'Text') |
---|---|---|
1,2,3,4,5 | 1 | a |
a,b,c,d | 3.14159 | d |
e,f,g | 2.23606 | x |
one, two , three | 1.732 | no_split |
no_split | nan |
with:
- Input columns as
['col_1', 'col_3']
, - Parameters
delimiter=',', use_regex=False, limit_splits=True, max_number_of_splits=3
,
the result will be the following dataset with column types as indicated:
('col_1', 'JSONArray') | ('col_2', 'Numeric') | ('col_3', 'JSONArray') |
---|---|---|
["1","2","3,4,5"] | 1 | ["a |
["a","b","c,d"] | 3.14159 | ["d |
["e","f","g"] | 2.23606 | ["x |
["one"," two "," three"] | 1.732 | ["no_split"] |
["no_split"] | nan |
Strip whitespaces
Unique identifier (API): strip_whitespaces
Strips whitespaces from rows of chosen Text columns.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
mode | The method for stripping the whitespaces. Stripping can be done at the beginning, end, both of the strings, and can even strip ALL of the whitespaces. | single-select |
Examples:
For the following table:
+-----------------------+
|col_1 |
+-----------------------+
| start_whitespace |
|end_whitespace |
| start_end_whitespaces |
|test tab |
+-----------------------+
and mode == 'End'
, the expected output is:
+----------------------+
|col_1 |
+----------------------+
| start_whitespace |
|end_whitespace |
| start_end_whitespaces|
|test tab |
+----------------------+
Text/Data Cleaning
Map to Text
Unique identifier (API): map_to_string
Replace values of input column(s) using a given mapping.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
mapping | Value that will be replaced and the replacement. | map |
Examples:
Assume we have the following single column table:
col_1 |
---|
dog |
cat |
null |
horse |
dogdog |
and the following mapping:
After running the action, the expected output is:col_1 |
---|
doggy |
kitten |
null |
horse |
dogdog |
where dog
-> doggy
, cat
-> kitten
and the rest of the values were unchanged
since they didn't appear in the mapping dictionary.
Parse strings as null
Unique identifier (API): parse_strings_as_null
Parses the specified strings in given column(s) as null values
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
strings_to_parse | The list of strings that will be parsed as null | string-list |
Examples:
For the following input dataset:
col_x | col_y |
---|---|
11 | n/a |
45 | 38 |
null | 31 |
25 | 1.#QNAN |
49 | 20 |
where both columns are of text type, applying the action with the parameter strings_to_parse = ['n/a', '', 'NA', 'N/A', '<NA>', 'NULL', 'nan', '1.#IND', '#NA', '-1.#QNAN', '-nan', 'NaN', '1.#QNAN', '-1.#IND', '-NaN', 'null', '#N/A', '#N/A N/A']
and both columns selected in the input will give the following result:
col_x | col_y |
---|---|
11 | |
45 | 38 |
31 | |
25 | |
49 | 20 |
Text/JSON
Extract from JSON by path
Unique identifier (API): extract_from_json_by_path
Extract from JSON by the path. Will return null
if the path isn't found
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
path | The path of the value to extract | string |
Get occurrence-count features
Unique identifier (API): get_occurrence_count_features
Count the occurrence for each element of a JSON array. This achieves the same functionality as CountVectorizer
in Scikit-learn package.
Examples:
For example, given the following dataset
customer_id | events_A | events_B | events_C |
---|---|---|---|
1 | ["login", "logout"] | ["open", "close"] | ["log in", "log out"] |
2 | ["login", "home", "logout"] | ["open", "idle", "close"] | ["log in", "home", "log out"] |
3 | ["login", "home", "logout", "login", "logout"] | ["open", "idle", "close", "open", "close"] | ["log in", "home", "log out", "log in", "log out"] |
4 | ["login", "home", "app", "home", "logout"] | ["open", "idle", "active", "idle", "close"] | ["log in", "home", "app", "home", "log out"] |
5 | {"open": 0, "close": 1} | ["log in", "home", "app", "home", "log out"] |
with inputs as events_A
, events_B
and events_C
, outputs as A
, B
and C
, the result is
customer_id | events_A | events_B | events_C | __group_A_logout | __group_A_login | __group_A_home | __group_A_app | __group_B_open | __group_B_close | __group_B_idle | __group_B_active | __group_C_log00out | __group_C_log00in | __group_C_00home00 | __group_C_home | __group_C_app | __group_C_00app00 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ["login", "logout"] | ["open", "close"] | ["log in", "log out"] | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
2 | ["login", "home", "logout"] | ["open", "idle", "close"] | ["log in", "home", "log out"] | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 |
3 | ["login", "home", "logout", "login", "logout"] | ["open", "idle", "close", "open", "close"] | ["log in", "home", "log out", "log in", "log out"] | 2 | 2 | 1 | 0 | 2 | 2 | 1 | 0 | 2 | 2 | 0 | 1 | 0 | 0 |
4 | ["login", "home", "app", "home", "logout"] | ["open", "idle", "active", "idle", "close"] | ["log in", "home", "app", "home", "log out"] | 1 | 1 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
5 | {"open": 0, "close": 1} | ["log in", "home", "app", "home", "log out"] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 |
Here in input column events_C
, there are white spaces and special characters such as $
and _
in the elements of JSON arrays. These characters other than numbers or letters are replaced with 00
in the output columns.
Parse JSON objects in column
Unique identifier (API): parse_json_objects
Parses each value in the input column as a JSON object, whose schema is specified by the user. Each attribute is extracted as one output column. All output columns will contain null values wherever the Text string is not parsable as a JSON object. If parsable but one or more of the specified attributes is missing or of incompatible type, their corresponding output columns will contain null values.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
allow_single_quotes | Whether to allow single quoted JSONs | False | boolean |
Examples:
For the input dataset with column names and data types as indicated below:
movie_id | title | movie_cast |
---|---|---|
1 | Avatar | {"character": "Neytiri", "gender": 1, "name": "Zoe Saldana"} |
2 | Pirates of... | {"character": "Will T", "gender": 2, "name": "Orlando Bloom"} |
3 | Spectre | {"character": "Blofeld", "gender": 2, "name": "C Waltz"} |
4 | Spider-Man 3 | {"character": "MJ", "gender": 1, "name": "Kirsten Dunst"} |
5 | X-Men | |
6 | Avengers | {"broken_json": "3", |
7 | Captain America | {"character": "Nick Fury", "gender": "M", "name_missing": true} |
8 | The Hobbit | ["Array instead of object 1", "Array instead of object 2"] |
Applying this action with input column = 'movie_cast'
and attribute/data type/output column trios [("character", "string", "character_name"), ("gender", "integer", "character_gender"), ("name", "string", "character_name")]
produces the output:
movie_id | title | movie_cast | character_name | character_gender | character_acted_by |
---|---|---|---|---|---|
1 | Avatar | {"character": "Neytiri", "gender": 1, "name": "Zoe Saldana"} | Neytiri | 1 | Zoe Saldana |
2 | Pirates of... | {"character": "Will T", "gender": 2, "name": "Orlando Bloom"} | Will T | 2 | Orlando Bloom |
3 | Spectre | {"character": "Blofeld", "gender": 2, "name": "C Waltz"} | Blofeld | 2 | C Waltz |
4 | Spider-Man 3 | {"character": "MJ", "gender": 1, "name": "Kirsten Dunst"} | MJ | 1 | Kirsten Dunst |
5 | X-Men | nan | |||
6 | Avengers | {"broken_json": "3", | nan | ||
7 | Captain America | {"character": "Nick Fury", "gender": "M", "name_missing": true} | Nick Fury | nan | |
8 | The Hobbit | ["Array instead of object 1", "Array instead of object 2"] | nan |
Text/NLP
Get topic distribution probabilities from text
Unique identifier (API): get_topic_distribution_from_text
Analyses the input text columns and extracts underlying topics in the text. Each record is considered a text document that has originated from the underlying topics. The output numeric column group contains the probability distribution over the identified topics for each input record (document)
Parameters | Parameter Description | Default | Parameter Type | Parameter Min | Parameter Max | Parameter Step |
---|---|---|---|---|---|---|
num_topics | The number of different topics that are assumed to be present in the input text column. | 5 | range | 2.0 | 500.0 | 1.0 |
max_iterations | Sets the maximum number of iterations that can be run by the algorithm for extracting the underlying topics | 20 | range | 15.0 | 30.0 | 1.0 |
seed | Random seed for topic modeling algorithms. Used for obtaining repeatable results. | integer |
Examples:
For the following input:
text |
---|
you ever wanted to try some yourself |
not only a fun read but also a pretty infographic |
something not understandable by humans |
text of all jane austens works if |
not only a fun read but also a pretty infographic |
and the parameter values:
num_topics == 2
(Number of topics to be extracted)
We get the following output:
text | __group_text_topics_0 | __group_text_topics_1 |
---|---|---|
you ever wanted to try some yourself | 0.400954 | 0.599046 |
not only a fun read but also a pretty infographic | 0.239050 | 0.760950 |
something not understandable by humans | 0.211497 | 0.788503 |
text of all jane austens works if | 0.435378 | 0.564622 |
not only a fun read but also a pretty infographic | 0.434958 | 0.565042 |
Get TF-IDF features from text
Unique identifier (API): tf_idf
Applies TF-IDF + dimensionality reduction
Parameters | Parameter Description | Default | Parameter Type | Parameter Min | Parameter Max | Parameter Step |
---|---|---|---|---|---|---|
post_processing_method | Algorithm for reducing the dimensionality of the TF-IDF output | truncated_svd | single-select | |||
num_output_dimensions | The number of dimensions after post processing of TF-IDF output. For example, if the post-processing method is truncated SVD, and 100 output dimensions are chosen, then the dimensions retained after SVD truncation correspond to 100 largest singular values. | 50 | range | 2.0 | 100.0 | 1.0 |
hashing_tf_num_features_method | How the number of features produced by the Hashing TF stage is determined. The larger the vocabulary size of the corpus, the more the number of hashing features needed for fidelity. If set to 'Automatic', the number of features to be produced by the Hashing TF stage is set to the smallest power of 2 that is closest to the estimated vocabulary size. | auto | single-select | |||
hashing_tf_num_features | The cardinality of the set of possible hash values output by the hash function. The hash function is used to convert a document (text string) into sparse vectors after tokenization. This can be used to control the number of dimensions output by TF-IDF before dimensionality reduction. Smaller numbers result in higher chance of two terms being confused, while larger numbers result in longer computation times. It is advisable to use a power of two for the parameter, otherwise the features will not be mapped evenly to the columns. | 4096 | range | 32.0 | 65536.0 | 32.0 |
Examples:
For the following input:
x |
---|
cant say for sure what they used but |
there is an r library that includes the |
text of all jane austens works if |
you ever wanted to try some yourself |
they probably got the idea |
from |
which uses jane austens work extensively |
as an ongoing example when |
showcasing the tidytext library |
the specific library of jane |
austens work is library |
im guessing they used tidytext |
and janeaustenr in r, then used a slightly |
different analysis |
using the results from r |
in tableau or another data-viz |
heavy program to make it palatable for |
readers |
pun intended sorry who has already |
done a lot of |
really cool analysis like this |
they claimed in the text that |
they used pca |
but what they dont say is how they |
created the matrix that was fed into the pca |
and the parameter values:
num_components == 2
(Number of PCA components --> Reducing to 2D)
num_features == 1024
(Number of hashing features the text is converted into)
We get the following output:
x | __group_x_encoded_0 | __group_x_encoded_1 |
---|---|---|
cant say for sure what they used but | -3.6428402533142896 | 0.4975849683664332 |
there is an r library that includes the | 0.7746395720660133 | 0.5766808957643564 |
text of all jane austens works if | 0.8119351913774921 | 0.317909114966179 |
you ever wanted to try some yourself | 0.7487119280397619 | -0.8721857902857719 |
they probably got the idea | -0.14175253549746752 | 0.5342944067879885 |
from | 0.03192967086734439 | 0.06422168591488635 |
which uses jane austens work extensively | 0.29369225493121526 | 0.8832400029096568 |
as an ongoing example when | 0.7196285830258167 | -0.7732919475505083 |
showcasing the tidytext library | 0.29799339616128284 | 0.26371593073556526 |
the specific library of jane | 0.7639567537897723 | -0.1252943669742009 |
austens work is library | 0.024263102404871856 | 0.8407417872671388 |
im guessing they used tidytext | -0.5380315745685673 | 0.8071242590633404 |
and janeaustenr in r, then used a slightly | -0.7396546964226968 | -1.5252802271552306 |
different analysis | -0.0791096097809773 | 0.016454344219511016 |
using the results from r | 0.39190114778641166 | 0.2030087287822549 |
in tableau or another data-viz | 0.08255810541303524 | 0.28434896277132304 |
heavy program to make it palatable for | -1.0379890382908399 | -0.9359229625044558 |
readers | 0.022451788887236538 | 0.15649840774931126 |
pun intended sorry who has already | 0.12645774034218657 | 0.5764433381841557 |
done a lot of | 0.0645538301832265 | -1.8895773134920169 |
really cool analysis like this | -0.8499956685146961 | -0.6456009071301254 |
they claimed in the text that | 0.31310522183662226 | 0.566252028711451 |
they used pca | -0.26125074206706816 | 0.5402513332430933 |
but what they dont say is how they | -3.1418969765868843 | 1.9805533718325532 |
created the matrix that was fed into the pca | 1.3754363219912489 | 0.2072323270271102 |
Text/Reshaping
Split by delimiter and unpack into rows
Unique identifier (API): split_and_unpack
Splits the text content in the input column by the specified delimiter and unpacks the result into multiple rows, creating an additional column for position index. The delimiter can be one or more specific characters, or a regex pattern to match.
Parameters | Parameter Description | Default | Parameter Type | Suggested Values |
---|---|---|---|---|
delimiter | Delimiter to be used for splitting the string | string | [',', ';', '-', ' | |
use_regex | Boolean flag indicating whether the specified delimiter should be treated as a regex pattern or an exact match. | False | boolean |
Examples:
Given this input dataset:
('col_1', 'Text') | ('col_2', 'Numeric') |
---|---|
star | trek |
game | of |
sein,field | 2.23606 |
the | simpsons |
the | big |
with the column to split as "col_1"
and the additional split position column name provided as "split_position"
and use_regex=True
, the output dataset will be:
('col_1', 'Text') | ('col_2', 'Numeric') | ('split_position', 'Numeric') |
---|---|---|
star | 2.71828 | 0 |
trek | 2.71828 | 1 |
game | 3.14159 | 0 |
of | 3.14159 | 1 |
thrones | 3.14159 | 2 |
sein | 2.23606 | 0 |
field | 2.23606 | 1 |
the | 1.732 | 0 |
simpsons | 1.732 | 1 |
the | nan | 0 |
big | nan | 1 |
bang | nan | 2 |
theory | nan | 3 |
Text/Web
Extract URL parts from URL Text string
Unique identifier (API): extract_url_parts
Extracts URL components from a given column containing URL strings
Time Series
Resample data into a regular time series
Unique identifier (API): resample_time_series_v2
Resample data with a time-index column into a regular time series, with a specified aggregation function for each input column. Null-value interpolation is applied.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
sampling_interval | Interval/Duration between successive records | integer | |
time_unit | Unit for the time interval | day | single-select |
Examples:
For example, given the following dataset
receipt_time | qty | sku | expected_receipt_time |
---|---|---|---|
null | 30 | A231 | 2012-06-12 02:00:00 |
2012-06-13 00:10:00 | 40 | A351 | 2012-06-13 00:00:00 |
2012-06-13 00:30:00 | 100 | A351 | 2012-06-13 00:30:00 |
2012-06-13 20:45:20 | 29 | A456 | 2012-06-13 00:30:00 |
2012-06-15 21:40:03 | 50 | A451 | 2012-06-16 01:30:00 |
2012-06-17 05:41:23 | 80 | A221 | 2012-06-20 08:30:00 |
2012-06-17 11:10:53 | 70 | A221 | 2012-06-16 20:00:00 |
with time-index column receipt_time
, for parameters sampling_interval = 1
, time_unit = ""day""
, input columns qty
, sku
, expected_receipt_time
and the corresponding aggregation functions ""sum""
, ""topk""
(k_most_frequent = 3
), ""latest""
, the result is as follows
receipt_time | qty | sku | expected_receipt_time |
---|---|---|---|
2012-06-14 00:00:00 | 169 | ["A351","A456"] | 2012-06-13 00:30:00 |
2012-06-15 00:00:00 | null | null | null |
2012-06-16 00:00:00 | 50 | ["A451"] | 2012-06-16 01:30:00 |
2012-06-17 00:00:00 | null | null | null |
2012-06-18 00:00:00 | 150 | ["A221"] | 2012-06-20 08:30:00 |
Window Function
Compute window functions
Unique identifier (API): compute_window_function
Compute functions such as cumulative sum, rolling mean, etc. by specifying a window and desired functions with their inputs. Unbounded, shrinking, expanding and rolling windows are supported, with optional partitioning and ordering
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
window_start | Whether to set window start unbounded or a specified position relative to current row, with unbounded by default | unbounded | single-select |
start_position | Position from which window starts, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively. | integer | |
window_end | Whether to set window end unbounded or a specified position relative to current row, with unbounded by default | unbounded | single-select |
end_position | Position from which window ends, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively. | integer |
Examples:
Given the dataset
animal | colour | birth | qty | desexed | note |
---|---|---|---|---|---|
cat | yellow | 2019-01-01 00:00:00 | 10 | True | There is a story |
cat | blue | 2019-01-03 00:00:00 | 9 | True | No story |
dog | blue | 2019-01-02 00:00:00 | 4 | True | |
cat | black | 2019-01-07 00:00:00 | 2 | True | No story |
cat | blue | 2019-01-04 00:00:00 | 8 | True | There is a story |
dog | yellow | 2019-01-05 00:00:00 | 1 | True | There is another story |
dog | black | 2019-01-06 00:00:00 | 2 | False | There is a story |
dog | yellow | 2019-01-08 00:00:00 | 2 | False |
for input columns qty
, colour
, birth
, desexed
, note
and the corresponding aggregation functions
sum
, most_frequent
, latest
, all
, count (non-null)
, the result is as follows
animal | colour | birth | qty | desexed | note | qty_sum | colour_most_frequent | birth_latest | desexed_all | note_count |
---|---|---|---|---|---|---|---|---|---|---|
dog | blue | 2019-01-02 00:00:00 | 4 | True | 4 | blue | 2019-01-02 00:00:00 | True | 0 | |
dog | yellow | 2019-01-05 00:00:00 | 1 | True | There is another story | 5 | blue | 2019-01-05 00:00:00 | True | 1 |
dog | black | 2019-01-06 00:00:00 | 2 | False | There is a story | 7 | black | 2019-01-06 00:00:00 | False | 2 |
dog | yellow | 2019-01-08 00:00:00 | 2 | False | 9 | yellow | 2019-01-08 00:00:00 | False | 2 | |
cat | yellow | 2019-01-01 00:00:00 | 10 | True | There is a story | 10 | yellow | 2019-01-01 00:00:00 | True | 1 |
cat | blue | 2019-01-03 00:00:00 | 9 | True | No story | 19 | blue | 2019-01-03 00:00:00 | True | 2 |
cat | blue | 2019-01-04 00:00:00 | 8 | True | There is a story | 27 | blue | 2019-01-04 00:00:00 | True | 3 |
cat | black | 2019-01-07 00:00:00 | 2 | True | No story | 29 | blue | 2019-01-07 00:00:00 | True | 4 |
with window of unbounded start and end position 0, partitioned by animal
and ordered by birth
.
Rank rows
Unique identifier (API): window_rank
Rank rows after ordering and optional partitioning
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
rank_method | Method used to get the rank | rank | single-select |
Examples:
For example, the following dataset
+------+------+------+
|animal|colour|credit|
+------+------+------+
|cat |blue |-3.0 |
|cat |blue |-1.0 |
|cat |blue |5.0 |
|cat |blue |7.0 |
|cat |blue |10.0 |
|cat |yellow|1.0 |
|cat |yellow|4.0 |
|cat |yellow|4.0 |
|cat |yellow|4.0 |
|cat |yellow|8.0 |
|dog |blue |9.0 |
|dog |blue |7.0 |
|dog |blue |6.0 |
|dog |blue |5.0 |
|dog |blue |-10.0 |
|dog |yellow|null |
|dog |yellow|5.0 |
|dog |yellow|8.0 |
|dog |yellow|null |
|dog |yellow|1.0 |
+------+------+------+
is converted to
+------+------+------+----+
|animal|colour|credit|rank|
+------+------+------+----+
|dog |blue |-10.0 |1 |
|dog |blue |5.0 |2 |
|dog |blue |6.0 |3 |
|dog |blue |7.0 |4 |
|dog |blue |9.0 |5 |
|cat |yellow|1.0 |1 |
|cat |yellow|4.0 |2 |
|cat |yellow|4.0 |2 |
|cat |yellow|4.0 |2 |
|cat |yellow|8.0 |5 |
|cat |blue |-3.0 |1 |
|cat |blue |-1.0 |2 |
|cat |blue |5.0 |3 |
|cat |blue |7.0 |4 |
|cat |blue |10.0 |5 |
|dog |yellow|null |1 |
|dog |yellow|null |1 |
|dog |yellow|1.0 |3 |
|dog |yellow|5.0 |4 |
|dog |yellow|8.0 |5 |
+------+------+------+----+
after applying window_rank
, partitioned by animal
and colour
and ordered by credit
for option rank
.
Shift rows after ordering optionally with partitioning
Unique identifier (API): window_shift
Shift rows for input columns after ordering, optionally with partitioning, by a specified value indicating the numberof shifted rows and shifted direction
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
n_rows | Number of rows to shift where positive values indicate moving downwards (lag) and negative values refer to moving moving upwards (lead) | 1 | integer |
Examples:
For example, given the following dataset
animal | colour | adopted | qty | vehicle |
---|---|---|---|---|
cat | yellow | 2020-01-01 00:00:00 | 3.0 | car |
cat | yellow | 2020-01-02 00:00:00 | 1.0 | truck |
cat | yellow | 2020-01-03 00:00:00 | 5.0 | car |
cat | yellow | 2020-01-04 00:00:00 | 7.0 | van |
cat | yellow | 2020-01-05 00:00:00 | 10.0 | van |
cat | white | 2020-01-06 00:00:00 | 1.0 | car |
cat | white | 2020-01-07 00:00:00 | 4.0 | van |
cat | white | 2020-01-08 00:00:00 | 4.0 | car |
cat | white | 2020-01-09 00:00:00 | 4.0 | truck |
cat | white | 2020-01-10 00:00:00 | 8.0 | car |
dog | yellow | 2020-01-11 00:00:00 | 9.0 | car |
dog | yellow | 2020-01-12 00:00:00 | 7.0 | van |
dog | yellow | 2020-01-13 00:00:00 | 6.0 | car |
dog | yellow | 2020-01-14 00:00:00 | 5.0 | van |
dog | yellow | 2020-01-15 00:00:00 | 10.0 | car |
dog | white | 2020-01-16 00:00:00 | null | van |
dog | white | 2020-01-17 00:00:00 | 5.0 | car |
dog | white | 2020-01-18 00:00:00 | 8.0 | car |
dog | white | 2020-01-19 00:00:00 | null | truck |
dog | white | 2020-01-20 00:00:00 | 1.0 | car |
the result is
animal | colour | adopted | qty | vehicle | qty_adjusted | vehicle_adjusted |
---|---|---|---|---|---|---|
cat | white | 2020-01-06 00:00:00 | 1.0 | car | 4.0 | van |
cat | white | 2020-01-07 00:00:00 | 4.0 | van | 4.0 | car |
cat | white | 2020-01-08 00:00:00 | 4.0 | car | 4.0 | truck |
cat | white | 2020-01-09 00:00:00 | 4.0 | truck | 8.0 | car |
cat | white | 2020-01-10 00:00:00 | 8.0 | car | null | null |
cat | yellow | 2020-01-01 00:00:00 | 3.0 | car | 1.0 | truck |
cat | yellow | 2020-01-02 00:00:00 | 1.0 | truck | 5.0 | car |
cat | yellow | 2020-01-03 00:00:00 | 5.0 | car | 7.0 | van |
cat | yellow | 2020-01-04 00:00:00 | 7.0 | van | 10.0 | van |
cat | yellow | 2020-01-05 00:00:00 | 10.0 | van | null | null |
dog | white | 2020-01-16 00:00:00 | null | van | 5.0 | car |
dog | white | 2020-01-17 00:00:00 | 5.0 | car | 8.0 | car |
dog | white | 2020-01-18 00:00:00 | 8.0 | car | null | truck |
dog | white | 2020-01-19 00:00:00 | null | truck | 1.0 | car |
dog | white | 2020-01-20 00:00:00 | 1.0 | car | null | null |
dog | yellow | 2020-01-11 00:00:00 | 9.0 | car | 7.0 | van |
dog | yellow | 2020-01-12 00:00:00 | 7.0 | van | 6.0 | car |
dog | yellow | 2020-01-13 00:00:00 | 6.0 | car | 5.0 | van |
dog | yellow | 2020-01-14 00:00:00 | 5.0 | van | 10.0 | car |
dog | yellow | 2020-01-15 00:00:00 | 10.0 | car | null | null |
with input columns qty
and vehicle
, partitioned by animal
and colour
and ordered by adopted
, for shifted n_rows = -1
(lead).
Deprecated actions
The following actions that were available previously have now been deprecated. This means that you cannot find them any more on the recipe UI, as they are substituted by newer actions that are better in terms of usability or performance. However, existing recipes that contain these actions will continue to be supported.
Column Group
Take
Unique identifier (API): take
Please use the newer version get_column_group_components
Extracts specified components from a column group. If the component is not found, returns a null column.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
address_type | Method of selection of group column component to take | single-select | |
index | The component index (zero based) to take from the group. E.g.If the group columns are ["__group_grp_a", "__group_grp_b"] and index==1, the taken column will be "__group_grp_b". | 0 | integer |
suffix | The component suffix to take from the group. E.g.If the group columns are ["__group_grp_a", "__group_grp_b"] and suffix=="a", the taken column will be "__group_grp_a". | string |
Examples:
For the following set of parameters:
and the following input table:__group_grp_a_0 | __group_grp_a_1 | __group_grp_b_0 | __group_grp_b_1 |
---|---|---|---|
1 | 2 | 1 | null |
Sentence A | Sentence B | Sentence C | Sentence D |
The output is:
__group_grp_a_0 | __group_grp_a_1 | __group_grp_b_0 | __group_grp_b_1 | col_from_grp_a | col_from_grp_b |
---|---|---|---|---|---|
1 | 2 | 1 | null | 2 | null |
Sentence A | Sentence B | Sentence C | Sentence D | Sentence B | Sentence D |
Custom Formula
Create a new column with a formula
Unique identifier (API): create_new_column_with_formula
Please use the new version create_new_column_with_formula_v2
Creates a new column with any user-entered formula. Common Excel/SQL functions and operators for manipulating Numeric, DateTime, Text, and Boolean values are permitted. For a full list of allowed functions, see documentation.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
expression | Any formula where symbols correspond to the input columns in lexical order. For eg., if the input selection has column_B then column_A and if the formula is y / x , column_B is mapped to x and column_A is mapped to y . |
string |
Examples:
Consider the dataset:
manufacturer | model | year | displ | cty | hwy |
---|---|---|---|---|---|
audi | a4 quattro | 1999 | 1.8 | 16 | 25 |
audi | a6 quattro | 1999 | 2.8 | 15 | 24 |
audi | a6 quattro | 2008 | 4.2 | 16 | 23 |
audi | a6 quattro | 2008 | 4.2 | 16 | 23 |
chevrolet | c1500 suburban 2wd | 2008 | 6 | 12 | 17 |
chevrolet | corvette | 1999 | 5.7 | 16 | 26 |
chevrolet | corvette | 1999 | 5.7 | 15 | 23 |
chevrolet | k1500 tahoe 4wd | 1999 | 6.5 | 14 | 17 |
chevrolet | malibu | 1999 | 3.1 | 18 | 26 |
dodge | caravan 2wd | 1999 | 3.8 | 15 | 22 |
where cty
, hwy
and year
are of Numeric
type.
Given the input columns selected as ['cty', 'hwy']
, the output column name 'hwy_minus_cty'
, the formula 'y - x'
, the result will be:
manufacturer | model | year | displ | cty | hwy | hwy_minus_cty |
---|---|---|---|---|---|---|
audi | a4 quattro | 1999 | 1.8 | 16 | 25 | 9 |
audi | a6 quattro | 1999 | 2.8 | 15 | 24 | 9 |
audi | a6 quattro | 2008 | 4.2 | 16 | 23 | 7 |
audi | a6 quattro | 2008 | 4.2 | 16 | 23 | 7 |
chevrolet | c1500 suburban 2wd | 2008 | 6 | 12 | 17 | 5 |
chevrolet | corvette | 1999 | 5.7 | 16 | 26 | 10 |
chevrolet | corvette | 1999 | 5.7 | 15 | 23 | 8 |
chevrolet | k1500 tahoe 4wd | 1999 | 6.5 | 14 | 17 | 3 |
chevrolet | malibu | 1999 | 3.1 | 18 | 26 | 8 |
dodge | caravan 2wd | 1999 | 3.8 | 15 | 22 | 7 |
where the new hwy_minus_cty
column will be of Numeric
type.
For the same input data, given the input columns selected as ['year']
, the output column name 'century'
the formula 'IF(x <= 2000, "20th", "21st")'
the result will be:
manufacturer | model | year | displ | cty | hwy | century |
---|---|---|---|---|---|---|
audi | a4 quattro | 1999 | 1.8 | 16 | 25 | 20th |
audi | a6 quattro | 1999 | 2.8 | 15 | 24 | 20th |
audi | a6 quattro | 2008 | 4.2 | 16 | 23 | 21st |
audi | a6 quattro | 2008 | 4.2 | 16 | 23 | 21st |
chevrolet | c1500 suburban 2wd | 2008 | 6 | 12 | 17 | 21st |
chevrolet | corvette | 1999 | 5.7 | 16 | 26 | 20th |
chevrolet | corvette | 1999 | 5.7 | 15 | 23 | 20th |
chevrolet | k1500 tahoe 4wd | 1999 | 6.5 | 14 | 17 | 20th |
chevrolet | malibu | 1999 | 3.1 | 18 | 26 | 20th |
dodge | caravan 2wd | 1999 | 3.8 | 15 | 22 | 20th |
where the new century
column will be of Text
type.
Transform columns with a formula
Unique identifier (API): transform_columns_with_formula
Please use the new version transform_columns_with_formula_v2
Transform one or more column(s) with a user-specified formula. Common Excel/SQL functions and operators for manipulating Numeric, DateTime, Text and Boolean values are permitted. For a full list of these functions, see documentation.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
formula | Formula where symbol SELF refers to the column(s) to be transformed, and the rest symbols in lexical order are mapped to the other input column(s) if any.For example, if the columns to be transformed are col_1 and col_2 , other input columns are col_A and col_B , with formula SELF + z - y , col_i is modified as col_i + col_B - col_A for i = 1, 2 |
string |
Examples:
For example, given the following dataset
qty_1 | qty_2 |
---|---|
10 | nan |
20 | 15 |
15 | 6 |
nan | 1 |
8 | 8 |
and formula = ""SELF + (y + SELF) * 10""
, with qty_1
as column to be transformed and qty_2
as the other input column, the result will be
qty_1 | qty_2 |
---|---|
nan | nan |
370 | 15 |
225 | 6 |
nan | 1 |
168 | 8 |
Data Cleaning/Processing
Remove duplicated rows within group
Unique identifier (API): group_by_remove_duplicates
Please use the newer version group_by
Remove duplicated rows for one or more input columns grouped by other selected column(s).
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
keep | First or last row to keep | first | single-select |
Examples:
For example, given the following dataset
animal | colour | carried | description | qty | time | desexed |
---|---|---|---|---|---|---|
cat | yellow | van | There is a story. | 0.5416980271485203 | 2020-10-10 00:00:00 | true |
cat | yellow | van | There is a story. | 0.09442987061798414 | 2020-10-10 00:00:00 | false |
cat | yellow | car | There is a story. | 0.5766948082733302 | 2020-10-10 00:00:00 | true |
cat | yellow | van | There is a story. | 0.8015245589198161 | 2020-01-01 00:00:00 | false |
cat | yellow | van | There is a story. | 0.2523252855972161 | 2019-01-01 00:00:00 | true |
cat | white | van | There is a story. | 0.044270295367698886 | 2020-01-01 00:00:00 | true |
cat | white | car | There is another story. | 0.34340627533339274 | 2020-01-01 00:00:00 | true |
cat | white | van | There is another story. | 0.243885710885979 | 2019-01-10 00:00:00 | true |
cat | white | car | There is another story. | 0.9875301137422845 | 2020-01-01 00:00:00 | false |
cat | white | van | There is a story. | 0.4416194595201147 | 2019-01-10 00:00:00 | false |
dog | yellow | van | There is another story. | 0.18251646225487406 | 2019-01-10 00:00:00 | false |
dog | yellow | car | There is a story. | 0.9073423733435338 | 2019-01-10 00:00:00 | false |
dog | yellow | car | There is another story. | 0.5242232544087159 | 2020-01-01 00:00:00 | false |
dog | yellow | van | There is a story. | 0.09555633797329921 | 2019-01-10 00:00:00 | true |
dog | yellow | car | There is a story. | 0.13287387217823854 | 2020-10-10 00:00:00 | true |
dog | white | car | There is another story. | 0.8686814568933312 | 2020-10-02 00:00:00 | true |
dog | white | car | There is another story. | 0.5376552548800148 | 2019-01-10 00:00:00 | true |
dog | white | car | There is another story. | 0.6979028231254814 | 2019-01-10 00:00:00 | false |
dog | white | van | There is a story. | 0.19022329903980462 | 2020-01-01 00:00:00 | true |
dog | white | car | There is another story. | 0.856685303343519 | 2020-01-01 00:00:00 | false |
with input columns animal
and colour
, keep=""first""
, the result is
animal | colour | carried | description | qty | time | desexed |
---|---|---|---|---|---|---|
cat | yellow | van | There is a story. | 0.5416980271485203 | 2020-10-10 00:00:00 | true |
cat | white | van | There is a story. | 0.044270295367698886 | 2020-01-01 00:00:00 | true |
dog | yellow | van | There is another story. | 0.18251646225487406 | 2019-01-10 00:00:00 | false |
dog | white | car | There is another story. | 0.8686814568933312 | 2020-10-02 00:00:00 | true |
Impute columns
Unique identifier (API): impute_column
Please use the newer version impute_numeric or impute_categorical
Impute the missing values in your data
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
categorical_strategy | Strategy for imputing Categorical columns | most_frequent | single-select |
categorical_fill_value | Filler value for Categoric columns when strategy is constant | string | |
numeric_strategy | Strategy for imputing Numeric columns | mean | single-select |
numeric_fill_value | Filler value for Numeric columns when strategy is constant | float |
Feature Engineering
Calculate math function
Unique identifier (API): calculate_math_function
Please use the action create_new_column_with_formula_v2
Calculate a math function on selected columns. Invalid values such as negative input for a logarithmic function result in null outputs.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
math_function | Math function applied to columns | single-select |
Examples:
For example, compute the absolute value abs(x)
.
Calculate math function (2)
Unique identifier (API): calculate_math_function2
Please use the action create_new_column_with_formula_v2
Calculate a math function that takes two inputs. If invalid inputs are encountered (such as division by zero), the outputs will be null.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
math_function | Math function applied to input columns | single-select |
Examples:
For example compute the difference (c = a -b)
or computing the ratio (c = a/b)
Cluster and extract features
Unique identifier (API): cluster
Please use the newer version get_cluster_features_v2
Cluster the input data using unsupervised learning algorithms. Outputs three features - the cluster ID, cluster size and euclidian distance from the centroid of the cluster.
Parameters | Parameter Description | Default | Parameter Type | Parameter Min | Parameter Max | Parameter Step |
---|---|---|---|---|---|---|
algorithm | The algorithm used for clustering, such as k-means or Gaussian Mixture Model. | single-select | ||||
num_clusters_method | The method that is utilized for calculating the number of clusters | inferred | single-select | |||
num_clusters | The numbers of clusters to generate. | 2 | range | 2.0 | 100.0 | 1.0 |
max_num_clusters | This defines the search space for the algorithm inferring the best number of clusters automatically. Setting it to a value of K will make the algorithm look for the best number of clusters in the range [2, 3, ..., K] |
10 | range | 2.0 | 100.0 | 1.0 |
seed | Nonnegative integer. Random seed for clustering algorithms. Used for obtaining repeatable results. | 0 | integer | |||
tolerance | The minimum change in the squared distances between cluster centroids between two successive iterations, for the algorithm to continue to the next iteration. If the change is below the threshold specified by this parameter, the algorithm is assumed to have converged before reaching the maximum number of iterations. A lower value for this parameter results in higher fidelity in the assignment of cluster IDs, but can make the action run slower especially on large datasets. | 0.0001 | float | |||
max_iter | The maximum number of iterations regardless of convergence. The higher it is the more accurate the clustering will be, but comes at the cost of a longer runtime. | 20 | integer | |||
min_cluster_size | The smallest size grouping that HDBSCAN considers as a cluster. Single linkage splits that contain fewer points than this will be considered points "falling out" of a cluster rather than a cluster splitting into two new clusters | 5 | integer | |||
min_samples_specified | If enabled, allows specifying the minimum number of samples parameter. Otherwise, will be determined automatically. | False | boolean | |||
min_samples | The number of samples in a neighbourhood for a point to be considered a core point. Providing a larger value for this parameter results in a more conservative clustering, which means more points will be declared as noise and clusters will be restricted to progressively more dense areas. | 1 | integer | |||
cluster_selection_epsilon | A distance threshold. Clusters below this value will be merged. The choice of cluster_selection_epsilon depends on the given distances between your data points. For example, set the value to 0.5 if you don't want to separate clusters that are less than 0.5 units apart. | 0 | float | |||
allow_single_cluster | By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset. | False | boolean |
Extract PCA Components
Unique identifier (API): extract_pca
Please use the newer version extract_pca_v2
Create new numerical features by combining existing numerical features in linear ways using PCA
Parameters | Parameter Description | Default | Parameter Type | Parameter Min | Parameter Max | Parameter Step |
---|---|---|---|---|---|---|
limit_components_by | Criterion for selecting the principal components.If the minimum explained variance is used, the components that contribute most to the variance are selected until the cumulative variance is not less than the specified ratio of the total variance. | min_explained_variance | single-select | |||
num_components | The number of principal components to extract from the input data | integer | ||||
min_explained_variance | This is the threshold on the ratio of explained variance of a component to the total variance. Only the components with highest variance whose cumulative variance is at least the specified ratio of total variance will be selected | 0.8 | range | 0.01 | 1.0 | 0.01 |
categorical_impute_method | Method for imputing Categorical columns | most_frequent | single-select | |||
categorical_fill_value | Filler value for categorical columns when strategy is constant | string | ||||
numeric_impute_method | Method for imputing Numeric columns | mean | single-select | |||
numeric_fill_value | Filler value for Numeric columns when strategy is constant | float |
One hot encode categorical columns
Unique identifier (API): one_hot_encode
Please use the newer version one_hot_encode_v2
This action converts categorical columns by one-hot encoding. Missing values remain as null. Unseen values are encoded as all 0s, when a finalized recipe with this action is applied on new data
Examples:
For example, given the following dataset
col_1 | col_2 |
---|---|
A | yellow |
B | null |
B | blue |
A | blue |
E | red |
the result is
col_1 | col_2 | __group_encoded_1_0 | __group_encoded_1_1 | __group_encoded_1_2 | __group_encoded_2_0 | __group_encoded_2_1 | __group_encoded_2_2 |
---|---|---|---|---|---|---|---|
A | yellow | 1 | 0 | 0 | 0 | 0 | 1 |
B | null | 0 | 1 | 0 | null | null | null |
B | blue | 0 | 1 | 0 | 1 | 0 | 0 |
A | blue | 1 | 0 | 0 | 1 | 0 | 0 |
E | red | 0 | 0 | 1 | 0 | 1 | 0 |
Filtering/Custom Formula
Filter rows
Unique identifier (API): filter_rows_with_formula
Please use the new version filter_rows_with_formula_v2
Filter rows in data by keeping only rows that satisfy the condition specified in user-entered formula. Common Excel/SQL functions and operators for manipulating Numeric, DateTime, Text, and Boolean values are permitted. For a full list of allowed functions, see documentation.
NOTE: The formula entered must evaluate to a boolean value.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
condition | The condition under which the rows are to be kept, entered as a formula. Symbols must correspond to the input columns in lexical order. For eg., if the input selection has column_B then column_A and if the formula is y >= x , column_B is mapped to x and column_A is mapped to y . |
string |
Examples:
Consider the dataset:
manufacturer | model | year | displ | cty | hwy |
---|---|---|---|---|---|
audi | a4 quattro | 1999 | 1.8 | 16 | 25 |
audi | a6 quattro | 1999 | 2.8 | 15 | 24 |
audi | a6 quattro | 2008 | 4.2 | 16 | 23 |
audi | a6 quattro | 2008 | 4.2 | 16 | 23 |
chevrolet | c1500 suburban 2wd | 2008 | 6 | 12 | 17 |
chevrolet | corvette | 1999 | 5.7 | 16 | 26 |
chevrolet | corvette | 1999 | 5.7 | 15 | 23 |
chevrolet | k1500 tahoe 4wd | 1999 | 6.5 | 14 | 17 |
chevrolet | malibu | 1999 | 3.1 | 18 | 26 |
dodge | caravan 2wd | 1999 | 3.8 | 15 | 22 |
dodge | caravan 2wd | 2008 | 4 | 16 | 23 |
dodge | durango 4wd | 2008 | 5.7 | 13 | 18 |
dodge | durango 4wd | 2008 | 5.7 | 13 | 18 |
dodge | ram 1500 pickup 4wd | 1999 | 5.9 | 11 | 15 |
ford | expedition 2wd | 1999 | 4.6 | 11 | 17 |
ford | explorer 4wd | 1999 | 5 | 13 | 17 |
ford | f150 pickup 4wd | 1999 | 4.2 | 14 | 17 |
ford | f150 pickup 4wd | 1999 | 4.2 | 14 | 17 |
ford | f150 pickup 4wd | 2008 | 4.6 | 13 | 17 |
ford | mustang | 2008 | 4 | 17 | 26 |
where cty
and year
are of Numeric
type.
Given the input columns selected as ['year', 'cty']
and the formula 'AND(x > 2000, y >= 15)'
, the result will be:
manufacturer | model | year | displ | cty | hwy |
---|---|---|---|---|---|
audi | a6 quattro | 2008 | 4.2 | 16 | 23 |
audi | a6 quattro | 2008 | 4.2 | 16 | 23 |
dodge | caravan 2wd | 2008 | 4 | 16 | 23 |
ford | mustang | 2008 | 4 | 17 | 26 |
Reshaping
Pivot a column with a specified aggregation
Unique identifier (API): pivot
Please use the newer version pivot_v2
Pivot a column and perform a specified aggregation on one or more numeric columns, grouped by other selected columns
Parameters | Parameter Description | Default | Parameter Type | Parameter Min | Parameter Max | Parameter Step |
---|---|---|---|---|---|---|
aggregation_function | Aggregation function applied to input column(s) | mean | single-select | |||
median_relative_error | Relative error for approximating median. Must be no less than 0.001 for computational efficiency. | 0.001 | range | 0.001 | 1.0 | 0.001 |
count_distinct_maximum_error | Maximum estimation error allowed for approximating the number of distinct values. Must be no less than 0.01 for computational efficiency. | 0.01 | range | 0.01 | 1.0 | 0.01 |
pivot_values | List of values that are converted to columns in the output | [] | string-list |
Examples:
For example, given the following dataset
animal | size | qty |
---|---|---|
cat | small | 95 |
cat | medium | 67 |
cat | large | 39 |
cat | huge | 45 |
cat | small | 50 |
cat | medium | 18 |
cat | large | 37 |
cat | huge | 22 |
cat | small | 12 |
cat | medium | 45 |
dog | large | 84 |
dog | huge | 48 |
dog | small | 71 |
dog | medium | 95 |
dog | large | 87 |
dog | huge | 35 |
dog | small | 7 |
dog | medium | 25 |
dog | large | 42 |
dog | huge | 45 |
with input pivot column size
, values column qty
and aggregation function sum
, grouped by column animal
, the result is
animal | __group_qty_agg_huge | __group_qty_agg_large | __group_qty_agg_medium | __group_qty_agg_small |
---|---|---|---|---|
dog | 128 | 213 | 120 | 78 |
cat | 67 | 76 | 130 | 157 |
Unpack values to multiple rows
Unique identifier (API): unpack
Please use the newer version unpack_v2
Reshape a dataset by unpacking one or more column group(s) and create a position indication column
Text
Extract capture groups
Unique identifier (API): extract_capture_groups
Please use the newer versions of extract_capture_groups
Extracts one or more regex capture groups into a column group
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
pattern | A regular expression pattern containing one or more capture groups | string |
Examples:
For the following pattern == '([aeiou].)([aeiou].)'
, the following input table and assuming we apply the action over both columns ([col_1, col_2]
):
col_1 | col_2 |
---|---|
New_York | bc |
Lisbon | df |
Tokyo | gh |
Paris | jk |
null | lm |
Munich | np |
The expected output is:
col_1 | col_2 | __group_grp_a_0 | __group_grp_a_1 | __group_grp_b_0 | __group_grp_b_1 |
---|---|---|---|---|---|
New_York | bc | null | null | null | null |
Lisbon | df | null | null | null | null |
Tokyo | gh | null | null | null | null |
Paris | jk | ar | is | null | null |
null | lm | null | null | null | null |
Munich | np | un | ic | null | null |
Find all matches of a regular expression
Unique identifier (API): regex_findall
Please use the new action extract_all_regex_matches
Creates an output column group containing all the strings that match the regex pattern.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
pattern | A regular expression pattern for the matching | string |
Examples:
For pattern == '\d\d'
(match non overlapping double digits) and the following input table:
col_1 |
---|
null |
Test 54 45 Test |
Dummy sentence |
Just some number 00 |
The output is:
col_1 | __group_col_1_re_findall_0 | __group_col_1_re_findall_1 |
---|---|---|
null | null | null |
Test 54 45 Test | 54 | 45 |
Dummy sentence | null | null |
Just some number 00 | 00 | null |
Split text column on delimiter
Unique identifier (API): split_on_delimiter
Please use either split_on_delimiter_to_multicol or split_to_array.
For a given column, extracts text components between the delimiters into a column group. The size of the column group corresponds to the largest number of components found in the column or by the limit on the number of splits. Trailing null values are created for records containing fewer components.
Parameters | Parameter Description | Default | Parameter Type | Suggested Values |
---|---|---|---|---|
delimiter | Delimiter to be used for splitting the string. | string | [',', ';', '-', ' | |
delimiter_type | Type of delimiter | Verbatim | single-select | |
limit_num_splits | Boolean flag indicating wether you wish to limit the number of splits or not. | False | boolean | |
num_of_splits | An integer which controls the number of times the delimiter regex is applied. | 2 | integer |
Examples:
For the following set of parameters:
and the following input table:
col_1 |
---|
one:two:three |
no_split |
The expected output is:
col_1 | __group_col_1_split_grp_0 | __group_col_1_split_grp_1 |
---|---|---|
one:two:three | one | two:three |
no_split | no_split | null |
Text/JSON
Split JSON array column into a column group
Unique identifier (API): split_json_array
Please use unpack_v2
Every row (or cell) of the input column is assumed to be a JSON array stored as a string. Extract each element in the array into a separate column component in a columngroup.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
path | Path to the JSON array component via (Spark-flavor) JSON path. Defaults to $ (the top-level). | $ | string |
Time Series
Resample data into a regular time series
Unique identifier (API): resample_time_series
Please use the newer version resample_time_series_v2
Resample data with a time-index column into a regular time series, with aggregation for all columns except the nominated time index. Null-value interpolation is applied.
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
sampling_interval | Length of intervals between successive records | integer | |
time_unit | Time unit for intervals between successive records | day | single-select |
aggregation_function | Aggregation function to be applied | last | single-select |
Examples:
For example, given the following dataset
receipt_time | qty | sku | expected_receipt_time |
---|---|---|---|
null | 30 | A231 | 2012-06-12 02:00:00 |
2012-06-13 00:00:00 | 40 | A351 | 2012-06-15 10:00:00 |
2012-06-13 20:45:20 | 29 | A456 | 2012-06-13 00:30:00 |
2012-06-15 21:40:03 | 50 | A451 | 2012-06-16 01:30:00 |
2012-06-17 05:41:23 | 80 | A223 | 2012-06-16 20:30:00 |
null | 100 | A221 | 2012-06-20 08:30:00 |
2012-06-17 11:10:53 | 70 | A264 | 2012-06-16 20:00:00 |
with time-index column receipt_time
for parameters sampling_interval = 12
, time_unit = ""hour""
and aggregation_function = ""last""
, the result is
receipt_time | qty | sku | expected_receipt_time |
---|---|---|---|
2012-06-13 00:00:00 | 40 | A351 | 2012-06-15 10:00:00 |
2012-06-13 12:00:00 | null | null | null |
2012-06-14 00:00:00 | 29 | A456 | 2012-06-13 00:30:00 |
2012-06-14 12:00:00 | null | null | null |
2012-06-15 00:00:00 | null | null | null |
2012-06-15 12:00:00 | null | null | null |
2012-06-16 00:00:00 | 50 | A451 | 2012-06-16 01:30:00 |
2012-06-16 12:00:00 | null | null | null |
2012-06-17 00:00:00 | null | null | null |
2012-06-17 12:00:00 | 70 | A264 | 2012-06-16 20:00:00 |
Window Function
Count the number of values over partition
Unique identifier (API): window_aggregate_count
Please use the newer version compute_rolling_window_function
This action counts the number of values in one or more input column(s) partitioned by other selected column(s).
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
value_type | What to count. The default is to count only non-null values.For approximately counting distinct values (see https://en.wikipedia.org/wiki/HyperLogLog), null values are ignored and the allowed estimation error is 0.05 at maximum | non_null | single-select |
Examples:
For example, given the following dataset
colour | vehicle | animal | description | qty | time | verified |
---|---|---|---|---|---|---|
blue | car | cat | There is another story. | 0.9371423365841736 | null | true |
blue | van | null | null | 0.504013991387411 | 2019-01-02 00:00:00 | true |
yellow | car | cat | There is another story. | 0.7664983791343694 | 2019-01-03 00:00:00 | false |
yellow | van | null | null | 0.391325429830197 | 2019-01-04 00:00:00 | true |
yellow | car | cat | There is another story. | null | 2019-01-05 00:00:00 | true |
blue | car | null | There is another story. | null | 2019-01-06 00:00:00 | true |
yellow | car | dog | There is another story. | null | null | true |
blue | van | dog | There is another story. | 0.8223637606171873 | null | false |
blue | van | cat | There is another story. | 0.9805952019930625 | 2019-01-09 00:00:00 | false |
yellow | van | dog | null | null | 2019-01-10 00:00:00 | true |
blue | van | null | null | null | 2019-01-11 00:00:00 | true |
blue | car | null | There is another story. | 0.010846767436331839 | 2019-01-12 00:00:00 | false |
blue | car | cat | There is a story. | 0.30042645923211464 | 2019-01-13 00:00:00 | true |
yellow | van | dog | There is a story. | 0.7759473050527014 | 2019-01-14 00:00:00 | true |
yellow | van | dog | There is another story. | 0.5250835307237919 | 2019-01-15 00:00:00 | true |
yellow | car | dog | There is another story. | 0.4708643894438649 | null | true |
blue | van | dog | null | 0.8425979127421865 | null | false |
yellow | van | dog | There is another story. | 0.9901884866319394 | 2019-01-18 00:00:00 | true |
blue | car | dog | There is a story. | 0.6662684130667034 | 2019-01-19 00:00:00 | true |
blue | car | cat | There is another story. | null | 2019-01-20 00:00:00 | true |
with input columns animal
, description
, qty
, time
, verified
partitioned by colour
and vehicle
for option non_null
, the result is
colour | vehicle | animal | description | qty | time | verified | animal_count | description_count | qty_count | time_count | verified_count |
---|---|---|---|---|---|---|---|---|---|---|---|
blue | car | cat | There is another story. | 0.9371423365841736 | null | true | 4 | 6 | 4 | 5 | 6 |
blue | van | null | null | 0.504013991387411 | 2019-01-02 00:00:00 | true | 3 | 2 | 4 | 3 | 5 |
yellow | car | cat | There is another story. | 0.7664983791343694 | 2019-01-03 00:00:00 | false | 4 | 4 | 2 | 2 | 4 |
yellow | van | null | null | 0.391325429830197 | 2019-01-04 00:00:00 | true | 4 | 3 | 4 | 5 | 5 |
yellow | car | cat | There is another story. | null | 2019-01-05 00:00:00 | true | 4 | 4 | 2 | 2 | 4 |
blue | car | null | There is another story. | null | 2019-01-06 00:00:00 | true | 4 | 6 | 4 | 5 | 6 |
yellow | car | dog | There is another story. | null | null | true | 4 | 4 | 2 | 2 | 4 |
blue | van | dog | There is another story. | 0.8223637606171873 | null | false | 3 | 2 | 4 | 3 | 5 |
blue | van | cat | There is another story. | 0.9805952019930625 | 2019-01-09 00:00:00 | false | 3 | 2 | 4 | 3 | 5 |
yellow | van | dog | null | null | 2019-01-10 00:00:00 | true | 4 | 3 | 4 | 5 | 5 |
blue | van | null | null | null | 2019-01-11 00:00:00 | true | 3 | 2 | 4 | 3 | 5 |
blue | car | null | There is another story. | 0.010846767436331839 | 2019-01-12 00:00:00 | false | 4 | 6 | 4 | 5 | 6 |
blue | car | cat | There is a story. | 0.30042645923211464 | 2019-01-13 00:00:00 | true | 4 | 6 | 4 | 5 | 6 |
yellow | van | dog | There is a story. | 0.7759473050527014 | 2019-01-14 00:00:00 | true | 4 | 3 | 4 | 5 | 5 |
yellow | van | dog | There is another story. | 0.5250835307237919 | 2019-01-15 00:00:00 | true | 4 | 3 | 4 | 5 | 5 |
yellow | car | dog | There is another story. | 0.4708643894438649 | null | true | 4 | 4 | 2 | 2 | 4 |
blue | van | dog | null | 0.8425979127421865 | null | false | 3 | 2 | 4 | 3 | 5 |
yellow | van | dog | There is another story. | 0.9901884866319394 | 2019-01-18 00:00:00 | true | 4 | 3 | 4 | 5 | 5 |
blue | car | dog | There is a story. | 0.6662684130667034 | 2019-01-19 00:00:00 | true | 4 | 6 | 4 | 5 | 6 |
blue | car | cat | There is another story. | null | 2019-01-20 00:00:00 | true | 4 | 6 | 4 | 5 | 6 |
Get DateTime value by partition
Unique identifier (API): window_aggregate_datetime
Please use the newer version compute_rolling_window_function
This action applies a window aggregation function on one or more DateTime column(s) partitioned by other selected column(s).
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
agg_function | Aggregation function applied to input column(s) | single-select |
Examples:
For example, given the following dataset
animal | colour | time_born | time_adopted |
---|---|---|---|
cat | blue | 2019-01-01 00:00:00 | null |
cat | yellow | 2019-01-02 00:00:00 | 2019-01-01 00:00:00 |
dog | yellow | 2019-01-03 00:00:00 | 2019-01-01 00:00:00 |
cat | yellow | 2019-01-04 00:00:00 | null |
cat | yellow | 2019-01-05 00:00:00 | null |
dog | blue | 2019-01-06 00:00:00 | 2019-01-02 00:00:00 |
cat | blue | 2019-01-07 00:00:00 | 2019-01-02 00:00:00 |
cat | blue | null | 2019-01-02 00:00:00 |
dog | yellow | 2019-01-09 00:00:00 | 2019-01-02 00:00:00 |
cat | yellow | 2019-01-10 00:00:00 | 2019-01-02 00:00:00 |
dog | blue | null | null |
cat | blue | 2019-01-12 00:00:00 | 2019-01-03 00:00:00 |
dog | blue | null | 2019-01-03 00:00:00 |
cat | blue | 2019-01-14 00:00:00 | 2019-01-03 00:00:00 |
cat | blue | null | 2019-01-03 00:00:00 |
cat | blue | 2019-01-16 00:00:00 | 2019-01-04 00:00:00 |
cat | blue | 2019-01-17 00:00:00 | 2019-01-04 00:00:00 |
cat | yellow | null | 2019-01-04 00:00:00 |
dog | blue | 2019-01-19 00:00:00 | null |
dog | yellow | 2019-01-20 00:00:00 | 2019-01-04 00:00:00 |
with input columns time_born
and time_adopted
, partitioned by animal
and colour
for option earliest
, the result is
animal | colour | time_born | time_adopted | agg_birth | agg_adoption |
---|---|---|---|---|---|
cat | blue | 2019-01-01 00:00:00 | null | 2019-01-01 00:00:00 | 2019-01-02 00:00:00 |
cat | yellow | 2019-01-02 00:00:00 | 2019-01-01 00:00:00 | 2019-01-02 00:00:00 | 2019-01-01 00:00:00 |
dog | yellow | 2019-01-03 00:00:00 | 2019-01-01 00:00:00 | 2019-01-03 00:00:00 | 2019-01-01 00:00:00 |
cat | yellow | 2019-01-04 00:00:00 | null | 2019-01-02 00:00:00 | 2019-01-01 00:00:00 |
cat | yellow | 2019-01-05 00:00:00 | null | 2019-01-02 00:00:00 | 2019-01-01 00:00:00 |
dog | blue | 2019-01-06 00:00:00 | 2019-01-02 00:00:00 | 2019-01-06 00:00:00 | 2019-01-02 00:00:00 |
cat | blue | 2019-01-07 00:00:00 | 2019-01-02 00:00:00 | 2019-01-01 00:00:00 | 2019-01-02 00:00:00 |
cat | blue | null | 2019-01-02 00:00:00 | 2019-01-01 00:00:00 | 2019-01-02 00:00:00 |
dog | yellow | 2019-01-09 00:00:00 | 2019-01-02 00:00:00 | 2019-01-03 00:00:00 | 2019-01-01 00:00:00 |
cat | yellow | 2019-01-10 00:00:00 | 2019-01-02 00:00:00 | 2019-01-02 00:00:00 | 2019-01-01 00:00:00 |
dog | blue | null | null | 2019-01-06 00:00:00 | 2019-01-02 00:00:00 |
cat | blue | 2019-01-12 00:00:00 | 2019-01-03 00:00:00 | 2019-01-01 00:00:00 | 2019-01-02 00:00:00 |
dog | blue | null | 2019-01-03 00:00:00 | 2019-01-06 00:00:00 | 2019-01-02 00:00:00 |
cat | blue | 2019-01-14 00:00:00 | 2019-01-03 00:00:00 | 2019-01-01 00:00:00 | 2019-01-02 00:00:00 |
cat | blue | null | 2019-01-03 00:00:00 | 2019-01-01 00:00:00 | 2019-01-02 00:00:00 |
cat | blue | 2019-01-16 00:00:00 | 2019-01-04 00:00:00 | 2019-01-01 00:00:00 | 2019-01-02 00:00:00 |
cat | blue | 2019-01-17 00:00:00 | 2019-01-04 00:00:00 | 2019-01-01 00:00:00 | 2019-01-02 00:00:00 |
cat | yellow | null | 2019-01-04 00:00:00 | 2019-01-02 00:00:00 | 2019-01-01 00:00:00 |
dog | blue | 2019-01-19 00:00:00 | null | 2019-01-06 00:00:00 | 2019-01-02 00:00:00 |
dog | yellow | 2019-01-20 00:00:00 | 2019-01-04 00:00:00 | 2019-01-03 00:00:00 | 2019-01-01 00:00:00 |
Make a boolean indicator over partition
Unique identifier (API): window_aggregate_indicate_boolean
Please use the newer version compute_rolling_window_function
This action applies a window aggregation function as an overall indicator on one or more boolean columns partitioned by other selected column(s).
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
indicator_function | Indicator function applied to input column(s) | single-select |
Examples:
For example, given the following dataset
animal | colour | adopted | registered |
---|---|---|---|
cat | blue | true | true |
dog | yellow | true | false |
cat | yellow | false | true |
cat | blue | false | true |
dog | blue | false | true |
dog | blue | false | false |
cat | yellow | true | true |
dog | yellow | true | true |
dog | yellow | true | true |
dog | blue | true | false |
dog | yellow | false | false |
dog | yellow | false | false |
dog | blue | false | true |
cat | yellow | true | false |
cat | yellow | true | true |
dog | blue | false | false |
dog | yellow | true | true |
dog | blue | false | false |
cat | yellow | false | false |
dog | blue | false | false |
with input columns adopted
and registered
, partitioned by animal
and colour
for option any
, the result is
animal | colour | adopted | registered | agg_adopted | agg_registered |
---|---|---|---|---|---|
cat | blue | true | true | true | true |
dog | yellow | true | false | true | true |
cat | yellow | false | true | true | true |
cat | blue | false | true | true | true |
dog | blue | false | true | true | true |
dog | blue | false | false | true | true |
cat | yellow | true | true | true | true |
dog | yellow | true | true | true | true |
dog | yellow | true | true | true | true |
dog | blue | true | false | true | true |
dog | yellow | false | false | true | true |
dog | yellow | false | false | true | true |
dog | blue | false | true | true | true |
cat | yellow | true | false | true | true |
cat | yellow | true | true | true | true |
dog | blue | false | false | true | true |
dog | yellow | true | true | true | true |
dog | blue | false | false | true | true |
cat | yellow | false | false | true | true |
dog | blue | false | false | true | true |
Aggregate numeric column(s) over partition
Unique identifier (API): window_aggregate_numeric
Please use the newer version compute_rolling_window_function
This action applies a window aggregation function on one or more numeric columns partitioned by other selected column(s).
Parameters | Parameter Description | Default | Parameter Type | Parameter Min | Parameter Max | Parameter Step |
---|---|---|---|---|---|---|
agg_function | Aggregation function applied to input column(s) | single-select | ||||
relative_error | Relative error for approximating median or a specific quantile. Must be no less than 0.001 for computational efficiency. | range | 0.001 | 1.0 | 0.001 | |
quantile_to_compute | Specify which quantile to compute for the quantile aggregation function. Must be a number between 0 and 1. | range | 0.0 | 1.0 | 0.05 |
Examples:
For example, given the following dataset
colour | vehicle | qty_1 | qty_2 |
---|---|---|---|
yellow | van | 0.11247070608569998 | 5 |
yellow | van | 0.28328160937677216 | -15 |
blue | car | 0.517063017219678 | 19 |
blue | van | 0.5301237478473354 | -9 |
blue | van | 0.3620932896270347 | 10 |
yellow | car | 0.4225691387726025 | 15 |
yellow | car | 0.290901753183477 | -19 |
blue | van | 0.8861529756930526 | 19 |
yellow | van | 0.0663419211386469 | 12 |
blue | car | 0.5966970598200722 | -4 |
blue | car | 0.420710037504774 | -8 |
yellow | van | 0.6275038840320153 | -14 |
yellow | car | 0.9498072805894898 | -16 |
blue | car | 0.9065231957598833 | -3 |
yellow | van | 0.6876214007342292 | -14 |
blue | van | 0.12799068964691773 | -16 |
yellow | car | 0.8456065368763648 | -20 |
yellow | van | 0.7136156920988985 | 10 |
yellow | van | 0.31253912094683933 | -16 |
yellow | van | 0.8326289665837018 | -20 |
with input columns qty_1
and qty_2
and aggregation function mean
partitioned by colour
and vehicle
, the result is
colour | vehicle | qty_1 | qty_2 | aggregation_1 | aggregation_2 |
---|---|---|---|---|---|
yellow | van | 0.11247070608569998 | 5 | 0.4545004126246004 | -6.5 |
yellow | van | 0.28328160937677216 | -15 | 0.4545004126246004 | -6.5 |
blue | car | 0.517063017219678 | 19 | 0.6102483275761019 | 1.0 |
blue | van | 0.5301237478473354 | -9 | 0.4765901757035851 | 1.0 |
blue | van | 0.3620932896270347 | 10 | 0.4765901757035851 | 1.0 |
yellow | car | 0.4225691387726025 | 15 | 0.6272211773554836 | -10.0 |
yellow | car | 0.290901753183477 | -19 | 0.6272211773554836 | -10.0 |
blue | van | 0.8861529756930526 | 19 | 0.4765901757035851 | 1.0 |
yellow | van | 0.0663419211386469 | 12 | 0.4545004126246004 | -6.5 |
blue | car | 0.5966970598200722 | -4 | 0.6102483275761019 | 1.0 |
blue | car | 0.420710037504774 | -8 | 0.6102483275761019 | 1.0 |
yellow | van | 0.6275038840320153 | -14 | 0.4545004126246004 | -6.5 |
yellow | car | 0.9498072805894898 | -16 | 0.6272211773554836 | -10.0 |
blue | car | 0.9065231957598833 | -3 | 0.6102483275761019 | 1.0 |
yellow | van | 0.6876214007342292 | -14 | 0.4545004126246004 | -6.5 |
blue | van | 0.12799068964691773 | -16 | 0.4765901757035851 | 1.0 |
yellow | car | 0.8456065368763648 | -20 | 0.6272211773554836 | -10.0 |
yellow | van | 0.7136156920988985 | 10 | 0.4545004126246004 | -6.5 |
yellow | van | 0.31253912094683933 | -16 | 0.4545004126246004 | -6.5 |
yellow | van | 0.8326289665837018 | -20 | 0.4545004126246004 | -6.5 |
Window Function/Rolling
Make a rolling boolean indicator with optional partitioning
Unique identifier (API): window_rolling_boolean
Please use the newer version compute_rolling_window_function
Compute an aggregation as a rolling indicator on one or more boolean columns, optionally partitioned by other selected column(s).
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
indicator_function | Indicator function applied to input column(s) | single-select | |
window_start | Whether to set window start unbounded or a specified position relative to current row, with unbounded by default | unbounded | single-select |
start_position | Position from which window starts, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively. | integer | |
window_end | Whether to set window end unbounded or a specified position relative to current row, with unbounded by default | unbounded | single-select |
end_position | Position from which window ends, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively. | integer |
Examples:
For example, given the following dataset
animal | colour | date | cured | adopted |
---|---|---|---|---|
cat | blue | 2019-01-01 00:00:00 | false | true |
cat | blue | 2019-01-02 00:00:00 | true | false |
dog | yellow | 2019-01-03 00:00:00 | true | true |
cat | yellow | 2019-01-04 00:00:00 | true | false |
cat | yellow | 2019-01-05 00:00:00 | true | false |
cat | blue | 2019-01-06 00:00:00 | false | false |
dog | blue | 2019-01-07 00:00:00 | true | true |
dog | blue | 2019-01-08 00:00:00 | false | false |
dog | yellow | 2019-01-09 00:00:00 | false | false |
dog | yellow | 2019-01-10 00:00:00 | false | false |
cat | blue | 2019-01-11 00:00:00 | false | false |
cat | yellow | 2019-01-12 00:00:00 | false | true |
dog | yellow | 2019-01-13 00:00:00 | true | false |
dog | blue | 2019-01-14 00:00:00 | true | false |
cat | blue | 2019-01-15 00:00:00 | false | false |
dog | yellow | 2019-01-16 00:00:00 | false | false |
dog | yellow | 2019-01-17 00:00:00 | false | true |
cat | blue | 2019-01-18 00:00:00 | false | false |
cat | yellow | 2019-01-19 00:00:00 | false | false |
dog | yellow | 2019-01-20 00:00:00 | false | false |
with input columns cured
and adopted
, partitioned by animal
and colour
and ordered by date
for indicator function all
, rolling window with unbounded start and current row as end, the result is
animal | colour | date | cured | adopted | roll_cured | roll_adopted |
---|---|---|---|---|---|---|
dog | blue | 2019-01-07 00:00:00 | true | true | true | true |
dog | blue | 2019-01-08 00:00:00 | false | false | false | false |
dog | blue | 2019-01-14 00:00:00 | true | false | false | false |
cat | yellow | 2019-01-04 00:00:00 | true | false | true | false |
cat | yellow | 2019-01-05 00:00:00 | true | false | true | false |
cat | yellow | 2019-01-12 00:00:00 | false | true | false | false |
cat | yellow | 2019-01-19 00:00:00 | false | false | false | false |
cat | blue | 2019-01-01 00:00:00 | false | true | false | true |
cat | blue | 2019-01-02 00:00:00 | true | false | false | false |
cat | blue | 2019-01-06 00:00:00 | false | false | false | false |
cat | blue | 2019-01-11 00:00:00 | false | false | false | false |
cat | blue | 2019-01-15 00:00:00 | false | false | false | false |
cat | blue | 2019-01-18 00:00:00 | false | false | false | false |
dog | yellow | 2019-01-03 00:00:00 | true | true | true | true |
dog | yellow | 2019-01-09 00:00:00 | false | false | false | false |
dog | yellow | 2019-01-10 00:00:00 | false | false | false | false |
dog | yellow | 2019-01-13 00:00:00 | true | false | false | false |
dog | yellow | 2019-01-16 00:00:00 | false | false | false | false |
dog | yellow | 2019-01-17 00:00:00 | false | true | false | false |
dog | yellow | 2019-01-20 00:00:00 | false | false | false | false |
Get a DateTime value over rolling windows, with optional partitioning
Unique identifier (API): window_rolling_datetime
Please use the newer version compute_rolling_window_function
Get an aggregation function on one or more DateTime columns over rolling windows, optionally partitioned by other selected column(s).
Parameters | Parameter Description | Default | Parameter Type |
---|---|---|---|
aggregation_function | Aggregation function applied to input column(s) | single-select | |
window_start | Whether to set window start unbounded or a specified position relative to current row, with unbounded by default | unbounded | single-select |
start_position | Position from which window starts, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively. | integer | |
window_end | Whether to set window end unbounded or a specified position relative to current row, with unbounded by default | unbounded | single-select |
end_position | Position from which window ends, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively. | integer |
Examples:
For example, given the following dataset
animal | time_born | time_adopted |
---|---|---|
cat | 2019-01-01 00:00:00 | 2019-02-01 00:00:00 |
cat | 2019-01-02 00:00:00 | 2019-03-01 00:00:00 |
cat | 2019-01-03 00:00:00 | 2019-03-01 00:00:00 |
cat | 2019-01-04 00:00:00 | 2020-04-01 00:00:00 |
cat | 2019-01-05 00:00:00 | 2019-05-01 00:00:00 |
dog | 2019-01-06 00:00:00 | 2020-06-01 00:00:00 |
dog | 2019-01-07 00:00:00 | 2019-05-01 00:00:00 |
dog | 2019-01-08 00:00:00 | 2020-07-01 00:00:00 |
dog | 2019-01-09 00:00:00 | 2020-06-01 00:00:00 |
dog | 2019-01-10 00:00:00 | 2019-03-01 00:00:00 |
the result is
animal | time_born | time_adopted | agg_time_adopted |
---|---|---|---|
dog | 2019-01-06 00:00:00 | 2020-06-01 00:00:00 | 2020-06-01 00:00:00 |
dog | 2019-01-07 00:00:00 | 2019-05-01 00:00:00 | 2020-06-01 00:00:00 |
dog | 2019-01-08 00:00:00 | 2020-07-01 00:00:00 | 2020-07-01 00:00:00 |
dog | 2019-01-09 00:00:00 | 2020-06-01 00:00:00 | 2020-07-01 00:00:00 |
dog | 2019-01-10 00:00:00 | 2019-03-01 00:00:00 | 2020-07-01 00:00:00 |
cat | 2019-01-01 00:00:00 | 2019-02-01 00:00:00 | 2019-02-01 00:00:00 |
cat | 2019-01-02 00:00:00 | 2019-03-01 00:00:00 | 2019-03-01 00:00:00 |
cat | 2019-01-03 00:00:00 | 2019-03-01 00:00:00 | 2019-03-01 00:00:00 |
cat | 2019-01-04 00:00:00 | 2020-04-01 00:00:00 | 2020-04-01 00:00:00 |
cat | 2019-01-05 00:00:00 | 2019-05-01 00:00:00 | 2020-04-01 00:00:00 |
with input column time_adopted
, partitioned by animal
and ordered by time_born
, for aggregation function latest
. The rolling window starts with unbounded
and ends with current row, for example, the window of the 4th row in the input dataset
is all the rows above down to the current row
|cat |2019-01-01 00:00:00|2019-02-01 00:00:00|
|cat |2019-01-02 00:00:00|2019-03-01 00:00:00|
|cat |2019-01-03 00:00:00|2019-03-01 00:00:00|
|cat |2019-01-04 00:00:00|2020-04-01 00:00:00|
Compute a rolling numeric aggregation, with optional partitioning
Unique identifier (API): window_rolling_numeric
Please use the newer version compute_rolling_window_function
Compute an aggregation function on one or more numeric columns over rolling/expanding/shrinking windows, optionally partitioned by other selected column(s). Boolean columns are accepted as well and are converted to numeric first.
Parameters | Parameter Description | Default | Parameter Type | Parameter Min | Parameter Max | Parameter Step |
---|---|---|---|---|---|---|
agg_function | Aggregation function applied to input column(s) | single-select | ||||
relative_error | Relative error for approximating median or a specific quantile. Must be no less than 0.001 for computational efficiency. | range | 0.001 | 1.0 | 0.001 | |
quantile_to_compute | Specify which quantile to compute for the quantile rolling function. Must be a number between 0 and 1. | range | 0.0 | 1.0 | 0.05 | |
window_start | Whether to set window start unbounded or a specified position relative to current row, with unbounded by default | unbounded | single-select | |||
start_position | Position from which window starts, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively. | integer | ||||
window_end | Whether to set window end unbounded or a specified position relative to current row, with unbounded by default | unbounded | single-select | |||
end_position | Position from which window ends, relative to the current row (0). Negative and positive numbers indicate the number of rows that precede and follow the current row respectively. | integer |
Examples:
For example, given the following dataset
colour | animal | birth | qty |
---|---|---|---|
blue | cat | 2019-01-01 00:00:00 | 0.9459212999951968 |
yellow | dog | 2019-01-02 00:00:00 | 0.923807614788432 |
yellow | dog | 2019-01-03 00:00:00 | 0.8160265494527545 |
blue | cat | 2019-01-04 00:00:00 | 0.21414080632280563 |
yellow | dog | 2019-01-05 00:00:00 | 0.31936422789647023 |
blue | dog | 2019-01-06 00:00:00 | 0.8090282829464444 |
yellow | cat | 2019-01-07 00:00:00 | 0.23359299737078676 |
yellow | cat | 2019-01-08 00:00:00 | 0.9244320791985904 |
blue | cat | 2019-01-09 00:00:00 | 0.6161751606853043 |
blue | cat | 2019-01-10 00:00:00 | 0.26992044210702926 |
yellow | dog | 2019-01-11 00:00:00 | 0.40755657119101496 |
yellow | cat | 2019-01-12 00:00:00 | 0.11657357211440877 |
yellow | cat | 2019-01-13 00:00:00 | 0.8757826191122454 |
yellow | cat | 2019-01-14 00:00:00 | 0.3989306451516289 |
blue | dog | 2019-01-15 00:00:00 | 0.09120177599596369 |
blue | dog | 2019-01-16 00:00:00 | 0.8228651830898742 |
yellow | cat | 2019-01-17 00:00:00 | 0.6855467490320383 |
blue | cat | 2019-01-18 00:00:00 | 0.27373326299975465 |
yellow | cat | 2019-01-19 00:00:00 | 0.10849671654816584 |
blue | cat | 2019-01-20 00:00:00 | 0.27289802805739094 |
the result is
colour | animal | birth | qty | roll_qty |
---|---|---|---|---|
yellow | dog | 2019-01-02 00:00:00 | 0.923807614788432 | 0.923807614788432 |
yellow | dog | 2019-01-03 00:00:00 | 0.8160265494527545 | 1.7398341642411865 |
yellow | dog | 2019-01-05 00:00:00 | 0.31936422789647023 | 2.059198392137657 |
yellow | dog | 2019-01-11 00:00:00 | 0.40755657119101496 | 2.466754963328672 |
blue | cat | 2019-01-01 00:00:00 | 0.9459212999951968 | 0.9459212999951968 |
blue | cat | 2019-01-04 00:00:00 | 0.21414080632280563 | 1.1600621063180023 |
blue | cat | 2019-01-09 00:00:00 | 0.6161751606853043 | 1.7762372670033066 |
blue | cat | 2019-01-10 00:00:00 | 0.26992044210702926 | 2.046157709110336 |
blue | cat | 2019-01-18 00:00:00 | 0.27373326299975465 | 2.31989097211009 |
blue | cat | 2019-01-20 00:00:00 | 0.27289802805739094 | 2.592789000167481 |
blue | dog | 2019-01-06 00:00:00 | 0.8090282829464444 | 0.8090282829464444 |
blue | dog | 2019-01-15 00:00:00 | 0.09120177599596369 | 0.900230058942408 |
blue | dog | 2019-01-16 00:00:00 | 0.8228651830898742 | 1.7230952420322823 |
yellow | cat | 2019-01-07 00:00:00 | 0.23359299737078676 | 0.23359299737078676 |
yellow | cat | 2019-01-08 00:00:00 | 0.9244320791985904 | 1.1580250765693771 |
yellow | cat | 2019-01-12 00:00:00 | 0.11657357211440877 | 1.274598648683786 |
yellow | cat | 2019-01-13 00:00:00 | 0.8757826191122454 | 2.1503812677960314 |
yellow | cat | 2019-01-14 00:00:00 | 0.3989306451516289 | 2.5493119129476605 |
yellow | cat | 2019-01-17 00:00:00 | 0.6855467490320383 | 3.2348586619796986 |
yellow | cat | 2019-01-19 00:00:00 | 0.10849671654816584 | 3.3433553785278645 |
with input column qty
, partitioned by colour
and animal
and ordered by birth
for aggregation function sum
, rolling window with unbounded start and current row as end.