Skip to content

Aggregation functions

AVG

Since: 0.6.0

1
AVG(col1)

Stream, Table

Return the average value for a given column.

COLLECT_LIST

Since: -

1
COLLECT_LIST(col1)

Stream, Table

Return an array containing all the values of col1 from each input row (for the specified grouping and time window, if any). Currently only works for simple types (not Map, Array, or Struct). This version limits the size of the result Array to a maximum of 1000 entries and any values beyond this limit are silently ignored. When using with a window type of session, it can sometimes happen that two session windows get merged together into one when a out-of-order record with a timestamp between the two windows is processed. In this case the 1000 record limit is calculated by first considering all the records from the first window, then the out-of-order record, then the records from the second window in the order they were originally processed.

COLLECT_SET

Since: -

1
COLLECT_SET(col1)

Stream

Return an array containing the distinct values of col1 from each input row (for the specified grouping and time window, if any). Currently only works for simple types (not Map, Array, or Struct). This version limits the size of the result Array to a maximum of 1000 entries and any values beyond this limit are silently ignored. When using with a window type of session, it can sometimes happen that two session windows get merged together into one when a out-of-order record with a timestamp between the two windows is processed. In this case the 1000 record limit is calculated by first considering all the records from the first window, then the out-of-order record, then the records from the second window in the order they were originally processed.

COUNT

Since: -

1
COUNT(col1)
1
COUNT(*)

Stream, Table

Count the number of rows. When col1 is specified, the count returned will be the number of rows where col1 is non-null. When * is specified, the count returned will be the total number of rows.

COUNT_DISTINCT

Since: 0.7.0

1
COUNT_DISTINCT(col1)

Stream, Table

Returns the approximate number of unique values of col1 in a group. The function implementation uses HyperLogLog to estimate cardinalities of 10^9 with a typical standard error of 2%.

EARLIEST_BY_OFFSET

Since: 0.10.0

1
EARLIEST_BY_OFFSET(col1, [ignoreNulls])

Stream

Return the earliest value for the specified column. The earliest value in the partition

has the lowest offset.

The optional ignoreNulls parameter, available since version 0.13.0, controls whether nulls are ignored. The default

is to ignore null values.

Since: 0.13.0

1
EARLIEST_BY_OFFSET(col1, earliestN, [ignoreNulls])

Stream

Return the earliest N values for the specified column as an ARRAY. The earliest values

in the partition have the lowest offsets.

The optional ignoreNulls parameter controls whether nulls are ignored. The default

is to ignore null values.

HISTOGRAM

Since: -

1
HISTOGRAM(col1)

Stream, Table

Return a map containing the distinct String values of col1 mapped to the number of times each one occurs for the given window. This version limits the number of distinct values which can be counted to 1000, beyond which any additional entries are ignored. When using with a window type of session, it can sometimes happen that two session windows get merged together into one when a out-of-order record with a timestamp between the two windows is processed. In this case the 1000 record limit is calculated by first considering all the records from the first window, then the out-of-order record, then the records from the second window in the order they were originally processed.

LATEST_BY_OFFSET

Since: 0.8.0

1
LATEST_BY_OFFSET(col1, [ignoreNulls])

Stream

Return the latest value for the specified column. The latest value in the partition

has the largest offset.

The optional ignoreNulls parameter, available since version 0.13.0, controls whether nulls are ignored. The default

is to ignore null values.

Since: 0.13.0

1
LATEST_BY_OFFSET(col1, latestN, [ignoreNulls])

Stream

Returns the latest N values for the specified column as an ARRAY. The latest values have

the largest offset.

The optional ignoreNulls parameter controls whether nulls are ignored. The default is to ignore

null values.

MAX

Since: -

1
MAX(col1)

Stream

Return the maximum value for a given column and window. Rows that have col1 set to null are ignored.

MIN

Since: -

1
MIN(col1)

Stream

Return the minimum value for a given column and window. Rows that have col1 set to null are ignored.

SUM

Since: -

1
SUM(col1)

Stream, Table

Sums the column values. Rows that have col1 set to null are ignored.

TOPK

Since: -

1
TOPK(col1, k)

Stream

Return the Top K values for the given column and window Rows that have col1 set to null are ignored.

TOPKDISTINCT

Since: -

1
TOPKDISTINCT(col1, k)

Stream

Return the distinct Top K values for the given column and window Rows that have col1 set to null are ignored.


Last update: 2020-12-15