Skip to content

Synthetic key columns

Some joins have a synthetic key column in their result. This is a column that does not come from any source. Here's an example to help explain what synthetic key columns are and why they are required:

1
2
CREATE TABLE OUTPUT AS
  SELECT * FROM L FULL OUTER JOIN R ON L.ID = R.ID;

The previous statement seems straightforward enough: create a new table that's the result of performing a full outer join of two source tables, joining on their ID columns. You might expect that such a join would result in a table with a compound primary key containing both L.ID and R.ID, and that's what will happen once ksqlDB supports structured keys.

Unfortunately, until ksqlDB supports structured keys, the join must result in a table with a single primary key column. Because the previous is a full-outer join, either L.ID or R.ID may be missing (NULL), or both may have the same value. But, the data produced to Apache Kafka® always has the message key set to the non-null ID column, as shown in the following table:

L.ID R.ID Kafka message key
10 null 10
null 7 7
8 8 8

Clearly, the data stored in the Kafka message's key does not match either of the source ID columns. Instead, it's a new column: a synthetic column, which means a column that doesn't belong to either source table.

What joins result in synthetic key columns?

Any join where the key column in the result does not match any source column is said to have a synthetic key column.

The following types of joins result in a synthetic key column being added to the result schema:

  1. FULL OUTER joins, for example:

    sql CREATE TABLE OUTPUT AS SELECT * FROM L FULL OUTER JOIN R ON L.ID = R.ID;

  2. Any join where all expressions used in the join ON criteria are not simple column references. For example:

    sql -- join on expressions other than column references: CREATE TABLE OUTPUT AS SELECT * FROM L JOIN R ON ABS(L.ID) = ABS(R.ID);

What name is assigned to a Synthetic key column?

The default name of a synthetic key column is ROWKEY. But, if any sources used in the join already contain a column named ROWKEY, the synthetic key column is named ROWKEY_1, or ROWKEY_2 if there exists a source column called ROWKEY_1, etc. For example:

1
2
3
4
5
6
7
8
-- given sources:
CREATE STREAM S1 (ROWKEY INT KEY, V0 STRING) WITH (...);
CREATE TABLE T1 (ID INT KEY, ROWKEY_1 INT) WITH (...);

CREATE STREAM OUTPUT AS
   SELECT * FROM S1 JOIN T1 ON ABS(S1.ROWKEY) = ABS(T1.ID);

-- result in OUTPUT with synthetic key column name: ROWKEY_2

Like any other key column, the synthetic key column must be included in the projection of streaming queries. If you projection is missing the synthetic key, then an error like the one below will be returned, indicating the name of the missing key column:

1
2
3
Key missing from projection. See https://cnfl.io/2LV7ouS.
The query used to build `OUTPUT` must include the join expression ROWKEY in its projection.
ROWKEY was added as a synthetic key column because the join criteria did not match any source column. This expression must be included in the projection and may be aliased. 

Optionally, you may provide an alias for the key column in the projection. This is recommended, as system generated names are not guaranteed to remain consistent between versions. For example:

1
2
CREATE STREAM OUTPUT AS
   SELECT ROWKEY AS ID, S1.C0, S2.C1 FROM S1 FULL OUTER JOIN S2 ON S1.ID = S2.ID;

Will there always be synthetic keys?

No, as stated above, synthetic key columns will no longer be required once ksqlDB supports structured keys.

Suggested Reading


Last update: 2020-09-10