Skip to content

CREATE STREAM AS SELECT

Synopsis

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
CREATE [OR REPLACE] STREAM stream_name
  [WITH ( property_name = expression [, ...] )]
  AS SELECT  select_expr [, ...]
  FROM from_stream
  [[ LEFT | FULL | INNER ]
      JOIN [join_table | join_stream]
          [WITHIN [<size> <timeunit> | (<before_size> <timeunit>, <after_size> <timeunit>)] [GRACE PERIOD <grace_size> <timeunit>]]
      ON join_criteria]*
  [ WHERE condition ]
  [PARTITION BY column_name]
  EMIT CHANGES;

Description

Create a new materialized stream view, along with the corresponding Kafka topic, and stream the result of the query into the topic.

The PARTITION BY clause, if supplied, is applied to the source after any JOIN or WHERE clauses, and before the SELECT clause, in much the same way as GROUP BY.

Joins

Joins to streams can use any stream column. If the join criteria is not the key column of the stream, ksqlDB repartitions the data internally.

Important

Kafka guarantees the relative order of any two messages from one source partition only if they are also both in the same partition after the repartition. Otherwise, Kafka is likely to interleave messages. The use case will determine if these ordering guarantees are acceptable.

Joins to tables must use the table's PRIMARY KEY as the join criteria: non-key joins are not supported. For more information, see Join Event Streams with ksqlDB.

See Partition Data to Enable Joins for more information about how to correctly partition your data for joins.

Note

  • Partitioning streams and tables is especially important for stateful or otherwise intensive queries. For more information, see Parallelization.
  • Once a stream is created, you can't change the number of partitions. To change the partition count, you must drop the stream and create it again.

For stream-stream joins, you must specify a WITHIN clause for matching records that both occur within a specified time interval. For valid time units, see Time Units.

The key of the resulting stream is determined by the following rules, in order of priority:

  1. If the query has a PARTITION BY, then the resulting number of key columns will match the number of expressions in the PARTITION BY clause. For each expression:

    1. If the PARTITION BY expression is a single source-column reference, the corresponding key column matches the name, type, and contents of the source column.

    2. If the PARTITION BY expression is a reference to a field within a STRUCT-type column, the corresponding key column matches the name, type, and contents of the STRUCT field.

    3. If the PARTITION BY expression is any other expression, the key column has a system-generated name, unless you provide an alias in the projection, and matches the type and contents of the result of the expression.

  2. If the query has a join. For more information, see Join Synthetic Key Columns.

  3. Otherwise, the key matches the name, unless you provide an alias in the projection, and type of the source stream's key.

The projection must include all columns required in the result, including any key columns.

Serialization

For supported serialization formats, ksqlDB can integrate with Confluent Schema Registry. ksqlDB registers the key and/or value schema(s) of the new stream with Schema Registry automatically. Key and value schemas are registered under the subjects <topic-name>-key and <topic-name>-value, respectively.

Stream properties

The WITH clause for the SELECT result supports the following properties:

Property Description
KAFKA_TOPIC The name of the Kafka topic that backs this stream. If this property is not set, then the name of the stream in upper case will be used as default.
KEY_FORMAT Specifies the serialization format of the message key in the topic. For supported formats, see Serialization Formats. If this property is not set, the format from the left-most input stream/table is used.
VALUE_FORMAT Specifies the serialization format of the message value in the topic. For supported formats, see Serialization Formats. If this property is not set, the format from the left-most input stream/table is used.
FORMAT Specifies the serialization format of both the message key and value in the topic. It is not valid to supply this property alongside either KEY_FORMAT or VALUE_FORMAT. For supported formats, see Serialization Formats.
VALUE_DELIMITER Used when VALUE_FORMAT='DELIMITED'. Supports single character to be a delimiter, defaults to ','. For space and tab delimited values you must use the special values 'SPACE' or 'TAB', not an actual space or tab character.
PARTITIONS The number of partitions in the backing topic. If this property is not set, then the number of partitions of the input stream/table will be used. In join queries, the property values are taken from the left-most stream or table. You can't change the number of partitions on a stream. To change the partition count, you must drop the stream and create it again.
REPLICAS The replication factor for the topic. If this property is not set, then the number of replicas of the input stream or table will be used. In join queries, the property values are taken from the left-most stream or table.
TIMESTAMP Sets a column within this stream's schema to be used as the default source of ROWTIME for any downstream queries. Downstream queries that use time-based operations, such as windowing, will process records in this stream based on the timestamp in this column. The column will be used to set the timestamp on any records emitted to Kafka. Timestamps have a millisecond accuracy. If not supplied, the ROWTIME of the source stream is used.
Note: This doesn't affect the processing of the query that populates this stream. For example, given the following statement:
CREATE STREAM foo WITH (TIMESTAMP='t2') AS
SELECT * FROM bar
WINDOW TUMBLING (size 10 seconds);
EMIT CHANGES;
The window into which each row of bar is placed is determined by bar's ROWTIME, not t2.
TIMESTAMP_FORMAT Used in conjunction with TIMESTAMP. If not set, ksqlDB timestamp column must be of type bigint or timestamp. When set, the TIMESTAMP column must be of type varchar and have a format that can be parsed with the Java DateTimeFormatter. If your timestamp format has characters requiring single quotes, you can escape them with two successive single quotes, '', for example: 'yyyy-MM-dd''T''HH:mm:ssX'. For more information on timestamp formats, see DateTimeFormatter.
WRAP_SINGLE_VALUE Controls how values are serialized where the values schema contains only a single column. This setting controls how the query serializes values with a single-column schema.
If set to true, ksqlDB serializes the column as a named column within a record.
If set to false, ksqlDB serializes the column as an anonymous value.
If not supplied, the system default, defined by ksql.persistence.wrap.single.values, then the format's default is used.
Note: null values have special meaning in ksqlDB. Care should be taken when dealing with single-column schemas where the value can be null. For more information, see Single column (un)wrapping.
Note: Supplying this property for formats that do not support wrapping, for example DELIMITED, or when the value schema has multiple columns, results in an error.

Note

  • To use Avro or Protobuf, you must have Schema Registry enabled and ksql.schema.registry.url must be set in the ksqlDB Server configuration file. See Configure ksqlDB for Avro, Protobuf, and JSON schemas.
  • Avro and Protobuf field names are not case sensitive in ksqlDB. This matches the ksqlDB column name behavior.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
-- Create a view that filters an existing stream:
CREATE STREAM filtered AS
   SELECT 
     a, 
     few,
     columns 
   FROM source_stream
   EMIT CHANGES;

-- Create a view that enriches a stream with a table lookup:
CREATE STREAM enriched AS
   SELECT
      cs.*,
      u.name,
      u.classification,
      u.level
   FROM clickstream cs
      JOIN users u ON u.id = cs.userId
   EMIT CHANGES;

Last update: 2022-01-19