Overview¶
This tutorial will demonstrate how to integrate ksqlDB with an external data source to power a simple ride sharing app. Our external source will be a PostgreSQL database containing relatively static data describing each driver’s vehicle. By combining this human-friendly static data with a continuous stream of computer-friendly driver and rider location events, we derive an enriched output stream that the ride sharing app may use to facilitate a rendezvous in real time.
When to use embedded Connect¶
ksqlDB natively integrates with Connect by either communicating with an external Connect cluster or by running Connect embedded within the ksqlDB server process. Each of these modes is best suited for the following environments:
- Embedded - Suitable for development, testing, and simpler production workloads at lower throughputs when there is no need to scale ksqlDB independently of Connect.
- External - Suitable for all production workloads.
Note
The Connect integration mode is a deployment configuration
option. The Connect integration interface is identical for both
modes, so your CREATE SOURCE
and CREATE SINK
statements are independent
of the integration mode.
1. Get ksqlDB¶
Since ksqlDB runs natively on Apache Kafka®, you need a running Kafka installation that ksqlDB is configured to use. The following docker-compose files run everything for you via Docker, including ksqlDB running Kafka Connect in embedded mode. Embedded Connect enables you to leverage the power of Connect without having to manage a separate Connect cluster, because ksqlDB manages one for you. Also, this tutorial use PostgreSQL as an external datastore to integrate with ksqlDB.
In an empty local working directory, copy and paste the following
docker-compose
content into a file named docker-compose.yml
. You will
create and add a number of other files to this directory during this tutorial.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
|
2. Get the JDBC connector¶
The easiest way to download connectors for use in ksqlDB with embedded Connect is via Confluent Hub Client.
To download the JDBC connector, use the following command, ensuring that the confluent-hub-components
directory exists first:
1 |
|
This command downloads the JDBC connector into the directory ./confluent-hub-components
.
3. Start ksqlDB and PostgreSQL¶
In the directory containing the docker-compose.yml
file you created in the
first step, run the following command to start all services in the correct
order.
1 |
|
4. Connect to PostgreSQL¶
Run the following command to establish an interactive session with PostgreSQL.
1 |
|
5. Populate PostgreSQL with vehicle/driver data¶
In the PostgreSQL session, run the following SQL statements to set up the driver data. You will join this PostgreSQL data with event streams in ksqlDB.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
6. Start ksqlDB's interactive CLI¶
ksqlDB runs as a server which clients connect to in order to issue queries.
Run the following command to connect to the ksqlDB server and start an interactive command-line interface (CLI) session.
1 |
|
7. Create source connector¶
Make your PostgreSQL data accessible to ksqlDB by creating a source connector. In the ksqlDB CLI, run the following command.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
When the source connector is created, it imports any PostgreSQL tables matching
the specified table.whitelist
. Tables are imported via Kafka topics,
with one topic per imported table. Once these topics are created, you can
interact with them just like any other Kafka topic used by ksqlDB.
8. View imported topic¶
In the ksqlDB CLI session, run the following command to verify that the
driver_profiles
table has been imported as a Kafka topic. Because you specified jdbc_
as the topic
prefix, you should see a jdbc_driver_profiles
topic in the output.
1 |
|
9. Create drivers table in ksqlDB¶
The driver data is now integrated as a Kafka topic, but you need to create a ksqlDB table over this topic to begin referencing it from ksqlDB queries. Streams and tables in ksqlDB essentially associate a schema with a Kafka topic, breaking each message in the topic into strongly typed columns.
1 2 3 4 5 6 7 8 9 |
|
Tables in ksqlDB support update semantics, where each message in the underlying topic represents a row in the table. For messages in the topic with the same key, the latest message associated with a given key represents the latest value for the corresponding row in the table.
Note
When the data is ingested from the database, it's being written
to the Kafka topic using JSON serialization. Since JSON itself doesn't
declare a schema, you need to declare it again when you run CREATE TABLE
.
In practice, you would normally use Avro or Protobuf, since this supports the retention
of schemas, ensuring compatibility between producers and consumers. This means
that you don't have to enter it each time you want to use the data in ksqlDB.
10. Create streams for driver locations and rider locations¶
In this step, you create streams over new topics to encapsulate location pings that are sent every few seconds by drivers’ and riders’ phones. In contrast to tables, ksqlDB streams are append-only collections of events, so they're suitable for a continuous stream of location updates.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
11. Enrich driverLocations stream by joining with PostgreSQL data¶
The driverLocations
stream has a relatively compact schema, and it doesn’t
contain much data that a human would find particularly useful. You can enrich
the stream of driver location events by joining them with the human-friendly
vehicle information stored in the PostgreSQL database. This enriched data can
be presented by the rider’s mobile application, ultimately helping the rider to
safely identify the driver’s vehicle.
You can achieve this result easily by joining the driverLocations
stream with
the driver_profiles
table stored in PostgreSQL.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
12. Create the rendezvous stream¶
To put all of this together, create a final stream that the ridesharing app can use to facilitate a driver-rider rendezvous in real time. This stream is defined by a query that joins together rider and driver location updates, resulting in a contextualized output that the app can use to show the rider their driver’s position as the rider waits to be picked up.
The rendezvous stream includes human-friendly information describing the driver’s vehicle for the rider. Also, the rendezvous stream computes (albeit naively) the driver’s estimated time of arrival (ETA) at the rider’s location.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
13. Start two ksqlDB CLI sessions¶
Run the following command twice to open two separate ksqlDB CLI sessions. If you still have a CLI session open from a previous step, you can reuse that session.
1 |
|
14. Run a continuous query¶
In this step, you run a continuous query over the rendezvous stream.
This may feel a bit unfamiliar, because the query never returns until you terminate it. The query perpetually pushes output rows to the client as events are written to the rendezvous stream. Leave the query running in your CLI session for now. It will begin producing output as soon as events are written into ksqlDB.
1 |
|
15. Write data to input streams¶
Your continuous query reads from the rendezvous
stream, which takes its input
from the enrichedDriverLocations
and riderLocations
streams. And
enrichedDriverLocations
takes its input from the driverLocations
stream,
so you need to write data into driverLocations
and riderLocations
before
rendezvous
produces the joined output that the continuous query reads.
1 2 3 4 5 6 7 8 9 10 11 |
|
As soon as you start writing rows to the input streams, your continuous query from the previous step starts producing joined output. The rider's location pings are joined with their inbound driver's location pings in real time, providing the rider with driver ETA, rating, and additional information describing the driver's vehicle.
Next steps¶
This tutorial shows how to run ksqlDB in embedded Connect mode using Docker. It uses the JDBC connector to integrate ksqlDB with PostgreSQL data, but this is just one of many connectors that are available to help you integrate ksqlDB with external systems. Check out Confluent Hub to learn more about all of the various connectors that enable integration with a wide variety of external systems.
You may also want to take a look at our examples to better understand how you can use ksqlDB for your specific workload.