One-to-many Kafka Streams Ktable join

Kafka Streams is a lightweight data processing library for Kafka. It's build on top of the Kafka consumer/producer APIs and provides higher level abstractions like streams and tables that can be joined and grouped with some flexibility.

One current limitation is the lack of non-key table join, i.e. impossibility to join 2 tables on something else than their primary key.

This post discusses approaches to work around this limitation.


  • For now, use composite keys in a state store and query it with a range scan.
  • Alternatively, wait for (or contribute to) KIP-213.

An example implementation of my …

Continue reading »

A commented Kafka configuration

Diving into Kafka configuration is a beautiful journey into its features.

As a preparation for a production deployment of Kafka 0.11, I gathered a set of comments on what I think are some interesting parameters. All this amazing wisdom is mostly extracted from the few resources mentioned at the end of this post.

A grain of salt...

This is all for information only, I honestly think most of the points below are relevant and correct, though mistakes and omissions are likely present here and there.

You should not apply any of this blindly to your production environment and hope …

Continue reading »

Event time low-latency joins with Kafka Streams

This post attempts to illustrate the difficulty of performing an event-time join between two time series with a stream processing framework. It also describes one solution based on Kafka Streams

An event-time join is an attempt to join two time series while taking into account the timestamps. More precisely, for each event from the first time series, it looks up the latest event from the other that occurred before it. This blog post is based on Kafka Stream although I found the original idea in this Flink tutorial, where the idea of event-time join is very …

Continue reading »

How to integrate Flink with Confluent's schema registry

This post illustrates how to use Confluent's Avro serializer in order to let a Flink program consume and produce avro messages through Kafka while keeping track of the Avro Schemas in Confluent's schema registry. This can be interresting if the messages are pumped into or out of Kafka with Kafka Connect, Kafka Streams, or just with anything else also integrated with the schema registry.

Warning: As of now (Aug 2017), it turns out using Confluent's Avro deserializer as explained below is not ideal when deploying to FLink in standalone mode, because of the way caching is impemented on Avro level …

Continue reading »

Sending Avro records from Scala to Azure eventhub over AMQP

This post illustrates how to emit Avro records to Azure EventHub from scala in such a way that they are directly parsed by the other services of the Azure platform (e.g. Azure Stream Analytics).

There exists a Java API for communicating with Azure EventHub which is documented as part of the azure documentation and even made open source on github (things have changed at Microsoft...). That said, the most detailed documentation still seems to be based on the .NET API as manipulated with Visual Studio on Windows. Me being a Scala developer on a Mac, it took me a …

Continue reading »