AWS MSK secure Python Kafka client

This blog describes how to write a secured python client for AWS MSK (i.e. Kafka) using TLS for authentication.

Kafka offers various security options, including traffic encryption with TLS, client authentication with either TLS or SASL, and ACL for authorization. AWS MSK, the Kafka offering of AWS, currently only supports TLS as authentication mechanism. MSK is also integrated with AWS IAM, although not for controlling access at topic granularity but rather for cluster administration tasks (e.g. describeCluster,...)

In case you are interested in writting a java or scala client instead, have a look at the official MSK documentation ...

Continue reading »

One-to-many Kafka Streams Ktable join

Kafka Streams is a lightweight data processing library for Kafka. It's build on top of the Kafka consumer/producer APIs and provides higher level abstractions like streams and tables that can be joined and grouped with some flexibility.

One current limitation is the lack of non-key table join, i.e. impossibility to join 2 tables on something else than their primary key.

This post discusses approaches to work around this limitation.

TLDR:

  • For now, use composite keys in a state store and query it with a range scan.
  • Alternatively, wait for (or contribute to) KIP-213.

An example implementation of ...

Continue reading »

Generating Realistic Random Datasets with Python

As a data engineer, after you have written your new awesome data processing application, you think it is time to start testing end-to-end and you therefore need some input data.

As a data scientist, you can benefit from data generation since it allows you to experiment with various ways of exploring datasets, algorithms, data visualization techniques or to validate assumptions about the behaviour of some method against many different dataset of your choosing.

In both cases, a tempting option is just to use real data. One small problem though is that production data is typically hard to obtain, even partially ...

Continue reading »

A commented Kafka configuration

Diving into Kafka configuration is a beautiful journey into its features.

As a preparation for a production deployment of Kafka 0.11, I gathered a set of comments on what I think are some interesting parameters. All this amazing wisdom is mostly extracted from the few resources mentioned at the end of this post.

A grain of salt...

This is all for information only, I honestly think most of the points below are relevant and correct, though mistakes and omissions are likely present here and there.

You should not apply any of this blindly to your production environment and hope ...

Continue reading »

Event time low-latency joins with Kafka Streams

This post attempts to illustrate the difficulty of performing an event-time join between two time series with a stream processing framework. It also describes one solution based on Kafka Streams 0.11.0.0.

An event-time join is an attempt to join two time series while taking into account the timestamps. More precisely, for each event from the first time series, it looks up the latest event from the other that occurred before it. This blog post is based on Kafka Stream although I found the original idea in this Flink tutorial, where the idea of event-time join is very ...

Continue reading »