Apache Kafka

What it is

Kafka is an open source distributed real-time messaging system originally developed by LinkedIn. A Kafka cluster consists of a set of brokers that process messages from producers.

Kafka deals with publisher-consumer and queue semantics by grouping data into topics.

Pub/Sub Queue

Each message is received by all the subscribers

Each message has to be comsumed by only one consumer

Each subscriber receives all the messages

Each message is consumed by any one of the available consumers

Use cases

  • Event spikes

  • Elasticsearch not reachable

You don’t necessary need Kafka if you can tolerate a relaxed search latency. If your application is emitting more logs than Logstash/Elasticsearch can ingest at real time, logs can be rotated and ships by Filebeat. In that your local filesystem will become the temporary buffer.

Topics

Topics are logical grouping of messages. They provide a way to isolate data from other consumers if necessary. For logging and event driven data, you might group multiple users in one topic based on attributes like data volume and expected search latency.

A good analogy is an expressway — you mostly want the fast lane to be free flowing, so slower vehicles are expected to move to other lanes. Think of Kafka topics as lanes and events as cars!

Topics can be created on the fly when data is first published to a non-existent one.

Logstash

logstash config
kafka {
   zk_connect => "hostname:port"
   topic_id => "apache_logs"
   ...
}

Another reason to use multiple Logstash instances is to add fault tolerance.

Partitions

Topics are divided into one or more partitions. Each contains an ordered set of messages. Partitions allows messages in a topic to be distributed to multiple servers. Each partition must fit entirely on a single server. Partition can be replicated across several servers for fault-tolerance. One server is marked as a leader and control the read and write for the partition. The others are the followers and replicate the data. When a leader fails, one of the followers automatically become the leader (using ZooKeeper). The partition count controls how many logs the topic will be sharded into. The more partitions you have, the more throughput you get when consuming data.

Consumer

Consumers are divided into consumer groups.

Listeners

A listener is a combination of

Host/IP
Port
Protocol