This document is written based on below:

Apache Kafka Series - Learn Apache Kafka for Beginners v3 https://learning.oreilly.com/course/apache-kafka-series/9781789342604/ by Stéphane Maarek

Kafka in 5min

Messaging queue benefits

able to adjust protocols (convert TCP, HTTP, FTP etc) into what the subscriber wants
able to adjust data formats (binary, csv, json, avro, protobuf)
able to adjust data scheme (adding new field/remove new field)
publisher doesn't need to know the subscriber (where should we sent etc)

about Kafka

created by LinkedIn, OSS
distributed, resilient architecture, fault torelant
horizonal scalability
able to scale 100s of brokers
scale millinons of messages per sec
10ms latency

Usage

messaging
activity tracking
gatgher metrics
application logs gathering
stream processing
decoupling of system dependencies
integration with Spark, Flink, Hadoop etc
micro-service pub/sub

Netflix: apply recommmendation in real-time while you're watching TV show

Toics, Partitions, and Offsets

Topic

Particular stream of data
like a table in a database
you can have many topics as you want
tiouc us identifies by its name
any kind of message format
json,avro,protobuf etc
the sequence of messages is called data stream

Topics are split in partitions

Partitions are created for the topic, each message holds id that called offset. each messages are immutable, once it's written, cannot be changed.

Important notes about topic:

once the data is written, it cannot be changes, so called immutability.
retention period of msg: 1week by default
order is guaranteed only within a partition (not across partitions)
data is assigned randomly to a partition unless key is provided
we can have as many partitions per topics as we cant

Producers and Message keys

Producers

who write data to topics
producers know to which partition to write to
in case of kafka broker failures, producers will automatically recover

Message keys

producer can choose to send a key with the message (str, int, binary etc)
key==null, data is setnt round robin to the partitions
key != null, all messages for that key will always go to the same partition (hashing)

kafka messages anatomy

Hashing method:

targetPartitions = Math.abs(Utils.murmur2(keyBytes)) % (numPartitions - 1)

Consumers

consumers read data from a topic - pull model
consumers automatically know which broker to read from
in case of broker failuers, consumers know how to recover
data is read in order from low to high offset within each partitions

Consumer group

for consumer, we can apply one of three semantics:

at least once (in app side, need to impl logic as idempotent)
at most once (message might be lost)
exactly once

Kafka Brokers

Kafka cluster is composed of multiple brokers (=~ servers). each broker is identified with its ID. each broker contain topic partitions. After connecting to any broker (called a bootstrap broker), you will be connected to the entire cluster.

Exampe) Topic A: 3 partitions, TopicB: 2 partitions

Topic replication factor

Producer acknowledements (acks)

Able to choose from three ways, these are the tradeoff of performance vs durability:

no wait: data might loss in case of master cannot receive data / replica cannot receive data when master failure
wait for master ack: data might loss in case of replica cannot receive data when the master failure
wait for all acks (master + replica): data won't loss, but performance might be down

kubosuke/Kafka.md