This document is written based on below:
Apache Kafka Series - Learn Apache Kafka for Beginners v3 https://learning.oreilly.com/course/apache-kafka-series/9781789342604/ by Stéphane Maarek
- able to adjust protocols (convert TCP, HTTP, FTP etc) into what the subscriber wants
- able to adjust data formats (binary, csv, json, avro, protobuf)
- able to adjust data scheme (adding new field/remove new field)
- publisher doesn't need to know the subscriber (where should we sent etc)
- created by LinkedIn, OSS
- distributed, resilient architecture, fault torelant
- horizonal scalability
- able to scale 100s of brokers
- scale millinons of messages per sec
- 10ms latency
- messaging
- activity tracking
- gatgher metrics
- application logs gathering
- stream processing
- decoupling of system dependencies
- integration with Spark, Flink, Hadoop etc
- micro-service pub/sub
Netflix: apply recommmendation in real-time while you're watching TV show
Topic
- Particular stream of data
- like a table in a database
- you can have many topics as you want
- tiouc us identifies by its name
- any kind of message format
- json,avro,protobuf etc
- the sequence of messages is called data stream
Topics are split in partitions
Partitions are created for the topic, each message holds id that called offset. each messages are immutable, once it's written, cannot be changed.
Important notes about topic:
- once the data is written, it cannot be changes, so called immutability.
- retention period of msg: 1week by default
- order is guaranteed only within a partition (not across partitions)
- data is assigned randomly to a partition unless key is provided
- we can have as many partitions per topics as we cant
Producers
- who write data to topics
- producers know to which partition to write to
- in case of kafka broker failures, producers will automatically recover
Message keys
- producer can choose to send a key with the message (str, int, binary etc)
- key==null, data is setnt round robin to the partitions
- key != null, all messages for that key will always go to the same partition (hashing)
kafka messages anatomy
Hashing method:
targetPartitions = Math.abs(Utils.murmur2(keyBytes)) % (numPartitions - 1)- consumers read data from a topic - pull model
- consumers automatically know which broker to read from
- in case of broker failuers, consumers know how to recover
- data is read in order from low to high offset within each partitions
for consumer, we can apply one of three semantics:
- at least once (in app side, need to impl logic as idempotent)
- at most once (message might be lost)
- exactly once
Kafka cluster is composed of multiple brokers (=~ servers). each broker is identified with its ID. each broker contain topic partitions. After connecting to any broker (called a bootstrap broker), you will be connected to the entire cluster.
Exampe) Topic A: 3 partitions, TopicB: 2 partitions
Able to choose from three ways, these are the tradeoff of performance vs durability:
- no wait: data might loss in case of master cannot receive data / replica cannot receive data when master failure
- wait for master ack: data might loss in case of replica cannot receive data when the master failure
- wait for all acks (master + replica): data won't loss, but performance might be down

