In event streaming technique, Kafka plays an important role. We are living in a very fast growing technological era and the analysis of our data is very important to give better user experience and build a more customer friendly business. In this article I will try to include as much details about the Kafka real time usages which helps SysAds/SRE for their day to day operations.
In this post I added the basic details about the Apache Kafka as well. If you want to go to the cheat sheet directly, click here: https://www.crybit.com/kafka-cheat-sheet/#Jump_to_cheat_sheet
What is event streaming?
Technically, event streaming means stream the data from different sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events for different destination technologies as needed. We need some mechanism to manage this event streaming. Here, in this post we are discussing about the Kafka technology for event/data streaming.
Quote from official site: Event streaming is the digital equivalent of the human body’s central nervous system. It is the technological foundation for the ‘always-on’ world where businesses are increasingly software-defined and automated, and where the user of software is more software.
Read more about from https://kafka.apache.org/intro
Some examples of event streaming in real life
- To process payments and financial transactions in real-time, such as in stock exchanges, banks, and insurances.
- To track and monitor cars, trucks, fleets, and shipments in real-time, such as in logistics and the automotive industry.
- To continuously capture and analyze sensor data from IoT devices or other equipment, such as in factories and wind parks.
- To collect and immediately react to customer interactions and orders, such as in retail, the hotel and travel industry, and mobile applications.
- To monitor patients in hospital care and predict changes in condition to ensure timely treatment in emergencies.
- To connect, store, and make available data produced by different divisions of a company.
- To serve as the foundation for data platforms, event-driven architectures, and microservices.
What is Kafka?
In event streaming terminology, Kafka is a well matured solution. Kafka acts as the nervous system for streaming your data for different use cases. It can be deployed on bare-metal hardware, virtual machines, and containers in on-premise as well as cloud environments. Kafka is distributed, highly scalable, elastic, fault-tolerant, and secure solution for data streaming.
Key capabilities of Kafka
Kafka combines three key core capabilities to achieve our use cases. Which includes:
- To publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems.
- To store streams of events durably and reliably for as long as you want.
- To process streams of events.
Kafka core APIs
- Producer API
- Consumer API
- Streams API
- Connector API
Read more from here: https://docs.confluent.io/platform/current/kafka/introduction.html
Kafka core components
- A Kafka server or Kafka Node is known as a Kafka broker.
- We can setup a Kafka cluster with one or many brokers.
- In real life broker is a third person stand in between a buyer and a seller. In the case of Kafka modelling, broker plays the same role but in between Producer and Consumer. A Kafka broker receives messages from producers and stores them on disk keyed by unique offset.
- Kafka broker hosts topics with events in it. Topics can have one or more partitions.
- A Kafka broker allows consumers to fetch messages by topic, partition and offset.
- Kafka brokers can create a Kafka cluster by sharing information between each other directly or indirectly using Zookeeper.
- Topic is a channel where publishers publish data and where subscribers (consumers) receive data.
- It’s a stream of a particular type/classification of data.
- Messages are structured or organised into topics or a particular type of messages are published to a particular topic.
- Kafka topics are identified by its name and the name is unique in the cluster. The name is our choice.
- You can create as many topics as you want.
- The data retention (how long) is configurable. Default is one week.
- Data in a topic is immutable. The data in the offset will remain immutable.
- As a comparison to RDBMS topics are similar to tables (but not containing all constraints).
- Kafka topics are further split in to partitions.
- These partitions are replicated across the brokers in a Kafka cluster.
- Data is written to partitions randomly unless a key is added to it.
- No limitations to partition the topic.
- Messages are stored in sequenced fashion in one partition.
- Each message in a partition is assigned an incremental id, also called offset.
- We can define replication factor to increase the availability. Partitions can have copies to increase durability and availability and enable Kafka to failover to a broker with a replica of the partition if the broker with the leader partition fails.
- Read more from: https://www.instaclustr.com/the-power-of-kafka-partitions-how-to-get-the-most-out-of-your-kafka-cluster/
Topic Replication Factor
- We already pointed out the replication factor above. This is for the high availability or durability or our data/events stored in a partition.
- The replication factor 0 means we do not have any copy of our data. If we enable the replica, the Kafka will create a copy of partition and it store in to any other brokers in the cluster. So in case any broker/node failure, we have a copy/replica on the other node and that will become the leader. So we do not need to struggle with data in case of failures.
- At a time only 1 broker can be a leader for a given partition and other brokers will have in-sync replica. That is known as ISR.
- You can’t have number of replication factor more than the number of available brokers.
- Ones who writes data to a Topic.
- Producers need to specify the Topic name and one broker details to connect to the cluster. Kafka will take of the rest. Kafka will automatically take care of sending the data to the right partition of the right broker.
- Producers have the provision to receive back the acknowledgement of the data it writes. There are three ways/types:
- Ack = 0 [Fastest way of write. Producers does not wait for Ack. It will write and move on.]
- Ack = 1 [Producers wait for the Ack from the primary partition.]
- Ack = all [Producers wait for the Ack from primary as well as the replica partitions. Slow writes.]
- If a producer sends a key along with the message, Kafka guarantees that messages with same key will appear in the same partition.
- Ones who read data from the topics.
- Consumers read data from partitions in a topic.
- Consumers need to mention the topic name and a broker while connecting. Kafka will ensure the consumer is connected to the entire cluster whenever a consumer connects to a single broker as it’s distributed architecture.
- Consumers read data from a partition in order
- However, the order is not guaranteed across partitions. Kafka consumers read from across partition in parallel.
- A consumer group can have multiple consumer process running.
- One consumer group will have one unique group-id.
- While reading, data from one partition is read by exactly one consumer instance in one consumer group.
- If there are more than one consumer group, one instance from each of these groups can read from one single partition.
- Read more: https://docs.confluent.io/platform/current/clients/consumer.html
- Zookeeper is a must required component in Kafka ecosystem.
- Zookeeper helps in managing Kafka brokers
- It also helps in leader election of partitions.
- It helps in maintaining the cluster membership. For example, when a new broker is added, a broker is removed, a new topic is added or a topic is deleted, when a broker goes down or comes up etc, Zookeeper manages such situations, informs Kafka.
- It also manages topic configurations like number of partitions a topic has, leader of the partitions for a topic.
Jump to cheat sheet
If you new to kafka, there is no point to jump into the cheat sheet without mentioning the key components. That’s why I added those details above. I hope, now you guys have a brief knowledge on Kafka internals. I suggest to go through the official documentation to get a clear picture on its internals. Here we go!!
- Zookeeper: confluent-zookeeper.service → 2181
- Kafka: confluent-kafka.service → 9092
- Schema Registry: confluent-schema-registry.service
- Kafka REST: confluent-kafka-rest.service → 8082
- Connect: confluent-kafka-connect.service → 8083
- KSQL Server: confluent-ksql.service → 8088
- Control Center: confluent-control-center.service
Manage all services together
systemctl start confluent-kafka-rest.service confluent-kafka.service confluent-control-center.service confluent-kafka-connect.service confluent-ksql.service confluent-schema-registry.servicesudo
systemctl stop confluent-kafka-rest.service confluent-kafka.service confluent-control-center.service confluent-kafka-connect.service confluent-ksql.service confluent-schema-registry.servicesudo
systemctl status confluent-kafka-rest.service confluent-kafka.service confluent-control-center.service confluent-kafka-connect.service confluent-ksql.service confluent-schema-registry.service
To restart connect service [one service]
systemctl stop confluent-kafka-connect.servicesudo
systemctl start confluent-kafka-connect.servicesudo
systemctl status confluent-kafka-connect.service
Like above you can manage services separately.
List all configured connectors
sudo curl localhost:8083/connectors
List all installed connector plugins
sudo curl localhost:8083/connector-plugins
Fetch a connector configuration details
sudo curl localhost:8083/connectors/<connector-name> | jq .
Check connector status
sudo curl localhost:8083/connectors/<connector-name>/status
Restart a connector
sudo curl -XPOST localhost:8083/connectors/<connector-name>/restart
Delete a connector
sudo curl -X DELETE http://localhost:8083/connectors/<connector-name>
sudo curl localhost:8083/connectors/<connector-name>/tasks | jq
Restart a task
sudo curl -XPOST localhost:8083/connectors//tasks/<task-id>/restart
Create a topic
sudo kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic new-topic
sudo kafka-topics --list --zookeeper localhost:2181
List topic with details (describe)
sudo kafka-topics --zookeeper localhost:2181 --describe --topic <topic-name>
This will show the “ReplicationFactor”, “PartitionCount” and more details about a topic.
Describe all topics
sudo kafka-topics --describe --zookeeper localhost:2181
Alter / to add more partitions
sudo kafka-topics --zookeeper localhost:2181 --alter --topic <topic-name> --partitions 16
Delete a topic
sudo kafka-topics --zookeeper localhost:2181 --delete --topic <topic-name>
Details about under replicated partition
sudo kafka-topics --zookeeper localhost:2181/kafka-cluster --describe --under-replicated-partitions
This is a live document and will keep updating this.
Let me know if you have more details or any questions.