With the development of microservice architecture, new tools are emerging that make working with microservice applications easier and more streamlined. One of these tools is Apache Kafka — a popular platform and system for stream data processing and real-time messaging. It is used by various companies around the world to build scalable message transmission systems, data analytics, and integration with microservice applications.
As a core service in application architecture, Kafka requires monitoring. Without proper monitoring, the cluster may experience failures, which could lead to data loss or leaks of information. Today, we will examine in detail how to organize monitoring for Apache Kafka.
Before moving on to the process of organizing monitoring and securing Kafka, let’s break down the program’s architecture.
Kafka is a distributed system consisting of several key components:
Brokers — physical or virtual servers (hosts) that receive, store, and process messages. Each broker is responsible for specific topic partitions.
Topics — logical categories where messages arrive. Topics are divided into partitions for parallel processing.
Producers — data sources or, more simply, clients that send data to topics.
Consumers — clients that read data from topics, often combined in groups for load distribution.
ZooKeeper — used to coordinate brokers and also stores metadata and configuration. Starting from version 3.3+, it is possible to work without ZooKeeper thanks to KRaft (a protocol for storing and managing metadata inside Kafka). The key feature of KRaft is eliminating Apache Kafka’s dependence on an external ZooKeeper service.
Messages in Kafka are key-value pairs written to partitions as logs. Consumers read these messages by tracking their position in the log. This architecture ensures high throughput but makes the system vulnerable to failures if monitoring and security are not given sufficient attention.
Kafka often plays the role of a central component in the infrastructure of large applications, especially when used in microservice architecture. For example, it can transmit millions of events per second between multiple systems or databases. Any delay, failure, or data loss can lead to serious consequences, including financial losses or data loss. Therefore, it is necessary to build Kafka monitoring that will address the following tasks:
Performance control. Broker performance decreases if there are delivery delays or if the broker itself is overloaded. These actions slow down the entire data processing chain.
Data integrity control. With data integrity monitoring, it is possible to minimize problems associated with message loss, duplication, or data corruption.
Scaling planning. Monitoring helps understand when to add brokers (horizontal scaling) or increase server resources (vertical scaling).
Effective monitoring requires tracking metrics at all system levels. Let’s look at the main categories and examples.
Broker Metrics
Incoming and Outgoing Traffic. Shows how much data the broker receives and sends. If the values approach network or disk limits, this is a signal for scaling.
Request Processing Latency. The average time to process requests from clients. Growth in latency may indicate a lack of resources.
Number of Active Connections. An abnormally high number of connections may indicate an attack or incorrect client behavior.
Resource Utilization. CPU, RAM, and disk space usage.
Topic and Partition Metrics
Log Size. The total volume of data in a topic. If it grows uncontrollably, the cleanup policy should be reviewed.
Number of Messages. Data arrival rate. Sharp spikes may indicate peak loads.
Offset. The position of the last recorded message and the position up to which consumers have read.
Consumer and Producer Metrics
Consumer Lag. The lag of consumers behind producers. For example, if the lag exceeds 10,000 messages, it may mean that consumers cannot keep up with processing.
Producer Request Rate. The frequency of producer requests. A drop in this metric may signal failures on the sender side.
Fetch Latency. The time required by the consumer to fetch data. High values indicate network or broker problems.
Let’s break down how to set up Kafka monitoring in practice.
We will need one server or virtual machine with any pre-installed Linux distribution. In this article, we will use Ubuntu 24.04 as an example.
The server must meet the following requirements:
At least 4 GB of RAM. This amount is suitable only for setting up and test usage of Apache Kafka and is not intended for high-resource tasks. For more serious tasks, at least 8 GB of RAM is required.
At least a single-core processor for basic configuration. For real workloads (for example, working with large data volumes, mathematical or scientific calculations), a 4-core processor is recommended.
A public IP address, which can be rented when creating the server in the “Network” section.
The server can be created in the control panel under Cloud Servers. During setup, we recommend choosing a region with minimal ping for fast data transfer. Other parameters can be left unchanged.
The server will launch in a couple of minutes, and you will find its IP address, login, and password in the server’s dashboard.
Let’s start by installing Kafka using these steps:
Update the repository index and install the OpenJDK 11 package needed to run Kafka:
apt update && apt -y install openjdk-11-jdk
Check that Java was successfully installed by displaying its version:
java -version
If a version is returned, Java was successfully installed.
Next, use wget
to download the program archive (used version — 3.9.1):
wget https://downloads.apache.org/kafka/3.9.1/kafka_2.13-3.9.1.tgz
Unpack the downloaded archive with the command:
tar -xvzf kafka_2.13-3.9.1.tgz
A directory named kafka_2.13-3.9.1
will appear. Move it to /opt/kafka
:
mv kafka_2.13-3.9.1 /opt/kafka
Next, for convenient Kafka management, create systemd
units. Let’s start with ZooKeeper. Using any text editor, create a file zookeeper.service
:
nano /etc/systemd/system/zookeeper.service
Use the following content:
[Unit]
Description=Apache Zookeeper service
Requires=network.target remote-fs.target
After=network.target remote-fs.target
[Service]
Type=simple
ExecStart=/opt/kafka/bin/zookeeper-server-start.sh /opt/kafka/config/zookeeper.properties
ExecStop=/opt/kafka/bin/zookeeper-server-stop.sh
Restart=on-abnormal
[Install]
WantedBy=multi-user.target
Save changes and exit the file.
Also create a systemd
file for Kafka:
nano /etc/systemd/system/kafka.service
Use this content:
[Unit]
Description=Apache Kafka Service
Requires=zookeeper.service
[Service]
Type=simple
Environment="JAVA_HOME=/usr/lib/jvm/java-1.11.0-openjdk-amd64"
ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
ExecStop=/opt/kafka/bin/kafka-server-stop.sh
[Install]
WantedBy=multi-user.target
Reload the daemon configuration files with:
systemctl daemon-reload
Start ZooKeeper:
systemctl start zookeeper
Check its status:
systemctl status zookeeper
It should show active (running)
indicating ZooKeeper started successfully.
Next, start Kafka:
systemctl start kafka
And also check its status:
systemctl status kafka
It should show active (running)
indicating Kafka started successfully.
Additionally, create a separate user who will be assigned as the owner of all Kafka-related files and directories:
useradd -r -m -s /bin/false kafka
Set the necessary permissions:
chown -R kafka:kafka /opt/kafka
After both services—ZooKeeper and Kafka—have been started, let’s test Kafka’s operation.
All commands below should be run from the /opt/kafka
directory:
cd /opt/kafka
Create a new topic called new-topic1
:
bin/kafka-topics.sh --create --topic new-topic1 --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
If successful, the terminal will display Created topic new-topic1
.
Also list all topics in the current Kafka instance:
bin/kafka-topics.sh --list --bootstrap-server localhost:9092
The topic new-topic1
should be listed.
Next, test the producer. Launch it with:
bin/kafka-console-producer.sh --topic new-topic1 --bootstrap-server localhost:9092
Send a test message:
Hello from kafka!
Without closing the current SSH session, open a new one and go to /opt/kafka
:
cd /opt/kafka
Start the consumer:
bin/kafka-console-consumer.sh --topic new-topic1 --from-beginning --bootstrap-server localhost:9092
If everything works correctly, you will see the previously sent message.
Create a user named prometheus
:
useradd --no-create-home --shell /bin/false prometheus
Create directories for Prometheus configuration files:
mkdir /etc/prometheus
mkdir /var/lib/prometheus
Assign the directory owner:
chown prometheus:prometheus /var/lib/prometheus
Move to the /tmp
directory:
cd /tmp/
And download the program archive:
wget https://github.com/prometheus/prometheus/releases/download/v2.53.5/prometheus-2.53.5.linux-amd64.tar.gz
Unpack the downloaded archive:
tar xvfz prometheus-2.53.5.linux-amd64.tar.gz
Go into the extracted directory:
cd prometheus-2.53.5.linux-amd64
Move the console directory, prometheus.yml
config file, and the Prometheus binary, and set ownership:
mv console* /etc/prometheus
mv prometheus.yml /etc/prometheus
mv prometheus /usr/local/bin/
chown -R prometheus:prometheus /etc/prometheus
chown prometheus:prometheus /usr/local/bin/prometheus
Additionally, create a systemd
unit for Prometheus:
nano /etc/systemd/system/prometheus.service
Use the following content:
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries
[Install]
WantedBy=multi-user.target
By default, Prometheus is only accessible from localhost
. Let’s allow access from all addresses by editing the main config:
nano /etc/prometheus/prometheus.yml
At the end of the file, find the targets parameter under static_configs
and replace localhost with the external IP address of your server (you will have your own external IP).
static_configs:
- targets: ["166.1.227.100:9090"]
Save and exit.
Start Prometheus, add it to autostart, and check its status:
systemctl start prometheus && systemctl enable prometheus && systemctl status prometheus
If the status shows active (running)
, Prometheus has started successfully.
Restart the systemd
daemon and Prometheus and check its status again:
systemctl daemon-reload && systemctl restart prometheus && systemctl status prometheus
If active (running)
is displayed, Prometheus is successfully running.
Now go to your browser using the server’s IP address and port 9090 (default Prometheus port). You should see the program’s web interface.
apt-get install -y apt-transport-https software-properties-common wget
Create a directory to store the key:
mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
Add the repository:
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | tee -a /etc/apt/sources.list.d/grafana.list
Update the package index and install Grafana:
apt update && apt -y install grafana
Start the service with the following commands:
systemctl daemon-reload && systemctl enable grafana-server && systemctl start grafana-server
Check Grafana’s status:
systemctl status grafana-server
If it shows active (running)
, Grafana has started successfully.
Using the server’s IP address and port 3000 (Grafana’s default port), go to the web interface. The initial login and password for the web interface are admin
/ admin
. On first login, the system will prompt you to set a new password for the admin
user.
After authentication, the web interface will open.
JMX Exporter is a utility that collects and transmits metrics from applications running on Java to monitoring systems such as Prometheus. To install JMX Exporter, you need to perform the following steps:
Download the utility from the official repository using wget:
wget https://repo.maven.apache.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.20.0/jmx_prometheus_javaagent-0.20.0.jar
Move the downloaded JAR file to the /opt/kafka/libs
directory:
mv jmx_prometheus_javaagent-0.20.0.jar /opt/kafka/libs/
Open the kafka-server-start.sh
file for editing:
nano /opt/kafka/bin/kafka-server-start.sh
And add the following lines at the very end of the file:
KAFKA_OPTS="-javaagent:/opt/kafka/libs/jmx_prometheus_javaagent-0.20.0.jar=9091:/etc/prometheus/prometheus.yml"
KAFKA_OPTS="-javaagent:/opt/kafka/libs/jmx_prometheus_javaagent-0.20.0.jar=9091:/opt/kafka/config/sample_jmx_exporter.yml"
Save the changes and exit the file.
Restart Kafka using the commands:
systemctl daemon-reload && systemctl restart kafka
Let's proceed to configure JMX Exporter.
Go to the /opt/kafka/config
directory:
cd /opt/kafka/config
Create the sample_jmx_exporter.yml
file:
nano sample_jmx_exporter.yml
And use the following content:
lowercaseOutputName: true
rules:
# Special cases and very specific rules
- pattern : kafka.server<type=(.+), name=(.+), clientId=(.+), topic=(.+), partition=(.*)><>Value
name: kafka_server_$1_$2
type: GAUGE
labels:
clientId: "$3"
topic: "$4"
partition: "$5"
- pattern : kafka.server<type=(.+), name=(.+), clientId=(.+), brokerHost=(.+), brokerPort=(.+)><>Value
name: kafka_server_$1_$2
type: GAUGE
labels:
clientId: "$3"
broker: "$4:$5"
- pattern : kafka.coordinator.(\w+)<type=(.+), name=(.+)><>Value
name: kafka_coordinator_$1_$2_$3
type: GAUGE
# Generic per-second counters with 0-2 key/value pairs
- pattern: kafka.(\w+)<type=(.+), name=(.+)PerSec\w*, (.+)=(.+), (.+)=(.+)><>Count
name: kafka_$1_$2_$3_total
type: COUNTER
labels:
"$4": "$5"
"$6": "$7"
- pattern: kafka.(\w+)<type=(.+), name=(.+)PerSec\w*, (.+)=(.+)><>Count
name: kafka_$1_$2_$3_total
type: COUNTER
labels:
"$4": "$5"
- pattern: kafka.(\w+)<type=(.+), name=(.+)PerSec\w*><>Count
name: kafka_$1_$2_$3_total
type: COUNTER
- pattern: kafka.server<type=(.+), client-id=(.+)><>([a-z-]+)
name: kafka_server_quota_$3
type: GAUGE
labels:
resource: "$1"
clientId: "$2"
- pattern: kafka.server<type=(.+), user=(.+), client-id=(.+)><>([a-z-]+)
name: kafka_server_quota_$4
type: GAUGE
labels:
resource: "$1"
user: "$2"
clientId: "$3"
# Generic gauges with 0-2 key/value pairs
- pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+), (.+)=(.+)><>Value
name: kafka_$1_$2_$3
type: GAUGE
labels:
"$4": "$5"
"$6": "$7"
- pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+)><>Value
name: kafka_$1_$2_$3
type: GAUGE
labels:
"$4": "$5"
- pattern: kafka.(\w+)<type=(.+), name=(.+)><>Value
name: kafka_$1_$2_$3
type: GAUGE
# Emulate Prometheus 'Summary' metrics for the exported 'Histogram's.
#
# Note that these are missing the '_sum' metric!
- pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+), (.+)=(.+)><>Count
name: kafka_$1_$2_$3_count
type: COUNTER
labels:
"$4": "$5"
"$6": "$7"
- pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.*), (.+)=(.+)><>(\d+)thPercentile
name: kafka_$1_$2_$3
type: GAUGE
labels:
"$4": "$5"
"$6": "$7"
quantile: "0.$8"
- pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.+)><>Count
name: kafka_$1_$2_$3_count
type: COUNTER
labels:
"$4": "$5"
- pattern: kafka.(\w+)<type=(.+), name=(.+), (.+)=(.*)><>(\d+)thPercentile
name: kafka_$1_$2_$3
type: GAUGE
labels:
"$4": "$5"
quantile: "0.$6"
- pattern: kafka.(\w+)<type=(.+), name=(.+)><>Count
name: kafka_$1_$2_$3_count
type: COUNTER
- pattern: kafka.(\w+)<type=(.+), name=(.+)><>(\d+)thPercentile
name: kafka_$1_$2_$3
type: GAUGE
labels:
quantile: "0.$4"
Save the changes and exit the file.
Next, open the main Prometheus configuration file prometheus.yml
for editing:
nano /etc/prometheus/prometheus.yml
We need to add the Kafka endpoint so that Prometheus can collect data. To do this, at the very bottom add the following block, where 166.1.227.100
is the external IP address of the server (do not forget to change it to your actual external IP address):
- job_name: 'kafka'
static_configs:
- targets: ["166.1.227.100:9091"]
Save the changes and exit the file.
Restart Prometheus and check its status:
systemctl daemon-reload && systemctl restart prometheus && systemctl status prometheus
Next, it is necessary to make changes when starting Kafka by adding the paths to the Prometheus and JMX Exporter files.
Open the Kafka systemd file for editing:
nano /etc/systemd/system/kafka.service
And add the following lines to the [Service]
block:
Environment="KAFKA_OPTS=-javaagent:/opt/kafka/libs/jmx_prometheus_javaagent-0.20.0.jar=9091:/etc/prometheus/prometheus.yml"
Environment="KAFKA_OPTS=-javaagent:/opt/kafka/libs/jmx_prometheus_javaagent-0.20.0.jar=9091:/opt/kafka/config/sample_jmx_exporter.yml"
Save the changes and exit the file.
Restart Kafka and check its status:
systemctl daemon-reload && systemctl restart kafka && systemctl status kafka
Go to the Prometheus web interface, then to the Status section, and in the dropdown menu select Targets:
A new data source for Kafka will appear.
The final step is to add the metrics from Prometheus to Grafana to build visualizations using graphs.
If connected to Prometheus successfully, a corresponding message will be displayed.
After we have configured monitoring, it is time to add a dashboard for visualization in Grafana.
On the left panel, go to the Dashboards section.
In the opened window, click the New button on the right and in the dropdown menu select New dashboard.
Next, go to the Import dashboard section:
Use dashboard number 11962 to add it to Grafana and click the Load button:
In the opened section, you can set a name for the dashboard. At the bottom, as the data source, select the previously added Prometheus instance:
Click the Import button.
The added dashboard currently does not show any load. Let’s simulate it ourselves.
On the server, go to the /opt/kafka
directory:
cd /opt/kafka
Create a new topic named test-load
:
bin/kafka-topics.sh --create --topic test-load --bootstrap-server localhost:9092 --partitions 4 --replication-factor 1
Kafka has a built-in tool kafka-producer-perf-test.sh
, which allows you to simulate message sending by a producer. Let’s launch it to create a test load:
bin/kafka-producer-perf-test.sh --topic test-load --num-records 1000000 --record-size 100 --throughput -1 --producer-props bootstrap.servers=localhost:9092
The command above will generate and send 1,000,000 messages.
Also, create a load by consuming another 1,000,000 messages with a consumer:
bin/kafka-consumer-perf-test.sh --topic test-load --messages 1000000 --broker-list localhost:9092 --group test-group
Go to the Grafana dashboard and you can observe the graphs:
Monitoring Apache Kafka is a complex and comprehensive process that requires maximum attention to detail. The process starts with metrics collection, which can be organized using modern tools like Prometheus and Grafana. Once the metrics are set up, it is necessary to regularly check the cluster’s state for possible problems. Proper monitoring ensures stability of operation. Apache Kafka is a powerful tool that will fully reveal its potential only with correct setup and operation.