Aiven for Apache Kafka® metrics available via Prometheus#
The following list only contains the most common metrics available via Prometheus for an Aiven for Apache Kafka® service.
You can retrieve the complete list of available metrics for your specific service by requesting the Prometheus endpoint, substituting:
the Aiven project certificate (
ca.pem)the Prometheus credentials (
<PROMETHEUS_USER>:<PROMETHEUS_PASSWORD>)the Aiven for Apache Kafka hostname (
<KAFKA_HOSTNAME>)the Prometheus port (
<PROMETHEUS_PORT>)
curl --cacert ca.pem \
--user '<PROMETHEUS_USER>:<PROMETHEUS_PASSWORD>' \
'https://<KAFKA_HOSTNAME>:<PROMETHEUS_PORT>/metrics'
Tip
You can check how to use Prometheus with Aiven in the dedicated document.
CPU utilization#
cpu_usage_guest: CPU time spent running a virtual CPU for guest operating systems.cpu_usage_guest_nice: The amount of time the CPU runs a virtual CPU for a guest operating system, which is low-priority and can be interrupted by other processes. This metric is measured in hundredths of a second.cpu_usage_idle: Time the CPU spends doing nothing.cpu_usage_iowait: Time waiting for I/O to complete.cpu_usage_irq: Time servicing interrupts.cpu_usage_nice: Time running user-niced processes.cpu_usage_softirq: Time servicing softirqs.cpu_usage_steal: Time spent in other operating systems when running in a virtualized environment.cpu_usage_system: Time spent running system processes.cpu_usage_user: Time spent running user processes.system_load1: System load average for the last minute.system_load15: System load average for the last 15 minutes.system_load5: System load average for the last 5 minutes.system_n_cpus: Number of CPU cores available.system_n_users: Number of users logged in.system_uptime: Time for which the system has been up and running.
Disk space utilization#
disk_free: Amount of free disk space.disk_inodes_free: Number of free inodes.disk_inodes_total: Total number of inodes.disk_inodes_used: Number of used inodes.disk_total: Total disk space.disk_used: Amount of used disk space.disk_used_percent: Percentage of disk space used.
Disk input and output#
Metrics such as diskio_io_time, diskio_iops_in_progress, etc., offer valuable insights into disk I/O operations. These metrics encompass read/write operations, the duration of these operations, bytes read/written, and more.
diskio_io_timediskio_iops_in_progressdiskio_merged_readsdiskio_merged_writesdiskio_read_bytesdiskio_read_timediskio_readsdiskio_weighted_io_timediskio_write_bytesdiskio_write_timediskio_writes
Garbage collector MXBean#
Metrics associated with the java_lang_GarbageCollector provide insights into the JVM’s garbage collection process. These metrics encompass details such as the collection count, duration of collections, and more.
java_lang_GarbageCollector_G1_Young_Generation_CollectionCount: returns the total number of collections that have occurredjava_lang_GarbageCollector_G1_Young_Generation_CollectionTime: returns the approximate accumulated collection elapsed time in millisecondsjava_lang_GarbageCollector_G1_Young_Generation_duration
Memory usage#
Metrics starting with java_lang_Memory provide insights into the JVM’s memory usage, such as committed memory, initial memory, max memory, used memory, etc.
java_lang_Memory_committed: returns the amount of memory in bytes that is committed for the Java virtual machine to usejava_lang_Memory_init: returns the amount of memory in bytes that the Java virtual machine initially requests from the operating system for memory managementjava_lang_Memory_max: returns the maximum amount of memory in bytes that can be used for memory managementjava_lang_Memory_used: returns the amount of used memory in bytesjava_lang_Memory_ObjectPendingFinalizationCount
Apache Kafka Connect#
The Apache Kafka Connect metrics list is available in the dedicated page.
Apache Kafka broker#
The descriptions for the below metrics are available in the Monitoring section of the Apache Kafka documentation.
Note
The metrics with a _Count suffix are cumulative counters for the given metric, e.g. kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_Count.
Note that a metric like kafka_server_BrokerTopicMetrics_MessagesInPerSec_Count is a cumulative count of incoming messages despite the PerSec suffix in the metric name.
To see the rate of change of these _Count metrics, you can apply a function such as the rate() function in PromQL.
Apache Kafka controller#
Note
These metrics with kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_XthPercentile (where X can be 50th, 75th, 95th, etc.) represent the time taken for leader elections to complete at various percentiles. It helps in understanding the distribution of leader election times.
Metrics below with kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_ (FifteenMinuteRate, FiveMinuteRate, etc.) represent the rate of leader elections over different time intervals.
Metrics below with kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_ (Max/Mean/Min/StdDev) provide statistical measures about the leader election times.
Metrics below with kafka_controller_KafkaController_Metrics provide insights into the state of the Kafka controller, like the number of active brokers, offline partitions, replicas to delete, etc.
kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_50thPercentilekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_75thPercentilekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_95thPercentilekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_98thPercentilekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_999thPercentilekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_99thPercentilekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_Count: The total number of leader elections.kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_FifteenMinuteRatekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_FiveMinuteRatekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_Maxkafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_Meankafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_MeanRatekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_Minkafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_OneMinuteRatekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_StdDevkafka_controller_ControllerStats_UncleanLeaderElectionsPerSec_Count: Number of times an unclean leader election occurs. Unclean leader elections can lead to data loss.kafka_controller_KafkaController_ActiveBrokerCount_Valuekafka_controller_KafkaController_ActiveControllerCount_Valuekafka_controller_KafkaController_FencedBrokerCount_Valuekafka_controller_KafkaController_OfflinePartitionsCount_Valuekafka_controller_KafkaController_PreferredReplicaImbalanceCount_Valuekafka_controller_KafkaController_ReplicasIneligibleToDeleteCount_Valuekafka_controller_KafkaController_ReplicasToDeleteCount_Valuekafka_controller_KafkaController_TopicsIneligibleToDeleteCount_Valuekafka_controller_KafkaController_TopicsToDeleteCount_Value
Jolokia collector collect time#
kafka_jolokia_collector_collect_time: Represents the time taken by the Jolokia collector to collect metrics. Jolokia is a JMX-HTTP bridge, giving an alternative to native JMX access.
Apache Kafka log#
Note
Metrics like kafka_log_LogCleaner_cleaner_recopy_percent_Value and kafka_log_LogCleanerManager_time_since_last_run_ms_Value provide insights into the log cleaner’s operation, which helps in compacting the Kafka logs.
Log Flush Rate Metrics give insights into the log flush operations. Flushing ensures that data is written from memory to disk. Metrics like kafka_log_LogFlushStats_LogFlushRateAndTimeMs_XthPercentile provide the time taken to flush logs at various percentiles.
kafka_log_LogCleaner_cleaner_recopy_percent_Valuekafka_log_LogCleanerManager_time_since_last_run_ms_Valuekafka_log_LogCleaner_max_clean_time_secs_Valuekafka_log_LogFlushStats_LogFlushRateAndTimeMs_50thPercentilekafka_log_LogFlushStats_LogFlushRateAndTimeMs_75thPercentilekafka_log_LogFlushStats_LogFlushRateAndTimeMs_95thPercentilekafka_log_LogFlushStats_LogFlushRateAndTimeMs_98thPercentilekafka_log_LogFlushStats_LogFlushRateAndTimeMs_999thPercentilekafka_log_LogFlushStats_LogFlushRateAndTimeMs_99thPercentilekafka_log_LogFlushStats_LogFlushRateAndTimeMs_Countkafka_log_LogFlushStats_LogFlushRateAndTimeMs_FifteenMinuteRatekafka_log_LogFlushStats_LogFlushRateAndTimeMs_FiveMinuteRatekafka_log_LogFlushStats_LogFlushRateAndTimeMs_Maxkafka_log_LogFlushStats_LogFlushRateAndTimeMs_Meankafka_log_LogFlushStats_LogFlushRateAndTimeMs_MeanRatekafka_log_LogFlushStats_LogFlushRateAndTimeMs_Minkafka_log_LogFlushStats_LogFlushRateAndTimeMs_OneMinuteRatekafka_log_LogFlushStats_LogFlushRateAndTimeMs_StdDevkafka_log_Log_LogEndOffset_Valuekafka_log_Log_LogStartOffset_Valuekafka_log_Log_Size_Value
Apache Kafka network#
Note
Metrics below like kafka_network_RequestMetrics_RequestsPerSec_Count and kafka_network_RequestMetrics_TotalTimeMs_Mean provide insights into the network requests made to the Kafka brokers.
kafka_network_RequestChannel_RequestQueueSize_Valuekafka_network_RequestChannel_ResponseQueueSize_Valuekafka_network_RequestMetrics_RequestsPerSec_Countkafka_network_RequestMetrics_TotalTimeMs_95thPercentilekafka_network_RequestMetrics_TotalTimeMs_Countkafka_network_RequestMetrics_TotalTimeMs_Meankafka_network_SocketServer_NetworkProcessorAvgIdlePercent_Value
Apache Kafka server#
Note
The metrics below like BrokerTopicMetrics provide insights into various operations related to topics, like bytes in/out, failed fetch/produce requests, etc.
Metrics ReplicaManager like kafka_server_ReplicaManager_LeaderCount_Value provide insights into the state of replicas in the Kafka cluster.
If you do not specify the topic tag, it displays the combined rate for all topics as well as the rate for each individual topic. To view rates for specific topics, use the topic tag. To exclude the combined rate for all topics and only list metrics for individual topics, filter with topic!=""
kafka_server_BrokerTopicMetrics_BytesInPerSec_Count: Byte in (from the clients) rate per topic. Omitting ‘topic=(…)’ will yield the all-topic rate.kafka_server_BrokerTopicMetrics_BytesOutPerSec_Count: Byte out (to the clients) rate per topic. Omitting ‘topic=(…)’ will yield the all-topic rate.kafka_server_BrokerTopicMetrics_BytesRejectedPerSec_Count: Rejected byte rate per topic due to the record batch size being greater than max.message.bytes configuration. Omitting ‘topic=(…)’ will yield the all-topic rate.kafka_server_BrokerTopicMetrics_FailedFetchRequestsPerSec_Count: Failed Fetch request (from clients or followers) rate per topic. Omitting ‘topic=(…)’ will yield the all-topic rate.kafka_server_BrokerTopicMetrics_FailedProduceRequestsPerSec_Count: Failed Produce request rate per topic. Omitting ‘topic=(…)’ will yield the all-topic rate.kafka_server_BrokerTopicMetrics_FetchMessageConversionsPerSec_Count: Message format conversion rate, for Produce or Fetch requests, per topic. Omitting ‘topic=(…)’ will yield the all-topic rate.kafka_server_BrokerTopicMetrics_MessagesInPerSec_Count: Incoming message rate per topic. Omitting ‘topic=(…)’ will yield the all-topic rate.kafka_server_BrokerTopicMetrics_ProduceMessageConversionsPerSec_Count: Message format conversion rate, for Produce or Fetch requests, per topic. Omitting ‘topic=(…)’ will yield the all-topic rate.kafka_server_BrokerTopicMetrics_ReassignmentBytesInPerSec_Count: Incoming byte rate of reassignment traffickafka_server_BrokerTopicMetrics_ReassignmentBytesOutPerSec_Count: Outgoing byte rate of reassignment traffickafka_server_BrokerTopicMetrics_ReplicationBytesInPerSec_Count: Byte in (from the other brokers) rate per topic. Omitting ‘topic=(…)’ will yield the all-topic rate.kafka_server_BrokerTopicMetrics_ReplicationBytesOutPerSec_Count: Byte out (to the other brokers) rate per topic. Omitting ‘topic=(…)’ will yield the all-topic rate.kafka_server_BrokerTopicMetrics_TotalFetchRequestsPerSec_Count: Fetch request (from clients or followers) rate per topic. Omitting ‘topic=(…)’ will yield the all-topic rate.kafka_server_BrokerTopicMetrics_TotalProduceRequestsPerSec_Count: Produce request rate per topic. Omitting ‘topic=(…)’ will yield the all-topic rate.kafka_server_DelayedOperationPurgatory_NumDelayedOperations_Valuekafka_server_DelayedOperationPurgatory_PurgatorySize_Valuekafka_server_KafkaRequestHandlerPool_RequestHandlerAvgIdlePercent_OneMinuteRatekafka_server_KafkaServer_BrokerState_Valuekafka_server_ReplicaManager_IsrExpandsPerSec_Countkafka_server_ReplicaManager_IsrShrinksPerSec_Countkafka_server_ReplicaManager_LeaderCount_Valuekafka_server_ReplicaManager_PartitionCount_Valuekafka_server_ReplicaManager_UnderMinIsrPartitionCount_Valuekafka_server_ReplicaManager_UnderReplicatedPartitions_Valuekafka_server_group_coordinator_metrics_group_completed_rebalance_countkafka_server_group_coordinator_metrics_group_completed_rebalance_ratekafka_server_group_coordinator_metrics_offset_commit_countkafka_server_group_coordinator_metrics_offset_commit_ratekafka_server_group_coordinator_metrics_offset_deletion_countkafka_server_group_coordinator_metrics_offset_deletion_ratekafka_server_group_coordinator_metrics_offset_expiration_countkafka_server_group_coordinator_metrics_offset_expiration_rate
Kernel#
Note
Metrics below, like kernel_boot_time, kernel_context_switches, etc., provide insights into the underlying system’s kernel operations.
kernel_boot_timekernel_context_switcheskernel_entropy_availkernel_interruptskernel_processes_forked
Generic memory#
Note
Metrics like mem_active, mem_available, etc., provide insights into the system’s memory usage.
mem_activemem_availablemem_available_percentmem_bufferedmem_cachedmem_commit_limitmem_committed_asmem_dirtymem_freemem_high_freemem_high_totalmem_huge_pages_freemem_huge_page_sizemem_huge_pages_totalmem_inactivemem_low_freemem_low_totalmem_mappedmem_page_tablesmem_sharedmem_slabmem_swap_cachedmem_swap_freemem_swap_totalmem_totalmem_usedmem_used_percentmem_vmalloc_chunkmem_vmalloc_totalmem_vmalloc_usedmem_wiredmem_write_backmem_write_back_tmp
Network#
Note
Metrics like net_bytes_recv, net_packets_sent, etc., provide insights into the system’s network operations.
net_bytes_recvnet_bytes_sentnet_drop_innet_drop_outnet_err_innet_err_outnet_icmp_inaddrmaskrepsnet_icmp_inaddrmasksnet_icmp_incsumerrorsnet_icmp_indestunreachsnet_icmp_inechorepsnet_icmp_inechosnet_icmp_inerrorsnet_icmp_inmsgsnet_icmp_inparmprobsnet_icmp_inredirectsnet_icmp_insrcquenchsnet_icmp_intimeexcdsnet_icmp_intimestamprepsnet_icmp_intimestampsnet_icmpmsg_intype3net_icmpmsg_intype8net_icmpmsg_outtype0net_icmpmsg_outtype3net_icmp_outaddrmaskrepsnet_icmp_outaddrmasksnet_icmp_outdestunreachsnet_icmp_outechorepsnet_icmp_outechosnet_icmp_outerrorsnet_icmp_outmsgsnet_icmp_outparmprobsnet_icmp_outredirectsnet_icmp_outsrcquenchsnet_icmp_outtimeexcdsnet_icmp_outtimestamprepsnet_icmp_outtimestampsnet_ip_defaultttlnet_ip_forwardingnet_ip_forwdatagramsnet_ip_fragcreatesnet_ip_fragfailsnet_ip_fragoksnet_ip_inaddrerrorsnet_ip_indeliversnet_ip_indiscardsnet_ip_inhdrerrorsnet_ip_inreceivesnet_ip_inunknownprotosnet_ip_outdiscardsnet_ip_outnoroutesnet_ip_outrequestsnet_ip_reasmfailsnet_ip_reasmoksnet_ip_reasmreqdsnet_ip_reasmtimeoutnet_packets_recvnet_packets_sentnetstat_tcp_closenetstat_tcp_close_waitnetstat_tcp_closingnetstat_tcp_establishednetstat_tcp_fin_wait1netstat_tcp_fin_wait2netstat_tcp_last_acknetstat_tcp_listennetstat_tcp_nonenetstat_tcp_syn_recvnetstat_tcp_syn_sentnetstat_tcp_time_waitnetstat_udp_socketnet_tcp_activeopensnet_tcp_attemptfailsnet_tcp_currestabnet_tcp_estabresetsnet_tcp_incsumerrorsnet_tcp_inerrsnet_tcp_insegsnet_tcp_maxconnnet_tcp_outrstsnet_tcp_outsegsnet_tcp_passiveopensnet_tcp_retranssegsnet_tcp_rtoalgorithmnet_tcp_rtomaxnet_tcp_rtominnet_udp_ignoredmultinet_udp_incsumerrorsnet_udp_indatagramsnet_udp_inerrorsnet_udplite_ignoredmultinet_udplite_incsumerrorsnet_udplite_indatagramsnet_udplite_inerrorsnet_udplite_noportsnet_udplite_outdatagramsnet_udplite_rcvbuferrorsnet_udplite_sndbuferrorsnet_udp_noportsnet_udp_outdatagramsnet_udp_rcvbuferrorsnet_udp_sndbuferrors
Processes#
Note
Metrics like processes_running, processes_zombies, etc., provide insights into the system’s process management.
processes_blockedprocesses_deadprocesses_idleprocesses_pagingprocesses_runningprocesses_sleepingprocesses_stoppedprocesses_totalprocesses_total_threadsprocesses_unknownprocesses_zombies
Swap usage#
Note
Metrics like swap_free, swap_used, etc., provide insights into the system’s swap memory usage.
swap_freeswap_inswap_outswap_totalswap_usedswap_used_percent