Spark Streaming with Kafka Leading to increase in Open File Descriptors ( Kafka )

Open File Descriptors w.r.t Kafka brokers relates with following -

number of file descriptors to just track log segment files.
Additional file descriptors to communicate via network sockets with external parties (such as clients, other brokers, Zookeeper, and Kerberos).

For # 1 this is formula -

(number of partitions)*(partition size / segment size)

Reference - https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/kafka-performance-tuning/topics/kafka-tune-broker-syslevel-file-descriptors.html

For #2, every connection made my consumer or producer or zookeeper or Kerberos opens file descriptors. Note that each TCP connection creates 2 file descriptors. These connections can be for internal communication of heartbeat, or security handshake, or data transfer to or from client (producer or consumer)

When we run a Spark application integrating it with Kafka. And, if it is not stable, meaning -

Streaming window for micro batches is less than Processing time. That will eventually create scheduling delay for each batch.
Eventually, Active Batch backlog starts to increase in a Spark program.

Each of the Active Batch opens connection with Kafka. Hence, opens file descriptors to read messages or metadata from Kafka. These descriptors remain open until that batch is processed.

That said, if Active Batches keeps piling up that will eventually pile up open file descriptors at Kafka Broker end.

Open File Descriptors due to increased Spark Streaming Active Batches -

Thus, Spark Streaming for an application can impact overall Kafka Broker performance. Hence, it can impact entire cluster, which might be shared by other teams and applications.

QueryDB

Search This Blog

Spark Streaming with Kafka Leading to increase in Open File Descriptors ( Kafka )

Comments

Post a Comment

Popular posts

Spark MongoDB Connector Not leading to correct count or data while reading

Scala Spark building Jar leads java.lang.StackOverflowError

MongoDB Chunk size many times bigger than configure chunksize (128 MB)

AWS EMR Spark – Much Larger Executors are Created than Requested

Hive Count Query not working