An interesting issue with kafka cleaner threads

Xin Li
2 min readOct 17, 2021

The issue

In our company we use kafka compaction feature intensively for cdc topics. Recently the operation team reported continues growth on disk space, without increasing traffics. And the box was almost FULL!!!

The investigation

Those effected topics are all cdc topics, which indicate potentially the compaction didn't work as expected. Looking in the log-cleaner.log, we found the following logs.

[2021-09-18 06:49:19,416] INFO Cleaner 0: Building offset map for log cdc-topic-14 for 1 segments in offset range [449610557, 459651607). (kafka.log.LogCleaner)
[2021-09-18 06:50:06,092] ERROR [kafka-log-cleaner-thread-0]: Error due to (kafka.log.LogCleaner)
java.io.IOException: Input/output error
at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:741)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:727)
at org.apache.kafka.common.utils.Utils.readFully(Utils.java:776)
at org.apache.kafka.common.record.FileRecords.readInto(FileRecords.java:115)
at kafka.log.Cleaner.kafka$log$Cleaner$$buildOffsetMapForSegment(LogCleaner.scala:736)
at kafka.log.Cleaner$$anonfun$buildOffsetMap$3.apply(LogCleaner.scala:707)
at kafka.log.Cleaner$$anonfun$buildOffsetMap$3.apply(LogCleaner.scala:704)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at kafka.log.Cleaner.buildOffsetMap(LogCleaner.scala:704)
at kafka.log.Cleaner.doClean(LogCleaner.scala:373)
at kafka.log.Cleaner.clean(LogCleaner.scala:361)
at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:256)
at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:236)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)
[2021-09-18 06:50:06,092] INFO [kafka-log-cleaner-thread-0]: Stopped (kafka.log.LogCleaner)

First thing came to my mind, maaaaybe there was a malformed record which killed the log cleaner thread, so lets try restart. It worked for the first 5 mins, then same error occurred but with a different topic partition. Hmmmm that cant be right. So having another look at the stack trace, it seems something wrong with the underlying disk. A quick dmsg -T gave us this.

[Fri Oct  1 04:17:36 2021] end_request: critical medium error, dev sde, sector 1041778840
[Fri Oct 1 04:17:37 2021] sd 0:0:0:4: [sde] Unhandled sense code
[Fri Oct 1 04:17:37 2021] sd 0:0:0:4: [sde]
[Fri Oct 1 04:17:37 2021] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Fri Oct 1 04:17:37 2021] sd 0:0:0:4: [sde]
[Fri Oct 1 04:17:37 2021] Sense Key : Medium Error [current]
[Fri Oct 1 04:17:37 2021] sd 0:0:0:4: [sde]
[Fri Oct 1 04:17:37 2021] Add. Sense: Unrecovered read error
[Fri Oct 1 04:17:37 2021] sd 0:0:0:4: [sde] CDB:
[Fri Oct 1 04:17:37 2021] Read(10): 28 00 3e 18 48 98 00 00 08 00
[Fri Oct 1 04:17:37 2021] end_request: critical medium error, dev sde, sector 1041778840

The solution

After figuring out the root cause, we did the following to mitigated the issue:

  1. Double checked if every topic in the cluster has at least a replication factor of 2.
  2. Adjust the server.properties by taking the problematic disk directory out of the log.dirs parameter list.
  3. Restarted broker.

Compaction started working afterwards, and freed up the disk space. The operation team will replace the faulty disk later, so we could bring that free slot back to business.

Aaaaand case closed on myside :)

--

--