Disable seek compaction

Seek compaction is causing a cascade effect in the chainstate DB, causing large parts of the database to be rewritten every ~hour. Every periodic flush writes around 2 MiB. Since this is roughly the `write_buffer_size`, these writes regularly cause the memtable to rotate into a small L0 file. This file has a small seek budget, and with the random UTXO reads done during validation, it can get scheduled for seek compaction quickly. That seek compaction pushes the small file down to L1. Since most UTXOs are already lower down in L4/L5, many reads that consult this file do not find the key there and continue downward. The bloom filter makes those misses cheap, but LevelDB still decrements the file's seek budget. The file then gets scheduled for another seek compaction, and the same pattern pushes it down through L2 and L3. The expensive part happens around L3/L4. L4 has many ~32 MiB files holding the bulk of the UTXO set. When LevelDB compacts into L3, it may split the output into many smaller L3 files to limit how much L4 "grandparent" data any one output overlaps. Each of these small L3 files then gets its own small seek budget. Because chainstate keys are hash-random, each small L3 file can still have a broad key range, so many random reads consult it and quickly drain its budget. Once seek-compacted into L4, each tiny L3 file can overlap many L4 files, so compacting a few hundred KiB from L3 can require rewriting hundreds of MiB from L4. Repeating that across many small L3 files can rewrite most of the chainstate. This is a poor fit for chainstate because UTXO keys are hash-random, the DB is large enough to have many levels, writes are relatively small and periodic, and reads are frequent. The result is that read misses trigger compactions much earlier than size pressure would, and those compactions have very high write amplification. Disabling seek compaction may leave more files in upper levels for longer, so reads could theoretically consult more files. But Bitcoin Core enables bloom filters for all its LevelDB instances, so these misses are usually cheap in-memory filter checks rather than disk reads. For the other DBs, the risk is much smaller. They also use bloom filters, and most are smaller and less read-heavy. With fewer levels and less random read pressure, disabling seek compaction should have little effect there. Co-authored-by: l0rinc <pap.lorinc@gmail.com> Github-Pull: #35313 Rebased-From: 6bfdb6093bba4710d0f8313ed0113967a8b5176f
2026-06-15 17:21:09 +02:00 · 2026-05-19 11:38:51 -04:00
parent 1dba05e7f6
commit c913cd9add
3 changed files with 24 additions and 27 deletions
--- a/src/leveldb/db/autocompact_test.cc
+++ b/src/leveldb/db/autocompact_test.cc
@@ -54,8 +54,8 @@ static const int kValueSize = 200 * 1024;
 static const int kTotalSize = 100 * 1024 * 1024;
 static const int kCount = kTotalSize / kValueSize;

-// Read through the first n keys repeatedly and check that they get
-// compacted (verified by checking the size of the key space).
+// Read through the first n keys repeatedly and check that reads do NOT
+// trigger compaction (seek compaction is disabled in this fork).
 void AutoCompactTest::DoReads(int n) {
  std::string value(kValueSize, 'x');
  DBImpl* dbi = reinterpret_cast<DBImpl*>(db_);
@@ -76,25 +76,23 @@ void AutoCompactTest::DoReads(int n) {
  const int64_t initial_size = Size(Key(0), Key(n));
  const int64_t initial_other_size = Size(Key(n), Key(kCount));

-  // Read until size drops significantly.
+  // Read repeatedly. The size of the read range must NOT shrink: with
+  // seek compaction disabled, reads never schedule a compaction.
  std::string limit_key = Key(n);
-  for (int read = 0; true; read++) {
-    ASSERT_LT(read, 100) << "Taking too long to compact";
+  for (int read = 0; read < 100; read++) {
    Iterator* iter = db_->NewIterator(ReadOptions());
    for (iter->SeekToFirst();
         iter->Valid() && iter->key().ToString() < limit_key; iter->Next()) {
      // Drop data
    }
    delete iter;
-    // Wait a little bit to allow any triggered compactions to complete.
-    Env::Default()->SleepForMicroseconds(1000000);
    uint64_t size = Size(Key(0), Key(n));
    fprintf(stderr, "iter %3d => %7.3f MB [other %7.3f MB]\n", read + 1,
            size / 1048576.0, Size(Key(n), Key(kCount)) / 1048576.0);
-    if (size <= initial_size / 10) {
-      break;
-    }
  }
+  // Give any background work a chance to run, even though none should.
+  Env::Default()->SleepForMicroseconds(1000000);
+  ASSERT_EQ(Size(Key(0), Key(n)), static_cast<uint64_t>(initial_size));

  // Verify that the size of the key space not touched by the reads
  // is pretty much unchanged.
--- a/src/leveldb/db/db_test.cc
+++ b/src/leveldb/db/db_test.cc
@@ -735,15 +735,14 @@ TEST(DBTest, GetPicksCorrectFile) {
  } while (ChangeOptions());
 }

-TEST(DBTest, GetEncountersEmptyLevel) {
+TEST(DBTest, GetDoesNotTriggerSeekCompaction) {
  do {
    // Arrange for the following to happen:
    //   * sstable A in level 0
    //   * nothing in level 1
    //   * sstable B in level 2
-    // Then do enough Get() calls to arrange for an automatic compaction
-    // of sstable A.  A bug would cause the compaction to be marked as
-    // occurring at level 1 (instead of the correct level 0).
+    // Seek compaction is disabled in this fork, so repeated reads must
+    // not change the level layout. A manual compaction must still work.

    // Step 1: First place sstables in levels 0 and 2
    int compaction_count = 0;
@@ -761,14 +760,17 @@ TEST(DBTest, GetEncountersEmptyLevel) {
    ASSERT_EQ(NumTableFilesAtLevel(1), 0);
    ASSERT_EQ(NumTableFilesAtLevel(2), 1);

-    // Step 3: read a bunch of times
+    // Step 3: many read misses must not schedule any compaction.
    for (int i = 0; i < 1000; i++) {
      ASSERT_EQ("NOT_FOUND", Get("missing"));
    }
-
-    // Step 4: Wait for compaction to finish
    DelayMilliseconds(1000);
+    ASSERT_EQ(NumTableFilesAtLevel(0), 1);
+    ASSERT_EQ(NumTableFilesAtLevel(1), 0);
+    ASSERT_EQ(NumTableFilesAtLevel(2), 1);

+    // Step 4: a manual compaction still moves the L0 file down.
+    dbfull()->TEST_CompactRange(0, nullptr, nullptr);
    ASSERT_EQ(NumTableFilesAtLevel(0), 0);
  } while (ChangeOptions());
 }
--- a/src/leveldb/db/version_set.cc
+++ b/src/leveldb/db/version_set.cc
@@ -400,16 +400,11 @@ Status Version::Get(const ReadOptions& options, const LookupKey& k,
  return state.found ? state.s : Status::NotFound(Slice());
 }

-bool Version::UpdateStats(const GetStats& stats) {
-  FileMetaData* f = stats.seek_file;
-  if (f != nullptr) {
-    f->allowed_seeks--;
-    if (f->allowed_seeks <= 0 && file_to_compact_ == nullptr) {
-      file_to_compact_ = f;
-      file_to_compact_level_ = stats.seek_file_level;
-      return true;
-    }
-  }
+bool Version::UpdateStats(const GetStats& /*stats*/) {
+  // Disable automatic compactions triggered by read seek counters.
+  // The heuristic was tuned for expensive random seeks and can create
+  // severe write amplification on large random-key databases.
+  // Size and manual compactions still run.
  return false;
 }

@@ -661,6 +656,8 @@ class VersionSet::Builder {
      // same as the compaction of 40KB of data.  We are a little
      // conservative and allow approximately one seek for every 16KB
      // of data before triggering a compaction.
+      //
+      // Note: seek compactions are disabled. See Version::UpdateStats.
      f->allowed_seeks = static_cast<int>((f->file_size / 16384U));
      if (f->allowed_seeks < 100) f->allowed_seeks = 100;