Skip to main content

Diffing Large Arrays Efficiently: How Merkle Trees Unlock Scalable Sync

🧠 The Problem: Efficiently Diffing Large Arrays​

Let’s say we have two large arrays β€” could be thousands or millions of entries.

❓ Goal:​

Find out what changed between them β€” not by checking every item blindly, but in a scalable way.

πŸͺ“ Step 1: The Naive Approach​

def naive_diff(a, b):
diffs = []
for i in range(min(len(a), len(b))):
if a[i] != b[i]:
diffs.append((i, a[i], b[i]))
# Handle any extra elements
if len(a) > len(b):
diffs.extend((i, a[i], None) for i in range(len(b), len(a)))
elif len(b) > len(a):
diffs.extend((i, None, b[i]) for i in range(len(a), len(b)))
return diffs

πŸ“‰ Benchmark This​

import time

a = ["x"] * 1_000_000
b = a.copy()
b[123456] = "y"

start = time.perf_counter()
diffs = naive_diff(a, b)
end = time.perf_counter()
print(f"Naive diff took {end - start:.4f} seconds")

πŸ” Step 2: Let's Structure the Data Differently​

"What if we store the array as a tree of hashes instead?"

  • Hash every element
  • Group into pairs and hash up
  • Keep doing it until we have a root

We now have a Merkle tree.

🌳 Step 3: Building the Merkle Tree​

import hashlib

def hash_fn(data):
return hashlib.sha256(data.encode()).hexdigest()

def build_merkle_tree(leaves):
if len(leaves) == 1:
return leaves
parent_level = []
for i in range(0, len(leaves), 2):
left = leaves[i]
right = leaves[i + 1] if i + 1 < len(leaves) else left
parent_level.append(hash_fn(left + right))
return build_merkle_tree(parent_level)

leaf_hashes = [hash_fn(x) for x in a]
root = build_merkle_tree(leaf_hashes)

You might memoize branches if you plan to diff or update later.

πŸͺž Step 4: Diffing Two Trees​

Now we recursively walk two Merkle trees side by side:

def merkle_diff(tree1, tree2, depth=0):
if tree1 == tree2:
return []
if depth == MAX_DEPTH:
return [(depth, tree1, tree2)]
left_diff = merkle_diff(tree1.left, tree2.left, depth + 1)
right_diff = merkle_diff(tree1.right, tree2.right, depth + 1)
return left_diff + right_diff

➑️ Show how this drastically reduces work when only 1 or 2 nodes differ.

πŸ§ͺ Benchmark it against the naive approach​

<PK_ToDo>


✍️ Updating a Tree​

  • Update a leaf node
  • Recalculate hashes on the path to the root
def update_leaf(tree, index, new_value):
tree.leaves[index] = hash_fn(new_value)
# propagate updates up the tree

🧩 Addendum: On Order and Position​

"You're diffing ordinal position as well."

Merkle trees work best for ordered data like arrays. They don’t just detect changes in a set β€” they detect where the change is, based on the position of the leaf.

[b, a, c, d] and [a, b, c, d] would produce different hashes in that middle layer

  • In our hash functions, we just need to ensure hash(left + right) isn't symmetric β€” i.e., hash(A+B) β‰  hash(B+A)
  • This naturally happens with most cryptographic hashes (they are not commutative)

🧠 Final Thoughts​

Merkle trees shine when:

  • Most of the data stays the same
  • You want to detect changes fast
  • Bandwidth or compute is a concern

They're not for searching, indexing, or querying values β€” they’re for detecting and verifying change efficiently.

Traversal​

  • Building or updating a Merkle tree traverses from leaf to root β€” because each node’s hash depends on its children. The root ultimately reflects all underlying data.
    • Diffing two Merkle trees traverses from root to leaf β€” allowing us to skip entire subtrees when hashes match. This is what enables logarithmic-time comparison in the best case.