Diffing Large Arrays Efficiently: How Merkle Trees Unlock Scalable Sync
π§ The Problem: Efficiently Diffing Large Arraysβ
Letβs say we have two large arrays β could be thousands or millions of entries.
β Goal:β
Find out what changed between them β not by checking every item blindly, but in a scalable way.
πͺ Step 1: The Naive Approachβ
def naive_diff(a, b):
diffs = []
for i in range(min(len(a), len(b))):
if a[i] != b[i]:
diffs.append((i, a[i], b[i]))
# Handle any extra elements
if len(a) > len(b):
diffs.extend((i, a[i], None) for i in range(len(b), len(a)))
elif len(b) > len(a):
diffs.extend((i, None, b[i]) for i in range(len(a), len(b)))
return diffs
π Benchmark Thisβ
import time
a = ["x"] * 1_000_000
b = a.copy()
b[123456] = "y"
start = time.perf_counter()
diffs = naive_diff(a, b)
end = time.perf_counter()
print(f"Naive diff took {end - start:.4f} seconds")
π Step 2: Let's Structure the Data Differentlyβ
"What if we store the array as a tree of hashes instead?"
- Hash every element
- Group into pairs and hash up
- Keep doing it until we have a root
We now have a Merkle tree.
π³ Step 3: Building the Merkle Treeβ
import hashlib
def hash_fn(data):
return hashlib.sha256(data.encode()).hexdigest()
def build_merkle_tree(leaves):
if len(leaves) == 1:
return leaves
parent_level = []
for i in range(0, len(leaves), 2):
left = leaves[i]
right = leaves[i + 1] if i + 1 < len(leaves) else left
parent_level.append(hash_fn(left + right))
return build_merkle_tree(parent_level)
leaf_hashes = [hash_fn(x) for x in a]
root = build_merkle_tree(leaf_hashes)
You might memoize branches if you plan to diff or update later.
πͺ Step 4: Diffing Two Treesβ
Now we recursively walk two Merkle trees side by side:
def merkle_diff(tree1, tree2, depth=0):
if tree1 == tree2:
return []
if depth == MAX_DEPTH:
return [(depth, tree1, tree2)]
left_diff = merkle_diff(tree1.left, tree2.left, depth + 1)
right_diff = merkle_diff(tree1.right, tree2.right, depth + 1)
return left_diff + right_diff
β‘οΈ Show how this drastically reduces work when only 1 or 2 nodes differ.
π§ͺ Benchmark it against the naive approachβ
<PK_ToDo>
βοΈ Updating a Treeβ
- Update a leaf node
- Recalculate hashes on the path to the root
def update_leaf(tree, index, new_value):
tree.leaves[index] = hash_fn(new_value)
# propagate updates up the tree
π§© Addendum: On Order and Positionβ
"You're diffing ordinal position as well."
Merkle trees work best for ordered data like arrays. They donβt just detect changes in a set β they detect where the change is, based on the position of the leaf.
[b, a, c, d]
and [a, b, c, d]
would produce different hashes in that middle layer
- In our hash functions, we just need to ensure
hash(left + right)
isn't symmetric β i.e.,hash(A+B)
βhash(B+A)
- This naturally happens with most cryptographic hashes (they are not commutative)
π§ Final Thoughtsβ
Merkle trees shine when:
- Most of the data stays the same
- You want to detect changes fast
- Bandwidth or compute is a concern
They're not for searching, indexing, or querying values β theyβre for detecting and verifying change efficiently.
Traversalβ
- Building or updating a Merkle tree traverses from leaf to root β because each nodeβs hash depends on its children. The root ultimately reflects all underlying data.
-
- Diffing two Merkle trees traverses from root to leaf β allowing us to skip entire subtrees when hashes match. This is what enables logarithmic-time comparison in the best case.