Diffing Large Arrays Efficiently: How Merkle Trees Unlock Scalable Sync

🧠 The Problem: Efficiently Diffing Large Arrays

Let’s say we have two large arrays — could be thousands or millions of entries.

❓ Goal:

Find out what changed between them — not by checking every item blindly, but in a scalable way.

🪓 Step 1: The Naive Approach

def naive_diff(a, b):
    diffs = []
    for i in range(min(len(a), len(b))):
        if a[i] != b[i]:
            diffs.append((i, a[i], b[i]))
    # Handle any extra elements
    if len(a) > len(b):
        diffs.extend((i, a[i], None) for i in range(len(b), len(a)))
    elif len(b) > len(a):
        diffs.extend((i, None, b[i]) for i in range(len(a), len(b)))
    return diffs

📉 Benchmark This

import time

a = ["x"] * 1_000_000
b = a.copy()
b[123456] = "y"

start = time.perf_counter()
diffs = naive_diff(a, b)
end = time.perf_counter()
print(f"Naive diff took {end - start:.4f} seconds")

🔁 Step 2: Let's Structure the Data Differently

"What if we store the array as a tree of hashes instead?"

Hash every element
Group into pairs and hash up
Keep doing it until we have a root

We now have a Merkle tree.

🌳 Step 3: Building the Merkle Tree

import hashlib

def hash_fn(data):
    return hashlib.sha256(data.encode()).hexdigest()

def build_merkle_tree(leaves):
    if len(leaves) == 1:
        return leaves
    parent_level = []
    for i in range(0, len(leaves), 2):
        left = leaves[i]
        right = leaves[i + 1] if i + 1 < len(leaves) else left
        parent_level.append(hash_fn(left + right))
    return build_merkle_tree(parent_level)

leaf_hashes = [hash_fn(x) for x in a]
root = build_merkle_tree(leaf_hashes)

You might memoize branches if you plan to diff or update later.

🪞 Step 4: Diffing Two Trees

Now we recursively walk two Merkle trees side by side:

def merkle_diff(tree1, tree2, depth=0):
    if tree1 == tree2:
        return []
    if depth == MAX_DEPTH:
        return [(depth, tree1, tree2)]
    left_diff = merkle_diff(tree1.left, tree2.left, depth + 1)
    right_diff = merkle_diff(tree1.right, tree2.right, depth + 1)
    return left_diff + right_diff

➡️ Show how this drastically reduces work when only 1 or 2 nodes differ.

🧪 Benchmark it against the naive approach

<PK_ToDo>

✍️ Updating a Tree

Update a leaf node
Recalculate hashes on the path to the root

def update_leaf(tree, index, new_value):
    tree.leaves[index] = hash_fn(new_value)
    # propagate updates up the tree

🧩 Addendum: On Order and Position

"You're diffing ordinal position as well."

Merkle trees work best for ordered data like arrays. They don’t just detect changes in a set — they detect where the change is, based on the position of the leaf.

[b, a, c, d] and [a, b, c, d] would produce different hashes in that middle layer

In our hash functions, we just need to ensure hash(left + right) isn't symmetric — i.e., hash(A+B) ≠ hash(B+A)
This naturally happens with most cryptographic hashes (they are not commutative)

🧠 Final Thoughts

Merkle trees shine when:

Most of the data stays the same
You want to detect changes fast
Bandwidth or compute is a concern

They're not for searching, indexing, or querying values — they’re for detecting and verifying change efficiently.

Traversal

Building or updating a Merkle tree traverses from leaf to root — because each node’s hash depends on its children. The root ultimately reflects all underlying data.
- Diffing two Merkle trees traverses from root to leaf — allowing us to skip entire subtrees when hashes match. This is what enables logarithmic-time comparison in the best case.

🧠 The Problem: Efficiently Diffing Large Arrays​

❓ Goal:​

🪓 Step 1: The Naive Approach​

📉 Benchmark This​

🔁 Step 2: Let's Structure the Data Differently​

🌳 Step 3: Building the Merkle Tree​

🪞 Step 4: Diffing Two Trees​

🧪 Benchmark it against the naive approach​

✍️ Updating a Tree​

🧩 Addendum: On Order and Position​

🧠 Final Thoughts​

Traversal​