Reports

I reviewed the code in jgit and came up with this. I have also written a couple of posts on this on my blog, and a C99 implementation on github.

Histogram diff algorithm

Compare files A and B via histogram diff algorithm:

A range is a sequence of zero or more consecutive lines in a file. A region is a range of file A together with a range of file B. A matching region is one with the same number of lines in the A and B ranges, with each line in the A range matching the corresponding line in the B range.

Begin with a region comprising all of file A and file B.

The idea is to find the "best" matching region within the current region, then recursively do the same with the regions before and after the matching region. The best matching region is the longest one with the lowest low count of A lines.

Eventually, the region under consideration will have no matching lines, and is therefore a difference.

If the region has no lines in its A range, the difference is an insertion of the B range.
If it has no lines in its B range, the difference is a deletion of the A range.
If it has lines in both its A and B ranges, the difference is a change.

In the following, consider only lines in the current region. That is, "lines in A" mean lines in the A range of the current region, unless otherwise noted, and similarly for B.

For each line in A, count how many times that line occurs in A.

Set lowcount initially to 65 (in jgit; somewhat arbitrary to limit time in pathological cases; I use 512).

Start finding matching regions with the first line in B, and process the B lines as follows.

Find_best_matching_region:

Take each line in A (from low to high line number) that matches the current line from B. If there is no matching line, or the occurrence count of this A line exceeds lowcount, go to the next line in B.

If no line in B matches any line in A, or all the A lines have counts over lowcount, the region is a difference, and we're done with the current region.

Having a matching line in A and B, widen the match by finding the longest matching region containing the original match. Note the line in A (of the matching region) that has the fewest occurrence count, and call this count the region lowcount. If this matching region is the first one, note its length and region lowcount, and call it the best matching region (so far). If it's not the first matching region found, then if it has more lines OR lower region lowcount than the current best matching region, this becomes the new best matching region. Keep the lowest region lowcount in lowcount. Skip any remaining matching lines in A that fall within the matching region just found. If any more lines in A (after the matching region) match the B line, repeat the process. Skip the remaining B lines in the matching region and go to Find_best_matching_region to continue with the B line that follows the current matching region.

After all B lines have been considered, the best matching region is used to split the original region into before-and-after regions, and the process continues until all the differences have been found.

Pseudocode:

push region(all of file A and file B) on region stack.

While region stack is non-empty:
    pop region stack to current_region
    for each B line, find all matching A lines
                        and count how many times this line occurs in A
    best_match = find_best_matching_region_in(current_region)
    if best_match is empty region:
        append current_region to diff_list
    else:
        after_match = lines in current_region after best_match
        if after_match non-empty: push after_match on region stack
        before_match = lines in current_region before best_match
        if before_match non-empty: push before_match on region stack

def find_best_matching_region_in(current_region):
    set best_match as an empty region
    set lowcount = 65 // arbitrary limit
    set i to first B line
    set nexti to i+1
    for (i = first B line; i <= last B line; i = nexti):
        nexti = i + 1
        if no line in A matches line i in B: continue
        find first line in A matching line i in B
        if count of A occurrences > lowcount: continue
        set j to this matching A line
        loop:   // consider each matching A line
            set current_match to region(j in A, i in B)
            widen current_match to include consecutive matching lines
                    j-1 and i-1, j+1 and i+1, etc.
                    to largest matching region inside current_region
            set region_lowcount to lowest occurrence count of
                    any line of A in current_match
            // Compare current_match to best_match
            if best_match is empty
                OR current_match is longer than best_match
                OR region_lowcount is less than lowcount:
                    set best_match to current_match and
                    set lowcount to region_lowcount
            set nexti to B line following current_match
            if another line in A (within current region) matches i in B:
                set j to that line in A
            else:
                break loop
        end loop    // loop on A lines matching i in B
    return best_match

79400763

Histogram diff algorithm

Find_best_matching_region:

Pseudocode: