The diff output is more specific:
[I]f a whole block of text is moved, then all of it, rather than just the beginning and end, is detected as changed.
The algorithm described here avoids these difficulties. It detects differences that correspond very closely to our intuitive notion of difference.
Compare to Python's difflib. We get an in-line diff instead of a block-level diff. Instead of this:
-I did not have sexual relations with that woman.
+I may have had sexual relations with that woman.... you get this:
I did not have may have had sexual relations with that woman.
Linear time and space for all cases.
To compare two files, it is usually convenient to take a line as the basic unit, although other units are possible, such as word, sentence, paragraph, or character. We approach the problem by finding similarities until only differences remain. We make two observations:
-
If a line occurs only once in each file, then it must be the same line, although it may have been moved.
We use this observation to locate unaltered lines that we subsequently exclude from further treatment.
-
If a line has been found to be unaltered, and the lines immediately adjacent to it in both files are identical, then these lines must be the same line. This information can be used to find blocks of unchanged lines.
- Files:
O, the old fileN, the new file
- Symbol table:
table
- Arrays:
OANA
Each line works as the key in the table look-up, i.e. as table[line].
The symbol table stores three entries for each line:
- Counters:
OCNC
- Line reference:
OLNO
In Python:
table = {
line: {
"OC": ocurrences_of_line_in_o
"NC": ocurrences_of_line_in_n
"OLNO": line_number_in_o
}
}The value entry for each line in table has two counters. They specify the line's number of occurrences in O and N: OC and NC.
OC and NC can assume three values: 1, 2, and many.
Aside from the two counters, the line's entry also includes a reference to the line's line number in O: OLNO.
OLNO is only interesting, if OC == 1. Alternatively, OLNO would have to assume multiple values or none at all.
The arrays OA and NA have one entry for each line in their respective files, O and N. The arrays contain either:
- a pointer to the line's symbol table entry,
table[line] - the line's number in the other file (
NforOA,OforNA)
... and some code to determine which. (See Third Pass part c.)
In other words, you would have for each array:
-
OA- one entry for each line of file
Ocontaining either- a pointer to
table[line] - the line's number in file
N
- a pointer to
- one entry for each line of file
-
NA- one entry for each line of file
Ncontaining either- a pointer to
table[line] - the line's number in file
O
- a pointer to
- one entry for each line of file
In Python:
OA = []
NA = []The algorithm has six passes:
- a. Each line
iof fileNis read in sequence - b. An entry for each line
iis created in the table, if it doesn't already exist - c.
NCfor the line'stableentry is incremented - d.
NA[i]is set to point to thetableentry of linei
a. each line
iof fileNis read in sequence
i = 0
for line in N:b. An entry for each line
iis created in the table, if it doesn't already exist
if not line in table:
table[line] = {}
# (...)
# else:
# (...)c.
NCfor the line'stableentry is incremented
if not "NC" in table[line]:
table[line]["NC"] = 1
else:
if table[line]["NC"] == 1:
table[line]["NC"] = 2 # Why not just have 1 and "many"?
else:
table[line]["NC"] = "many"d.
NA[i]is set to point to thetableentry of linei
NA[i] = table[line]
i += 1All steps combined:
i = 0
for line in N:
if not line in table:
table[line] = {}
table[line]["NC"] = 1
else:
if table[line]["NC"] == 1:
table[line]["NC"] = 2
else:
table[line]["NC"] = "many"
NA[i] = table[line]
i += 1Similar to first pass, except it acts on files
OOAOCOLNOwhich is set to the line's number
O
j = 0
for line in O:
OA
OA[j] = table[line]
j += 1
OC
if not "OC" in table[line]:
table[line]["OC"] = 1
else:
if table[line]["OC"] == 1:
table[line]["OC"] = 2
else:
table[line]["OC"] = "many"
OLNO
# for line in O:
table[line]["OLNO"] = jAll steps combined:
j = 0
for line in O:
if not line in table:
table[line] = {}
table[line]["OC"] = 1
else:
if not "OC" in table[line]:
table[line]["OC"] = 1
elif table[line]["OC"] == 1:
table[line]["OC"] = 2
else:
table[line]["OC"] = "many"
table[line]["OLNO"] = j # Gets overwritten with multiple occurrences.
# Check to see if this is the intended implementation.
# Maybe only relevant for "OC" == "NC" == 1
OA[j] = table[line]
j += 1-
a. We use Observation 1:
If a line occurs only once in each file, then it must be the same line, although it may have been moved.
We use this observation to locate unaltered lines that we subsequently exclude from further treatment.
-
b. Using this, we only process the lines where
OC == NC == 1. -
c. As the lines between
OandN"must be the same line, although it may have been moved", we alter thetablepointers inOAandNAto the number of the line in the other file. -
d. We also locate unique virtual lines
- immediately before the first and
- immediately after the last
lines of the files (???)
b. Using this, we only process the lines where
OC == NC == 1.
if "OC" in NA[i] and "NC" in NA[i]:
if NA[i]["OC"] == NA[i]["NC"] == 1:c. As the lines between
OandN"must be the same line, although it may have been moved", we alter thetablepointers inOAandNAto the number of the line in the other file.
"""cf. the paper's ex. description of Pass 3."""
olno = NA[i]["OLNO"]
NA[i] = olno
OA[olno] = id. "Find" unique virtual lines immediately before the first and immediately after the last lines of the files (???)
All steps combined:
i = 0
for i in NA:
if "OC" in NA[i] and "NC" in NA[i]:
if NA[i]["OC"] == NA[i]["NC"] == 1:
olno = NA[i]["OLNO"]
NA[i], OA[olno] = olno, i
i += 1-
a. We use Observation 2:
If a line has been found to be unaltered, and the lines immediately adjacent to it in both files are identical, then these lines must be the same line. This information can be used to find blocks of unchanged lines.
-
b. Using this, we process each entry in ascending order.
-
c. If
NA[i]points toOA[j], andNA[i+1]andOA[j+1]contain identicaltableentry pointers
then
OA[j+1]is set to linei+1, andNA[i+1]is set to linej+1
b. Using this, we process each entry in ascending order.
# ascending
for i, j in NA, OA:c. If (...) then (...)
if NA[i] == OA[j] and NA[i+1] == OA[j+1]:
OA[j+1] = table[O[i+1]]
NA[i+1] = table[N[j+1]]Similar to fourth pass, except:
- It processes each entry in descending order
- It uses
j-1andi-1instead ofj+1andi+1
It processes each entry in descending order
# descending
for i, j in NA, OA:It uses
j-1andi-1instead ofj+1andi+1
if NA[i] == OA[j] and NA[i-1] == OA[j-1]:
OA[j-1] = table[O[i-1]]
NA[i-1] = table[N[j-1]]At this point following our five passes, we have the necessary information contained in NA to tell the differences between O and N.
This pass uses NA and OA to tell when a line has changed between O and N, and how far the change extends.
Recall our initial description of NA in which we said that the array has either:
one entry for each line of file
Ncontaining either
- a pointer to
table[line]- the line's number in file
O
Using these two cases, we know that if NA[i] refers to an entry in table (case 1), then line i must be new.
We know this, because otherwise, NA[i] would have contained the line's number in O (case 2), if it existed in O and N.
We now know that we are dealing with a new line, but we have yet to figure where the change ends.
Recall Observation 2:
If a line has been found to be unaltered, and the lines immediately adjacent to it in both files are identical, then these lines must be the same line. This information can be used to find blocks of unchanged lines.
If NA[i] points to OA[j], but NA[i+1] does not point to OA[j+1], then line i is the boundary for the alteration.
You can look at it this way:
i : The quick brown fox | j : The quick brown fox
i+1: jumps over the lazy dog | j+1: jumps over the loafing cat
Here, NA[i] == OA[j], but NA[i+1] != OA[j+1]. This means our boundary is between the two lines.
At this point, we know all we need to know and have all the relevant logic in place to output the diff(erence) between O and N.
Hi,
Thanks for putting this gist up.
I was wondering what your interpretation of Pass 3 of Heckel's diff algorithm is wrt. "virtual lines".
In his paper he talks about locating "virtual lines" before the first and after the last. Maybe he means putting the begin and end lines in there?
It looks like that can be ignored, but I was just curious if I was missing something?