leeper · August 29, 2015 14:26
diff --git a/KVF.R b/KVF.R
 # The UNF algorithm provides a clearly defined specification for creating a data hash.
 # The algorithm has many strengths, including the fact that it is file format-independent,
 # controls for floating point and rounding errors, and is independent of the column-
 # organization of a dataset (i.e., variable order is irrelevant).
 #
 # A weakness, however, is that UNF is row-order sensitive. This is useful in terms of 
 # data subsetting (i.e., a subset of a dataset produces a different UNF than the full
 # dataset), but it produces a weakness when two identical datasets are arranged in 
 # different row orders. Here's a simple example:

 library("UNF")
 unf(mtcars[1:2,])
 # UNF6:MlIby1x5LHlZxU5UfQNgMQ== 
 unf(mtcars[2:1,])
 # UNF6:m8ra3OOjIH1RLXa7Mm31+A== 

 # Both of these two-row datasets are identical, but have simply had rows rearranged.
 # Because row order is meaningless in modern statistical software (e.g., R), this 
 # is an unnecessary feature of UNF. It might matter in software such as SPSS, SAS, or 
 # Stata that have sort-dependent operations, but such an approach is fundamentally 
 # flawed. Thus, what is needed is a data hashing algorithm that is row-independent.
 #
 # Another problem with UNF is that it ignores variable names. While this makes sense
 # for focusing on the core content of the data, it ignores a potentially vital part
 # of the metadata associated with a dataset. If variable names change or the the 
 # label applied to two variables are swapped, the UNF signatures are identical even
 # though the usability of the dataset has fundamentally changed. Consider:

 unf(mtcars)
 # UNF6:lJ2kCuaI9qFfW9XPRhy/aA==
 unf(setNames(mtcars, 1:length(mtcars)))
 # UNF6:lJ2kCuaI9qFfW9XPRhy/aA==

 # Despite completely different variable names, the UNF signatures are identical.
 #
 # Luckily, the strengths of UNF can be used to build a better algorithm. Namely:
 # 1. An improved algorithm would use the UNF specification for formatting variables.
 # 2. The column-independent of UNF, involving sorting a dataset would be preserved.
 # 3. The hashing algorithm underlying UNF would be preserved.
 #
 # The proposed hashing strategy involves a key-value store. Any dataset can be 
 # converted from a rectangular structure into a simple array of tuples, where
 # the first element of each tuple is a key and the second element is a value.
 #
 # The Key: The key would consist of two parts. The first part would be a record/row
 # indicator set by the user during the hashing process. This might be a row name, 
 # or another unique identifier in the dataset. The second part of the key would be 
 # a variable name indicator.
 # 
 # The Value: The value would the corresponding value from the dataset for that record
 # and variable. The variable values would have been pre-standardized following the 
 # existing UNF specification, so that all of the values are binary representations.
 # 
 # The Algorithm: With a dataset converted to an array of key-value pairs, the array
 # would be sorted according to a simple rule (e.g., alphabetically sort by key).
 # The final signature would then a hash of the hash for each pair.
 #
 # For example, the general approach for the following simple dataset would be:

 mtcars[1:2,1:2]
 #               mpg cyl
 # Mazda RX4      21   6
 # Mazda RX4 Wag  21   6
 keyvalue <- list("Mazda RX4/mpg" = "+21.e+", "Mazda RX4/cyl" = "+6.e+", 
                 "Mazda RX4 Wag/mpg" = "+21.e+", "Mazda RX4 Wag/cyl" = "+6.e+")
 (pairs <- unname(mapply(paste, names(keyvalue), keyvalue, sep = ":")))
 # [1] "Mazda RX4/mpg:+21.e+"     "Mazda RX4/cyl:+6.e+"      "Mazda RX4 Wag/mpg:+21.e+" "Mazda RX4 Wag/cyl:+6.e+"
 library("digest")
 library("base64enc")
 base64encode(digest(pairs, algo = "sha256", serialize=FALSE, raw=TRUE))
 # [1] "aY5Q9qR4x+ZsuCqG0chVFm8kr4VHHT8llR+CTQZgdb8="

 # Some things to consider:
 # 1. Truncation and formatting of record indicators prior to constructing keys
 # 2. Truncation and formatting of variable labels prior to constructing keys
 # 3. Specification of keys for nested structures (e.g., JSON, databases)
 # 4. Sort order of keys
 #
 # Comments welcome.
 #
 #
	# The UNF algorithm provides a clearly defined specification for creating a data hash.
	# The algorithm has many strengths, including the fact that it is file format-independent,
	# controls for floating point and rounding errors, and is independent of the column-
	# organization of a dataset (i.e., variable order is irrelevant).
	#
	# A weakness, however, is that UNF is row-order sensitive. This is useful in terms of
	# data subsetting (i.e., a subset of a dataset produces a different UNF than the full
	# dataset), but it produces a weakness when two identical datasets are arranged in
	# different row orders. Here's a simple example:

	library("UNF")
	unf(mtcars[1:2,])
	# UNF6:MlIby1x5LHlZxU5UfQNgMQ==
	unf(mtcars[2:1,])
	# UNF6:m8ra3OOjIH1RLXa7Mm31+A==

	# Both of these two-row datasets are identical, but have simply had rows rearranged.
	# Because row order is meaningless in modern statistical software (e.g., R), this
	# is an unnecessary feature of UNF. It might matter in software such as SPSS, SAS, or
	# Stata that have sort-dependent operations, but such an approach is fundamentally
	# flawed. Thus, what is needed is a data hashing algorithm that is row-independent.
	#
	# Another problem with UNF is that it ignores variable names. While this makes sense
	# for focusing on the core content of the data, it ignores a potentially vital part
	# of the metadata associated with a dataset. If variable names change or the the
	# label applied to two variables are swapped, the UNF signatures are identical even
	# though the usability of the dataset has fundamentally changed. Consider:

	unf(mtcars)
	# UNF6:lJ2kCuaI9qFfW9XPRhy/aA==
	unf(setNames(mtcars, 1:length(mtcars)))
	# UNF6:lJ2kCuaI9qFfW9XPRhy/aA==

	# Despite completely different variable names, the UNF signatures are identical.
	#
	# Luckily, the strengths of UNF can be used to build a better algorithm. Namely:
	# 1. An improved algorithm would use the UNF specification for formatting variables.
	# 2. The column-independent of UNF, involving sorting a dataset would be preserved.
	# 3. The hashing algorithm underlying UNF would be preserved.
	#
	# The proposed hashing strategy involves a key-value store. Any dataset can be
	# converted from a rectangular structure into a simple array of tuples, where
	# the first element of each tuple is a key and the second element is a value.
	#
	# The Key: The key would consist of two parts. The first part would be a record/row
	# indicator set by the user during the hashing process. This might be a row name,
	# or another unique identifier in the dataset. The second part of the key would be
	# a variable name indicator.
	#
	# The Value: The value would the corresponding value from the dataset for that record
	# and variable. The variable values would have been pre-standardized following the
	# existing UNF specification, so that all of the values are binary representations.
	#
	# The Algorithm: With a dataset converted to an array of key-value pairs, the array
	# would be sorted according to a simple rule (e.g., alphabetically sort by key).
	# The final signature would then a hash of the hash for each pair.
	#
	# For example, the general approach for the following simple dataset would be:

	mtcars[1:2,1:2]
	# mpg cyl
	# Mazda RX4 21 6
	# Mazda RX4 Wag 21 6
	keyvalue <- list("Mazda RX4/mpg" = "+21.e+", "Mazda RX4/cyl" = "+6.e+",
	"Mazda RX4 Wag/mpg" = "+21.e+", "Mazda RX4 Wag/cyl" = "+6.e+")
	(pairs <- unname(mapply(paste, names(keyvalue), keyvalue, sep = ":")))
	# [1] "Mazda RX4/mpg:+21.e+" "Mazda RX4/cyl:+6.e+" "Mazda RX4 Wag/mpg:+21.e+" "Mazda RX4 Wag/cyl:+6.e+"
	library("digest")
	library("base64enc")
	base64encode(digest(pairs, algo = "sha256", serialize=FALSE, raw=TRUE))
	# [1] "aY5Q9qR4x+ZsuCqG0chVFm8kr4VHHT8llR+CTQZgdb8="

	# Some things to consider:
	# 1. Truncation and formatting of record indicators prior to constructing keys
	# 2. Truncation and formatting of variable labels prior to constructing keys
	# 3. Specification of keys for nested structures (e.g., JSON, databases)
	# 4. Sort order of keys
	#
	# Comments welcome.
	#
	#