Last active
August 29, 2015 14:26
-
-
Save leeper/e8bfb6200798d565e484 to your computer and use it in GitHub Desktop.
A key-value alternative to UNF
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# The UNF algorithm provides a clearly defined specification for creating a data hash. | |
# The algorithm has many strengths, including the fact that it is file format-independent, | |
# controls for floating point and rounding errors, and is independent of the column- | |
# organization of a dataset (i.e., variable order is irrelevant). | |
# | |
# A weakness, however, is that UNF is row-order sensitive. This is useful in terms of | |
# data subsetting (i.e., a subset of a dataset produces a different UNF than the full | |
# dataset), but it produces a weakness when two identical datasets are arranged in | |
# different row orders. Here's a simple example: | |
library("UNF") | |
unf(mtcars[1:2,]) | |
# UNF6:MlIby1x5LHlZxU5UfQNgMQ== | |
unf(mtcars[2:1,]) | |
# UNF6:m8ra3OOjIH1RLXa7Mm31+A== | |
# Both of these two-row datasets are identical, but have simply had rows rearranged. | |
# Because row order is meaningless in modern statistical software (e.g., R), this | |
# is an unnecessary feature of UNF. It might matter in software such as SPSS, SAS, or | |
# Stata that have sort-dependent operations, but such an approach is fundamentally | |
# flawed. Thus, what is needed is a data hashing algorithm that is row-independent. | |
# | |
# Another problem with UNF is that it ignores variable names. While this makes sense | |
# for focusing on the core content of the data, it ignores a potentially vital part | |
# of the metadata associated with a dataset. If variable names change or the the | |
# label applied to two variables are swapped, the UNF signatures are identical even | |
# though the usability of the dataset has fundamentally changed. Consider: | |
unf(mtcars) | |
# UNF6:lJ2kCuaI9qFfW9XPRhy/aA== | |
unf(setNames(mtcars, 1:length(mtcars))) | |
# UNF6:lJ2kCuaI9qFfW9XPRhy/aA== | |
# Despite completely different variable names, the UNF signatures are identical. | |
# | |
# Luckily, the strengths of UNF can be used to build a better algorithm. Namely: | |
# 1. An improved algorithm would use the UNF specification for formatting variables. | |
# 2. The column-independent of UNF, involving sorting a dataset would be preserved. | |
# 3. The hashing algorithm underlying UNF would be preserved. | |
# | |
# The proposed hashing strategy involves a key-value store. Any dataset can be | |
# converted from a rectangular structure into a simple array of tuples, where | |
# the first element of each tuple is a key and the second element is a value. | |
# | |
# The Key: The key would consist of two parts. The first part would be a record/row | |
# indicator set by the user during the hashing process. This might be a row name, | |
# or another unique identifier in the dataset. The second part of the key would be | |
# a variable name indicator. | |
# | |
# The Value: The value would the corresponding value from the dataset for that record | |
# and variable. The variable values would have been pre-standardized following the | |
# existing UNF specification, so that all of the values are binary representations. | |
# | |
# The Algorithm: With a dataset converted to an array of key-value pairs, the array | |
# would be sorted according to a simple rule (e.g., alphabetically sort by key). | |
# The final signature would then a hash of the hash for each pair. | |
# | |
# For example, the general approach for the following simple dataset would be: | |
mtcars[1:2,1:2] | |
# mpg cyl | |
# Mazda RX4 21 6 | |
# Mazda RX4 Wag 21 6 | |
keyvalue <- list("Mazda RX4/mpg" = "+21.e+", "Mazda RX4/cyl" = "+6.e+", | |
"Mazda RX4 Wag/mpg" = "+21.e+", "Mazda RX4 Wag/cyl" = "+6.e+") | |
(pairs <- unname(mapply(paste, names(keyvalue), keyvalue, sep = ":"))) | |
# [1] "Mazda RX4/mpg:+21.e+" "Mazda RX4/cyl:+6.e+" "Mazda RX4 Wag/mpg:+21.e+" "Mazda RX4 Wag/cyl:+6.e+" | |
library("digest") | |
library("base64enc") | |
base64encode(digest(pairs, algo = "sha256", serialize=FALSE, raw=TRUE)) | |
# [1] "aY5Q9qR4x+ZsuCqG0chVFm8kr4VHHT8llR+CTQZgdb8=" | |
# Some things to consider: | |
# 1. Truncation and formatting of record indicators prior to constructing keys | |
# 2. Truncation and formatting of variable labels prior to constructing keys | |
# 3. Specification of keys for nested structures (e.g., JSON, databases) | |
# 4. Sort order of keys | |
# | |
# Comments welcome. | |
# | |
# |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment