Browse Source

finish implementation

pull/1/merge
Luke VanderHart 6 years ago
parent
commit
2d18c4f43c
  1. 106
      README.md
  2. 7
      project.clj
  3. 67
      src/valuehash/api.clj
  4. 111
      src/valuehash/impl.clj
  5. 25
      src/valuehash/specs.clj
  6. 67
      test/valuehash/api_test.clj
  7. 57
      test/valuehash/bench.clj

106
README.md

@ -1,21 +1,115 @@
# identihash
# Valuehash
A Clojure library that provides a way to provide higher-bit hashes of arbitrary
Clojure data structures, which respect Clojure's identity semantics. That is, if
Clojure data structures, which respect Clojure's value semantics. That is, if
two objects are `clojure.core/=`, they will have the same hash value. To my
knowledge, no other Clojure data hashing libraries make this guarantee.
The protocol is extensible to arbitrary data types and can work with any hash
function that can take a byte array.
function.
Although the library uses byte streams as an intermediate format, it does not
tag types, or perform any optimization or compaction of the byte stream. Therefore it should _not_ be used as a serialization library. Use
[Fressian](https://github.com/clojure/data.fressian),
[Transit](https://github.com/cognitect/transit-clj) or something similar
instead.
## Usage
TODO
To get a MD5 hash of any Clojure object, do this:
```
(valuehash.api/md5 {:hello "world"})
=> #object["[B" 0x30cb9804 "[B@30cb9804"]
```
To get the hexadecimal string version, do this:
```
(valuehash.api/md5-str {:hello "world"})
=> "d3d7ccf8b8c217f3b52dc08929eabab8"
```
Also provided are `sha-1`, `sha-1-str`, `sha-256` and `sha-256-str`, which do
what they say on the tin.
### Custom hash functions
If you want a hash function other than md5, sha-1 or sha-256, you can obtain a
digest function for any algorithm supported by `java.security.MessageDigest` in
your JVM.
Obtain the digest function using the `messagedigest-fn` function, then pass it
and the object to be hashed to `digest`.
If you wish to obtain a hexadecimal string of the result, call the `hex-str` function on the result.
```clojure
(h/hex-str (h/digest (h/messagedigest-fn "MD2") {:hello "World"}))
=> "81c9637d9fcb071a486eeb0c76dce1f6"
```
### Even more custom hash functions
If nothing in `java.security.MessageDigest` meets your needs, you can supply
your own digest function to `valuehash.api/digest`. This may be any function
which takes a `java.io.InputStream` and returns a byte array.
For example, the following example defines and uses a valid but terrible hash function:
```clojure
(defn lazyhash [is]
;; chosen by a fair dice roll, guaranteed to be a random oracle
(byte-array [(byte 4)]))
(h/digest lazyhash {:hello "world"})
```
## Semantics
This does not combine hashes: it converts the entire input data to binary data,
and hashes that. As such, it is suitable for cryptographic applications when
used with an appropriate hash function.
The binary data supplied to the hash function matches Clojure's equality
semantics. That is, objects that are semantically `clojure.core/=` will have the
same binary representation.
This means:
- All lists are encoded the same
- All sets are encoded the same
- All integer numbers are encoded the same
- All floating-point numbers are encoded the same
The system does take some steps to rule out common types of "collisions", where two unequal objects have the same binary representation (and therefore the same hash). It injects "separator" bytes in collections, so that (for example) the binary representation of `["ab" "c"]` is not equal to `["a" "bc"]`.
## Supported Types
By default, Clojure's native types are supported: as a rule of thumb, if it can be printed to EDN by the default printer, it can be hashed with no fuss.
If you want to extend the system to hash arbitrary values, you can extend the `valuehash.impl/CanonicalByteArray` protocol to any object of your choosing.
## Performance
On my Macbook Pro, this library can determine the MD5 hash of small (0-10
element vectors) at a rate of about 22,000 hashed objects per second.
Larger, more complex nested object slow down significantly, to a rate of around
2,600 per second for objects generated by
`(clojure.test.check.generators/sample-seq gen/any-printable 100)`
To run your own benchmarks, check out the `valuehash.bench` namespace in the
`test` directory.
The current implementation is known to be somewhat naive, as it is single
threaded and performs lots of redundant array copying. If you have ideas for
how to make this faster, please see the `valuehash.impl` namespace and
re-implement/replace `CanonicalByteArray`, then submit a pull request with your
alternative impl in a separate namespace, with comparative benchmarks attached.
## License
Copyright © 2016 Luke VanderHart
Distributed under the Eclipse Public License either version 1.0 or (at
your option) any later version.
your option) any later version.

7
project.clj

@ -1,6 +1,7 @@
(defproject arachne-framework/identihash "0.1.0-SNAPSHOT"
:description "Identity based hashing for Clojure data"
(defproject arachne-framework/valuehash "0.1.0-SNAPSHOT"
:description "Value based hashing for Clojure data"
:license {:name "Eclipse Public License"
:url "http://www.eclipse.org/legal/epl-v10.html"}
:dependencies [[org.clojure/clojure "1.9.0-alpha14"]
[org.clojure/test.check "0.9.0" :scope "test"]])
[org.clojure/test.check "0.9.0" :scope "test"]
[criterium "0.4.4" :scope "test"]])

67
src/valuehash/api.clj

@ -0,0 +1,67 @@
(ns valuehash.api
(:require [valuehash.impl :as impl]
[valuehash.specs])
(:import [java.security MessageDigest]
[java.io InputStream ByteArrayInputStream]))
(defn digest
"Given a digest function and an arbitrary Clojure object, return a byte array
representing the digest of the object.
The digest function must take an InputStream as its argument, and return a
byte array."
^bytes [digest-fn obj]
(digest-fn (ByteArrayInputStream. (impl/to-byte-array obj))))
(defn- consume
"Fully consume the specified input stream, using the supplied MessageDigest
object."
[^MessageDigest digest ^InputStream is]
(let [buf (byte-array 64)]
(loop []
(let [read (.read is buf)]
(when (<= 0 read)
(.update digest buf 0 read)
(recur))))
(.digest digest)))
(defn messagedigest-fn
"Return a digest function using java.security.MessageDigest, using the specified algorithm"
[algorithm]
(fn [is]
(consume (MessageDigest/getInstance algorithm) is)))
(defn hex-str
"Return the hexadecimal string representation of a byte array"
[ba]
(apply str (map #(format "%02x" %) ba)))
(defn md5
"Return the MD5 digest of an arbitrary Clojure data structure"
[obj]
(digest (messagedigest-fn "MD5") obj))
(defn md5-str
"Return the MD5 digest of an arbitrary Clojure data structure, as a string"
[obj]
(hex-str (md5 obj)))
(defn sha-1
"Return the SHA-1 digest of an arbitrary Clojure data structure"
[obj]
(digest (messagedigest-fn "SHA-1") obj))
(defn sha-1-str
"Return the SHA-1 digest of an arbitrary Clojure data structure, as a string"
[obj]
(hex-str (sha-1 obj)))
(defn sha-256
"Return the SHA-256 digest of an arbitrary Clojure data structure"
[obj]
(digest (messagedigest-fn "SHA-256") obj))
(defn sha-256-str
"Return the sha256 digest of an arbitrary Clojure data structure, as a string"
[obj]
(hex-str (sha-256 obj)))

111
src/valuehash/impl.clj

@ -0,0 +1,111 @@
(ns valuehash.impl
"Simple implementation based on plain byte arrays"
(:import [java.util UUID Date]))
(defprotocol CanonicalByteArray
"An object that can be converted to a canonical byte array, with value
semantics intact (that is, two objects that are clojure.core/= will always
have the same binary representation)"
(to-byte-array [this] "Convert an object to a canonical byte array"))
(defn- ba-comparator
"Comparator function for byte arrays"
^long [^bytes a ^bytes b]
(let [alen (alength a)
blen (alength b)]
(if (not= alen blen)
(- alen blen)
; compare backward, since lots of symbols/keywords have a common prefix
(loop [i (dec alen)]
(if (< i 0)
0
(let [c (- (aget a i) (aget b i))]
(if (= 0 c)
(recur (dec i))
c)))))))
(defn long->bytes
"Convert a long value to a byte array"
[val]
(.toByteArray (biginteger val)))
(defn- join-byte-arrays
"Copy multiple byte arrays to a single output byte array in the order they
are given."
[arrays]
(let [dest (byte-array (+ (reduce + (map alength arrays))))]
(loop [offset 0
[^bytes src & more] arrays]
(when src
(let [srclen (alength src)]
(System/arraycopy src 0 dest offset srclen)
(recur (+ offset srclen) more))))
dest))
;; Primitive values
(extend-protocol CanonicalByteArray
nil
(to-byte-array [_] (byte-array 1 (byte 0)))
String
(to-byte-array [this] (.getBytes ^String this))
clojure.lang.Keyword
(to-byte-array [this] (.getBytes (str this)))
clojure.lang.Symbol
(to-byte-array [this] (.getBytes (str this)))
Byte
(to-byte-array [this] (long->bytes this))
Integer
(to-byte-array [this] (long->bytes this))
Long
(to-byte-array [this] (long->bytes this))
Double
(to-byte-array [this] (long->bytes (Double/doubleToLongBits this)))
Float
(to-byte-array [this] (long->bytes (Double/doubleToLongBits this)))
clojure.lang.Ratio
(to-byte-array [this] (long->bytes (Double/doubleToLongBits (double this))))
Boolean
(to-byte-array [this] (byte-array 1 (if this (byte 1) (byte 0))))
Character
(to-byte-array [this] (.getBytes (str this)))
UUID
(to-byte-array [this]
(join-byte-arrays [(long->bytes (.getMostSignificantBits ^UUID this))
(long->bytes (.getLeastSignificantBits ^UUID this))]))
Date
(to-byte-array [this]
(long->bytes (.getTime this))))
(def list-sep (byte-array 1 (byte 42)))
(def set-sep (byte-array 1 (byte 21)))
(def map-sep (byte-array 1 (byte 19)))
(defn- map-entry->byte-array
[map-entry]
(join-byte-arrays [(to-byte-array (.getKey map-entry))
(to-byte-array (.getValue map-entry))]))
;; Collections
(extend-protocol CanonicalByteArray
java.util.List
(to-byte-array [this]
(->> this
(map to-byte-array)
(interpose list-sep)
(cons list-sep)
(join-byte-arrays)))
java.util.Set
(to-byte-array [this]
(->> this
(map to-byte-array)
(sort ba-comparator)
(interpose set-sep)
(cons set-sep)
(join-byte-arrays)))
java.util.Map
(to-byte-array [this]
(->> this
(map map-entry->byte-array)
(sort ba-comparator)
(cons map-sep)
(join-byte-arrays))))

25
src/valuehash/specs.clj

@ -0,0 +1,25 @@
(ns valuehash.specs
(:require [clojure.spec :as s]))
(def byte-array-class (class (byte-array 0)))
(defn byte-array? [obj] (instance? byte-array-class obj))
(defn input-stream? [obj] (instance? java.io.InputStream obj))
(s/def ::digest-fn
(s/fspec
:args (s/cat :input-stream input-stream?)
:ret byte-array?))
(s/fdef valuehash.api/digest
:args (s/cat :digest-fn ::digest-fn, :obj any?)
:ret byte-array?)
(s/fdef valuehash.api/mesagedigest-fn
:args (s/cat :algorithm string?)
:ret ::digest-fn)
(s/fdef valuehash.api/hex-str
:args (s/cat :byte-array byte-array?)
:ret string?)

67
test/valuehash/api_test.clj

@ -0,0 +1,67 @@
(ns valuehash.api-test
(:require [clojure.test.check.generators :as gen]
[clojure.test.check.properties :as prop]
[clojure.test.check.clojure-test :refer [defspec]]
[valuehash.api :as api]
))
(defprotocol Perturbable
"A value that can be converted to a value of a different type, but stil be equal"
(perturb [obj] "Convert an object to a different but equal object"))
(defn select
"Deterministically select one of the options (based on the hash of the key)"
[key & options]
(nth options (mod (hash key) (count options))))
(extend-protocol Perturbable
Object
(perturb [obj] obj)
Long
(perturb [l]
(select l
(if (< Byte/MIN_VALUE l Byte/MAX_VALUE) (byte l) l)
(if (< Integer/MIN_VALUE l Integer/MAX_VALUE) (int l) l)))
Double
(perturb [d]
(if (= d (unchecked-float d))
(unchecked-float d)
d))
java.util.Map
(perturb [obj]
(let [keyvals (interleave (reverse (keys obj))
(reverse (map perturb (vals obj))))]
(select obj
(apply array-map keyvals)
(apply hash-map keyvals)
(java.util.HashMap. (apply array-map keyvals)))))
java.util.List
(perturb [obj]
(let [l (map perturb obj)]
(select obj
(lazy-seq l)
(apply vector l)
(apply list l)
(java.util.ArrayList. l)
(java.util.LinkedList. l))))
java.util.Set
(perturb [obj]
(let [s (reverse (map perturb obj))]
(select obj
(apply hash-set s)
(java.util.HashSet. s)
(java.util.LinkedHashSet. s)))))
(defspec value-semantics-hold 150
(prop/for-all [o gen/any-printable]
(let [p (perturb o)]
(= (api/md5-str o) (api/md5-str p))
(= (api/sha-1-str o) (api/sha-1-str p))
(= (api/sha-256-str o) (api/sha-256-str p)))))

57
test/valuehash/bench.clj

@ -0,0 +1,57 @@
(ns valuehash.bench
(:require [valuehash.api :as api]
[criterium.core :as c]
[clojure.test.check.generators :as gen]
[clojure.test.check.random :as random]
[clojure.test.check.rose-tree :as rose]))
(defn- sample-seq
"Return a sequence of realized values from `generator`.
Copy of the built in `sample-seq`, but lets you pass in the seed so benchmark
runs can be deterministic across machines."
[generator seed max-size]
(let [r (random/make-random seed)
size-seq (gen/make-size-range-seq max-size)]
(map #(rose/root (gen/call-gen generator %1 %2))
(gen/lazy-random-states r)
size-seq)))
(defn- do-bench
"Benchmark the specified digest function, using the specified seq of sample data"
[do-digest data]
(let [data (doall data)]
(c/with-progress-reporting
(let [results (c/benchmark
(doseq [obj data]
(do-digest obj))
{})
hps (/ (count data) (first (:mean results)))]
(c/report-result results)
(println "\nThis translates to about" (Math/round hps) "hashed objects per second")))))
(defn bench-small-vectors
[]
(do-bench api/md5 (take 1000 (sample-seq (gen/vector gen/simple-type-printable) 42 10))))
(defn bench-small-maps
[]
(do-bench api/md5 (take 1000 (sample-seq (gen/map
gen/simple-type-printable
gen/simple-type-printable)
42 10))))
(defn bench-complex
[]
(do-bench api/md5 (take 1000 (sample-seq gen/any-printable 42 100))))
(comment
(bench-small-vectors)
(bench-small-maps)
(bench-complex)
)
Loading…
Cancel
Save