Library to generate hashes from Clojure data.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

120 lines
4.5 KiB

6 years ago
6 years ago
6 years ago
6 years ago
5 years ago
6 years ago
5 years ago
  1. # Valuehash
  2. [![CircleCI](https://circleci.com/gh/arachne-framework/valuehash.svg?style=svg)](https://circleci.com/gh/arachne-framework/valuehash)
  3. A Clojure library that provides a way to provide higher-bit hashes of arbitrary
  4. Clojure data structures, which respect Clojure's value semantics. That is, if
  5. two objects are `clojure.core/=`, they will have the same hash value. To my
  6. knowledge, no other Clojure data hashing libraries make this guarantee.
  7. The protocol is extensible to arbitrary data types and can work with any hash
  8. function.
  9. Although the library uses byte streams as an intermediate format, it does not
  10. tag types, or perform any optimization or compaction of the byte stream. Therefore it should _not_ be used as a serialization library. Use
  11. [Fressian](https://github.com/clojure/data.fressian),
  12. [Transit](https://github.com/cognitect/transit-clj) or something similar
  13. instead.
  14. ## Usage
  15. To get a MD5 hash of any Clojure object, do this:
  16. ```
  17. (valuehash.api/md5 {:hello "world"})
  18. => #object["[B" 0x30cb9804 "[B@30cb9804"]
  19. ```
  20. To get the hexadecimal string version, do this:
  21. ```
  22. (valuehash.api/md5-str {:hello "world"})
  23. => "d3d7ccf8b8c217f3b52dc08929eabab8"
  24. ```
  25. Also provided are `sha-1`, `sha-1-str`, `sha-256` and `sha-256-str`, which do
  26. what they say on the tin.
  27. ### Custom hash functions
  28. If you want a hash function other than md5, sha-1 or sha-256, you can obtain a
  29. digest function for any algorithm supported by `java.security.MessageDigest` in
  30. your JVM.
  31. Obtain the digest function using the `messagedigest-fn` function, then pass it
  32. and the object to be hashed to `digest`.
  33. If you wish to obtain a hexadecimal string of the result, call the `hex-str` function on the result.
  34. ```clojure
  35. (h/hex-str (h/digest (h/messagedigest-fn "MD2") {:hello "World"}))
  36. => "81c9637d9fcb071a486eeb0c76dce1f6"
  37. ```
  38. ### Even more custom hash functions
  39. If nothing in `java.security.MessageDigest` meets your needs, you can supply
  40. your own digest function to `valuehash.api/digest`. This may be any function
  41. which takes a `java.io.InputStream` and returns a byte array.
  42. For example, the following example defines and uses a valid but terrible hash function:
  43. ```clojure
  44. (defn lazyhash [is]
  45. ;; chosen by a fair dice roll, guaranteed to be a random oracle
  46. (byte-array [(byte 4)]))
  47. (h/digest lazyhash {:hello "world"})
  48. ```
  49. ## Semantics
  50. This does not combine hashes: it converts the entire input data to binary data,
  51. and hashes that. As such, it is suitable for cryptographic applications when
  52. used with an appropriate hash function.
  53. The binary data supplied to the hash function matches Clojure's equality
  54. semantics. That is, objects that are semantically `clojure.core/=` will have the
  55. same binary representation.
  56. This means:
  57. - All list types are encoded the same
  58. - All set types are encoded the same
  59. - All map types are encoded the same
  60. - All integer numbers are encoded the same
  61. - All floating-point numbers are encoded the same
  62. The system does take some steps to rule out common types of "collisions", where two unequal objects have the same binary representation (and therefore the same hash). It injects "separator" bytes in collections, so that (for example) the binary representation of `["ab" "c"]` is not equal to `["a" "bc"]`.
  63. ## Supported Types
  64. By default, Clojure's native types are supported: as a rule of thumb, if it can be printed to EDN by the default printer, it can be hashed with no fuss.
  65. If you want to extend the system to hash arbitrary values, you can extend the `valuehash.impl/CanonicalByteArray` protocol to any object of your choosing.
  66. ## Performance
  67. On my Macbook Pro, this library can determine the MD5 hash of small (0-10
  68. element vectors) at a rate of about 22,000 hashed objects per second.
  69. Larger, more complex nested object slow down significantly, to a rate of around
  70. 2,600 per second for objects generated by
  71. `(clojure.test.check.generators/sample-seq gen/any-printable 100)`
  72. To run your own benchmarks, check out the `valuehash.bench` namespace in the
  73. `test` directory.
  74. The current implementation is known to be somewhat naive, as it is single
  75. threaded and performs lots of redundant array copying. If you have ideas for
  76. how to make this faster, please see the `valuehash.impl` namespace and
  77. re-implement/replace `CanonicalByteArray`, then submit a pull request with your
  78. alternative impl in a separate namespace, with comparative benchmarks attached.
  79. ## License
  80. Copyright © 2016 Luke VanderHart
  81. Distributed under the Eclipse Public License either version 1.0 or (at
  82. your option) any later version.