Monday, November 9, 2009

Why Fuzzy Hashing is Really Cool

For years, computer forensic investigators have put a great deal of stock in the effectiveness of MD5 hashing. Now to quantify that statement, I mean specifically using MD5 hashes to identify known malicious files. The key word in that sentence is known, but let's take that one step further to add the word “unmodified” known files. One minor change to a file, and the MD5 hash is now completely different, rendering the investigators search totally ineffective. So, what's the answer? Easy, fuzzy hashing.

Hash comparisons are either a yes or a no – either the hash matches, or it doesn't. But, that does not mean that the files are not the same, it just means they are not exactly the same. I am going to use a simple example, that will illustrate exactly what I am talking about.

The photograph of Oklahoma State University wide receiver Dez Bryant below was taken from, “” on November 09, 2009.

Using MD5Deep, I took generated an MD5 hash for this picture:

b2cedc90072bacc43fdcc533ad4f24ad /home/cepogue/Pictures/DezBryant.jpg

Now, if you were an investigator, and you were going to search for that image of Dez based on the MD5 hash, you would only find it if the image were totally and completely identical to this original.

To show how easy it is to modify an image like this, I used Ghex to open the image and scrolled to the bottom of the content.

Note at offset 5879 (the last line), there are only four characters, which on the right translate to a blank space, a question mark, and two periods. Using Ghex, I am simply going to replace the blank space with a period.

Look at offset 5879 again in the figure above. I replaced the blank space with the period, changing that last line from "20 3F FF D9" to "2E 3F FF 2E". A very minor change. As you can see from the modified image of Dez below, there is no visible change to the image.

Again, using MD5deep, I calculated the MD5 hash of the image, and it is totally different from the first image.

Here is the unmodified image hash one more time:
b2cedc90072bacc43fdcc533ad4f24ad /home/cepogue/Pictures/DezBryant.jpg

Here is the modified image hash:
df3e3d942610781f9b9d0b41683c46db /home/cepogue/Pictures/DezBryant2.jpg

The hashes are not even close. So, if an investigator was performing a search for this image based on the MD5 hash, he would fail to find it.

So, if you are an investigator, you may be thinking, “Aw what?! So ALL of the hash comparisons I have been doing could have failed while the evidence was still present?”

The answer to question is, “Yes...if the images were modified in any way...yes they did.” But, there is hope, and that hope is called fuzzy hashing.

Since the one to one comparison of hash sets is obviously antiquated and inadequate, Jesse Kornblum of Mantech thought up a fantastic solution called fuzzy hashing. Using a tool called SSDEEP, you can generate hash values that can then be compared to other files producing a percentage in which the file matches other files!

Using SSDEEP, I generate an output file from the first image of Dez that looks like this:

ssdeep -b DezBryant.jpg

I simply redirected the output to a file named dez.hash.

Then, I use that file to compare to the second image of Dez:

root@Linux-Forensic1:/home/cepogue/Pictures# ssdeep -bm dez.hash DezBryant2.jpg
DezBryant2.jpg matches DezBryant.jpg (99)

As you can see from the output, these two images are 99% similar.

Using fuzzy hashing can efficiently and effectively help investigators to identify files that contain a high percentage of similarities. While the file may not be 100% exactly the same, as proven by my example, that does not necessarily mean that they are not the same image. This same theory can be used with really any type of file. An investigator can then take the files with the highest percentage of similarities and manually review those individual files.

SSDEEP is a free utility and can be downloaded from


  1. Chris,

    Great post! I'm sure that there are some who take this for granted, while others simply have no idea.

    The math behind each of the hashing algorithms is poorly understood by you've pointed out, a change in one BIT in a file will cause the MD5 hash to change; this is why it has been so valuable when it comes to validating the integrity of data.

    Fuzzy hashing takes it a step further by letting you get an idea of how close two files may be to each other. Like you, I've used this before...during PCI exams 8 months apart, I had two files of the same name (and they had very similar artifacts on the respective systems), but different MD5/SHA-1 hashes. Ssdeep let me know that they were 98% similar.

    As a side note, VirusTotal uses ssdeep, as well.

  2. Great point Harlan! I hope people that read this will make the connection between the example with the images, and really any other type of file. We have used Fuzzy Hashing extensively in malware cases across multiple franchise locations. From what we have seen in certain circumstances, the malware is polymorphic, and there-by throwing off any attempts by the parent company to use the MD5 of the malicious files to identify the malware across multiple locations. By using fuzzy hashing, we were able to quickly identify the files regardless of location.

    Jesse really came up with a fantastic tool here!

  3. "As you can see from the output, these two images are 99% similar."
    This is a wrong interpretation, it says the *files* are 99% similar. To do approximate images comparison, you should use this kind of algorithm "An image signature for any kind of image" ( ) as implemented by the library: LibPuzzle ( )

  4. Oldie but still very useful. Thanks!

  5. There's an easier way to create an altered image. simply open the jpg using the jpg viewr and choose "Save As". The file you get is usually larger than the original file.

  6. not sure if it's good to compare images with ssdeep as different image formats have different ways of encoding images and they are very different. It'd work well for text files though.

  7. Interesting post, thank you for sharing. You are making a slight error in your narrative:- "As you can see from the output, these two images are 99% similar." The images are identical, the image files are not.... The byte you changed in the jpeg file does not change the image payload of the file, its in the EOF section that is not part of the image, its part of the JFIF container that holds the image.