For years, computer forensic investigators have put a great deal of stock in the effectiveness of MD5 hashing. Now to quantify that statement, I mean specifically using MD5 hashes to identify known malicious files. The key word in that sentence is known, but let's take that one step further to add the word “unmodified” known files. One minor change to a file, and the MD5 hash is now completely different, rendering the investigators search totally ineffective. So, what's the answer? Easy, fuzzy hashing.
Hash comparisons are either a yes or a no – either the hash matches, or it doesn't. But, that does not mean that the files are not the same, it just means they are not exactly the same. I am going to use a simple example, that will illustrate exactly what I am talking about.
The photograph of Oklahoma State University wide receiver Dez Bryant below was taken from, “http://media.photobucket.com/image/dez%20bryant/imandyduckworth/DezBryant.jpg” on November 09, 2009.
Using MD5Deep, I took generated an MD5 hash for this picture:
Now, if you were an investigator, and you were going to search for that image of Dez based on the MD5 hash, you would only find it if the image were totally and completely identical to this original.
To show how easy it is to modify an image like this, I used Ghex to open the image and scrolled to the bottom of the content.
Note at offset 5879 (the last line), there are only four characters, which on the right translate to a blank space, a question mark, and two periods. Using Ghex, I am simply going to replace the blank space with a period.
Look at offset 5879 again in the figure above. I replaced the blank space with the period, changing that last line from "20 3F FF D9" to "2E 3F FF 2E". A very minor change. As you can see from the modified image of Dez below, there is no visible change to the image.
Again, using MD5deep, I calculated the MD5 hash of the image, and it is totally different from the first image.
Here is the unmodified image hash one more time:
Here is the modified image hash:
The hashes are not even close. So, if an investigator was performing a search for this image based on the MD5 hash, he would fail to find it.
So, if you are an investigator, you may be thinking, “Aw crap...now what?! So ALL of the hash comparisons I have been doing could have failed while the evidence was still present?”
The answer to question is, “Yes...if the images were modified in any way...yes they did.” But, there is hope, and that hope is called fuzzy hashing.
Since the one to one comparison of hash sets is obviously antiquated and inadequate, Jesse Kornblum of Mantech thought up a fantastic solution called fuzzy hashing. Using a tool called SSDEEP, you can generate hash values that can then be compared to other files producing a percentage in which the file matches other files!
Using SSDEEP, I generate an output file from the first image of Dez that looks like this:
ssdeep -b DezBryant.jpg
I simply redirected the output to a file named dez.hash.
Then, I use that file to compare to the second image of Dez:
root@Linux-Forensic1:/home/cepogue/Pictures# ssdeep -bm dez.hash DezBryant2.jpg
DezBryant2.jpg matches DezBryant.jpg (99)
As you can see from the output, these two images are 99% similar.
Using fuzzy hashing can efficiently and effectively help investigators to identify files that contain a high percentage of similarities. While the file may not be 100% exactly the same, as proven by my example, that does not necessarily mean that they are not the same image. This same theory can be used with really any type of file. An investigator can then take the files with the highest percentage of similarities and manually review those individual files.
SSDEEP is a free utility and can be downloaded from http://ssdeep.sourceforge.net/.