CRC-32 Uniqueness and Usefulness in Profiling Text (non-binary) Files

letusknow@cogentcomputingsystems.com 18/08/2016

“Profiling files” the technique for uniquely identifying computer files.

I assert that CRC-32 numbers under specific conditions can contribute to uniquely identifying files.

Why is profiling necessary?

Consider the following types of organizations … Insurance companies receive claims for reimbursement. Manufacturers receive purchase orders. Banks send out statements.  There is information coming and going all the time for many organizations, and it is a challenge to organize the influx of so much information that is most often received in the form of computer files.

The benefits of profiling are as follows:

  1. Never processing the same file more than once
  2. Being able to store a file with a unique identification, i.e. unique file name
  3. Being able to recognize if you accidentally receive a duplicate file, i.e. 2 files with different names that are in fact the same

Manufacturers can’t afford to ship the same merchandise more than once if they accidentally received the purchase order file multiple times; Insurers can’t afford to pay the same medical claims more than once if they accidentally received the same claim or set of claims more than once etc. etc.

One such file profiling technique involves the evaluation of a file’s contents in order to produce a unique number per a given sized file. For example, the idea would be that every eighteen byte, or twenty byte, or thousand byte file etc. would result in a unique number meaning that the combination of calculated number and file size would be unique for any computer file.  However there is a caveat to what I am about to suggest.

The technique I am suggesting involves a calculation known as a cyclic-redundancy-check that produces a 32 bit number more concisely known as a CRC-32 number.

Cyclic-redundancy-checks were invented to quickly verify whether a transmitted set of data arrived across the wire intact.  You can learn more about and the history of CRC calculation at the following:

Wikipedia link to Cyclic Redundancy Check
exhaustively detailed paper on Cyclic Redundancy Check

Unfortunately the uniqueness of the CRC-32 number is not totally guaranteed… unless – and this is the premise of the title – you begin with the prerequisite that a computer file can only contain ASCII text.

I wrote a C++ program based on and incorporating portions of the code in mdgray’s article and sample program found on CodeProject; the article and sample program describe how a file’s contents can be manipulated to resolve to any CRC-32 number you want.  A corollary to that is the ability to manipulate a file’s contents to suit an already known CRC-32 number which is what my program does thus seemingly deriding the notion that a CRC-32 number and file size will uniquely correlate to any computer file.

However my program also demonstrates that the manipulation cannot occur without altering an ASCII only text file’s nature; manipulation will always result in some binary values ending up within the file; there is no chance of avoiding it as can be seen in this screen shot of the program’s output.

crcspoof

Save the source file as .cpp file and create a sample1.txt file.  The sample1.txt file seen in the example contains “This is just some text.”

Therefore as long as you know and can verify that your computer files will only contain ASCII text as many already do, you can guarantee that the combination of CRC-32 number and file size will be unique for every file of that type.

About the Author

Leave a Reply