Deep Content Fingerprinting

To meet the content protection requirements of today’s enterprises, Code Green Networks has developed proprietary Deep Content Fingerprinting™ technology that specifically protects unstructured data —content contained in more than 400 file types including MS Office and other documents, across all languages, including multi-byte character sets.  Based on research conducted at Stanford University and supported by a number of pending patents, the technology consists of a series of sliding hashes that are mathematically reduced to uniquely represent a document and all of its constituent parts.

img

The Key to Effective Content Protection

To protect confidential information accurately and with minimal false positives, Code Green Networks Deep Content Fingerprinting captures and stores representative signatures of the content to be protected.  It then compares these signatures, in real time, to content transmitted across the network.  If it detects a match, it can then invoke the appropriate policy and take action on it.

Traditionally, digital documents have been compared using hashes of entire files.  Sufficient for detecting exact file matches, this method is inadequate for the multitude of ways that content is used and transmitted today.  The digital workflow of today’s enterprises requires a content fingerprinting methodology that:

  • Reliably and accurately detects derivatives and excerpts of confidential content independent of format and message protocol
  • Fingerprints all languages including those with non-Roman scripts (ex: Japanese, Chinese)
  • Can be implemented to protect content at all potential leakage points

Deep Content Fingerprinting in Action

Deep Content Fingerprinting occurs either automatically, by “crawling” data repositories, or manually, by Web upload.

Crawling is a key feature of the Content Inspection Appliances, providing an efficient and scalable solution to rapidly register confidential content contained in file shares (CIFS or NFS) and repositories such as enterprise content management systems.  The content crawling engine recursively traverses file system trees on a file share to identify and efficiently encode confidential content into a set of unique digital signatures.  It does this by opening and inspecting files stored in data repositories and then generating unique signatures, similar to an individual’s fingerprint.  These fingerprints are then stored in a fingerprint database and later used to identify confidential content transmitted on the network, even if the content has been cut and pasted into another document, compressed or modified.

Code Green Network’s Deep Content Fingerprinting differs from other content detection methods in its high accuracy and efficiency at inspecting large volumes of data on the network, while ensuring confidential content transmitted on the network is detected.  

Since content fingerprints are a unique and accurate representation of the original content, they can later be used to identify confidential content even if it has been cut and pasted into another document, compressed or modified.  For example, if an employee cut and pasted a section of C++ source code and attempted to email the code outside of the network, the Content Inspection Appliance would detect the derivative work. 

The unique and efficient encoding scheme used in Deep Content Fingerprinting results in fingerprint hashes that are 1/300 the size of the original document, allowing large volumes of confidential content fingerprints to be stored in a minimum amount of space, and inspected in real-time.