Using Spider from Helix (Linux version)
URL: http://helix.cit.cornell.edu Download Spider 4.0
|
Prerequisites: Perl 5.6.1 or higher with the following modules:
auxiliary programs:
|
Spider is intended to recursively process a mounted volume, searching files for a limited set of regular expressions. Both out of the need to protect that specific data and because the numeric format lends itself well to accurate matching, our default regular expressions focus on social security numbers and credit card numbers. Other data, such as driver's license numbers, dates of birth, telephone numbers, and the like are frequently present on disk and not in and of themselves protected data.
Spider will misidentify certain types of files as containing confidential data. Every effort should be made to verify Spider's results before moving, encrypting, or removing files.
Spider's logs can function as a roadmap to confidential data and must be well secured.
Spider consists of four basic pieces and has as its prerequisites several perl modules intended to improve its efficiency. It should be portable across UNIX flavors provided these modules exist.
The four basic components are (in /usr/local/cornell/spider in Cornell's Helix distro):
- spider_server.pl: the spider collector process that receives matches from spider clients and records them to its log file
- spider-4.0.pl: the spider client, actually responsible for processing files
- spider.conf: the configuration file used by both the client and server. It contains the shared secret, path to the log file written by spider_server, path to the regular expressions file, and a few other configuration parameters
- REGEXES: the list of regular expressions to use
The default configuration will encrypt all traffic between the spider client and server and use an MD5 hash of each packet to insure the data is received intact and undamaged. Spider_server listens on UDP port 3000 for incoming client messages, processes them, and writes them to the log file.
Startup:
It is only necessary to start the server once, either on the Helix instance used for examination or on a central spider log host:
/usr/local/cornell/spider/spider_server.plSpider_server will print its configuration then background itself.
Scanning:
Normal spider behavior is to recursively scan directories and process all readable files. Files of certain types, including binary graphics data, RPMs, and executables, are not processed as they yield extreme false positive rates. Every effort is made to deal with Unicode, Office file formats, PDFs, mailboxes, and the like. Spider_server's log file begins with "GIF89a" deliberately to cause spider to skip processing that file should circumstances dictate it be in the examination tree.
Default arguments to spider are generally sufficient:
/usr/local/cornell/spider/spider-4.0.pl -D /path/to/mountSpider can be fairly verbose, and will report files under examination.
The Spider log file:
The spider log file records the IP address of the spider client making the report (localhost in most cases), the path to the file under examination, the regular expression that caused the match, and roughly 1K worth of human-readable text in which the match was found.
False positives are not uncommon, as many things are likely to match SSN or CC patterns. Visual inspection of the spider log file is necessary as a final step to determine whether or not sensitive data is present and to what degree.
Spider_server command line options:
| -c <file> | read configuration from <file> instead of /usr/local/cornell/spider/spider.conf |
| -l <log> | write to log; path in spider.conf supersedes this option |
| -r <regexes> | read regular expressions from regexes; path in spider.conf supersedes this option |
Spider client command line options:
| -v | [verbose] will cause spider to be excessively chatty in its operation. Also nullifies the default behavior of spider to only report the first match in a file |
| -u | [unscrew] will cause spider to recursively process directories converting Windows filename convention to UNIX filename convention. This is NOT forensically sound as it requires modifying the evidence drive, can be extremely destructive to production systems, and makes post-examination followup difficult. It is also unnecessary 99% of the time with spider 4.0 |
| -T | [test] will cause spider to compile the regular expressions in the file specified by spider.conf and report any syntactical errors |
| -D <dir> | will cause spider to begin processing at directory <dir> |
| -S | [show] will print the regular expressions in the file specified in spider.conf |
| -c <config> | will cause spider to use <config> as its configuration file instead of /usr/local/cornell/spider/spider.conf |
Spider.conf options:
The spider.conf configuration file consists of a series of keyword/value pairs delimited by equals ("=") characters. Pound ("#") signs anywhere on a line are considered comments and are ignored. Keywords are case insensitive.
Definable options are as follows:
Logfile (path) path to spider_server's record of incoming pattern matches; defaults to /tmp/logfile.IP where "IP" will be replaced by the spider client IP addressRegexes (path) path to spider's regular expressions file. Be aware that overzealous regular expressions slow spider considerably and result in higher false positive rates.
Use_hmac (0|1) determines whether spider will prefix each packet with an MD5 of the payload. Default is 1, which allows spider to detect decryption errors or tampering.
Hmac (md5|sha1) determines which hashing algorithm will be used for integrity checking
Encrypt (0|1) determines whether the communications between spider client and server are encrypted. Defaults to 1. As spider will, by design, selectively discover and concentrate sensitive data encrypted communications are extremely desirable. See Note 1 below.
Cipher (Blowfish|DES) selects the cipher to use of those available to the Crypt::CBC perl module. Blowfish is the default.
Key (user defined) gives the encryption key used by both spider client and spider_server for secure communications. Ideally this is uniquely defined for each spider use in non-loopback communications and never reused. No effort is made to periodically change keying material and the communication of keying material to spider clients is assumed to be the security responsibility of the user.
Interface (user defined, default 127.0.0.1) gives the interface on which spider_server should listen for connections. Giving the localhost interface prevents outside clients from accessing spider_server
Port (user defined, default 3000) gives the port on which spider client and server communicate. Protocol is UDP and max payload size is 1024 bytes plus any overhead incurred by HMAC
Loghost (user defined, default localhost) is the log server to which spider clients should send their results.
Summary 0|1; determines whether to send a summary of file types to the loghost. Can be used to keep server logs brief
Types <number>; only report on the top <number> of file types found
Max_depth <bytes>; only scan files to <bytes> depth. Generally results in a speed improvement if kept less than 20000
Unprint <quoted char>; replace unprintable characters with <char>. Defaults to "."
Unpack 0|1; unpack archives in /tmp and scan the results. Can slow spider but does result in better discovery
Security Notes:
- Spider uses an encryption and integrity checking mechanism intended to prevent the eavesdropping of results on the local LAN. The protocol should be sufficiently sound to prevent disclosure in a low threat environment. Captured spider sessions can be decrypted after the fact if the shared secret itself is reused or disclosed, so every effort should be made to protect spider's key when used outside the local machine.
- Spider serves to select and concentrate sensitive data in the log file. That file exists in a RAM disk under Helix but may not when operating under other systems. The log file should be securely recorded to removable media, which in turn is properly handled. Once this is done, the log file should be wiped with a disk wiping utility of known and reliable behavior.
