Bug#990302: ITP: bulk-extractor -- A stream-based forensics tool for triage and cross-evidence analysis, which scans the media and extracts recognizable content
Package: wnpp
X-Debbugs-Cc: debian-devel@lists.debian.org, debian-security-tools@lists.debian.org
Owner: Jan Gru <j4n6ru@gmail.com>
Severity: wishlist
* Package name : bulk-extractor
Version : 1.6.0
Upstream Author : Simson L. Garfinkel <slgarfin@nps.edu>
* URL : https://github.com/simsong/bulk_extractor
* License : MIT and CC0
Programming Lang: C++, Python (and Java for the BEViewier, probably not packaged)
Description : A stream-based forensics tool for triage and cross-evidence analysis, which scans the media and extracts recognizable content
bulk_extractor is a program for bulk data extraction and analysis, it carves for relevant features such as email addresses, credit card numbers, URLs,
and other types of information from digital evidence files in a stream-based manner by parallelized processing blocks to omit disk seeking.
** Why is this package relevant?
It is a useful tool for forensic investigations, because it is way more than just another file carver. The program provides several unusual capabilities including:
- It finds email addresses, URLs and credit card numbers that other tools miss because it can process compressed data (like ZIP, PDF and GZIP files) and incomplete or partially corrupted data.
- It can carve JPEGs, office documents and other kinds of files out of fragments of compressed data. It will detect and carve encrypted RAR files.
- It builds word lists based on all of the words found within the data, even those in compressed files that are in unallocated space. Those word lists can be useful for password cracking.
- It is multi-threaded; running bulk_extractor on a computer with twice the number of cores typically makes it complete a run in half the time.
- It creates histograms showing the most common email addresses, URLs, domains, search terms and other kinds of information on the drive.
The program is authored by the renowned forensics researcher Simson L. Garfinkel, who is probably most recognized for his work on DFXML at the Naval Postgraduate School (NPS) and the National Institute of Standards and Technology (NIST). It provides rich documentation -- for the end-users as well as for potential contributors [0].
To sum it up, bulk_extractor has great potential for improving triage and automatation workflows within digital forensics and should be therefore included in Debian's package sources.
** Resolved issues
bulk_extractor is already packaged in Kali [1], but had licensing issues until recently.
To be more precise, it linked code with OpenSSL while not explicitly permitting it and used a the modified MIT-license from the
JSON-project, which is considered non-free and not DFSG-compliant. To overcome this issues I resolved this issues in cooperation
with upstream by sending two recent patches [2], which were already accepted.
** Maintanance plan
I plan to maintain it within the pkg-security-team's repository on salsa, where a lot of forensics packages live [3].
I am looking for a sponsor of this package, who would be ideally a member of the a/m team.
Best regards
Jan
[0] See http://digitalcorpora.org/downloads/bulk_extractor/BEUsersManual.pdf, https://digitalcorpora.s3.amazonaws.com/downloads/bulk_extractor/BEProgrammersManual.pdf and https://digitalcorpora.s3.amazonaws.com/downloads/bulk_extractor/BEWorkedExamplesStandalone.pdf
[1] See https://tools.kali.org/forensics/bulk-extractor
[2] See https://github.com/simsong/bulk_extractor/issues/168, https://github.com/simsong/bulk_extractor/pull/169 and https://github.com/simsong/bulk_extractor/pull/170
[3] See https://salsa.debian.org/pkg-security-team/
Reply to: