Cinthya Grajeda, Frank Breitinger, and Ibrahim Baggili. “Availability of Datasets for digital forensics – and what is missing”. In: Digital Investigation (2017). (Presented at DFRWS 2017, Austin, TX)


SourceDescriptionSample Datasets
Data.govU.S. Government’s open data website. Contains data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more.Data pertaining to education, health, economy, technology, etc.
Forensics Wiki CorporaThis site lists information and links to other forensic dataset repositories.Darpa, Digital Corpora, Wireshark, DFRWS challenges, The Honeynet Project's Forensic Challenge, etc.
Forensic FocusThis site is a digital forensics portal for computer forensics and eDiscovery professionals. It offers links to major test image dataset repositories and forensic challenges.Disk images, pcap, mobile images, apks, incident response etc.
Impact Cyber TrustSponsored by DHS and other agencies, this website includes a central metadata index database of ground truth and synthetic data available for sharing. This data was provided by at least 10 providers, Georgia Tech, Packet Clearing House, etc. An account is needed in order to access the datasets. IDS and Firewall, DNS, IP, BGP routing data, etc.
PeekaTorrent 2.6 petabytes of publicly shared files aquired through P2P filesharing networks.Torrent archives and hashdb datasets with info hash
Perdisci - Useful Public ResourcesPersonal website, containing a few links to network traffic, malware and machine learning dataset repositories.Pcap, traces, logs, malware, etc.
Security RepoThis repository includes samples of security related data created by the owner of the site and links to other dataset repositories found online.Networking (scanning/recon, explotation, shell traffic, security incidents, system logs, net traffic, ftp, ssh, (DNS) lookups, ssl certs, UPD, TCP, URLs, Exploit kits and benign traffic, malware, threat feeds etc.)
Stratosphere IPS ProjectDataset captures with malware, normal and background traffic for machine learning algorithms.Binetflow files, Biargus files, and Pcap files
The Open American National CorpusMassive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward.Texts and transcripts
UC Irvine Machine Learning RepositoryThis website contains machine learning datasets on almost any topic. It currently contains 351 data sets, and 106 of those are related to CS and Engeneering.Anonymous Microsoft Web Data logs, mobile robots sensor data, SMS spam collection, etc.
Zeltser - Malware SamplesPersonal website that has a few links to other malware dataset sources.Malware samples