Other Dataset Repositories
If you use this website to find a reference set for your research, please cite our publication:
Cinthya Grajeda, Frank Breitinger, and Ibrahim Baggili. “Availability of Datasets for digital forensics – and what is missing”. In: Digital Investigation (2017). (Presented at DFRWS 2017, Austin, TX)
Source | Description | Sample Datasets |
ASCL.net | The Astrophysics Source Code Library (ASCL) is a free online registry for source codes of interest to astronomers and astrophysicists. It lists codes that have been used in research that has appeared in, or been submitted to, peer-reviewed publications. The ASCL is indexed by the SAO/NASA Astrophysics Data System (ADS). | Source codes of interest to astronomers and astrophysicists. Note that this website provides links to other sources as well as locations of the code. |
The Drone Forensics Program | Research sponsored by the United States Department of Homeland Security (DHS) Science and Technology Directorate, Cyber Security Division (DHS S&T/CSD). It contains 3TB from 30 consumer and professional drones to aid law enforcement and government in investigations. The program is run by VTO Inc. | DJI, Parrot, Intel and more. |
Data.gov | U.S. Government open data website. Contains data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more. | Data pertaining to education, health, economy, technology, etc. |
Forensics Wiki Corpora | This site lists information and links to other forensic dataset repositories. | Darpa, Digital Corpora, Wireshark, DFRWS challenges, The Honeynet Project's Forensic Challenge, etc. |
Forensic Focus | This site is a digital forensics portal for computer forensics and eDiscovery professionals. It offers links to major test image dataset repositories and forensic challenges. | Disk images, pcap, mobile images, apks, incident response etc. |
Impact Cyber Trust | Sponsored by DHS and other agencies, this website includes a central metadata index database of ground truth and synthetic data available for sharing. This data was provided by at least 10 providers, Georgia Tech, Packet Clearing House, etc. An account is needed in order to access the datasets. | IDS and Firewall, DNS, IP, BGP routing data, etc. |
PeekaTorrent | 2.6 petabytes of publicly shared files aquired through P2P filesharing networks. | Torrent archives and hashdb datasets with info hash |
Perdisci - Useful Public Resources | Personal website, containing a few links to network traffic, malware and machine learning dataset repositories. | Pcap, traces, logs, malware, etc. |
Real Data Corpus | This is a collection of disk images extracted from secondary storage devices that were acquired from second-hand markets around the world. The RDC currently consists of 58 TB of data contained in 3,127 disk images from 29 countries. | Extractions from magnetic media, solid state storage from laptops, desktops, mobile phones, USB memory sticks, and other media. |
Security Repo | This repository includes samples of security related data created by the owner of the site and links to other dataset repositories found online. | Networking (scanning/recon, explotation, shell traffic, security incidents, system logs, net traffic, ftp, ssh, (DNS) lookups, ssl certs, UPD, TCP, URLs, Exploit kits and benign traffic, malware, threat feeds etc.) |
Stratosphere IPS Project | Dataset captures with malware, normal and background traffic for machine learning algorithms. | Binetflow files, Biargus files, and Pcap files |
The Open American National Corpus | Massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. | Texts and transcripts |
UC Irvine Machine Learning Repository | This website contains machine learning datasets on almost any topic. It currently contains 351 data sets, and 106 of those are related to CS and Engeneering. | Anonymous Microsoft Web Data logs, mobile robots sensor data, SMS spam collection, etc. |
Zeltser - Malware Samples | Personal website that has a few links to other malware dataset sources. | Malware samples |