Other Dataset Repositories

If you use this website to find a reference set for your research, please cite our publication:

Cinthya Grajeda, Frank Breitinger, and Ibrahim Baggili. “Availability of Datasets for digital forensics – and what is missing”. In: Digital Investigation (2017). (Presented at DFRWS 2017, Austin, TX)

 

SourceDescriptionSample Datasets
ASCL.netThe Astrophysics Source Code Library (ASCL) is a free online registry for source codes of interest to astronomers and astrophysicists. It lists codes that have been used in research that has appeared in, or been submitted to, peer-reviewed publications. The ASCL is indexed by the SAO/NASA Astrophysics Data System (ADS).Source codes of interest to astronomers and astrophysicists. Note that this website provides links to other sources as well as locations of the code.
Data.govU.S. Government open data website. Contains data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more.Data pertaining to education, health, economy, technology, etc.
Forensics Wiki CorporaThis site lists information and links to other forensic dataset repositories.Darpa, Digital Corpora, Wireshark, DFRWS challenges, The Honeynet Project's Forensic Challenge, etc.
Forensic FocusThis site is a digital forensics portal for computer forensics and eDiscovery professionals. It offers links to major test image dataset repositories and forensic challenges.Disk images, pcap, mobile images, apks, incident response etc.
Impact Cyber TrustSponsored by DHS and other agencies, this website includes a central metadata index database of ground truth and synthetic data available for sharing. This data was provided by at least 10 providers, Georgia Tech, Packet Clearing House, etc. An account is needed in order to access the datasets. IDS and Firewall, DNS, IP, BGP routing data, etc.
PeekaTorrent 2.6 petabytes of publicly shared files aquired through P2P filesharing networks.Torrent archives and hashdb datasets with info hash
Perdisci - Useful Public ResourcesPersonal website, containing a few links to network traffic, malware and machine learning dataset repositories.Pcap, traces, logs, malware, etc.
Real Data CorpusThis is a collection of disk images extracted from secondary storage devices that were acquired from second-hand markets around the world. The RDC currently consists of 58 TB of data contained in 3,127 disk images from 29 countries.Extractions from magnetic media, solid state storage from laptops, desktops, mobile phones, USB memory sticks, and other media.
Security RepoThis repository includes samples of security related data created by the owner of the site and links to other dataset repositories found online.Networking (scanning/recon, explotation, shell traffic, security incidents, system logs, net traffic, ftp, ssh, (DNS) lookups, ssl certs, UPD, TCP, URLs, Exploit kits and benign traffic, malware, threat feeds etc.)
Stratosphere IPS ProjectDataset captures with malware, normal and background traffic for machine learning algorithms.Binetflow files, Biargus files, and Pcap files
The Open American National CorpusMassive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward.Texts and transcripts
UC Irvine Machine Learning RepositoryThis website contains machine learning datasets on almost any topic. It currently contains 351 data sets, and 106 of those are related to CS and Engeneering.Anonymous Microsoft Web Data logs, mobile robots sensor data, SMS spam collection, etc.
Zeltser - Malware SamplesPersonal website that has a few links to other malware dataset sources.Malware samples