SoReL-20M (Sophos-ReversingLabs – 20 Million)

Sophos-ReversingLabs 20 Million dataset.
The code included in this repository produced the baseline models available at s3://sorel-20m/09-DEC-2020/baselines

This code depends on the SOREL dataset available via Amazon S3 at s3://sorel-20m/09-DEC-2020/processed-data/; to train the lightGBM models you can use the npz files available at s3://sorel-20m/09-DC-2020/lightGBM-features/ or use the scripts included here to extract the required files from the processed data.

If you use this code or this data in your own research, please cite our paper: [Link forthcoming].

The full size of this dataset is approximately 8TB. It is highly recommended that you only obtain the specific elements you need.