Featured Dataset#
The NIST Cybersecurity Training dataset has been downloaded over 2,200 times since its release. It contains 124,946 validated links to NIST cybersecurity publications with pre-computed embeddings and FAISS indices for efficient similarity search.
Full Collection#
Open Source Security Compliance
A collection of my open source security compliance datasets, models (fine-tuned LLMs), etc.

What’s In the Collection#
Datasets - NIST cybersecurity publications with pre-computed embeddings and FAISS indices for similarity search. The training data covers SP 800-series, CSF, FIPS, Zero Trust, privacy, cloud security, IoT, and supply chain risk management documents.
Fine-Tuned Models - LLMs trained on the datasets above. The HackIDLE-NIST-Coder is the flagship model: Qwen2.5-Coder-7B fine-tuned via LoRA on 530K+ examples from 596 NIST documents. Available on Ollama, GGUF, and MLX formats.
Embeddings - Pre-computed vector embeddings for efficient semantic search across NIST publications. Useful for building RAG pipelines or similarity-based lookup tools.
Why Open Source#
Compliance knowledge shouldn’t be locked behind vendor paywalls. NIST publications are public domain. The datasets derived from them and the models trained on those datasets should be too.
The collection is released under CC0 1.0 (public domain) so anyone can use it - commercial, research, or otherwise. The goal is to lower the barrier to compliance knowledge, especially for smaller organizations that can’t afford enterprise GRC tools.
Related#
- HackIDLE-NIST-Coder - The fine-tuned model built from these datasets
- myctrl.tools - Security controls reference covering 105+ frameworks





