Large Language Models Surpass Humans in Cybersecurity Knowledge

In a new study, researchers have developed benchmark datasets to evaluate the general knowledge of Large Language Models (LLMs) in cybersecurity. The creation of CyberMetric80, CyberMetric500, CyberMetric2000, and CyberMetric10000 has marked a significant step towards understanding the capabilities and limitations of LLMs in this domain.

These multiple-choice QA datasets comprise questions collected from various sources, including NIST standards, research papers, publicly accessible books, RFCs, and other publications. The results have shown that state-of-the-art LLM models, such as GPT4o, GPT4turbo, Mixtral8x7B, Instruct Falcon180BChat, and GEMINIpro 10, outperformed humans on CyberMetric80, although highly experienced human experts still excelled in complex tasks.

The study highlights the importance of balancing human expertise with AI capabilities in cybersecurity. By making the CyberMetric dataset publicly available, researchers can now compare and improve their own LLM models, ultimately accelerating progress in this field. As research continues to advance, it is essential to strike a balance between leveraging AI capabilities and preserving human expertise, enabling more effective solutions for complex cybersecurity challenges.

The CyberMetric dataset is a collection of multiple-choice QA benchmark datasets designed to evaluate the general knowledge of Large Language Models (LLMs) in cybersecurity. The dataset was created by researchers from the Technology Innovation Institute (TII) in Abu Dhabi, United Arab Emirates, in collaboration with other institutions. The goal of the CyberMetric dataset is to provide a diverse, accurate, and up-to-date benchmark for evaluating LLMs’ knowledge in various fields of cybersecurity.

The dataset consists of four versions: CyberMetric80, CyberMetric500, CyberMetric2000, and CyberMetric10000, each containing 80, 500, 2000, and 10,000 questions respectively. The questions were generated using GPT-3.5 and Retrieval-Augmented Generation (RAG), which collected documents from various sources such as NIST standards, research papers, publicly accessible books, RFCs, and other publications in the cybersecurity domain.

The CyberMetric dataset is significant because it addresses a gap in the research community’s ability to accurately test the general knowledge of LLMs in cybersecurity. The dataset provides a comprehensive benchmark for evaluating LLMs’ performance in various fields of cybersecurity, including cryptography, reverse engineering, and risk assessment.

How Was the CyberMetric Dataset Created?

The creation of the CyberMetric dataset involved several steps. First, researchers collected documents from various sources in the cybersecurity domain using GPT-3.5 and RAG. These documents included NIST standards, research papers, publicly accessible books, RFCs, and other publications.

Next, the researchers used these documents to generate questions for each version of the CyberMetric dataset (CyberMetric80, CyberMetric500, CyberMetric2000, and CyberMetric10000). Each question was designed to have four possible answers. The results underwent several rounds of error checking and refinement to ensure accuracy and relevance.

Human experts invested over 200 hours validating the questions and solutions to ensure their accuracy and relevance. This process helped filter out any questions unrelated to cybersecurity. The resulting dataset is a comprehensive benchmark for evaluating LLMs’ knowledge in various fields of cybersecurity.

What are Large Language Models (LLMs) and How Do They Relate to Cybersecurity?

Large Language Models (LLMs) are artificial intelligence models that can process and generate human-like language. These models have been increasingly used across various domains, including software development, cyber threat intelligence, and more. However, understanding the different fields of cybersecurity poses a challenge even for human experts.

To accurately test the general knowledge of LLMs in cybersecurity, researchers need a diverse, accurate, and up-to-date dataset. The CyberMetric dataset was created to address this gap. By utilizing LLMs like GPT-3.5 and RAG, researchers can generate questions that are relevant to various fields of cybersecurity.

LLMs have the potential to revolutionize the field of cybersecurity by providing a scalable and efficient way to analyze vast amounts of data. However, their accuracy and reliability in specific domains, such as cybersecurity, need to be evaluated using comprehensive benchmarks like the CyberMetric dataset.

What is the Significance of Evaluating LLMs on the CyberMetric Dataset?

Evaluating LLMs on the CyberMetric dataset provides a comprehensive benchmark for comparing the general knowledge of humans and LLMs in cybersecurity. The results can serve as a reference point for researchers, developers, and practitioners to evaluate the performance of various LLM models.

The evaluation process involved 25 state-of-the-art LLM models being tested on the CyberMetric datasets. In addition to evaluating LLMs, human participants were also involved in solving CyberMetric80 in a closed-book scenario. The results showed that GPT-4o, GPT-4turbo, Mixtral8x, and other top-performing LLMs were more accurate than humans on CyberMetric80.

However, highly experienced human experts still outperformed small models like Llama38B, Phi2, or Gemma7b. The evaluation process highlights the potential of LLMs in cybersecurity but also underscores the need for further research and development to improve their accuracy and reliability.

What are the Implications of the CyberMetric Dataset for the Field of Cybersecurity?

The creation and evaluation of the CyberMetric dataset have significant implications for the field of cybersecurity. The dataset provides a comprehensive benchmark for evaluating LLMs’ knowledge in various fields of cybersecurity, including cryptography, reverse engineering, and risk assessment.

The results of the evaluation process highlight the potential of LLMs to revolutionize the field of cybersecurity by providing a scalable and efficient way to analyze vast amounts of data. However, the accuracy and reliability of LLMs in specific domains need to be evaluated using comprehensive benchmarks like the CyberMetric dataset.

The implications of the CyberMetric dataset are far-reaching and have significant potential for improving the field of cybersecurity. By leveraging the power of LLMs, researchers and practitioners can develop more accurate and reliable methods for analyzing data, detecting threats, and mitigating risks in various domains of cybersecurity.

What is the Future of the CyberMetric Dataset?

The future of the CyberMetric dataset is bright, with significant potential for growth and development. The dataset has been made publicly available on GitHub (https://github.com/CyberMetric), allowing researchers and practitioners to access and utilize it for various purposes.

As the field of cybersecurity continues to evolve, the CyberMetric dataset will likely become an essential tool for evaluating LLMs’ knowledge in various domains. Researchers and developers can build upon the existing dataset by adding new questions, improving the evaluation process, or exploring new applications.

The future of the CyberMetric dataset is closely tied to the development of more advanced LLM models that can accurately and reliably analyze vast amounts of data in cybersecurity. By leveraging the power of LLMs, researchers and practitioners can develop more accurate and reliable methods for analyzing data, detecting threats, and mitigating risks in various domains of cybersecurity.

Conclusion

The CyberMetric dataset is a significant contribution to the field of cybersecurity, providing a comprehensive benchmark for evaluating LLMs’ knowledge in various fields. The creation and evaluation process involved several steps, including collecting documents from various sources, generating questions using GPT-3.5 and RAG, and refining the results through human validation.

The implications of the CyberMetric dataset are far-reaching, with significant potential for improving the field of cybersecurity by leveraging the power of LLMs. The future of the dataset is bright, with opportunities for growth and development as researchers and practitioners continue to explore new applications and improve existing methods.

By utilizing the CyberMetric dataset, researchers and practitioners can develop more accurate and reliable methods for analyzing data, detecting threats, and mitigating risks in various domains of cybersecurity.

Publication details: “CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge”
Publication Date: 2024-09-02
Authors: Norbert Tihanyi, Mohamed Amine Ferrag, Ridhi Jain, Tamás Bisztray, et al.
Source:
DOI: https://doi.org/10.1109/csr61664.2024.10679494

Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

Scientists Guide Zapata's Path to Fault-Tolerant Quantum Systems

Scientists Guide Zapata’s Path to Fault-Tolerant Quantum Systems

December 22, 2025
NVIDIA’s ALCHEMI Toolkit Links with MatGL for Graph-Based MLIPs

NVIDIA’s ALCHEMI Toolkit Links with MatGL for Graph-Based MLIPs

December 22, 2025
New Consultancy Helps Firms Meet EU DORA Crypto Agility Rules

New Consultancy Helps Firms Meet EU DORA Crypto Agility Rules

December 22, 2025