KEENHash, a novel hashing approach utilising large language model-generated function embeddings, significantly accelerates binary code similarity analysis. It achieves program-level comparisons by condensing binaries into fixed-length representations, demonstrating at least 215 times faster performance than existing function-matching tools, and superior malware detection capabilities across extensive datasets.
Analysing the similarity of binary code is fundamental to numerous areas of computer science, notably cybersecurity where identifying malicious software variants and vulnerabilities requires efficient comparison of program behaviour. Current methods, reliant on detailed function-level comparisons, struggle to scale effectively when dealing with vast codebases. Researchers at ShanghaiTech University, Tencent Security Keen Lab, and the University of Glasgow address this challenge with a novel hashing technique, detailed in their paper, ‘KEENHash: Hashing Programs into Function-Aware Embeddings for Large-Scale Binary Code Similarity Analysis’. Zhi Jie Liu, Qi Yi Tang, Sen Nie, Shi Wu, Liang Feng Zhang, and Yu Tian Tang demonstrate a system that transforms binary programs into compact, fixed-length embeddings using large language models and feature hashing, enabling significantly faster and more scalable similarity searches, achieving speed improvements of over 215 times compared to existing function matching tools, and demonstrating superior performance in malware detection across extensive datasets.
KEENHash, a novel hashing approach, addresses challenges inherent in large-scale binary code similarity analysis (BCSA). It generates program-level representations by utilising large language model (LLM)-generated function embeddings, effectively condensing binaries into compact, fixed-length program embeddings. This methodology achieves a substantial performance increase, demonstrating speeds at least 215 times faster than current state-of-the-art function matching tools, and completing 5.3 billion similarity evaluations in 395.83 seconds—a task requiring a minimum of 56 days using conventional methods.
Performance evaluations indicate that Pythia-410M consistently delivers the optimal balance between performance and resource utilisation for function embedding. While larger LLMs, such as Pythia-1B and StarCoder-1B, do not demonstrably improve performance, smaller models like Pythia-160M and Jina-137M exhibit noticeably reduced accuracy. Quantitative results show Pythia-410M achieves a Mean Reciprocal Rank (MRR) of 0.8588, Recall@1 of 0.9267, and Recall@5 of 0.7915, establishing it as the preferred model for this application. Mean Reciprocal Rank (MRR) measures the average of the reciprocal ranks of the first relevant document for each query, while Recall@k represents the proportion of relevant documents retrieved within the top k results.
The research explores two hashing strategies, KEENHash-stru and KEENHash-sem, both offering advantages in program clone search. KEENHash-stru focuses on structural features of the code, while KEENHash-sem emphasises semantic characteristics, allowing for a more comprehensive representation of binary programs. The investigation demonstrates that utilising either approach independently yields strong results, providing flexibility in adapting to different analysis requirements.
Although a hybrid search combining both modalities was explored, it did not yield significant improvements over utilising either approach independently, achieving a mean Average Precision (mAP) of 0.9384 and 0.9300 at a recall of 100. Average Precision (mAP) is a measure of the precision of the results retrieved, averaged over a set of queries. This suggests that the benefits of combining the two approaches are limited, and focusing on optimising either KEENHash-stru or KEENHash-sem provides a more efficient path to improved performance.
Further analysis identifies potential vulnerabilities in the Number of Strings (NoS) feature used within KEENHash-sem, noting its susceptibility to variations in compilation options and obfuscation techniques. Attackers can manipulate the strings within a binary to evade detection, rendering the NoS feature unreliable. To address this, the research incorporates the use of robust features like Lines of Code (LoC).
Across a dataset of binary code, KEENHash demonstrates superior performance, stemming from the combination of LLM-generated function embeddings, the optimisation of the LLM selection process, and the incorporation of robust features like LoC. The results establish KEENHash as a promising solution for efficient and effective binary code analysis.
This superior performance addresses the limitations of existing BCSA systems, providing a significant advancement in the field of cybersecurity and a clear path towards building more effective and reliable tools for detecting and mitigating malicious code.
The investigation into KEENHash modalities highlights their complementary strengths in representing binary programs from different perspectives. KEENHash-stru captures the structural organisation of the code, while KEENHash-sem focuses on the semantic meaning of the instructions. This dual representation provides a more comprehensive understanding of the binary, enabling more accurate similarity comparisons.
Future work should focus on exploring more sophisticated methods for combining KEENHash-stru and KEENHash-sem, potentially leveraging techniques like ensemble learning or feature fusion. Additionally, the researchers intend to explore the application of KEENHash to other cybersecurity tasks, such as vulnerability detection and malware classification.
By addressing the limitations of existing BCSA systems and introducing innovative techniques, this research makes a significant contribution to the field of cybersecurity. KEENHash provides a powerful and efficient solution for analysing binary code, enabling faster and more accurate detection of malicious software. The results demonstrate the potential of LLM-based approaches to revolutionise the field of cybersecurity.
👉 More information
🗞 KEENHash: Hashing Programs into Function-Aware Embeddings for Large-Scale Binary Code Similarity Analysis
🧠 DOI: https://doi.org/10.48550/arXiv.2506.11612
