Software engineering relies heavily on practical knowledge embedded within grey literature, reports, blog posts, and forum discussions, yet systematically gathering and analysing this information presents a significant challenge, due to its diverse sources and formats. Houcine Abdelkader Cherief, Brahim Mahmoudi, and Zacharie Chenail-Larcher, from École de technologie supérieure, along with colleagues from Université du Québec à Montréal, now present a solution in the form of GLiSE, an automated tool designed to extract relevant grey literature from key online platforms. GLiSE transforms research topics into targeted queries, efficiently collects results from sources like GitHub and Stack Overflow, and then employs advanced semantic analysis to filter and rank the findings, offering a reproducible and scalable method for synthesising real-world software engineering practices. This work not only delivers a functional tool, but also a curated dataset of classified search results and an evaluation of its usability, representing a substantial step forward in harnessing the wealth of knowledge contained within grey literature.

This addresses the time-consuming task of manually collecting and screening this crucial information for comprehensive literature reviews and evidence-based research. GLiSE operates by first converting a research question into specific queries for relevant data sources, including platforms like GitHub and documentation repositories. It then retrieves results and uses embedding-based machine learning classifiers to filter irrelevant content and rank remaining results by relevance, leveraging semantic meaning for more intelligent filtering than keyword searches.

A usability study demonstrated that participants could identify relevant sources significantly faster with GLiSE compared to manual methods, highlighting the tool’s effectiveness in reducing effort during grey literature screening. Future development includes expanding data sources, creating a web-based interface, increasing the training dataset size, and adding mechanisms to assess information trustworthiness. The tool utilizes embeddings, machine learning classifiers, such as Gaussian Naive Bayes and Support Vector Machines, and systematic literature review techniques, implemented in Python with the scikit-learn library. GLiSE represents a promising solution to a significant bottleneck in software engineering research, enabling efficient handling of grey literature.

GLiSE, Automated Grey Literature Search for Software Engineering

Scientists developed GLiSE, a tool to systematically collect and assess grey literature vital to software engineering research, overcoming challenges posed by its diverse sources. This pioneering approach integrates web-scale search with specialized platform connectors and natural language processing for classifying and ranking results. GLiSE transforms research topics into platform-specific queries for sources like GitHub, Stack Overflow, and Google Search. It employs embedding-based semantic classifiers to filter and rank results by relevance, prioritizing the most pertinent information. All settings are configuration-based, and generated queries are recorded for review, ensuring reproducibility. A key achievement is a curated dataset of 1,137 software engineering grey literature search results, each paired with its search intent and classified by relevance, serving as a valuable resource for validation and refinement. GLiSE offers a dedicated framework for automated discovery, acquisition, and curation of grey literature at scale, handling heterogeneous data and sparse metadata.

GLiSE Automatically Curates Software Engineering Grey Literature

Researchers created GLiSE, a new tool to automatically extract and curate grey literature within software engineering, addressing a gap in systematic evidence collection. The tool overcomes challenges posed by the diverse formats and sources of this information, hindering large-scale analysis and reproducibility. GLiSE transforms research topics into specific queries for platforms like GitHub, Stack Overflow, and Google Search, enabling efficient data gathering. It then employs embedding-based semantic classifiers to assess and rank results based on their relevance to the original research intent, streamlining identification of valuable information.

A key achievement is a curated dataset comprising 1,137 search results, each paired with its originating search intent and classified by semantic relevance, serving as a benchmark for evaluating GLiSE’s performance. The tool systematically collects and normalizes data from heterogeneous sources, overcoming limitations of manual approaches. GLiSE prioritizes reproducibility with configurable settings and accessible queries, ensuring reliable replication of searches and validation of findings. By automating grey literature discovery, acquisition, and curation, GLiSE delivers a breakthrough in evidence-based software engineering, enabling more comprehensive research.

GLiSE Automates Grey Literature Retrieval and Screening

This work presents GLiSE, a tool automating the retrieval and screening of grey literature from sources including GitHub, Stack Overflow, and Google Search. The team developed a system translating research topics into specific queries, gathering information, and employing machine learning classifiers to filter and rank results by relevance. A curated dataset of classified grey literature search results accompanies the tool, facilitating further research and evaluation. Researchers demonstrated GLiSE’s usability in a study showing participants identified relevant sources significantly faster, indicating a clear benefit for software engineering researchers. Acknowledging limitations in the training dataset size, the team mitigated this through careful feature selection and model capacity. Future work focuses on expanding data sources, creating a web-based version, increasing the training dataset, and developing methods to assess information trustworthiness.

👉 More information
🗞 An Automated Grey Literature Extraction Tool for Software Engineering
🧠 ArXiv: https://arxiv.org/abs/2512.23066

Tags:

embedding-based classifiers GitHub GLiSE Google Search grey literature semantic relevance software research Stack Overflow

Glise Achieves Scalable Software Research Via Automated Grey Literature Extraction

GLiSE, Automated Grey Literature Search for Software Engineering

GLiSE Automatically Curates Software Engineering Grey Literature

GLiSE Automates Grey Literature Retrieval and Screening

Rohail T.

Latest Posts by Rohail T.:

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently

Protected: Quantum Computing Tackles Fluid Dynamics with a New, Flexible Algorithm

Protected: Silicon Unlocks Potential for Long-Distance Quantum Communication Networks