Xian Gong, Paul X. McCarthy, Colin Griffith, Claire McFarland, and Marian-Andrei Rizoiu have introduced Cosmos 1.0, a novel dataset and methodology for mapping the landscape of emerging technologies and related entities. Published November 19, 2025, this work details a dataset comprising 23,544 technology-adjacent entities (TA23k), each represented by a 100-dimensional contextual embedding vector and categorized into seven thematic tech-clusters (TC7) with three meta tech-clusters (TC3). The dataset incorporates external indices – including the Technology Awareness, Generality, Deeptech, and Age of Tech Index – and extensive metadata from sources like Wikipedia, Crunchbase, and Google Scholar to assess and validate the relevance of emerging technologies.
Cosmos 1.0 Dataset and Technology Mapping
The Cosmos 1.0 dataset introduces a novel methodology for mapping emerging technologies, comprised of 23,544 technology-adjacent entities (TA23k). These entities are structured hierarchically and categorized using eight external indices. Each entity is represented by a 100-dimensional contextual embedding vector, allowing for assignment into seven thematic tech-clusters (TC7) and three meta tech-clusters (TC3). A subset of 100 emerging technologies (ET100) within the TA23k have undergone manual verification, enhancing dataset reliability.
This dataset incorporates indices designed to assess emerging technologies, including the Technology Awareness, Generality, Deeptech, and Age of Tech Index. Data sources include extensive metadata from Wikipedia, Crunchbase, Google Books, OpenAlex, and Google Scholar, used to validate index relevance and accuracy. The aim is to provide researchers, policymakers, and corporations with a tool for informed decision-making and resource allocation in a rapidly evolving technological landscape.
The Cosmos 1.0 approach utilizes a “bottom-up” methodology, leveraging the Wikipedia corpus and natural language processing (NLP) techniques to identify and explore the structure of technology-adjacent space. Entity embeddings, derived from Wikipedia articles, provide context-specific representations, and dimensionality reduction/clustering algorithms reveal hierarchical relationships, resulting in the three-level TC3, TC7, and ET100 structure.
Technology-Adjacent Entities (TA23k) and Structure
The Cosmos 1.0 dataset features 23,544 technology-adjacent entities (TA23k) organized with a hierarchical structure. This structure is categorized into three meta tech-clusters (TC3), seven theme tech-clusters (TC7), and a subset of 100 manually verified emerging technologies (ET100). Each entity within the TA23k is represented by a 100-dimensional contextual embedding vector, allowing for analysis and assignment to these thematic and meta clusters. This “bottom-up” approach leverages the Wikipedia corpus and NLP techniques to define and explore the technology landscape.
This dataset utilizes a novel methodology, moving away from traditional “top-down” expert panels. Instead, it employs NLP techniques and data from sources like Wikipedia, Crunchbase, and Google Books to construct a technology-adjacent space. Dimensionality reduction and clustering algorithms are then applied to detect the hierarchical structure within this space, resulting in the TC3, TC7, and ET100 classifications. This approach aims to identify emerging technologies through data analysis rather than subjective expert opinion.
The study highlights the limitations of current emerging technology identification methods, specifically the reliance on patents, publications, and news articles – representing 89% of data sources. Cosmos 1.0 aims to address this by using a broader range of data and a “bottom-up” approach. The resulting dataset provides technology indices—like the Technology Awareness Index—to filter both mature and emerging technologies, aiding informed decision-making for researchers, policymakers, and corporations.
Manually Verified Emerging Technologies (ET100)
The Cosmos 1.0 dataset includes 23,544 technology-adjacent entities (TA23k) organized with a hierarchical structure and categorized by eight external indices. Within this broad dataset, a specific subset of 100 emerging technologies (ET100) has been manually verified, providing a focused collection of recognizable advancements. This manual verification process aims to ensure the accuracy and relevance of identified emerging technologies within the larger technology landscape mapped by the dataset.
This dataset employs a “bottom-up” approach, leveraging the Wikipedia corpus and natural language processing (NLP) techniques to identify and explore technology-adjacent space. The process results in a three-level hierarchical tree – three meta tech-clusters (TC3), seven theme tech-clusters (TC7), and the 100 manually verified emerging technologies (ET100). The ET100 represents the most granular level, offering a focused list of confirmed emerging technologies.
The inclusion of manually verified ET100 within the Cosmos 1.0 dataset is designed to assist researchers, policymakers, and corporations in making informed decisions. By identifying these technologies early, stakeholders can strategically allocate resources, foster development, and proactively adapt to changes and opportunities. This dataset aims to facilitate sustainable growth and maintain competitive advantage by pinpointing key areas of technological advancement.
Technology Awareness, Generality, and Deeptech Indices
The Cosmos 1.0 dataset includes several indices designed to assess emerging technologies, notably the Technology Awareness Index, Generality Index, Deeptech, and Age of Tech Index. These indices are used to filter both mature and emerging technologies from a universe of 23,544 technology-adjacent entities (TA23k). This filtering process aims to help stakeholders make informed decisions and allocate resources effectively, fostering sustainable growth and competitive advantage in rapidly evolving fields.
This research utilizes a “bottom-up” approach, leveraging the Wikipedia corpus and Natural Language Processing (NLP) techniques to explore the structure of technology-adjacent space. Entity embeddings, derived from Wikipedia articles, provide context-specific representations for these technologies. These embeddings are then used with dimensionality and clustering algorithms to create a three-level hierarchical structure – three meta tech-clusters (TC3), seven theme tech-clusters (TC7), and a verified set of 100 emerging technologies (ET100).
The study addresses limitations in existing methods for identifying emerging technologies, which often rely on limited data types like patents, publications, and news articles. By employing a “bottom-up” approach with Wikipedia and NLP, the Cosmos 1.0 dataset offers a more comprehensive exploration of the technology landscape. The resulting hierarchical structure and associated indices aim to provide insights into both the nature and functions of these technologies for researchers, policymakers, and corporations.
Age of Tech Index and Data Sources
The Cosmos 1.0 dataset includes a novel “bottom-up” methodology for mapping emerging technologies, culminating in a dataset of 23,544 technology-adjacent entities (TA23k). This dataset is structured hierarchically, moving from three meta tech-clusters (TC3) down to seven thematic tech-clusters (TC7) and finally to 100 manually verified emerging technologies (ET100). The study leverages the Wikipedia corpus and natural language processing techniques to identify and explore these technology-adjacent spaces.
Several indices are incorporated to assess technologies within the dataset, including the Technology Awareness Index, Generality Index, Deeptech, and the Age of Tech Index. These indices are designed to filter both mature and emerging technologies from the larger universe of technology-adjacent entities. The study emphasizes a shift away from traditional “top-down” methods relying on expert panels, instead utilizing a data-driven approach.
Data sources for this research include Wikipedia, Crunchbase, Google Books, OpenAlex, and Google Scholar. The majority (89%) of data used in emerging technology forecasting typically comes from patents, publications, and news articles. However, this study aims to diversify data sources, using Wikipedia’s extensive content and linked data to validate relevance and accuracy of constructed indices, providing a comprehensive overview of the technology landscape.
Identifying Emerging Technologies: Importance and Impact
Identifying emerging technologies is crucial as these innovations drive economic benefits, improved health outcomes, and sustained innovation. The source highlights that understanding and adopting technologies early can lead to growth and competitiveness, evidenced by the positive impact of digital adoption in the retail industry—a study of 181 companies demonstrated broad impact. Technologies like artificial intelligence, robotics, and new materials are proving pivotal for economies worldwide.
The source details a “bottom-up” methodology for identifying emerging technologies, differing from traditional “top-down” approaches relying on expert panels. This new method leverages the Wikipedia corpus and natural language processing (NLP) techniques to define a universe of 23,544 technology-adjacent entities (TA23k). From this, 100 emerging technologies (ET100) were manually verified, creating a hierarchical structure with three meta tech-clusters (TC3) and seven theme tech-clusters (TC7).
This dataset, built using entity embeddings and dimensionality reduction, aims to help stakeholders make informed decisions about resource allocation and foster technological development. By analyzing a broad range of data—including Wikipedia content—researchers can move beyond relying solely on patents, publications, and news articles—which currently comprise 89% of data used in technology forecasting—to gain unique insights into emerging technologies.
Qualitative and Quantitative Methods for Tech Identification
Both qualitative and quantitative methods are currently employed to identify emerging technologies. Qualitative approaches traditionally utilize “top-down” processes, relying on panels of experts to discuss and vote on critical technologies—a method used by organizations like the OECD, WEF, and MIT Technology Review. The Delphi method exemplifies this, systematically gathering and refining expert opinions through iterative questionnaires. Quantitative methods are rapidly evolving with the application of text analysis and deep learning techniques.
Quantitative methods historically relied on data sources like publications, patents, and news articles, often counting keywords, authors, or citations. However, the development of tools like natural language processing (NLP), subject-action-object (SAO) structure analysis, and large language models (LLMs) offers potential for creating new data types. Currently, 89% of data sources for emerging technology forecasting are still based on these traditional publications, patents and news articles.
This study takes a “bottom-up” approach, leveraging the Wikipedia corpus and NLP techniques to identify and explore technology-adjacent space. The researchers utilized Wikipedia2Vec, a pre-trained language model, to filter relevant articles based on cosine similarity. This resulted in a hierarchical structure comprised of three meta tech-clusters (TC3), seven thematic tech-clusters (TC7), and a manually verified set of 100 emerging technologies (ET100) representing the lowest level of the hierarchy.
Limitations of Current Emerging Tech Methods
Current methods for identifying emerging technologies face limitations in the types of data used and the methodology applied. The majority (89%) of data sources rely on patents, publications, and news articles. This reliance hinders a broader understanding of the technology landscape. The source highlights a historical preference for “top-down” approaches, like expert panels using the Delphi method, rather than methods leveraging diverse data sources for a more comprehensive view.
Quantitative methods traditionally count keywords, authors, or citations, but the development of tools like natural language processing (NLP) and large language models (LLMs) offers potential for creating new data types. The study addresses this by utilizing a “bottom-up” approach, leveraging the Wikipedia corpus and NLP techniques to explore the underlying structure of technology-adjacent space. This aims to move beyond solely relying on publications and patents.
The Cosmos 1.0 dataset attempts to address these limitations by creating entity embeddings of 23,000 technology-adjacent entities (TA23k) and a set of technology indices. By using Wikipedia and techniques like Wikipedia2Vec, the researchers aim to filter and explore emerging technologies based on context-specific representations, rather than solely relying on traditional data sources and top-down methodologies. This results in a three-level hierarchical tree, including 100 manually verified emerging technologies (ET100).
Data Sources: Patents, Publications, and News
The Cosmos 1.0 dataset utilizes a variety of data sources to map emerging technologies, with 89% of current forecasting methods relying on patents, publications, and news articles. This study moves beyond solely using these traditional sources by leveraging the Wikipedia corpus and Natural Language Processing (NLP) techniques. The dataset comprises 23,544 technology-adjacent entities (TA23k), aiming to provide a more comprehensive understanding of the technology landscape and aid in identifying key advancements.
This research employs a “bottom-up” approach, utilizing Wikipedia’s extensive content and hyperlinks to define a universe of technology-adjacent entities. The Wikipedia2Vec model, a pre-trained language model with entity embeddings, filters relevant articles and provides context-specific representations of technologies. This contrasts with traditional “top-down” methodologies relying on expert panels and voting, offering a data-driven alternative for identifying and mapping emerging technology trends.
The resulting dataset features a hierarchical structure, categorized into three meta tech-clusters (TC3), seven theme tech-clusters (TC7), and a manually verified set of 100 emerging technologies (ET100). These clusters are derived using dimensionality reduction and clustering algorithms applied to entity embeddings, providing a visualized map of the technology-adjacent space. The aim is to assist stakeholders in making informed decisions regarding resource allocation and fostering sustainable growth.
Natural Language Processing and Text Analysis Tools
The Cosmos 1.0 dataset utilizes natural language processing (NLP) techniques, alongside methods like subject-action-object (SAO) structure analysis and large language models (LLMs), to create new data for detecting emerging technologies. Current quantitative methods often rely on keyword counts from publications, patents, and news articles; however, advancements in text-mining are expanding data diversity and offering insights into the nature of emerging technologies. This approach aims to overcome limitations found in traditional “top-down” methodologies.
This dataset leverages the Wikipedia corpus and NLP to explore the underlying structure of technology-adjacent space. Wikipedia’s extensive and reliable content, edited by numerous experts, provides relevant descriptions and links. Entity embeddings, derived using a pre-trained language model called Wikipedia2Vec, enable filtering of technology-adjacent articles based on cosine similarity and provide context-specific representations of technologies.
The study employs dimensionality reduction and clustering algorithms to detect and visualize the hierarchical structure among technology-adjacent entities. This “bottom-up” approach results in a three-level hierarchical tree consisting of three meta tech-clusters (TC3), seven theme tech-clusters (TC7), and a manually verified set of 100 emerging technologies (ET100). The dataset includes 23,544 technology-adjacent entities (TA23k).
Subject-Action-Object Structure and Knowledge Networks
The Cosmos 1.0 dataset utilizes natural language processing (NLP) techniques, including subject-action-object (SAO) structure analysis and knowledge networks, to identify emerging technologies. This “bottom-up” approach contrasts with traditional “top-down” methods reliant on expert panels. By leveraging the Wikipedia corpus, the study aims to move beyond data limitations previously focused on patents, publications, and news articles. This allows for a broader exploration of technology-adjacent spaces and a deeper understanding of emerging tech structures.
The dataset constructs a hierarchical structure of technology-adjacent entities (TA23k), culminating in 100 manually verified emerging technologies (ET100). This structure is visualized through three meta tech-clusters (TC3), seven theme tech-clusters (TC7), arranged from top to bottom. Entity embeddings, derived from Wikipedia2Vec, provide context-specific representations of technologies, enabling filtering and analysis based on cosine similarity, and revealing the relationships between them.
The study utilizes dimensionality reduction and clustering algorithms to detect and visualize this hierarchical structure within the technology-adjacent space. These methods analyze the Wikipedia corpus to create a map of 23,544 technology-adjacent entities. This process aims to provide researchers, policymakers, and corporations with a tool for informed decision-making and resource allocation within the rapidly evolving technological landscape.
“Bottom-Up” Approach Using Wikipedia and NLP
The study employs a “bottom-up” approach to map emerging technologies, differing from traditional “top-down” methods relying on expert panels. This methodology leverages the extensive data within the Wikipedia corpus and utilizes natural language processing (NLP) techniques to identify and explore technology-adjacent entities. By analyzing Wikipedia’s content, researchers aim to reveal the underlying structure of the technology landscape, moving beyond reliance on publications, patents, or news articles as primary data sources.
This bottom-up approach begins with a universe of 23,544 technology-adjacent entities (TA23k) sourced from Wikipedia. These entities are represented by 100-dimensional contextual embedding vectors, allowing for analysis and clustering based on similarity. Dimensionality reduction and clustering algorithms are then used to reveal a hierarchical structure, resulting in three meta tech-clusters (TC3), seven theme tech-clusters (TC7), and ultimately, a manually verified subset of 100 emerging technologies (ET100).
The use of Wikipedia2Vec, a pre-trained language model, is key to filtering relevant articles and identifying technology-adjacent entities. Entity embeddings, unlike word embeddings, provide context-specific representations of Wikipedia articles, enabling a nuanced understanding of technology relationships. Cosine similarity is used to identify related articles, supporting the creation of the hierarchical structure and validation of the 100 emerging technologies identified within the larger dataset.
Hierarchical Structure: Meta, Theme, and Emerging Tech
The Cosmos 1.0 dataset utilizes a “bottom-up” approach to map emerging technologies, constructing a hierarchical structure from a universe of 23,544 technology-adjacent entities (TA23k). This structure organizes technologies into three meta tech-clusters (TC3), seven thematic tech-clusters (TC7), and a manually verified subset of 100 emerging technologies (ET100). The methodology leverages the Wikipedia corpus and NLP techniques to identify relationships and patterns within the technology landscape.
This hierarchical organization is created through dimensionality reduction and clustering algorithms applied to entity embeddings derived from Wikipedia articles. Entity embeddings, unlike word embeddings, provide context-specific representations, enabling a more nuanced understanding of technology relationships. The resulting tree structure allows for visualization of the technology-adjacent space, ranging from broad meta-clusters down to the specific, verified ET100 at the lowest level.
The dataset aims to move beyond traditional “top-down” methods—like expert panels—by utilizing a data-driven approach. By analyzing a vast number of technology-adjacent entities from Wikipedia, researchers can identify and map emerging technologies in a more systematic and comprehensive way. The goal is to provide stakeholders with tools to make informed decisions and foster innovation through early identification and understanding of technological advancements.
Entity Embeddings and Cosine Similarity
The Cosmos 1.0 dataset utilizes a “bottom-up” approach to map emerging technologies, beginning with a universe of 23,544 technology-adjacent entities (TA23k). These entities are represented by 100-dimensional contextual embedding vectors, enabling analysis of relationships within the technology landscape. This methodology leverages the Wikipedia corpus and natural language processing techniques to identify and structure these entities, moving beyond traditional “top-down” expert-driven approaches to technology forecasting.
A key element of this research is the use of Wikipedia2Vec, a pre-trained language model, to filter technology-adjacent articles. Cosine similarity is specifically employed to identify articles related to emerging technologies. Entity embeddings, unlike standard word embeddings, provide context-specific representations of Wikipedia articles, allowing for a more nuanced understanding of technology relationships and aiding in the creation of a hierarchical structure of tech-clusters.
The analysis culminates in a three-level hierarchical tree composed of three meta tech-clusters (TC3), seven theme tech-clusters (TC7), and a manually verified set of 100 emerging technologies (ET100). This structure is created through dimensionality reduction and clustering algorithms applied to the entity embedding vectors, and ultimately aims to provide researchers and policymakers with a structured view of the emerging technology landscape.
Applications for Researchers, Policymakers, and Corporations
The Cosmos 1.0 dataset aims to support informed decision-making for researchers, policymakers, and corporations. It provides a dataset of 23,544 technology-adjacent entities (TA23k), including 100 manually verified emerging technologies (ET100), alongside a set of technology indices. These indices allow for filtering both mature and emerging technologies, aiding in resource allocation and fostering sustainable growth and competitive advantage within various research and industry fields.
This dataset utilizes a “bottom-up” approach, leveraging the Wikipedia corpus and natural language processing (NLP) techniques. Wikipedia, with its numerous expert contributors and reliable sources, provides relevant descriptions and hyperlinks for technologies. The dataset outputs a three-level hierarchical structure – three meta tech-clusters (TC3), seven theme tech-clusters (TC7), and the ET100 – enabling visualization of relationships within the technology-adjacent space.
The technology indices within Cosmos 1.0 are designed to help stakeholders identify emerging technologies early. By using these indices alongside the entity embeddings, researchers and policymakers can proactively adapt to changes and strategically plan, while corporations can make informed decisions for competitive advantage. This approach differs from traditional “top-down” methods reliant on expert panels, utilizing a data-driven methodology.
