Scientists are tackling the challenge of managing increasingly complex materials data with a new categorisation system called M-CODE, or Materials Categorization via Ontology, Dimensionality and Evolution. Vsevolod Biryukov and Kamal Choudhary, working in collaboration between Exabyte Inc. and Johns Hopkins University, alongside Timur Bazhirov from Exabyte Inc., have developed a compact system linking materials science terminology to reusable concepts and traceable transformations. This research is significant because it provides a standardised approach to classifying materials by dimensionality and structural complexity, ranging from pristine to defective and processed forms, facilitating reproducible dataset generation, validation and wider community contributions. The team’s practical implementation, including open-source codebases with JSON schemas and Python/TypeScript interfaces, promises to advance data management practices crucial for the ongoing development of artificial intelligence in materials science.
Effective categorisation is essential if machines are to accelerate the discovery of better materials for everything from electronics to energy storage. This new system offers a common language for describing materials, promising to unlock the full potential of data-driven materials science.
Artificial intelligence is rapidly advancing materials science, yet current data standards struggle to capture the complexity of real-world materials, including surfaces, defects, and reduced dimensionality. Unlike previous approaches, this work establishes a shared language for describing materials and how they are created. The system classifies structures based on their dimensionality and structural complexity, ranging from perfect crystals to those containing defects or created through specific processing techniques.
This categorization isn’t merely descriptive; it’s designed to be actively used in computational workflows. At its core, M-CODE represents materials generation as explicit transformations with traceable origins. Each structure arises from a ‘Configuration’, specifying input components and parameters, processed by a ‘Builder’ to yield the final ‘Material’ along with detailed metadata.
This approach ensures that structures can be reliably reproduced and their provenance accurately tracked. The researchers implemented this framework using JSON schemas and examples, alongside Python and TypeScript interfaces, fostering interoperability and community contributions. M-CODE goes beyond defining structure classes by providing an ontology of entities and operations, essentially a set of building blocks and instructions for constructing materials.
These entities, categorized by their role in the construction process, are defined using JSON schemas, allowing for automated validation and exchange of data. By focusing on the intermediate layer between realistic structures and reusable build operations, the work avoids the need to standardize simulation inputs or property ontologies, offering a flexible and targeted solution. The resulting open-source codebase promises to accelerate materials discovery by streamlining data management and promoting collaboration.
M-CODE classification of materials structures and composable software elements
The research delivers a categorization system, M-CODE, capable of concisely tagging materials structures with compact identifiers. These tags, encoding dimensionality, structural complexity, and creation methods, provide a standardized language for materials science data. A key output is the development of 58 distinct M-CODE tags, covering pristine, compound, defective, and processed structures across 0D, 1D, 2D, and 3D dimensionalities.
For instance, a monolayer material receives the tag P-2D-MNL, while a defective structure featuring a nitrogen substitution in graphene is labelled D-0D-SUB. The work presents a software-focused view of reusable entities and operations, implemented as composable elements designed for validation and assembly into complex structures.
The system’s architecture allows for reproducible dataset generation and community contributions, fostering collaborative materials science research. Schemas are defined once and language bindings are automatically generated, ensuring consistency across implementations. A representative example illustrates the schema for a two-dimensional interface configuration, comprising a stack of two strained supercells and vacuum, with a default xy_shift of [0.0, 0.0].
Corresponding JSON instances demonstrate how this schema translates into concrete data, such as a Ni-Graphene interface separated by 10 Ångströms. Similarly, a schema for a substitutional point defect allows for the precise definition of defects within a crystal structure, using a merge operation to combine a bulk material with a specific defect site. The research targets an intermediate layer connecting realistic structures to reusable build operations, rather than focusing on direct simulation inputs or universal property ontologies.
This methodology combines a categorization scheme for target structures with a software-oriented ontology of entities and operations, enabling reproducible materials construction. JavaScript Object Notation (JSON) and JSON Schema form the foundation for data organization, validation, and exchange, providing a single source of truth for definitions. Consequently, language bindings, specifically Python and TypeScript interfaces, are automatically generated from these schemas to ensure consistency across implementations.
The reference implementation employs object-oriented design, promoting modularity and reusability of concepts, and is distributed via PyPi and NPM packages as mat3ra-esse. Each material structure is represented as a combination of building blocks and the process used to assemble them, emphasizing explicit transformations and provenance tracking. A Configuration specifies input structures and physical parameters, which are then consumed by a Builder to generate the resulting Material alongside metadata detailing the transformation applied.
Entities within this framework are grouped into four categories based on their construction role, and are specified as JSON Schemas for validation and interchange. These schemas are then realised as corresponding classes and methods within the codebase, allowing translation of scientific descriptions into machine-readable, reproducible configurations.
For instance, a “Ni slab with Nlayers and 10 Å vacuum” can be encoded as a slab configuration, demonstrating the system’s ability to capture complex structural details. Core entities, such as 3D crystals, 3D voids, and 0D atoms, are defined by dimensionality and purpose, with required and optional fields specified within the JSON Schema. Auxiliary entities like supercell matrices and Miller indices further refine the description of structures and their creation processes.
By representing materials generation as explicit transformations with provenance, the work aims to improve data quality and facilitate collaboration within the materials science community. Once a structure is defined, the Builder component utilizes the Configuration to create the Material, ensuring a clear and traceable pathway from initial parameters to final result.
Developing a standardised data framework to unlock machine learning in materials science
Scientists are increasingly reliant on machine learning to accelerate materials discovery, yet a fundamental bottleneck persists: data. Not simply the volume of data, but its consistency and ability to be shared and reused between research groups. Establishing a common language for materials is not simple, as materials exhibit a vast range of compositions, structures, and processing histories, each influencing their properties.
M-CODE tackles this complexity through a hierarchical system, classifying materials not just by what they are, but also by how they were created and their structural characteristics. This focus on provenance, the history of a material’s creation, is particularly valuable, as it allows algorithms to account for the impact of synthesis and processing on final performance.
The true test will be widespread adoption, requiring a shift in culture as much as technology. The availability of open-source tools and schemas is a welcome step, lowering the barrier to entry for researchers. However, the success of any standard hinges on community buy-in and ongoing maintenance. While existing initiatives like OPTIMADE provide platforms for data sharing, they rely on consistent data formatting to be effective.
Once implemented across multiple datasets, this categorization system could unlock new possibilities for predictive modelling and automated materials design. Beyond this, the framework could extend to other areas of physical science where complex structures and histories are important, such as chemistry and geology. This work signals a move towards greater data maturity.
It is not a final solution, but a vital building block. Open questions remain regarding the scalability of the ontology to encompass entirely new classes of materials and the integration of experimental and computational data. Future efforts should focus on developing automated tools for data validation and transformation, ensuring that materials data is not just abundant, but also reliable and readily accessible to the growing community of materials scientists and AI researchers.
👉 More information
🗞 M-CODE: Materials Categorization via Ontology, Dimensionality and Evolution
🧠 ArXiv: https://arxiv.org/abs/2602.14384
