AI Learns Building Details with Language Model Boost

Researchers are increasingly focused on improving the accurate representation of building semantics for effective artificial intelligence training within the architecture, engineering, construction and operation sectors. Suhyung Jang, Ghang Lee (from Yonsei University and the Technical University of Munich), Jaekun Lee, and Hyunjun Lee, and colleagues demonstrate a novel training approach utilising large language model embeddings to better preserve nuanced distinctions between building object subtypes. This work addresses the limitations of conventional encoding methods by employing embeddings such as OpenAI GPT and Meta LLaMA, and evaluating their performance using GraphSAGE to classify 42 building object subtypes across five high-rise residential building information models. The findings, which show a weighted average F1-score of 0.8766 with compacted llama-3 embeddings compared to 0.8475 for one-hot encoding, highlight the potential of leveraging LLM-based encodings to significantly enhance semantic comprehension in AI applications throughout the AECO industry.

Artificial intelligence is increasingly used to design and manage the buildings we live and work in. However, current systems struggle to fully understand the complex details within these structures, and this limits their effectiveness. New techniques utilising advanced language models promise to give computers a far richer grasp of building information, improving performance across the construction industry.

Researchers are applying large language models (LLMs) to building design and construction. Accurate interpretation of building semantics, the detailed meaning of components and their relationships, is vital for artificial intelligence systems used in the architecture, engineering, construction, and operation (AECO) industry. The team evaluated this method by training an AI model called GraphSAGE to classify 42 different types of building objects found within five high-rise residential building information models (BIMs).

These BIMs, digital representations of the building’s physical and functional characteristics, provided a complex dataset for testing. Experiments involved varying the dimensions of the LLM embeddings, including both high-resolution and compacted versions generated using a technique called Matryoshka representation. Results demonstrate that LLM encodings outperform traditional one-hot encoding in accurately identifying building subtypes.

Specifically, a compacted embedding from the llama-3 model achieved a weighted average F1-score of 0.8766, exceeding the 0.8475 score obtained with one-hot encoding. This improvement suggests that leveraging the semantic understanding embedded within LLMs can significantly enhance AI’s ability to interpret complex building data. This approach promises to unlock more intelligent automation throughout the AECO industry, from design validation to construction monitoring and building operations.

As LLMs and data compression techniques continue to advance, the potential for applying these enriched encodings to a wider range of semantic tasks appears substantial. The challenge remains in efficiently translating complex building information into a format that AI can readily process and utilise for informed decision-making. Accurate interpretation of building semantics is essential for reliable and informed decisions throughout construction projects.

Representing and interpreting this information, referred to as “building semantics,” is vital for both AI systems and their training. To enable AI models to comprehend this data through supervised learning, it is labelled and entered into AI models utilising encoding methods, which play a critical role in differentiating classes during training. Prior studies often defaulted to conventional methods like one-hot or label encodings, overlooking the selection of encoding method itself.

The advent of large language models (LLMs) and their generated embeddings has demonstrated the ability to capture domain-specific contextual nuances within the AECO fields, though their utilisation has primarily been limited to retrieving similar information. This study proposes a novel AI model training method that employs LLM embeddings as encodings to preserve finer distinctions between building semantics.

To validate this approach, researchers conducted an experiment focusing on building object subtype classification using GraphSAGE models. For 42 building object subtypes and five BIM models utilised by a major contractor, GraphSAGE models were trained for node classification. The experiment contrasted one-hot encoding with LLM encodings generated by OpenAI’s ‘text-embedding-3-small’ and ‘text-embedding-3-large’, as well as Meta’s ‘llama3’.

Additionally, diverse embedding dimensions were compacted using the Matryoshka representation model to assess whether semantic nuances were preserved during training. The findings offer a new pathway for enhancing AI’s comprehension of complex building information, potentially revolutionizing workflows across the AECO industry. AI’s ability to function effectively within the AECO industry depends on its capacity to interpret the terminology and concepts used in building projects.

This interpretation of building semantics is largely achieved through supervised learning, where objects, concepts, and encoding methods form a critical triangle of reference. Previous research has highlighted the importance of representing frequently occurring and well-defined building elements. The use of five high-rise residential building information models (BIMs) provided a diverse dataset for training and evaluation, ensuring the generalizability of the results.

Furthermore, the GraphSAGE models employed in this work demonstrated their capacity to effectively learn from the LLM-based encodings. GraphSAGE, a graph neural network, formed the core of our methodology for classifying building object subtypes within building information models (BIMs). We selected this technique because of its ability to effectively learn node embeddings from graph-structured data, mirroring the relationships inherent in BIMs.

Five high-rise residential BIMs, sourced from a major construction firm, provided the foundation for our experiments. Within these models, 42 distinct building object subtypes, ranging from walls and columns to doors and windows, were identified for classification. Each object subtype initially received a conventional one-hot encoding, establishing a baseline for performance comparison.

We then explored utilising embeddings generated by large language models (LLMs). Embeddings are numerical representations of words or concepts that capture semantic meaning. We tested embeddings from OpenAI’s ‘text-embedding-3-small’ and ‘text-embedding-3-large’ models, alongside Meta’s ‘llama3’, each offering differing levels of contextual understanding.

To assess the impact of embedding size, we employed original high-dimensional embeddings of 1,536, 3,072, and 4,096 dimensions. High-dimensional embeddings can be computationally expensive, so we also implemented the Matryoshka representation model, a dimensionality reduction technique, to create compacted 1,024-dimensional embeddings. This allowed us to investigate whether semantic information could be preserved even with reduced dimensionality.

During training, the GraphSAGE model learned to predict the subtype of each object based on its features and relationships within the BIM graph. By comparing the performance of models trained with one-hot encoding versus various LLM embeddings, we aimed to demonstrate the benefits of capturing finer distinctions in building semantics. Once the GraphSAGE models were trained, we evaluated their performance using a weighted average F1-score, a metric that balances precision and recall across all 42 object subtypes.

This metric provides a comprehensive assessment of the model’s ability to accurately classify each object type. The selection of weighted average F1-score was deliberate, as it accounts for potential imbalances in the distribution of object subtypes within the BIMs. Careful consideration was given to ensure a fair comparison between the different encoding methods.

By systematically varying the LLM model and embedding dimensions, we sought to identify the optimal configuration for maximising semantic comprehension in AI-driven AECO applications. The intention was to move beyond simple label encoding. Unlike traditional methods, LLM embeddings encode semantic relationships, allowing the AI to understand that a ‘fire door’ is more closely related to a ‘standard door’ than to a ‘window’.

This nuanced understanding is critical for tasks such as automated code compliance checking or generating accurate construction schedules. By leveraging the power of LLMs, we aimed to create AI models that can interpret building semantics with greater accuracy and sophistication. Instead of merely recognising object types, the models can begin to understand their function and relationships within the broader building context.

👉 More information
🗞 Enhancing Building Semantics Preservation in AI Model Training with Large Language Model Encodings
🧠 ArXiv: https://arxiv.org/abs/2602.15791

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Quantum Error Correction Gains a Clearer Building Mechanism for Robust Codes

Quantum Error Correction Gains a Clearer Building Mechanism for Robust Codes

March 10, 2026

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently

March 3, 2026

Protected: Quantum Computing Tackles Fluid Dynamics with a New, Flexible Algorithm

March 3, 2026