Paper ID: 2502.06836 • Published Feb 6, 2025
CAST: Cross Attention based multimodal fusion of Structure and Text for materials property prediction
Jaewan Lee, Changyoung Park, Hongjun Yang, Sungbin Lim, Sehui Han
TL;DR
Get AI-generated summaries with premium
Get AI-generated summaries with premium
Recent advancements in AI have revolutionized property prediction in
materials science and accelerating material discovery. Graph neural networks
(GNNs) stand out due to their ability to represent crystal structures as
graphs, effectively capturing local interactions and delivering superior
predictions. However, these methods often lose critical global information,
such as crystal systems and repetitive unit connectivity. To address this, we
propose CAST, a cross-attention-based multimodal fusion model that integrates
graph and text modalities to preserve essential material information. CAST
combines node- and token-level features using cross-attention mechanisms,
surpassing previous approaches reliant on material-level embeddings like graph
mean-pooling or [CLS] tokens. A masked node prediction pretraining strategy
further enhances atomic-level information integration. Our method achieved up
to 22.9\% improvement in property prediction across four crystal properties
including band gap compared to methods like CrysMMNet and MultiMat. Pretraining
was key to aligning node and text embeddings, with attention maps confirming
its effectiveness in capturing relationships between nodes and tokens. This
study highlights the potential of multimodal learning in materials science,
paving the way for more robust predictive models that incorporate both local
and global information.