Model Card for CrystaLLM-pi_Chili100K-XRD

Model Details

Model Description

CrystaLLM-pi_Chili100K-XRD is a conditional generative model designed for the recovery of crystal structures from X-ray Diffraction (XRD) data. It is a fine-tuned version of the CrystaLLM-pi framework, utilizing a GPT-2 decoder-only architecture. This model employs a Residual Attention (Slider) mechanism to condition the generation of Crystallographic Information Files (CIFs) on heterogeneous X-ray diffraction data.

The model generates crystal structures based on an XRD pattern input vector consisting of the 20 most intense peaks:

  1. Peak Positions ($2\theta$)
  2. Peak Intensities

The Chili-100K XRD dataset the model is fine-tuned on contains experimentally determined structures sourced from Chili-100K, which is an inorganic experimental nanomaterials curated and filtered subset of the Crystallographic Open Database (COD). Notably, this model features an extended context window of 1536 tokens, enabling the generation of larger and more complex unit cells containing up to ~100 atoms.

  • Developed by: Bone et al. (University College London)
  • Model type: Autoregressive Transformer with Residual Attention Conditioning
  • Language(s): CIF (Crystallographic Information File) syntax
  • License: MIT
  • Finetuned from model: c-bone/CrystaLLM-pi_Mattergen-XRD

Model Sources

Uses

Direct Use

The model is intended for structure solution and recovery from powder XRD data. Researchers can input a list of peak positions and intensities derived from experimental diffraction patterns to generate candidate crystal structures that match the experimental signature.

Out-of-Scope Use

  • Disordered Systems: The model does not natively handle partial occupancies or significant disorder.
  • Organic/MOFs: The training data was strictly filtered for inorganic nanomaterials as per the Chili-100K dataset methdology.
  • Extremely Large Unit Cells: While the context window is expanded to 1536 tokens, structures with high numbers of atoms per unit cell may face or degradation in generation quality.

Bias, Risks, and Limitations

  • Experimental Noise: Performance relies on the quality of the input peak extraction and rarity of material.
  • Missing Data: The "Slider" mechanism handles missing peaks (padded with -100), but significant data loss degrades recovery rates.
  • Polymorphs: In cases of strong structural similarity, the model may bias towards the polymorph most prevalent in the Chili-100K distribution.

How to Get Started with the Model

For instructions on loading and running generation, refer to the _load_and_generate.py script in the CrystaLLM-pi GitHub Repository. This script handles XRD vector tokenization and normalization.

Training Details

Training Data

The model underwent a two-stage fine-tuning process:

  1. MatterGen XRD: Theoretical XRD patterns generated from the MatterGen dataset.
  2. Chili-100K XRD (c-bone/chili100k_strat): An experimentally determined, curated, and filtered subset of inorganic nanomaterials from the COD (accessed April 2026). After deduplication, this comprises ~14K materials derived from ~21K CIFs.

Dataset Splitting (Chili-100K):

  • Train:Val:Test Ratio: 78.6:10.7:10.7
  • Leakage-Aware Test Set: The test set was strictly stratified to evaluate generalization:
    • 500 materials: Fully seen during training (LeMaterial, MatterGen XRD, or Chili-100K train/val).
    • 500 materials: Reduced formula seen during training, but the specific structure was unseen (measured via Structure Novelty metric).
    • 500 materials: Neither reduced formula nor structure seen in any training phase.

Training Procedure

  • Architecture: GPT-2 with Residual Attention (Slider) layers. (~47.7M parameters)
  • Mechanism: The Slider mechanism computes a parallel attention score for the conditioning vector, dynamically weighting it against base self-attention to robustly handle heterogeneous/missing diffraction data.

Evaluation

Metrics

The model is evaluated on the leakage-aware test splits using:

  1. Match Rate: Percentage of ground truth structures successfully recovered.
  2. RMS-d: Root Mean Square distance between ground truth and generated structures.
  3. Lattice Parameter and Volume MAE: Mean Absolute Error of predicted unit cell dimensions.
  4. N atoms match: The average amount of atoms in the unit cell of matched material in the test set.

Citation

Primary Model Paper:

@misc{bone2025discoveryrecoverycrystallinematerials,
      title={Discovery and recovery of crystalline materials with property-conditioned transformers}, 
      author={Cyprien Bone and Matthew Walker and Kuangdai Leng and Luis M. Antunes and Ricardo Grau-Crespo and Amil Aligayev and Javier Dominguez and Keith T. Butler},
      year={2025},
      eprint={2511.21299},
      archivePrefix={arXiv},
      primaryClass={cond-mat.mtrl-sci},
      url={https://arxiv.org/abs/2511.21299}, 
}

CHILI Dataset:

@inproceedings{10.1145/3637528.3671538,
      author = {Friis-Jensen, Ulrik and Johansen, Frederik L. and Anker, Andy S. and Dam, Erik B. and Jensen, Kirsten M. \O{}. and Selvan, Raghavendra},
      title = {CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning},
      year = {2024},
      isbn = {9798400704901},
      publisher = {Association for Computing Machinery},
      address = {New York, NY, USA},
      url = {https://doi.org/10.1145/3637528.3671538},
      doi = {10.1145/3637528.3671538},
      pages = {4962–4973},
      numpages = {12},
      keywords = {atomic structure, chemistry, datasets, deep learning, graph neural network, graphs, machine learning, nanomaterials, neutron, scattering, x-ray},
      location = {Barcelona, Spain},
      series = {KDD '24}
}
Downloads last month
227
Safetensors
Model size
47.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train c-bone/CrystaLLM-pi_Chili100K-XRD

Paper for c-bone/CrystaLLM-pi_Chili100K-XRD