ChartVerse-Coder is a complexity-aware chart code generator that can autonomously synthesize diverse, high-complexity chart codes from scratch, developed as part of the opendatalab/ChartVerse project. For more details about our method, datasets, and full model series, please visit our Project Page.
Unlike prior template-based or seed-conditioned approaches, ChartVerse-Coder generates chart code via high-temperature sampling, enabling broad exploration of the long-tail chart distribution and producing diverse, realistic charts with high structural complexity.
π₯ Highlights
- Autonomous Synthesis: Generates diverse chart codes from scratch without templates or seed charts
- Complexity-Aware: Trained with RPE-guided filtering to master high-complexity visualizations
- High Diversity: Produces charts spanning 3D plots, hierarchical structures, multi-subplot layouts, and more
- Iterative Self-Enhancement: Progressively improves code quality through generation-filtering-retraining loops
π¬ Method Overview
Rollout Posterior Entropy (RPE)
We propose Rollout Posterior Entropy (RPE) to quantify intrinsic chart complexity via generative stability:
- VLM Rollout: Given a chart, prompt a VLM to generate executable code 8 times with temperature 1.0
- Feature Extraction: Extract CLIP embeddings from reconstructed images and compute Gram matrix
- Spectral Entropy: Calculate entropy from normalized singular values
Key Insight: Simple charts yield consistent reconstructions (low RPE), while complex charts result in divergent outcomes (high RPE). We retain only samples with RPE β₯ 0.4.
Training Pipeline
Stage 1: Difficulty-Filtered Cold Start
- Aggregate charts from existing datasets and filter by RPE β₯ 0.4
- Use Claude-4-Sonnet to infer source code for high-complexity charts
- Curate 60K high-quality seed samples
Stage 2: Iterative Self-Enhancement
- Generate 2M raw candidates via high-temperature sampling
- Apply tri-fold filtering:
- β Valid Execution
- β High Complexity (RPE β₯ 0.4)
- β Low Similarity to existing data (Cosine Sim β€ 0.65)
- Retrain coder on expanded dataset
- Repeat for 2 iterations
Final Output: Generate 1M high-complexity chart code samples for downstream QA synthesis.
ποΈ Training Details
- Base Model: Qwen2.5-Coder-7B-Instruct
- Cold Start Data: 60K high-complexity samples
- Boost Data: 200K iteratively filtered samples
- Training: Full-parameter fine-tuning with LLaMA-Factory
- Learning Rate: 2.0 Γ 10β»β΅
- Batch Size: 16
- Context Length: 4,096 tokens
- Epochs: 5
- Precision: BF16
π Synthesized Data Quality
Comparison with Existing Datasets
ChartVerse-Coder synthesizes charts with significantly higher complexity and diversity than all existing datasets.
Synthesized Chart Examples
Our synthesized charts demonstrate exceptional diversity:
- 3D Visualizations: Surface plots, 3D bar charts, scatter plots
- Hierarchical Structures: Treemaps, sunburst charts, dendrograms
- Statistical Plots: Violin plots, radar charts, box plots with annotations
- Multi-Subplot Layouts: Complex dashboards with mixed chart types
- Specialized Charts: Sankey diagrams, chord diagrams, heatmaps with clustering
π Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load Model
model_path = "opendatalab/ChartVerse-Coder"
model = AutoModelForCausalLM.from_pretrained(
model_path, torch_dtype="auto", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# System Prompt
prompt = """You are a Python visualization expert. Generate a random Python visualization code focusing on charts, tables, or diagrams.
Requirements:
- Choose any visualization type (chart, table, flowchart, diagram, etc.)
- Create sample data
- Use Python visualization library (matplotlib, graphviz, etc.)
- Make it visually appealing with proper labels, titles, and colors
- Include sufficient visual elements
- Carefully design the layout to avoid any overlapping text or elements
- Adjust figure size, margins, and spacing for optimal clarity
- Make it visually appealing with proper labels, titles, and colors
Output format: Only output the Python visualization code wrapped in ```python```
"""
# Generate Chart Code
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
# High-temperature sampling for diversity
outputs = model.generate(
**inputs,
max_new_tokens=4096,
temperature=1.0,
top_p=0.95,
top_k=20,
do_sample=True
)
generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_code)
Execute Generated Code
import re
import matplotlib.pyplot as plt
# Extract code from response
code_match = re.search(r'```python\n(.*?)```', generated_code, re.DOTALL)
if code_match:
code = code_match.group(1)
exec(code) # This will save the figure as 'image.png'
π Citation
@article{chartverse2026,
title={ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch},
author={Anonymous Authors},
journal={Anonymous ACL Submission},
year={2026}
}
π License
This model is released under the Apache 2.0 License.
π Acknowledgements
- Base model: Qwen2.5-Coder-7B-Instruct
- Training framework: LLaMA-Factory
- Code inference: Claude-4-Sonnet for cold start data generation
- Downloads last month
- -