| --- |
| license: mit |
| language: |
| - en |
| base_model: |
| - meta-llama/Llama-3.1-8B-Instruct |
| --- |
| |
| # Model Card |
|
|
| This is a **simulator model** used to score candidate natural-language explanations of internal features in Llama-3.1-8B. Given: |
|
|
| - an input text sequence `x` (tokenized), |
| - a candidate explanation `E` (e.g., “encodes city names”), |
|
|
| the simulator predicts **where the described feature should activate** in the sequence (token-level activation scores). These simulated activations can then be compared to a target feature’s *true* activations, enabling scoring of the explanations by computing correlation (the "simulator score" / correlation objective described in [the paper](https://arxiv.org/abs/2511.08579)). |
|
|
| --- |
| ## Usage |
|
|
| **Note:** This simulator is not usable via standard `transformers` APIs alone. You must first **clone and install [our repository](https://github.com/TransluceAI/introspective-interp/tree/main#)**, which provides the custom simulator wrapper and scoring utilities. |
|
|
|
|
| ```python |
| from observatory_utils.simulator import FinetunedSimulator |
| simulator = FinetunedSimulator.setup( |
| model_path="Transluce/features_explain_llama3.1_8b_simulator", |
| add_special_tokens=True, |
| gpu_idx=simulator_device_idx, # e.g. 0 |
| tokenizer_path="meta-llama/Llama-3.1-8B", |
| cache_dir=config.get("cache_dir", None), |
| ) |
| ``` |
|
|
|
|