| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - bigcode/the-stack-dedup |
| | library_name: transformers |
| | language: |
| | - code |
| | --- |
| | |
| | ## CodeSage-Base |
| |
|
| | ### Updates |
| | * [12/2024] <span style="color:blue">We are excited to announce the release of the CodeSage V2 model family with largely improved performance and flexible embedding dimensions!</span> Please check out our [models](https://huggingface.co/codesage) and [blogpost](https://code-representation-learning.github.io/codesage-v2.html) for more details. |
| | * [11/2024] You can now access CodeSage models through SentenceTransformer. |
| |
|
| |
|
| | ### Model description |
| | CodeSage is a new family of open code embedding models with an encoder architecture that support a wide range of source code understanding tasks. It is introduced in the paper: |
| |
|
| | [Code Representation Learning At Scale by |
| | Dejiao Zhang*, Wasi Uddin Ahmad*, Ming Tan, Hantian Ding, Ramesh Nallapati, Dan Roth, Xiaofei Ma, Bing Xiang](https://arxiv.org/abs/2402.01935) (* indicates equal contribution). |
| |
|
| | ### Pretraining data |
| | This checkpoint is trained on the Stack data (https://huggingface.co/datasets/bigcode/the-stack-dedup). Supported languages (9 in total) are as follows: c, c-sharp, go, java, javascript, typescript, php, python, ruby. |
| |
|
| | ### Training procedure |
| | This checkpoint is first trained on code data via masked language modeling (MLM) and then on bimodal text-code pair data. Please refer to the paper for more details. |
| |
|
| | ### How to Use |
| | This checkpoint consists of an encoder (356M model), which can be used to extract code embeddings of 1024 dimension. |
| |
|
| | 1. Accessing CodeSage via HuggingFace: it can be easily loaded using the AutoModel functionality and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf). |
| | |
| | ``` |
| | from transformers import AutoModel, AutoTokenizer |
| | |
| | checkpoint = "codesage/codesage-base" |
| | device = "cuda" # "cpu" for CPU usage |
| | |
| | # Note: CodeSage requires adding eos token at the end of each tokenized sequence |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True) |
| | |
| | model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device) |
| | |
| | inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device) |
| | embedding = model(inputs)[0] |
| | ``` |
| |
|
| | 2. Accessing CodeSage via SentenceTransformer |
| | ``` |
| | from sentence_transformers import SentenceTransformer |
| | model = SentenceTransformer("codesage/codesage-base", trust_remote_code=True) |
| | ``` |
| |
|
| | ### BibTeX entry and citation info |
| | ``` |
| | @inproceedings{ |
| | zhang2024codesage, |
| | title={CodeSage: Code Representation Learning At Scale}, |
| | author={Dejiao Zhang* and Wasi Ahmad* and Ming Tan and Hantian Ding and Ramesh Nallapati and Dan Roth and Xiaofei Ma and Bing Xiang}, |
| | booktitle={The Twelfth International Conference on Learning Representations}, |
| | year={2024}, |
| | url={https://openreview.net/forum?id=vfzRRjumpX} |
| | } |
| | ``` |