Sentence Similarity
sentence-transformers
Safetensors
llama
feature-extraction
text-embeddings-inference
Instructions to use Kwaipilot/OASIS-code-1.3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Kwaipilot/OASIS-code-1.3B with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Kwaipilot/OASIS-code-1.3B") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
| library_name: sentence-transformers | |
| pipeline_tag: sentence-similarity | |
| tags: | |
| - sentence-transformers | |
| - sentence-similarity | |
| - feature-extraction | |
| license: mit | |
| <div align="center"> | |
| <img src="https://raw.githubusercontent.com/Anditty/OASIS/refs/heads/main/Group.svg" width="60%" alt="Kwaipilot" /> | |
| </div> | |
| <hr> | |
| # Kwaipilot OASIS-1.3B | |
| ## News 📢 | |
| - 🔥 [2025/03/12] Our latest Code Embedding Model [OASIS-code-1.5B](https://huggingface.co/Kwaipilot/OASIS-code-1.5B) is now released. | |
| - 🔥 [2025/03/12] Our preprint is now available at [OASIS-arxiv](https://arxiv.org/abs/2503.08161). | |
| ## Model Details | |
| **Model Name**: OASIS (Order-Augmented Strategy for Improved code Search) | |
| **Introduction** | |
| OASIS is a state-of-the-art code embedding model developed by Kwaipilot. This model incorporates unique, proprietary methods including **repository-level program analysis**, the **OASIS-instruct data synthesis** algorithm, and a **specialized fusion loss function**, setting new benchmarks in code search efficiency and accuracy. | |
| **Intended Use** | |
| This model is ideal for developers and researchers engaged in enhancing **code retrieval systems**. OASIS excels in scenarios requiring semantic understanding and retrieval of code snippets within varied programming contexts. | |
| **Training and Performance** | |
| OASIS was trained on a synthetic dataset created through repository-level analysis, ensuring broad understanding across different coding styles and languages. It has demonstrated state-of-the-art performance on latest code search benchmarks. | |
| ## Future Directions | |
| Kwaipilot upcoming initiatives include: | |
| - ~~Open sourcing improved models.~~ Please visit our latest model [OASIS-code-1.5B](https://huggingface.co/Kwaipilot/OASIS-code-1.5B). | |
| - ~~Releasing technical reports.~~ Our preprint is now available at [OASIS-arxiv](https://arxiv.org/abs/2503.08161). | |
| - Releasing natural language processing models. | |
| - ... | |
| ## Performance | |
| | | Size | CoSQA | AdvTest | CSN-Py | CSN-Ja | CSN-JS | CSN-PHP | CSN-Go | CSN-Ruby | Avg| | |
| |-----------------|:-----:|:------:|:---------:|:--------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:| | |
| |Openai-Embedding-Ada-002 | Unknown | 0.4423| 0.3808 | 0.6802 | 0.7149| 0.6750| 0.6062| 0.8563| **0.7472**|0.6378| | |
| |jina-embeddings-v2-base-code | 161M |**0.6837** |0.385 | 0.6634 | 0.6803| 0.6304| 0.5701| 0.8595| 0.7095|0.6477| | |
| | CodeSage-large | 1.3B | 0.4753| **0.5267** | 0.7077 | 0.7021| **0.695** | 0.6133| 0.8371| 0.7192|0.6595| | |
| | CodeFuse-CGE-Small | 3.8B | 0.5619| 0.4639 | 0.6958 | 0.6863| 0.6564| 0.6133| 0.8637| 0.7341|0.6594| | |
| | OASIS-1.3B | 1.3B | 0.5532| 0.4861 | **0.7110** | **0.7199**| 0.6727| **0.6217**| **0.8732**| 0.7333|**0.6713**| | |
| ## Usage | |
| ### Direct Usage | |
| ```bash | |
| pip install -U torch | |
| pip install -U transformers | |
| ``` | |
| Avoid using torch=2.5.0 when loading the model with torch_dtype=torch.bfloat16. For optimal performance and stability, please use PyTorch version 2.4.1 or earlier, or upgrade to 2.5.1 or later. | |
| ```python | |
| import torch | |
| import torch.nn.functional as F | |
| from torch import Tensor | |
| from transformers import AutoModel, AutoTokenizer | |
| def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor: | |
| left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0]) | |
| if left_padding: | |
| return last_hidden_states[:, -1] | |
| else: | |
| sequence_lengths = attention_mask.sum(dim=1) - 1 | |
| batch_size = last_hidden_states.shape[0] | |
| return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths] | |
| # Add query prompt | |
| def get_query_prompt(query: str): | |
| query_description = 'Given a code search query, retrieve relevant code snippet that answer the query' | |
| prompt = f'Instruct: {query_description}\nQuery: {query}' | |
| return prompt | |
| query = "How to do quicksort in python?" | |
| code1 = """def bubble_sort(arr): | |
| n = len(arr) | |
| for i in range(n): | |
| swapped = False | |
| for j in range(1, n - i): | |
| if arr[j - 1] > arr[j]: | |
| arr[j - 1], arr[j] = arr[j], arr[j - 1] | |
| swapped = True | |
| if not swapped: | |
| break | |
| return arr""" | |
| code2 = """def quick_sort(arr): | |
| if len(arr) <= 1: | |
| return arr | |
| else: | |
| pivot = arr[0] | |
| less = [x for x in arr[1:] if x <= pivot] | |
| greater = [x for x in arr[1:] if x > pivot] | |
| return quick_sort(less) + [pivot] + quick_sort(greater)""" | |
| model = AutoModel.from_pretrained("Kwaipilot/OASIS-code-1.3B", output_hidden_states=True) | |
| tokenizer = AutoTokenizer.from_pretrained("Kwaipilot/OASIS-code-1.3B") | |
| # Tokenize and inference | |
| inputs = tokenizer([get_query_prompt(query), code1, code2], max_length=8192, padding=True, truncation=True, return_tensors='pt') | |
| outputs = model(**inputs) | |
| # Last token pooling | |
| embeddings = last_token_pool(outputs.hidden_states[-1], inputs['attention_mask']) | |
| print(embeddings.shape) | |
| # torch.Size([3, 2048]) | |
| embeddings = F.normalize(embeddings, dim=1, p=2) | |
| similarity = embeddings @ embeddings.T | |
| print(similarity[0, 1:]) | |
| # tensor([0.6495, 0.8036]) | |
| ``` | |
| ### Sentence Transformers | |
| First install the Sentence Transformers library: | |
| ```bash | |
| pip install -U sentence-transformers | |
| ``` | |
| Then you can load this model and run inference. | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| # Download from the 🤗 Hub | |
| model = SentenceTransformer("Kwaipilot/OASIS-code-1.3B")#, model_kwargs={"torch_dtype": torch.bfloat16}) | |
| query = "How to do quicksort in python?" | |
| code1 = """def bubble_sort(arr): | |
| n = len(arr) | |
| for i in range(n): | |
| swapped = False | |
| for j in range(1, n - i): | |
| if arr[j - 1] > arr[j]: | |
| arr[j - 1], arr[j] = arr[j], arr[j - 1] | |
| swapped = True | |
| if not swapped: | |
| break | |
| return arr""" | |
| code2 = """def quick_sort(arr): | |
| if len(arr) <= 1: | |
| return arr | |
| else: | |
| pivot = arr[0] | |
| less = [x for x in arr[1:] if x <= pivot] | |
| greater = [x for x in arr[1:] if x > pivot] | |
| return quick_sort(less) + [pivot] + quick_sort(greater)""" | |
| # Run inference | |
| query_embedding = model.encode([query], prompt_name="query") | |
| code_embeddings = model.encode([code1, code2]) | |
| print(code_embeddings.shape) | |
| # (2, 2048) | |
| # Get the similarity scores for the embeddings | |
| print(model.similarity(query_embedding[0], code_embeddings[0])) | |
| print(model.similarity(query_embedding[0], code_embeddings[1])) | |
| # tensor([[0.6495]]) | |
| # tensor([[0.8036]]) | |
| ``` | |
| ### BibTeX | |
| ```bibtex | |
| @misc{kwaipilotoasis, | |
| title = {Optimized Augmentation Strategy for Improved code Search}, | |
| author = {Kwaipilot team}, | |
| year = {2024}, | |
| } | |
| ``` |