Update README.md

37b3d16 verified about 1 year ago

6.74 kB

	---
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	license: mit
	---
	<div align="center">
	<img src="https://raw.githubusercontent.com/Anditty/OASIS/refs/heads/main/Group.svg" width="60%" alt="Kwaipilot" />
	</div>
	<hr>

	# Kwaipilot OASIS-1.3B

	## News 📢

	- 🔥 [2025/03/12] Our latest Code Embedding Model [OASIS-code-1.5B](https://huggingface.co/Kwaipilot/OASIS-code-1.5B) is now released.
	- 🔥 [2025/03/12] Our preprint is now available at [OASIS-arxiv](https://arxiv.org/abs/2503.08161).

	## Model Details
	Model Name: OASIS (Order-Augmented Strategy for Improved code Search)

	Introduction

	OASIS is a state-of-the-art code embedding model developed by Kwaipilot. This model incorporates unique, proprietary methods including repository-level program analysis, the OASIS-instruct data synthesis algorithm, and a specialized fusion loss function, setting new benchmarks in code search efficiency and accuracy.

	Intended Use

	This model is ideal for developers and researchers engaged in enhancing code retrieval systems. OASIS excels in scenarios requiring semantic understanding and retrieval of code snippets within varied programming contexts.

	Training and Performance

	OASIS was trained on a synthetic dataset created through repository-level analysis, ensuring broad understanding across different coding styles and languages. It has demonstrated state-of-the-art performance on latest code search benchmarks.

	## Future Directions
	Kwaipilot upcoming initiatives include:

	- ~~Open sourcing improved models.~~  Please visit our latest model [OASIS-code-1.5B](https://huggingface.co/Kwaipilot/OASIS-code-1.5B).
	- ~~Releasing technical reports.~~  Our preprint is now available at [OASIS-arxiv](https://arxiv.org/abs/2503.08161).
	- Releasing natural language processing models.
	- ...


	## Performance

	\| \| Size \| CoSQA \| AdvTest \| CSN-Py \| CSN-Ja \| CSN-JS \| CSN-PHP \| CSN-Go \| CSN-Ruby \| Avg\|
	\|-----------------\|:-----:\|:------:\|:---------:\|:--------:\|:-------:\|:-------:\|:-------:\|:-------:\|:-------:\|:-------:\|
	\|Openai-Embedding-Ada-002 \| Unknown \| 0.4423\| 0.3808 \| 0.6802 \| 0.7149\| 0.6750\| 0.6062\| 0.8563\| 0.7472\|0.6378\|
	\|jina-embeddings-v2-base-code \| 161M \|0.6837 \|0.385 \| 0.6634 \| 0.6803\| 0.6304\| 0.5701\| 0.8595\| 0.7095\|0.6477\|
	\| CodeSage-large \| 1.3B \| 0.4753\| 0.5267 \| 0.7077 \| 0.7021\| 0.695 \| 0.6133\| 0.8371\| 0.7192\|0.6595\|
	\| CodeFuse-CGE-Small \| 3.8B \| 0.5619\| 0.4639 \| 0.6958 \| 0.6863\| 0.6564\| 0.6133\| 0.8637\| 0.7341\|0.6594\|
	\| OASIS-1.3B \| 1.3B \| 0.5532\| 0.4861 \| 0.7110 \| 0.7199\| 0.6727\| 0.6217\| 0.8732\| 0.7333\|0.6713\|

	## Usage

	### Direct Usage

	```bash
	pip install -U torch
	pip install -U transformers
	```

	Avoid using torch=2.5.0 when loading the model with torch_dtype=torch.bfloat16. For optimal performance and stability, please use PyTorch version 2.4.1 or earlier, or upgrade to 2.5.1 or later.

	```python
	import torch
	import torch.nn.functional as F

	from torch import Tensor
	from transformers import AutoModel, AutoTokenizer

	def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
	left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
	if left_padding:
	return last_hidden_states[:, -1]
	else:
	sequence_lengths = attention_mask.sum(dim=1) - 1
	batch_size = last_hidden_states.shape[0]
	return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

	# Add query prompt
	def get_query_prompt(query: str):
	query_description = 'Given a code search query, retrieve relevant code snippet that answer the query'
	prompt = f'Instruct: {query_description}\nQuery: {query}'
	return prompt

	query = "How to do quicksort in python?"

	code1 = """def bubble_sort(arr):
	n = len(arr)
	for i in range(n):
	swapped = False
	for j in range(1, n - i):
	if arr[j - 1] > arr[j]:
	arr[j - 1], arr[j] = arr[j], arr[j - 1]
	swapped = True
	if not swapped:
	break
	return arr"""

	code2 = """def quick_sort(arr):
	if len(arr) <= 1:
	return arr
	else:
	pivot = arr[0]
	less = [x for x in arr[1:] if x <= pivot]
	greater = [x for x in arr[1:] if x > pivot]
	return quick_sort(less) + [pivot] + quick_sort(greater)"""

	model = AutoModel.from_pretrained("Kwaipilot/OASIS-code-1.3B", output_hidden_states=True)
	tokenizer = AutoTokenizer.from_pretrained("Kwaipilot/OASIS-code-1.3B")

	# Tokenize and inference
	inputs = tokenizer([get_query_prompt(query), code1, code2], max_length=8192, padding=True, truncation=True, return_tensors='pt')
	outputs = model(**inputs)

	# Last token pooling
	embeddings = last_token_pool(outputs.hidden_states[-1], inputs['attention_mask'])
	print(embeddings.shape)
	# torch.Size([3, 2048])

	embeddings = F.normalize(embeddings, dim=1, p=2)
	similarity = embeddings @ embeddings.T
	print(similarity[0, 1:])
	# tensor([0.6495, 0.8036])
	```



	### Sentence Transformers

	First install the Sentence Transformers library:

	```bash
	pip install -U sentence-transformers
	```

	Then you can load this model and run inference.
	```python
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	model = SentenceTransformer("Kwaipilot/OASIS-code-1.3B")#, model_kwargs={"torch_dtype": torch.bfloat16})

	query = "How to do quicksort in python?"

	code1 = """def bubble_sort(arr):
	n = len(arr)
	for i in range(n):
	swapped = False
	for j in range(1, n - i):
	if arr[j - 1] > arr[j]:
	arr[j - 1], arr[j] = arr[j], arr[j - 1]
	swapped = True
	if not swapped:
	break
	return arr"""

	code2 = """def quick_sort(arr):
	if len(arr) <= 1:
	return arr
	else:
	pivot = arr[0]
	less = [x for x in arr[1:] if x <= pivot]
	greater = [x for x in arr[1:] if x > pivot]
	return quick_sort(less) + [pivot] + quick_sort(greater)"""

	# Run inference
	query_embedding = model.encode([query], prompt_name="query")
	code_embeddings = model.encode([code1, code2])

	print(code_embeddings.shape)
	# (2, 2048)

	# Get the similarity scores for the embeddings
	print(model.similarity(query_embedding[0], code_embeddings[0]))
	print(model.similarity(query_embedding[0], code_embeddings[1]))
	# tensor([[0.6495]])
	# tensor([[0.8036]])
	```

	### BibTeX
	```bibtex
	@misc{kwaipilotoasis,
	title = {Optimized Augmentation Strategy for Improved code Search},
	author = {Kwaipilot team},
	year = {2024},
	}
	```