Feature Extraction
sentence-transformers
Safetensors
Transformers
English
qwen3
text-embeddings-inference
Instructions to use codefuse-ai/F2LLM-1.7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use codefuse-ai/F2LLM-1.7B with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("codefuse-ai/F2LLM-1.7B") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Transformers
How to use codefuse-ai/F2LLM-1.7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="codefuse-ai/F2LLM-1.7B")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("codefuse-ai/F2LLM-1.7B") model = AutoModel.from_pretrained("codefuse-ai/F2LLM-1.7B") - Notebooks
- Google Colab
- Kaggle
| base_model: | |
| - Qwen/Qwen3-1.7B | |
| datasets: | |
| - codefuse-ai/F2LLM | |
| language: | |
| - en | |
| tags: | |
| - transformers | |
| license: apache-2.0 | |
| pipeline_tag: feature-extraction | |
| library_name: sentence-transformers | |
| # F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data | |
| This model is presented in the paper [F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data](https://huggingface.co/papers/2510.02294). | |
| The code for this model is available on [GitHub](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/F2LLM). | |
| F2LLMs (Foundation to Feature Large Language Models) are foundation models directly finetuned on 6 million high-quality query-document pairs (available in [codefuse-ai/F2LLM](https://huggingface.co/datasets/codefuse-ai/F2LLM)) covering a diverse range of retrieval, classification, and clustering data, curated solely from open-source datasets without any synthetic data. These models are trained with homogeneous macro batches in a single stage, without sophisticated multi-stage pipelines. | |
| ## Usage | |
| ### With Sentence Transformers | |
| To encode text using F2LLM with the [Sentence Transformers](https://www.sbert.net/) library: | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer("codefuse-ai/F2LLM-1.7B", model_kwargs={"torch_dtype": "bfloat16"}) | |
| # Some sample query and documents | |
| query = "What is F2LLM used for?" | |
| documents = [ | |
| 'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.', | |
| 'Model checkpoints, training datasets, and training code are released, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future research in text embedding models.', | |
| 'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.' | |
| ] | |
| # Encode the query and documents separately, the encode_query method uses the query prompt | |
| query_embedding = model.encode_query(query) | |
| document_embeddings = model.encode_document(documents) | |
| print(query_embedding.shape, document_embeddings.shape) | |
| # (2048,) (3, 2048) | |
| # Compute cosine similarity between the query and documents | |
| similarity = model.similarity(query_embedding, document_embeddings) | |
| print(similarity) | |
| # tensor([[0.5373, 0.6257, 0.8218]]) | |
| ``` | |
| ### With Transformers | |
| Or directly with the [Transformers](https://huggingface.co/docs/transformers/index) library: | |
| ```python | |
| from transformers import AutoModel, AutoTokenizer | |
| import torch | |
| import torch.nn.functional as F | |
| model_path = "codefuse-ai/F2LLM-1.7B" | |
| tokenizer = AutoTokenizer.from_pretrained(model_path) | |
| model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map={'': 0}) | |
| query = "What is F2LLM used for?" | |
| query_prompt = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:" | |
| documents = [ | |
| 'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.', | |
| 'Model checkpoints, training datasets, and training code are released, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future research in text embedding models.', | |
| 'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.' | |
| ] | |
| def encode(sentences): | |
| batch_size = len(sentences) | |
| tokenized_inputs = tokenizer(sentences, padding=True, return_tensors='pt').to(model.device) | |
| last_hidden_state = model(**tokenized_inputs).last_hidden_state | |
| eos_positions = tokenized_inputs.attention_mask.sum(dim=1) - 1 | |
| embeddings = last_hidden_state[torch.arange(batch_size, device=model.device), eos_positions] | |
| embeddings = F.normalize(embeddings, p=2, dim=1) | |
| return embeddings | |
| # Encode the query and documents | |
| query_embedding = encode([query_prompt + query]) | |
| document_embeddings = encode(documents) | |
| print(query_embedding.shape, document_embeddings.shape) | |
| # torch.Size([1, 2048]) torch.Size([3, 2048]) | |
| # Compute cosine similarity between the query and documents | |
| similarity = query_embedding @ document_embeddings.T | |
| print(similarity) | |
| # tensor([[0.5391, 0.6250, 0.8242]], device='cuda:0', dtype=torch.bfloat16, | |
| # grad_fn=<MmBackward0>) | |
| ``` | |
| ## Evaluation | |
| To evaluate F2LLMs on MTEB (currently requires installing MTEB from source): | |
| ```python | |
| import mteb | |
| import logging | |
| logging.basicConfig(level=logging.INFO) | |
| task_names = ['AmazonCounterfactualClassification', 'ArXivHierarchicalClusteringP2P', 'ArXivHierarchicalClusteringS2S', 'ArguAna', 'AskUbuntuDupQuestions', 'BIOSSES', 'Banking77Classification', 'BiorxivClusteringP2P.v2', 'CQADupstackGamingRetrieval', 'CQADupstackUnixRetrieval', 'ClimateFEVERHardNegatives', 'FEVERHardNegatives', 'FiQA2018', 'HotpotQAHardNegatives', 'ImdbClassification', 'MTOPDomainClassification', 'MassiveIntentClassification', 'MassiveScenarioClassification', 'MedrxivClusteringP2P.v2', 'MedrxivClusteringS2S.v2', 'SCIDOCS', 'SICK-R', 'STS12', 'STS13', 'STS14', 'STS15', 'STS17', 'STS22.v2', 'STSBenchmark', 'SprintDuplicateQuestions', 'StackExchangeClustering.v2', 'StackExchangeClusteringP2P.v2', 'SummEvalSummarization.v2', 'TRECCOVID', 'Touche2020Retrieval.v3', 'ToxicConversationsClassification', 'TweetSentimentExtractionClassification', 'TwentyNewsgroupsClustering.v2', 'TwitterSemEval2015', 'TwitterURLCorpus', 'MindSmallReranking'] | |
| tasks = [ | |
| mteb.get_task(task_name, languages = ["eng"], eval_splits=["test"], exclusive_language_filter=True) | |
| for task_name in task_names | |
| ] | |
| model = mteb.get_model("codefuse-ai/F2LLM-1.7B", device="cuda:0") | |
| evaluation = mteb.MTEB(tasks=tasks) | |
| evaluation.run(model, encode_kwargs={"batch_size": 16}) | |
| ``` | |
| ## Training | |
| Training code is available in our [Github repo](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/F2LLM). | |
| ## Citation | |
| If you use the F2LLM models, data, or code, please cite the following technical report. | |
| ``` | |
| @article{2025F2LLM, | |
| title={F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data}, | |
| author={Ziyin Zhang and Zihan Liao and Hang Yu and Peng Di and Rui Wang}, | |
| journal = {CoRR}, | |
| volume = {abs/2510.02294}, | |
| year = {2025}, | |
| url = {https://doi.org/10.48550/arXiv.2510.02294}, | |
| doi = {10.48550/ARXIV.2510.02294}, | |
| eprinttype = {arXiv}, | |
| eprint = {2510.02294} | |
| } | |
| ``` |