Automatically add EOS via Tokenizer, add Sentence Transformers snippet (#2)

cb8d3c8 verified 5 months ago

6.59 kB

	---
	base_model:
	- Qwen/Qwen3-1.7B
	datasets:
	- codefuse-ai/F2LLM
	language:
	- en
	tags:
	- transformers
	license: apache-2.0
	pipeline_tag: feature-extraction
	library_name: sentence-transformers
	---

	# F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

	This model is presented in the paper [F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data](https://huggingface.co/papers/2510.02294).
	The code for this model is available on [GitHub](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/F2LLM).

	F2LLMs (Foundation to Feature Large Language Models) are foundation models directly finetuned on 6 million high-quality query-document pairs (available in [codefuse-ai/F2LLM](https://huggingface.co/datasets/codefuse-ai/F2LLM)) covering a diverse range of retrieval, classification, and clustering data, curated solely from open-source datasets without any synthetic data. These models are trained with homogeneous macro batches in a single stage, without sophisticated multi-stage pipelines.

	## Usage

	### With Sentence Transformers

	To encode text using F2LLM with the [Sentence Transformers](https://www.sbert.net/) library:

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("codefuse-ai/F2LLM-1.7B", model_kwargs={"torch_dtype": "bfloat16"})

	# Some sample query and documents
	query = "What is F2LLM used for?"
	documents = [
	'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
	'Model checkpoints, training datasets, and training code are released, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future research in text embedding models.',
	'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.'
	]

	# Encode the query and documents separately, the encode_query method uses the query prompt
	query_embedding = model.encode_query(query)
	document_embeddings = model.encode_document(documents)
	print(query_embedding.shape, document_embeddings.shape)
	# (2048,) (3, 2048)

	# Compute cosine similarity between the query and documents
	similarity = model.similarity(query_embedding, document_embeddings)
	print(similarity)
	# tensor([[0.5373, 0.6257, 0.8218]])
	```

	### With Transformers

	Or directly with the [Transformers](https://huggingface.co/docs/transformers/index) library:

	```python
	from transformers import AutoModel, AutoTokenizer
	import torch
	import torch.nn.functional as F


	model_path = "codefuse-ai/F2LLM-1.7B"
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map={'': 0})

	query = "What is F2LLM used for?"
	query_prompt = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:"
	documents = [
	'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
	'Model checkpoints, training datasets, and training code are released, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future research in text embedding models.',
	'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.'
	]

	def encode(sentences):
	batch_size = len(sentences)
	tokenized_inputs = tokenizer(sentences, padding=True, return_tensors='pt').to(model.device)
	last_hidden_state = model(**tokenized_inputs).last_hidden_state
	eos_positions = tokenized_inputs.attention_mask.sum(dim=1) - 1
	embeddings = last_hidden_state[torch.arange(batch_size, device=model.device), eos_positions]
	embeddings = F.normalize(embeddings, p=2, dim=1)
	return embeddings

	# Encode the query and documents
	query_embedding = encode([query_prompt + query])
	document_embeddings = encode(documents)
	print(query_embedding.shape, document_embeddings.shape)
	# torch.Size([1, 2048]) torch.Size([3, 2048])

	# Compute cosine similarity between the query and documents
	similarity = query_embedding @ document_embeddings.T
	print(similarity)
	# tensor([[0.5391, 0.6250, 0.8242]], device='cuda:0', dtype=torch.bfloat16,
	# grad_fn=<MmBackward0>)
	```

	## Evaluation

	To evaluate F2LLMs on MTEB (currently requires installing MTEB from source):

	```python
	import mteb
	import logging
	logging.basicConfig(level=logging.INFO)

	task_names = ['AmazonCounterfactualClassification', 'ArXivHierarchicalClusteringP2P', 'ArXivHierarchicalClusteringS2S', 'ArguAna', 'AskUbuntuDupQuestions', 'BIOSSES', 'Banking77Classification', 'BiorxivClusteringP2P.v2', 'CQADupstackGamingRetrieval', 'CQADupstackUnixRetrieval', 'ClimateFEVERHardNegatives', 'FEVERHardNegatives', 'FiQA2018', 'HotpotQAHardNegatives', 'ImdbClassification', 'MTOPDomainClassification', 'MassiveIntentClassification', 'MassiveScenarioClassification', 'MedrxivClusteringP2P.v2', 'MedrxivClusteringS2S.v2', 'SCIDOCS', 'SICK-R', 'STS12', 'STS13', 'STS14', 'STS15', 'STS17', 'STS22.v2', 'STSBenchmark', 'SprintDuplicateQuestions', 'StackExchangeClustering.v2', 'StackExchangeClusteringP2P.v2', 'SummEvalSummarization.v2', 'TRECCOVID', 'Touche2020Retrieval.v3', 'ToxicConversationsClassification', 'TweetSentimentExtractionClassification', 'TwentyNewsgroupsClustering.v2', 'TwitterSemEval2015', 'TwitterURLCorpus', 'MindSmallReranking']

	tasks = [
	mteb.get_task(task_name, languages = ["eng"], eval_splits=["test"], exclusive_language_filter=True)
	for task_name in task_names
	]


	model = mteb.get_model("codefuse-ai/F2LLM-1.7B", device="cuda:0")
	evaluation = mteb.MTEB(tasks=tasks)
	evaluation.run(model, encode_kwargs={"batch_size": 16})
	```

	## Training

	Training code is available in our [Github repo](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/F2LLM).

	## Citation

	If you use the F2LLM models, data, or code, please cite the following technical report.

	```
	@article{2025F2LLM,
	title={F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data},
	author={Ziyin Zhang and Zihan Liao and Hang Yu and Peng Di and Rui Wang},
	journal = {CoRR},
	volume = {abs/2510.02294},
	year = {2025},
	url = {https://doi.org/10.48550/arXiv.2510.02294},
	doi = {10.48550/ARXIV.2510.02294},
	eprinttype = {arXiv},
	eprint = {2510.02294}
	}
	```