Sentence Similarity
sentence-transformers
Safetensors
English
bert
feature-extraction
patent
embeddings
mteb
text-embeddings-inference
Instructions to use datalyes/patembed-base_small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use datalyes/patembed-base_small with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("datalyes/patembed-base_small") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
| license: cc-by-nc-sa-4.0 | |
| library_name: sentence-transformers | |
| tags: | |
| - sentence-transformers | |
| - sentence-similarity | |
| - feature-extraction | |
| - patent | |
| - embeddings | |
| - mteb | |
| language: | |
| - en | |
| pipeline_tag: sentence-similarity | |
| # patembed-base_small | |
| This is a **sentence-transformers** model trained specifically for **patent text embeddings**. It is part of the **PatenTEB** project, which provides state-of-the-art models for patent document understanding and retrieval. | |
| **Note:** This model uses task-specific instruction prompts during inference for optimal performance. | |
| ## Model Details | |
| - **Model Type**: Sentence Transformer | |
| - **Base Architecture**: Distilled from patembed-large using layers {0,3,6,9,12,15,18,21} | |
| - **Parameters**: 143M | |
| - **Number of Layers**: 8 | |
| - **Hidden Size**: 1024 | |
| - **Embedding Dimension**: 512 | |
| - **Max Sequence Length**: 512 tokens | |
| - **Language**: English | |
| - **License**: CC BY-NC-SA 4.0 | |
| ## Model Description | |
| Memory-constrained deployment variant. Maintains 1024 hidden size with projection to 512-dim embeddings. | |
| This model is part of the **patembed family**, developed through multi-task learning on 13 training tasks from the PatenTEB benchmark. For detailed information about the training methodology, architecture, and comprehensive evaluation results, please refer to our paper. | |
| ## Usage | |
| ### Using Sentence Transformers | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| # Load the model | |
| model = SentenceTransformer('datalyes/patembed-base_small') | |
| # Encode patent texts | |
| patent_texts = [ | |
| "A method for manufacturing semiconductor devices...", | |
| "An apparatus for processing chemical compounds...", | |
| ] | |
| embeddings = model.encode(patent_texts) | |
| # Compute similarity | |
| from sentence_transformers import util | |
| similarity = util.cos_sim(embeddings[0], embeddings[1]) | |
| print(f"Similarity: {similarity.item():.4f}") | |
| ``` | |
| ### Using Transformers | |
| ```python | |
| from transformers import AutoTokenizer, AutoModel | |
| import torch | |
| import torch.nn.functional as F | |
| # Load model and tokenizer | |
| tokenizer = AutoTokenizer.from_pretrained('datalyes/patembed-base_small') | |
| model = AutoModel.from_pretrained('datalyes/patembed-base_small') | |
| def mean_pooling(model_output, attention_mask): | |
| token_embeddings = model_output[0] | |
| input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() | |
| return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) | |
| # Tokenize and encode | |
| texts = ["A method for manufacturing semiconductor devices..."] | |
| encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') | |
| with torch.no_grad(): | |
| model_output = model(**encoded) | |
| embeddings = mean_pooling(model_output, encoded['attention_mask']) | |
| embeddings = F.normalize(embeddings, p=2, dim=1) | |
| ``` | |
| ### Patent Retrieval Example | |
| ```python | |
| from sentence_transformers import SentenceTransformer, util | |
| model = SentenceTransformer('datalyes/patembed-base_small') | |
| # Query patent | |
| query = "Method for reducing power consumption in mobile devices" | |
| # Candidate patents | |
| candidates = [ | |
| "A power management system for portable electronic devices...", | |
| "Chemical composition for battery manufacturing...", | |
| "Method for wireless data transmission in mobile networks...", | |
| ] | |
| # Encode and retrieve | |
| query_emb = model.encode(query) | |
| candidate_embs = model.encode(candidates) | |
| # Compute similarities | |
| scores = util.cos_sim(query_emb, candidate_embs)[0] | |
| # Get ranked results | |
| results = [(candidates[i], scores[i].item()) for i in range(len(candidates))] | |
| results.sort(key=lambda x: x[1], reverse=True) | |
| for patent, score in results: | |
| print(f"Score: {score:.4f} - {patent[:100]}...") | |
| ``` | |
| ## Intended Use | |
| This model is designed for patent-specific tasks including: | |
| - Patent search and retrieval | |
| - Prior art search | |
| - Patent classification and clustering | |
| - Technology landscape analysis | |
| For detailed training methodology, evaluation protocols, and performance analysis, please refer to our paper. | |
| ## Citation | |
| If you use this model, please cite our paper: | |
| ```bibtex | |
| @misc{ayaou2025patentebcomprehensivebenchmarkmodel, | |
| title={PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding}, | |
| author={Iliass Ayaou and Denis Cavallucci}, | |
| year={2025}, | |
| eprint={2510.22264}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| url={https://arxiv.org/abs/2510.22264} | |
| } | |
| ``` | |
| **Paper**: [PatenTEB on arXiv](https://arxiv.org/abs/2510.22264) | |
| ## License | |
| This model is released under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)** license. | |
| **Key Terms:** | |
| - ✅ You can use, share, and adapt the model | |
| - ✅ You must give appropriate credit | |
| - ❌ You may not use the model for commercial purposes | |
| - ⚠️ If you adapt or build upon this model, you must distribute under the same license | |
| For full license details: https://creativecommons.org/licenses/by-nc-sa/4.0/ | |
| ## Contact | |
| - **Authors**: Iliass Ayaou, Denis Cavallucci | |
| - **Institution**: ICUBE Laboratory, INSA Strasbourg | |
| - **GitHub**: [PatentTEB/PatentTEB](https://github.com/iliass-y/patenteb) | |
| - **HuggingFace**: [datalyes](https://huggingface.co/datalyes) | |