Instructions to use albertan017/hashencoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use albertan017/hashencoder with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="albertan017/hashencoder")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("albertan017/hashencoder") model = AutoModel.from_pretrained("albertan017/hashencoder") - Notebooks
- Google Colab
- Kaggle
| # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1 | |
| # Doc / guide: https://huggingface.co/docs/hub/model-cards | |
| {} | |
| # Model Card for Model ID | |
| <!-- Provide a quick summary of what the model is/does. --> | |
| #Encoder from HICL: Hashtag-Driven In-Context Learning for Social Media Natural Language Understanding. | |
| The model can effectively encode a tweet into topic-level embeddings. It can be used to estimate **topic-level similarity** between tweets. | |
| ## Model Details | |
| #Encoder leverage hashtags to learn inter-post topic relevance (for retrieval) via contrastive learning over 179M tweets. | |
| It was pre-trained on pairwise posts, and contrastive learning guided them to learn topic relevance via learning to identify posts with the same hashtag. | |
| We randomly noise the hashtags to avoid trivial representation. | |
| Please refers to https://github.com/albertan017/HICL for more details. | |
|  | |
| ### Model Description | |
| <!-- Provide a longer summary of what this model is. --> | |
| - **Developed by:** Hanzhuo Tan, Department of Computing, the Hong Kong Polytechnic University | |
| - **Model type:** Roberta | |
| - **Language(s) (NLP):** English | |
| - **License:** n.a | |
| - **Finetuned from model [optional]:** Bertweet | |
| ### Model Sources [optional] | |
| <!-- Provide the basic links for the model. --> | |
| - **Repository:** https://github.com/albertan017/HICL | |
| - **Paper [optional]:** HICL: Hashtag-Driven In-Context Learning for Social Media Natural Language Understanding | |
| ## Uses | |
| <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> | |
| ``` | |
| from transformers import AutoModel, AutoTokenizer | |
| hashencoder = AutoModel.from_pretrained("albertan017/hashencoder") | |
| tokenizer = AutoTokenizer.from_pretrained("albertan017/hashencoder") | |
| tweet = "here's a sample tweet for encoding" | |
| input_ids = torch.tensor([tokenizer.encode(tweet)]) | |
| with torch.no_grad(): | |
| features = hashencoder(input_ids) # Models outputs are now tuples | |
| ``` | |
| ## Bias, Risks, and Limitations | |
| <!-- This section is meant to convey both technical and sociotechnical limitations. --> | |
| We do not inforce semantic similarity. | |
| ## Training Details | |
| ### Training Data | |
| <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> | |
| #Encoder is pre-trained on 15 GB of plain text from 179 million tweets and 4 billion tokens. | |
| Following the practice to pre-train BERTweet, the raw data was collected from the archived Twitter stream, containing 4TB of sampled tweets from January 2013 to June 2021. | |
| For data pre-processing, we ran the following steps. | |
| First, we employed fastText to extract English tweets and only kept tweets with hashtags. | |
| Then, low-frequency hashtags appearing in less than 100 tweets were further filtered out to alleviate sparsity. | |
| After that, we obtained a large-scale dataset containing 179M tweets, each has at least one hashtag, and hence corresponds to 180K hashtags in total. | |
| ### Training Procedure | |
| <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> | |
| To leverage hashtag-gathered context in pre-training, we exploit contrastive learning and train #Encoder to identify pairwise posts sharing the same hashtag for gaining topic relevance. | |
| ## Citation [optional] | |
| <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> | |
| **BibTeX:** | |
| [More Information Needed] | |
| **APA:** | |
| [More Information Needed] | |