Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
nyuuzyou 
posted an update 2 days ago
Post
1536
🇨🇳 Gitee Code Dataset - The Missing Piece of the Stack
nyuuzyou/gitee-code

Gitee is not included in the Software Heritage archive, meaning it is currently missing from datasets like The Stack. This release fills that massive gap, serving as the largest Chinese code dataset and one of the largest code corpuses overall.

- 819,472,785 files from 3,105,923 repositories
- 536 GB compressed Parquet storage
- 554 programming languages
- Extensive quality filtering: Removed vendor code, artifacts, and generated files
- Rich Chinese language understanding: High volume of Chinese comments and docs

Huge thanks to Hugging Face for the storage grant that made hosting this (and all my other datasets) possible!

I have also already dropped several other new code datasets and rolled out QoL improvements for older ones. I will be dropping posts on those throughout the week.

Amazing! Thanks for sharing!

In this post