Post
91
šØš³ GitCode Dataset - Continuing the Chinese Code Series
nyuuzyou/gitcode-code
Following up on the Gitee release, here's another major Chinese code dataset from GitCode (CSDN's code hosting platform). Same pipeline, same clean format, more valuable data from China's developer ecosystem.
Key Stats:
- 48,142,567 files from 85,632 repositories
- 40 GB compressed Parquet storage
- 537 programming languages
- Extensive quality filtering applied
- Rich metadata: repo names, file paths, licenses, and sizes
The final dataset in the Chinese code series is also available: nyuuzyou/jihulab-code. It's smaller in size but shares the same pipeline and formatting.
Following up on the Gitee release, here's another major Chinese code dataset from GitCode (CSDN's code hosting platform). Same pipeline, same clean format, more valuable data from China's developer ecosystem.
Key Stats:
- 48,142,567 files from 85,632 repositories
- 40 GB compressed Parquet storage
- 537 programming languages
- Extensive quality filtering applied
- Rich metadata: repo names, file paths, licenses, and sizes
The final dataset in the Chinese code series is also available: nyuuzyou/jihulab-code. It's smaller in size but shares the same pipeline and formatting.