docx-corpus

docx-corpus

The largest open corpus of .docx files for document processing research

45

GitHub Stars

Jan 9, 2026

Launch Date

6h ago

First Tracked

About

AI Summary

docx-corpus is the largest open dataset of .docx files designed to facilitate research in document processing.

The largest open corpus of .docx files for document processing research

Tags

bun
common-crawl
corpus
dataset
document-processing
docx
machine-learning
nlp
typescript
word-documents
TypeScript