Text Generation
Safetensors
English
Chinese
qwen3
commoncrawl
html-extraction
content-extraction
information-extraction
qwen
conversational
Instructions to use opendatalab/MinerU-HTML with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Inference
| license: apache-2.0 | |
| datasets: | |
| - opendatalab/AICC | |
| language: | |
| - en | |
| - zh | |
| pipeline_tag: text-generation | |
| tags: | |
| - commoncrawl | |
| - html-extraction | |
| - content-extraction | |
| - information-extraction | |
| - qwen | |
| base_model: | |
| - Qwen/Qwen3-0.6B | |
| # Dripper(MinerU-HTML) | |
| <a href="https://github.com/opendatalab/MinerU-HTML"> | |
| <img src="https://img.shields.io/badge/GitHub-Repo-black?style=flat-square&logo=github" alt="GitHub Repo" /> | |
| </a> | |
| **Dripper(MinerU-HTML)** is an advanced HTML main content extraction tool based on Large Language Models (LLMs). It provides a complete pipeline for extracting primary content from HTML pages using LLM-based classification and state machine-guided generation. | |
| ## Features | |
| - π **LLM-Powered Extraction**: Uses state-of-the-art language models to intelligently identify main content | |
| - π― **State Machine Guidance**: Implements logits processing with state machines for structured JSON output | |
| - π **Fallback Mechanism**: Automatically falls back to alternative extraction methods on errors | |
| - π **Comprehensive Evaluation**: Built-in evaluation framework with ROUGE and item-level metrics | |
| - π **REST API Server**: FastAPI-based server for easy integration | |
| - β‘ **Distributed Processing**: Ray-based parallel processing for large-scale evaluation | |
| - π§ **Multiple Extractors**: Supports various baseline extractors for comparison | |
| --- | |
| ## Installation | |
| ### Prerequisites | |
| - Python >= 3.10 | |
| - CUDA-capable GPU (recommended for LLM inference) | |
| - Sufficient memory for model loading | |
| ### Install from Source | |
| The installation process automatically handles dependencies. The `setup.py` reads dependencies from `requirements.txt` and optionally from `baselines.txt`. | |
| #### Basic Installation (Core Functionality) | |
| For basic usage of Dripper, install with core dependencies only: | |
| ```bash | |
| # Clone the repository | |
| git clone https://github.com/opendatalab/MinerU-HTML | |
| cd MinerU-HTML | |
| # Install the package with core dependencies only | |
| # Dependencies from requirements.txt are automatically installed | |
| pip install . | |
| ``` | |
| #### Installation with Baseline Extractors (for Evaluation) | |
| If you need to run baseline evaluations and comparisons, install with the `baselines` extra: | |
| ```bash | |
| # Install with baseline extractor dependencies | |
| pip install -e .[baselines] | |
| ``` | |
| This will install additional libraries required for baseline extractors: | |
| - `readabilipy`, `readability_lxml` - Readability-based extractors | |
| - `resiliparse` - Resilient HTML parsing | |
| - `justext` - JustText extractor | |
| - `gne` - General News Extractor | |
| - `goose3` - Goose3 article extractor | |
| - `boilerpy3` - Boilerplate removal | |
| - `crawl4ai` - AI-powered web content extraction | |
| **Note**: The baseline extractors are only needed for running comparative evaluations. For basic usage of Dripper, the core installation is sufficient. | |
| ## Quick Start | |
| ### 1. Download the model | |
| visit our model at [MinerU-HTML](https://huggingface.co/opendatalab/MinerU-HTML) and download the model, you can use the following command to download the model: | |
| ```bash | |
| huggingface-cli download opendatalab/MinerU-HTML | |
| ``` | |
| ### 2. Using the Python API | |
| ```python | |
| from dripper.api import Dripper | |
| # Initialize Dripper with model configuration | |
| dripper = Dripper( | |
| config={ | |
| 'model_path': '/path/to/your/model', | |
| 'tp': 1, # Tensor parallel size | |
| 'state_machine': None, # or 'v1', or 'v2 | |
| 'use_fall_back': True, | |
| 'raise_errors': False, | |
| } | |
| ) | |
| # Extract main content from HTML | |
| html_content = "<html>...</html>" | |
| result = dripper.process(html_content) | |
| # Access results | |
| main_html = result[0].main_html | |
| ``` | |
| ### 3. Using the REST API Server | |
| ```bash | |
| # Start the server | |
| python -m dripper.server \ | |
| --model_path /path/to/your/model \ | |
| --state_machine v2 \ | |
| --port 7986 | |
| # Or use environment variables | |
| export DRIPPER_MODEL_PATH=/path/to/your/model | |
| export DRIPPER_STATE_MACHINE=v2 | |
| export DRIPPER_PORT=7986 | |
| python -m dripper.server | |
| ``` | |
| Then make requests to the API: | |
| ```bash | |
| # Extract main content | |
| curl -X POST "http://localhost:7986/extract" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"html": "<html>...</html>", "url": "https://example.com"}' | |
| # Health check | |
| curl http://localhost:7986/health | |
| ``` | |
| ## Configuration | |
| ### Dripper Configuration Options | |
| | Parameter | Type | Default | Description | | |
| | --------------- | ---- | ------------ | ------------------------------------------------ | | |
| | `model_path` | str | **Required** | Path to the LLM model directory | | |
| | `tp` | int | 1 | Tensor parallel size for model inference | | |
| | `state_machine` | str | None | State machine version: `'v1'`, `'v2'`, or `None` | | |
| | `use_fall_back` | bool | True | Enable fallback to trafilatura on errors | | |
| | `raise_errors` | bool | False | Raise exceptions on errors (vs returning None) | | |
| | `debug` | bool | False | Enable debug logging | | |
| | `early_load` | bool | False | Load model during initialization | | |
| ### Environment Variables | |
| - `DRIPPER_MODEL_PATH`: Path to the LLM model | |
| - `DRIPPER_STATE_MACHINE`: State machine version (`v1`, `v2`, or empty) | |
| - `DRIPPER_PORT`: Server port number (default: 7986) | |
| - `VLLM_USE_V1`: Must be set to `'0'` when using state machine | |
| ## Usage Examples | |
| ### Batch Processing | |
| ```python | |
| from dripper.api import Dripper | |
| dripper = Dripper(config={'model_path': '/path/to/model'}) | |
| # Process multiple HTML strings | |
| html_list = ["<html>...</html>", "<html>...</html>"] | |
| results = dripper.process(html_list) | |
| for result in results: | |
| print(result.main_html) | |
| ``` | |
| ### Evaluation | |
| #### Baseline Evaluation | |
| ```bash | |
| python app/eval_baseline.py \ | |
| --bench /path/to/benchmark.jsonl \ | |
| --task_dir /path/to/output \ | |
| --extractor_name dripper-md \ | |
| --default_config gpu \ | |
| --model_path /path/to/model | |
| ``` | |
| #### Two-Step Evaluation | |
| ```bash | |
| # if inferencen with no state machine, set VLLM_USE_V1=1 | |
| export VLLM_USE_V1=1 | |
| # if use state machine, set VLLM_USE_V1=0 | |
| # export VLLM_USE_V1=0 | |
| RESULT_PATH=/path/to/output | |
| EXP_NAME=MinerU-HTML | |
| MODEL_PATH=/path/to/model | |
| BENCH_DATA=/path/to/benchmark.jsonl | |
| # Step 1: Prepare for evaluation | |
| python app/eval_with_answer.py \ | |
| --bench $BENCH_DATA \ | |
| --task_dir $RESULT_PATH/$MODEL_NAME \ | |
| --step 1 --cpus 128 --force_update | |
| # Step 2: Run inference | |
| python app/run_inference.py \ | |
| --task_dir $RESULT_PATH/$MODEL_NAME \ | |
| --model_path $MODEL_PATH \ | |
| --output_path $RESULT_PATH/$MODEL_NAME/res.jsonl \ | |
| --no_logits | |
| # Step 3οΌ process results | |
| python app/process_res.py \ | |
| --response $RESULT_PATH/$MODEL_NAME/res.jsonl \ | |
| --answer $RESULT_PATH/$MODEL_NAME/ans.jsonl \ | |
| --error $RESULT_PATH/$MODEL_NAME/err.jsonl | |
| # Step 4: Evaluate with answers | |
| python app/eval_with_answer.py \ | |
| --bench $BENCH_DATA \ | |
| --task_dir $RESULT_PATH/$MODEL_NAME \ | |
| --answer $RESULT_PATH/$MODEL_NAME/ans.jsonl \ | |
| --step 2 --cpus 128 --force_update | |
| ``` | |
| ## MinerU Ecosystem & Cloud API (No GPU Required) | |
| MinerU-HTML is part of the broader **MinerU Ecosystem**. If you don't have local GPU resources, or if you want to seamlessly integrate HTML/PDF/Document extraction into your existing workflows, you can use our official Cloud API, SDKs, and RAG integrations. | |
| ### Command Line API | |
| <details> | |
| <summary>Show commands</summary> | |
| ```bash | |
| # Windows (PowerShell) | |
| irm https://cdn-mineru.openxlab.org.cn/open-api-cli/install.ps1 | iex | |
| # macOS / Linux | |
| curl -fsSL https://cdn-mineru.openxlab.org.cn/open-api-cli/install.sh | sh | |
| # Precision extract β token required | |
| mineru-open-api auth | |
| mineru-open-api extract webpage.html -o ./output/ # local file | |
| mineru-open-api crawl https://mineru.net/apiManage/docs -o ./output/ # crawl from URL | |
| ``` | |
| </details> | |
| ### Python SDK | |
| <details> | |
| <summary>Show code</summary> | |
| ```python | |
| # pip install mineru-open-sdk | |
| from mineru import MinerU | |
| # Precision mode β tables, formulas, large files | |
| client = MinerU("your-token") # https://mineru.net/apiManage/token | |
| result = client.extract("webpage.html") # local file | |
| result = client.crawl("https://mineru.net/apiManage/docs") # crawl from URL | |
| print(result.markdown) | |
| ``` | |
| </details> | |
| ### RAG β LangChain | |
| <details> | |
| <summary>Show code</summary> | |
| ```python | |
| # pip install langchain-mineru | |
| from langchain_mineru import MinerULoader | |
| # Precision mode β full RAG pipeline | |
| from langchain_text_splitters import RecursiveCharacterTextSplitter | |
| from langchain_openai import OpenAIEmbeddings | |
| from langchain_community.vectorstores import FAISS | |
| docs = MinerULoader(source="article.html", mode="precision", token="your-token", | |
| formula=True, table=True).load() | |
| chunks = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200).split_documents(docs) | |
| vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings()) | |
| results = vectorstore.similarity_search("key requirements", k=3) | |
| ``` | |
| </details> | |
| ### RAG β LlamaIndex | |
| llama-index-readers-mineru is an official LlamaIndex Reader supporting multi-format document extraction. | |
| <details> | |
| <summary>Show code</summary> | |
| ```python | |
| # pip install llama-index-readers-mineru | |
| from llama_index.readers.mineru import MinerUReader | |
| # Precision mode β OCR, formula, table | |
| docs = MinerUReader(mode="precision", token="your-token", | |
| ocr=True, formula=True, table=True).load_data("complex_article.html") | |
| # Full RAG pipeline | |
| from llama_index.core import VectorStoreIndex | |
| index = VectorStoreIndex.from_documents(docs) | |
| response = index.as_query_engine().query("Summarize the key content of this page") | |
| print(response) | |
| ``` | |
| </details> | |
| ### MCP Server (Claude Desktop Β· Cursor Β· Windsurf) | |
| mineru-open-mcp lets any MCP-compatible AI client parse web pages and documents as a native tool. | |
| <details> | |
| <summary>Show config</summary> | |
| ```json | |
| { | |
| "mcpServers": { | |
| "mineru": { | |
| "command": "uvx", | |
| "args": ["mineru-open-mcp"], | |
| "env": { "MINERU_API_TOKEN": "your-token" } | |
| } | |
| } | |
| } | |
| ``` | |
| </details> | |
| ## Project Structure | |
| ``` | |
| Dripper/ | |
| βββ dripper/ # Main package | |
| β βββ api.py # Dripper API class | |
| β βββ server.py # FastAPI server | |
| β βββ base.py # Core data structures | |
| β βββ exceptions.py # Custom exceptions | |
| β βββ inference/ # LLM inference modules | |
| β β βββ inference.py # Generation functions | |
| β β βββ prompt.py # Prompt generation | |
| β β βββ logits.py # Response parsing | |
| β β βββ logtis_processor/ # State machine logits processors | |
| β βββ process/ # HTML processing | |
| β β βββ simplify_html.py | |
| β β βββ map_to_main.py | |
| β β βββ html_utils.py | |
| β βββ eval/ # Evaluation modules | |
| β β βββ metric.py # ROUGE and item-level metrics | |
| β β βββ eval.py # Evaluation functions | |
| β β βββ process.py # Processing utilities | |
| β β βββ benckmark.py # Benchmark data structures | |
| β βββ eval_baselines/ # Baseline extractors | |
| β βββ base.py # Evaluation framework | |
| β βββ baselines/ # Extractor implementations | |
| βββ app/ # Application scripts | |
| β βββ eval_baseline.py # Baseline evaluation script | |
| β βββ eval_with_answer.py # Two-step evaluation | |
| β βββ run_inference.py # Inference script | |
| β βββ process_res.py # Result processing | |
| βββ requirements.txt # Core Python dependencies (auto-installed) | |
| βββ baselines.txt # Optional dependencies for baseline extractors | |
| βββ LICENCE # Apache License 2.0 | |
| βββ NOTICE # Copyright and attribution notices | |
| βββ setup.py # Package setup (handles dependency installation) | |
| ``` | |
| ## Supported Extractors | |
| Dripper supports various baseline extractors for comparison: | |
| - **Dripper** (`dripper-md`, `dripper-html`): The main LLM-based extractor | |
| - **Trafilatura**: Fast and accurate content extraction | |
| - **Readability**: Mozilla's readability algorithm | |
| - **BoilerPy3**: Python port of Boilerpipe | |
| - **NewsPlease**: News article extractor | |
| - **Goose3**: Article extractor | |
| - **GNE**: General News Extractor | |
| - **Crawl4ai**: AI-powered web content extraction | |
| - And more... | |
| ## Evaluation Metrics | |
| - **ROUGE Scores**: ROUGE-N precision, recall, and F1 scores | |
| - **Item-Level Metrics**: Per-tag-type (main/other) precision, recall, F1, and accuracy | |
| - **HTML Output**: Extracted main HTML for visual inspection | |
| ## Development | |
| ### Running Tests | |
| ```bash | |
| # Add test commands here when available | |
| ``` | |
| ### Code Style | |
| The project uses pre-commit hooks for code quality. Install them: | |
| ```bash | |
| pre-commit install | |
| ``` | |
| ## Troubleshooting | |
| ### Common Issues | |
| 1. **VLLM_USE_V1 Error**: When using state machine, ensure `VLLM_USE_V1=0` is set: | |
| ```bash | |
| export VLLM_USE_V1=0 | |
| ``` | |
| 2. **Model Loading Errors**: Verify model path and ensure sufficient GPU memory | |
| 3. **Import Errors**: Ensure the package is properly installed: | |
| ```bash | |
| # Reinstall the package (this will automatically install dependencies from requirements.txt) | |
| pip install -e . | |
| # If you need baseline extractors for evaluation: | |
| pip install -e .[baselines] | |
| ``` | |
| ## License | |
| This project is licensed under the Apache License, Version 2.0. See the [LICENCE](LICENCE) file for details. | |
| ### Copyright Notice | |
| This project contains code and model weights derived from Qwen3. Original Qwen3 Copyright 2024 Alibaba Cloud, licensed under Apache License 2.0. Modifications and additional training Copyright 2025 OpenDatalab Shanghai AILab, licensed under Apache License 2.0. | |
| For more information, please see the [NOTICE](NOTICE) file. | |
| ## Contributing | |
| Contributions are welcome! Please feel free to submit a Pull Request. | |
| ## Acknowledgments | |
| - Built on top of [vLLM](https://github.com/vllm-project/vllm) for efficient LLM inference | |
| - Uses [Trafilatura](https://github.com/adbar/trafilatura) for fallback extraction | |
| - Finetuned on [Qwen3](https://github.com/QwenLM/Qwen3) | |
| - Inspired by various HTML content extraction research |