# qmtStudy

**Repository Path**: ooooinfo/qmt-study

## Basic Information

- **Project Name**: qmtStudy
- **Description**: QMT 学习资料
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-12-25
- **Last Updated**: 2025-12-25

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# QMT Documentation Scraper

A Python web scraper that extracts content from the QMT (QuantMiniTrader) documentation website and saves it locally in a structured format within a qmthelp directory.

## Project Structure

```
qmt-doc-scraper/
├── qmt_doc_scraper/           # Main package directory
│   ├── __init__.py           # Package initialization
│   ├── web_client.py         # HTTP requests and rate limiting
│   ├── html_parser.py        # HTML parsing and content extraction
│   ├── content_processor.py  # Content processing and formatting
│   ├── file_manager.py       # File system operations
│   └── config_manager.py     # Configuration management
├── main.py                   # Main entry point script
├── config.json              # Default configuration file
├── requirements.txt         # Python dependencies
└── README.md               # This file
```

## Installation

1. Install dependencies:
```bash
pip install -r requirements.txt
```

## Usage

### Basic Usage

Run the scraper with default settings:
```bash
python main.py
```

Run with custom configuration:
```bash
python main.py --config custom_config.json --output-dir my_docs
```

### Command-Line Interface

The QMT Documentation Scraper provides a comprehensive command-line interface with the following options:

#### Configuration Options

- `--config, -c FILE`: Path to configuration file (default: config.json)
- `--output-dir, -o DIR`: Output directory for scraped content (overrides config)
- `--url, -u URL`: Add URL to scrape (can be used multiple times)

#### Scraping Behavior Options

- `--delay SECONDS`: Delay between requests in seconds (overrides config)
- `--max-retries N`: Maximum number of retries for failed requests (overrides config)
- `--timeout SECONDS`: Request timeout in seconds (overrides config)
- `--no-assets`: Skip downloading assets (images, CSS, etc.)
- `--no-index`: Skip generating navigation index

#### Output and Logging Options

- `--verbose, -v`: Increase verbosity (use -v, -vv, or -vvv)
- `--quiet, -q`: Suppress all output except errors
- `--log-file FILE`: Write logs to file (overrides config)
- `--no-progress`: Disable progress reporting

#### Utility Options

- `--dry-run`: Show what would be scraped without actually scraping
- `--list-config`: Display current configuration and exit
- `--validate-config`: Validate configuration file and exit
- `--version`: Show version information
- `--help`: Show help message

### Examples

#### Basic scraping with default configuration:
```bash
python main.py
```

#### Scrape specific URLs with custom output directory:
```bash
python main.py --url https://dict.thinktrader.net/innerApi/start_now.html \
               --url https://dict.thinktrader.net/innerApi/another_page.html \
               --output-dir ./qmt_docs
```

#### Scrape with custom settings and verbose output:
```bash
python main.py --delay 2.0 --max-retries 5 --verbose --log-file scraper.log
```

#### Preview what would be scraped (dry run):
```bash
python main.py --dry-run
```

#### Validate configuration file:
```bash
python main.py --validate-config --config my_config.json
```

#### Scrape without downloading assets, with minimal output:
```bash
python main.py --no-assets --quiet --no-progress
```

#### Display current configuration:
```bash
python main.py --list-config
```

### Exit Codes

The scraper returns the following exit codes:

- `0`: Success
- `1`: General error (scraping failed, unexpected error)
- `2`: Invalid arguments or configuration
- `130`: Interrupted by user (Ctrl+C)

### Progress Reporting

By default, the scraper displays progress information during execution:

```
QMT Documentation Scraper v1.0.0
Configuration: config.json
Output directory: qmthelp
URLs to scrape: 1
--------------------------------------------------
Progress: 1/1 (100.0%) - Success: 1, Failed: 0, Assets: 5

Scraping Results:
==================================================
Success: Yes
Pages scraped: 1
Assets downloaded: 5
Duration: 12.34 seconds
Output directory: qmthelp

Documentation successfully saved to: qmthelp
Open qmthelp/index.html in your browser to browse the documentation.
```

Use `--quiet` to suppress output or `--no-progress` to disable progress updates while keeping other output.

## Configuration

The scraper uses a JSON configuration file to specify:
- Target URLs to scrape
- Output directory settings
- Request parameters (delays, retries, timeouts)
- Content processing options
- Logging configuration

See `config.json` for the default configuration structure.

## Development Status

This project is currently under development. Core functionality will be implemented in subsequent development phases.