# load_model

**Repository Path**: bestForwarder_admin/load_model

## Basic Information

- **Project Name**: load_model
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-12-03
- **Last Updated**: 2025-12-04

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# 🤖 8B 模型本地服务部署

一个完整的 8B 大语言模型本地部署和服务化解决方案，支持多种主流模型，提供 RESTful API 和 Web 界面。

## ✨ 特性

- 🚀 **一键启动** - 简单命令即可启动完整服务
- 🤖 **多模型支持** - 支持 Qwen、Llama、ChatGLM 等主流 8B 模型
- 🌐 **RESTful API** - 标准 HTTP 接口，易于集成
- 💬 **Web 聊天界面** - 美观的 Streamlit 聊天界面
- ⚡ **GPU 加速** - 自动检测并使用 GPU 加速
- 📊 **实时监控** - 内存使用和性能监控
- 🔧 **灵活配置** - 支持多种生成参数调整
- 🛡️ **错误处理** - 完善的异常处理机制

## 📋 系统要求

- Python 3.8+
- CUDA 11.0+ (可选，用于 GPU 加速)
- 内存: 16GB+ (推荐 32GB)
- 显存: 8GB+ (用于 GPU 推理)

## 🚀 快速开始

### 方法 1: 一键启动 (推荐)

```bash
# 克隆或下载项目到本地
cd d:/code/load_model

# 安装依赖
pip install -r requirements.txt

# 一键启动服务
python quick_start.py
```

### 方法 2: 手动启动

```bash
# 1. 安装依赖
pip install torch transformers accelerate uvicorn fastapi streamlit requests pydantic python-dotenv

# 2. 启动API服务
python model_server.py

# 3. 在新终端启动Web界面
streamlit run web_interface.py
```

### 方法 3: 使用启动脚本

```bash
# 启动API服务
python start_server.py

# 启动API + Web界面
python start_server.py --web
```

## 🌐 访问地址

启动成功后，你可以通过以下地址访问服务：

- **API 服务**: http://localhost:8000
- **Web 聊天界面**: http://localhost:8501
- **API 文档**: http://localhost:8000/docs
- **健康检查**: http://localhost:8000/health

## 🤖 支持的模型

| 模型名称              | 描述                       | 默认路径                               |
| --------------------- | -------------------------- | -------------------------------------- |
| Qwen2.5-8B-Instruct   | 通义千问 2.5 8B 指令模型   | `Qwen2.5-0.5B-Instruct`                |
| Llama-3.1-8B-Instruct | Meta Llama 3.1 8B 指令模型 | `meta-llama/Llama-3.1-8B-Instruct`     |
| ChatGLM3-6B           | 智谱 ChatGLM3 6B 模型      | `THUDM/chatglm3-6b`                    |
| Baichuan2-7B-Chat     | 百川 2 7B 对话模型         | `baichuan-inc/Baichuan2-7B-Chat`       |
| DeepSeek-Coder-6.7B   | DeepSeek 编程模型          | `deepseek-ai/deepseek-coder-6.7b-base` |

## 📝 使用方法

### 1. 配置模型

编辑 `config.py` 文件，设置你想要使用的模型：

```python
MODEL_CONFIG = {
    "model_name": "Qwen2.5-0.5B-Instruct",  # 修改为你的模型路径
    "device": "auto",  # auto, cpu, cuda
    "torch_dtype": "auto",  # auto, float16, bfloat16
    "trust_remote_code": True,
}
```

### 2. API 调用示例

```python
import requests

# 生成文本
response = requests.post("http://localhost:8000/generate", json={
    "prompt": "请介绍一下人工智能",
    "max_length": 512,
    "temperature": 0.7,
    "top_p": 0.9
})

result = response.json()
print(result["generated_text"])
```

### 3. 流式生成

```python
import requests

response = requests.post(
    "http://localhost:8000/generate_stream",
    json={"prompt": "写一首关于春天的诗", "max_length": 200},
    stream=True
)

for line in response.iter_lines():
    if line:
        print(line.decode('utf-8'), end='', flush=True)
```

### 4. Web 界面使用

直接访问 http://localhost:8501，在聊天界面中输入问题即可与模型对话。

## ⚙️ 配置参数

### 生成参数

| 参数                 | 类型  | 默认值 | 说明                    |
| -------------------- | ----- | ------ | ----------------------- |
| `prompt`             | str   | -      | 输入提示词              |
| `max_length`         | int   | 512    | 最大生成长度            |
| `temperature`        | float | 0.7    | 生成温度，越高越随机    |
| `top_p`              | float | 0.9    | 核采样参数              |
| `top_k`              | int   | 50     | 保留最高概率的 K 个标记 |
| `do_sample`          | bool  | True   | 是否使用采样            |
| `repetition_penalty` | float | 1.1    | 重复惩罚系数            |

### 服务配置

编辑 `config.py` 中的服务配置：

```python
SERVER_CONFIG = {
    "host": "0.0.0.0",  # 监听地址
    "port": 8000,       # 端口号
    "workers": 1,       # 工作进程数
}
```

## 🧪 测试

### 运行测试客户端

```bash
python test_client.py
```

### 运行演示

```bash
python demo.py
```

### 健康检查

```bash
curl http://localhost:8000/health
```

## 📁 项目结构

```
d:/code/load_model/
├── model_server.py      # 主服务文件 - FastAPI模型服务
├── web_interface.py     # Web聊天界面 - Streamlit
├── config.py           # 配置文件
├── start_server.py     # 启动脚本
├── quick_start.py      # 快速启动脚本
├── test_client.py      # 测试客户端
├── demo.py             # 演示脚本
├── requirements.txt    # 依赖列表
├── .env.example       # 环境变量示例
└── .gitignore         # Git忽略文件
```

## 🔧 故障排除

### 常见问题

1. **内存不足**

   - 尝试使用量化模型：`load_in_4bit=True` 或 `load_in_8bit=True`
   - 减少最大生成长度
   - 使用 CPU 模式：`device="cpu"`

2. **模型下载失败**

   ```bash
   # 使用镜像源
   export HF_ENDPOINT=https://hf-mirror.com  # Linux/Mac
   set HF_ENDPOINT=https://hf-mirror.com     # Windows CMD
   $env:HF_ENDPOINT="https://hf-mirror.com"  # Windows PowerShell
   ```

3. **CUDA 相关错误**

   - 检查 CUDA 版本：`nvidia-smi`
   - 安装对应版本的 PyTorch：访问 [PyTorch 官网](https://pytorch.org/)

4. **端口占用**
   - 修改 `config.py` 中的端口号
   - 或者停止占用端口的进程

### 性能优化

1. **使用量化**

   ```python
   MODEL_CONFIG = {
       "load_in_4bit": True,  # 4位量化
       "load_in_8bit": True,  # 8位量化
   }
   ```

2. **启用 Flash Attention**

   ```python
   MODEL_CONFIG = {
       "use_flash_attention_2": True,
   }
   ```

3. **批处理推理**
   使用 `/generate_batch` 接口进行批量处理

## 📄 API 文档

### 主要接口

| 接口               | 方法 | 说明     |
| ------------------ | ---- | -------- |
| `/`                | GET  | 服务信息 |
| `/health`          | GET  | 健康检查 |
| `/status`          | GET  | 模型状态 |
| `/generate`        | POST | 文本生成 |
| `/generate_stream` | POST | 流式生成 |
| `/generate_batch`  | POST | 批量生成 |

详细 API 文档请访问: http://localhost:8000/docs

## 🤝 贡献

欢迎提交 Issue 和 Pull Request 来改进这个项目！

## 📄 许可证

本项目采用 Apache 2.0 许可证。详见 [LICENSE](LICENSE) 文件。

## 🙏 致谢

- [Hugging Face Transformers](https://github.com/huggingface/transformers)
- [FastAPI](https://github.com/tiangolo/fastapi)
- [Streamlit](https://github.com/streamlit/streamlit)
- [PyTorch](https://github.com/pytorch/pytorch)

---

### 安装 gpu 版本 torch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

如果这个项目对你有帮助，请给个 ⭐️ 支持一下！