# Native-LLM-for-Android
**Repository Path**: y1320722/Native-LLM-for-Android
## Basic Information
- **Project Name**: Native-LLM-for-Android
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-03-18
- **Last Updated**: 2025-04-12
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Native-LLM-for-Android
## Overview
Demonstration of running a native Large Language Model (LLM) on Android devices. Currently supported models include:
- **DeepSeek-R1-Distill-Qwen**: 1.5B
- **Qwen2.5-Instruct**: 0.5B, 1.5B
- **Qwen2/2.5VL**: 2B, 3B
- **MiniCPM-DPO/SFT**: 1B, 2.7B
- **Gemma2-it**: 2B
- **Phi3.5-mini-instruct**: 3.8B
- **Llama-3.2-Instruct**: 1B
- **InternVL-Mono**: 2B
## Recent Updates
- 2025/04/05:Update Qwen, InternVL-Mono `q4f32` + `dynamic_axes`.
- 2025/02/22:Support loading with low memory mode: `Qwen`, `QwenVL`, `Phi_single`, `MiniCPM_2B_single`; Set `low_memory_mode = true` in `MainActivity.java`.
- 2025/02/07:**DeepSeek-R1-Distill-Qwen**: 1.5B (Please using Qwen_Export.py)
## Getting Started
1. **Download Models:**
- Demo models are available on [Google Drive](https://drive.google.com/drive/folders/1E43ApPcOq3I2xvb9b7aOxazTcR3hn5zK?usp=drive_link).
- Alternatively, use [Baidu Cloud](https://pan.baidu.com/s/1NHbUyjZ_VC-o62G13KCrSA?pwd=dake) with the extraction code: `dake`.
- Quick Try: [DeepSeek-R1-Distill-Qwen-1.5B-Android](https://drive.google.com/drive/folders/1cwVeZj14DLYvl75wOH0_Cf8CWJKjSO1M?usp=sharing) / [Qwen2VL-2B](https://drive.google.com/file/d/11POekmCRLsYk9B_ivJ9st5zRIqjJKlov/view?usp=sharing)
2. **Setup Instructions:**
- Place the downloaded model files into the `assets` folder.
- Decompress the `*.so` files stored in the `libs/arm64-v8a` folder.
3. **Model Notes:**
- Demo models are converted from HuggingFace or ModelScope and optimized for extreme execution speed.
- Inputs and outputs may differ slightly from the original models.
- For Qwen2VL / Qwen2.5VL, adjust the key variables to match the model parameters and `export_config.py`.
- `GLRender.java: Line 37, 38, 39`
- `project.h: Line 14, 15, 16, 35, 36, 39, 116, 117, 118, 121, 122`
4. **ONNX Export Considerations:**
- It is recommended to use dynamic axes and q4f32 quantization.
## Tokenizer Files
- The `tokenizer.cpp` and `tokenizer.hpp` files are sourced from the [mnn-llm repository](https://github.com/alibaba/MNN/tree/master/transformers/llm/engine/src).
## Exporting Models
1. Navigate to the `Export_ONNX` folder.
2. Follow the comments in the Python scripts to set the folder paths.
3. Execute the `***_Export.py` script to export the model.
4. Quantize or optimize the ONNX model manually.
## Quantization Notes
- Use `onnxruntime.tools.convert_onnx_models_to_ort` to convert models to `*.ort` format. Note that this process automatically adds `Cast` operators that change FP16 multiplication to FP32.
- The quantization methods are detailed in the `Do_Quantize` folder.
## Additional Resources
- Explore more projects: [DakeQQ Projects](https://github.com/DakeQQ?tab=repositories)
## Performance Metrics
### DeepSeek-R1
| OS | Device | Backend | Model | Inference (1024 Context) |
|:----------:|:------------:|:-----------------------:|:----------------------:|:------------------------:|
| Android 13 | Nubia Z50 | 8_Gen2-CPU | Distill-Qwen-1.5B
q4f32
dynamic | 31.5 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | Distill-Qwen-1.5B
q4f32
dynamic | 20 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | Distill-Qwen-1.5B
q8f32 | 13 token/s |
| HyperOS 2 | Xiaomi-14T-Pro | MediaTek_9300+-CPU | Distill-Qwen-1.5B
q8f32 | 22 token/s |
### Qwen2VL
| OS | Device | Backend | Model | Inference (1024 Context) |
|:----------:|:------------:|:-----------------------:|:-----------------:|:------------------------:|
| Android 13 | Nubia Z50 | 8_Gen2-CPU | Qwen2VL-2B
q8f32 | 15 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | Qwen2VL-2B
q8f32 | 9 token/s |
### Qwen
| OS | Device | Backend | Model | Inference (1024 Context) |
|:----------:|:------------:|:-----------------------:|:----------------------:|:------------------------:|
| Android 13 | Nubia Z50 | 8_Gen2-CPU | Qwen2-1.5B-Instruct
q8f32 | 20 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | Qwen2.5-1.5B-Instruct
q4f32
dynamic | 20 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | Qwen2-1.5B-Instruct
q8f32 | 13 token/s |
| Harmony 3 | 荣耀 20S | Kirin_810-CPU | Qwen2-1.5B-Instruct
q8f32 | 7 token/s |
### MiniCPM
| OS | Device | Backend | Model | Inference (1024 Context) |
|:----------:|:------------:|:-----------------------:|:----------------------:|:------------------------:|
| Android 13 | Nubia Z50 | 8_Gen2-CPU | MiniCPM-2.7B
q8f32 | 9.5 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | MiniCPM-2.7B
q8f32 | 6 token/s |
| Android 13 | Nubia Z50 | 8_Gen2-CPU | MiniCPM-1.3B
q8f32 | 16.5 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | MiniCPM-1.3B
q8f32 | 11 token/s |
### Gemma
| OS | Device | Backend | Model | Inference (1024 Context) |
|:----------:|:------------:|:-----------------------:|:----------------------:|:------------------------:|
| Android 13 | Nubia Z50 | 8_Gen2-CPU | Gemma1.1-it-2B
q8f32 | 16 token/s |
### Phi
| OS | Device | Backend | Model | Inference (1024 Context) |
|:----------:|:------------:|:-----------------------:|:----------------------:|:------------------------:|
| Android 13 | Nubia Z50 | 8_Gen2-CPU | Phi2-2B-Orange-V2
q8f32 | 9.5 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | Phi2-2B-Orange-V2
q8f32 | 5.8 token/s |
### Llama
| OS | Device | Backend | Model | Inference (1024 Context) |
|:----------:|:------------:|:-----------------------:|:----------------------:|:------------------------:|
| Android 13 | Nubia Z50 | 8_Gen2-CPU | Llama3.2-1B-Instruct
q8f32 | 25 token/s |
| Harmony 4 | P40 | Kirin_990_5G-CPU | Llama3.2-1B-Instruct
q8f32 | 16 token/s |
### InternVL
| OS | Device | Backend | Model | Inference (1024 Context) |
|:----------:|:------------:|:-----------------------:|:----------------------:|:------------------------:|
| Harmony 4 | P40 | Kirin_990_5G-CPU | Mono-2B-S1-3
q4f32
dynamic | 10.5 token/s |
## Demo Results
### Qwen2VL-2B / 1024 Context

### Qwen2-1.5B / 1024 Context

## 概述
展示在 Android 设备上运行原生大型语言模型 (LLM) 的示范。目前支持的模型包括:
- **DeepSeek-R1-Distill-Qwen**: 1.5B
- **Qwen2.5-Instruct**: 0.5B, 1.5B
- **Qwen2/2.5VL**: 2B, 3B
- **MiniCPM-DPO/SFT**: 1B, 2.7B
- **Gemma2-it**: 2B
- **Phi3.5-mini-instruct**: 3.8B
- **Llama-3.2-Instruct**: 1B
- **InternVL-Mono**: 2B
## 最近更新
- 2025/04/05: 更新 Qwen, InternVL-Mono `q4f32` + `dynamic_axes`。
- 2025/02/22:支持低内存模式加载: `Qwen`, `QwenVL`, `Phi_single`, `MiniCPM_2B_single`; Set `low_memory_mode = true` in `MainActivity.java`.
- 2025/02/07:**DeepSeek-R1-Distill-Qwen**: 1.5B (请使用Qwen_Export.py)。
## 入门指南
1. **下载模型:**
- Demo模型可以在 [Google Drive](https://drive.google.com/drive/folders/1E43ApPcOq3I2xvb9b7aOxazTcR3hn5zK?usp=drive_link) 上获取。
- 或者使用 [百度网盘](https://pan.baidu.com/s/1NHbUyjZ_VC-o62G13KCrSA?pwd=dake) 提取码:`dake`。
- Quick Try: [DeepSeek-R1-Distill-Qwen-1.5B-Android](https://drive.google.com/drive/folders/1cwVeZj14DLYvl75wOH0_Cf8CWJKjSO1M?usp=sharing) / [Qwen2VL-2B](https://drive.google.com/file/d/11POekmCRLsYk9B_ivJ9st5zRIqjJKlov/view?usp=sharing)
2. **设置说明:**
- 将下载的模型文件放入 `assets` 文件夹。
- 解压存储在 `libs/arm64-v8a` 文件夹中的 `*.so` 文件。
3. **模型说明:**
- 演示模型是从 HuggingFace 或 ModelScope 转换而来,并针对极限执行速度进行了优化。
- 输入和输出可能与原始模型略有不同。
- 对于Qwen2VL / Qwen2.5VL,请调整关键变量以匹配模型参数和`export_config.py`
- `GLRender.java: Line 37, 38, 39`
- `project.h: Line 14, 15, 16, 35, 36, 39, 116, 117, 118, 121, 122`
4. **ONNX 导出注意事项:**
- 推荐使用动态轴以及`q4f32`量化。
## 分词器文件
- `tokenizer.cpp` 和 `tokenizer.hpp` 文件来源于 [mnn-llm 仓库](https://github.com/alibaba/MNN/tree/master/transformers/llm/engine/src)。
## 导出模型
1. 进入 `Export_ONNX` 文件夹。
2. 按照 Python 脚本中的注释设置文件夹路径。
3. 执行 `***_Export.py` 脚本以导出模型。
4. 手动量化或优化 ONNX 模型。
## 量化说明
- 使用 `onnxruntime.tools.convert_onnx_models_to_ort` 将模型转换为 `*.ort` 格式。注意该过程会自动添加 `Cast` 操作符,将 FP16 乘法改为 FP32。
- 量化方法详见 `Do_Quantize` 文件夹。
## 额外资源
- 探索更多项目:[DakeQQ Projects](https://github.com/DakeQQ?tab=repositories)