# ollm **Repository Path**: frontcold/ollm ## Basic Information - **Project Name**: ollm - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-11-22 - **Last Updated**: 2025-11-22 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

oLLM

LLM Inference for Large-Context Offline Workloads

oLLM is a lightweight Python library for large-context LLM inference, built on top of Huggingface Transformers and PyTorch. It enables running models like [gpt-oss-20B](https://huggingface.co/openai/gpt-oss-20b), [qwen3-next-80B](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct) or [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) on 100k context using ~$200 consumer GPU with 8GB VRAM. No quantization is used—only fp16/bf16 precision.

Latest updates (1.0.3) 🔥

--- ### 8GB Nvidia 3060 Ti Inference memory usage: | Model | Weights | Context length | KV cache | Baseline VRAM (no offload) | oLLM GPU VRAM | oLLM Disk (SSD) | | ------- | ------- | -------- | ------------- | ------------ | ---------------- | --------------- | | [qwen3-next-80B](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct) | 160 GB (bf16) | 50k | 20 GB | ~190 GB | ~7.5 GB | 180 GB | | [gpt-oss-20B](https://huggingface.co/openai/gpt-oss-20b) | 13 GB (packed bf16) | 10k | 1.4 GB | ~40 GB | ~7.3GB | 15 GB | | [gemma3-12B](https://huggingface.co/google/gemma-3-12b-it) | 25 GB (bf16) | 50k | 18.5 GB | ~45 GB | ~6.7 GB | 43 GB | | [llama3-1B-chat](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) | 2 GB (bf16) | 100k | 12.6 GB | ~16 GB | ~5 GB | 15 GB | | [llama3-3B-chat](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | 7 GB (bf16) | 100k | 34.1 GB | ~42 GB | ~5.3 GB | 42 GB | | [llama3-8B-chat](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | 16 GB (bf16) | 100k | 52.4 GB | ~71 GB | ~6.6 GB | 69 GB | By "Baseline" we mean typical inference without any offloading How do we achieve this: - Loading layer weights from SSD directly to GPU one by one - Offloading KV cache to SSD and loading back directly to GPU, no quantization or PagedAttention - Offloading layer weights to CPU if needed - FlashAttention-2 with online softmax. Full attention matrix is never materialized. - Chunked MLP. Intermediate upper projection layers may get large, so we chunk MLP as well --- Typical use cases include: - Analyze contracts, regulations, and compliance reports in one pass - Summarize or extract insights from massive patient histories or medical literature - Process very large log files or threat reports locally - Analyze historical chats to extract the most common issues/questions users have --- **Supported GPUs**: NVIDIA (with additional performance benefits from `kvikio` and `flash-attn`), AMD, and Apple Silicon (MacBook). ## Getting Started It is recommended to create venv or conda environment first ```bash python3 -m venv ollm_env source ollm_env/bin/activate ``` Install oLLM with `pip install --no-build-isolation ollm` or [from source](https://github.com/Mega4alik/ollm): ```bash git clone https://github.com/Mega4alik/ollm.git cd ollm pip install --no-build-isolation -e . # for Nvidia GPUs with cuda (optional): pip install kvikio-cu{cuda_version} Ex, kvikio-cu12 #speeds up the inference ``` > 💡 **Note** > **voxtral-small-24B** requires additional pip dependencies to be installed as `pip install "mistral-common[audio]"` and `pip install librosa` Check out the [Troubleshooting](https://github.com/Mega4alik/ollm/wiki/Troubleshooting) in case of any installation issues ## Example Code snippet sample ```python from ollm import Inference, file_get_contents, TextStreamer o = Inference("llama3-1B-chat", device="cuda:0", logging=True) #llama3-1B/3B/8B-chat, gpt-oss-20B, qwen3-next-80B o.ini_model(models_dir="./models/", force_download=False) o.offload_layers_to_cpu(layers_num=2) #(optional) offload some layers to CPU for speed boost past_key_values = o.DiskCache(cache_dir="./kv_cache/") #set None if context is small text_streamer = TextStreamer(o.tokenizer, skip_prompt=True, skip_special_tokens=False) messages = [{"role":"system", "content":"You are helpful AI assistant"}, {"role":"user", "content":"List planets"}] input_ids = o.tokenizer.apply_chat_template(messages, reasoning_effort="minimal", tokenize=True, add_generation_prompt=True, return_tensors="pt").to(o.device) outputs = o.model.generate(input_ids=input_ids, past_key_values=past_key_values, max_new_tokens=500, streamer=text_streamer).cpu() answer = o.tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=False) print(answer) ``` or run sample python script as `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python example.py` ```python # with AutoInference, you can run any LLama3/gemma3 model with PEFT adapter support # pip install peft from ollm import AutoInference o = AutoInference("./models/gemma3-12B", # any llama3 or gemma3 model adapter_dir="./myadapter/checkpoint-20", # PEFT adapter checkpoint if available device="cuda:0", multimodality=False, logging=True) ... ``` **More samples** - [gemma3-12B image+text](https://github.com/Mega4alik/ollm/blob/main/example_image.py) - [voxtral-small-24B audio+text](https://github.com/Mega4alik/ollm/blob/main/example_audio.py) - [AutoInference + SFT](https://github.com/Mega4alik/peftee?tab=readme-ov-file#usage) ## Knowledge base - [Documentation](https://github.com/Mega4alik/ollm/wiki/Documentation) - [Community](https://github.com/Mega4alik/ollm/wiki/Community) articles, video, blogs - [Troubleshooting](https://github.com/Mega4alik/ollm/wiki/Troubleshooting) ## Roadmap *For visibility of what's coming next (subject to change)* - Qwen3-Next quantized version - Qwen3-VL or alternative vision model - Qwen3-Next MultiTokenPrediction in R&D ## Contact us If there’s a model you’d like to see supported, feel free to suggest it in the [discussion](https://github.com/Mega4alik/ollm/discussions/4) — I’ll do my best to make it happen.