The Complete Guide to OSS-Licensed Local Coding LLMs (2026 Edition)
From 2024 to 2025, coding AI evolved from simple code completion tools into “coding agents” capable of understanding entire repositories and autonomously performing debugging and refactoring. Meanwhile, the security risks of sending proprietary source code to external servers and the increasing costs of API usage have become pressing concerns.
This article focuses exclusively on coding-focused LLMs available under OSS licenses (Apache 2.0 / MIT) that allow commercial use, providing comprehensive guidance for running them locally.
Why OSS Licensing Matters
The definition of “open source” has become increasingly blurred in the AI world. Many models are released as “open weights,” but they don’t necessarily carry true OSS licenses.
For example, these popular models have commercial use restrictions under their custom licenses:
| Model | License | Restrictions |
|---|---|---|
| Meta Llama 3.x / 4 | Llama Community License | 700M MAU cap, acceptable use restrictions |
| Codestral (Mistral) | MNPL | Paid license required for commercial use |
| CodeGemma (Google) | Gemma Terms of Use | Must agree to Google’s usage license |
| DeepSeek-Coder-V2 | DeepSeek Model License | Custom license with use-based restrictions |
There’s also the “code is OSS but weights aren’t” pattern — DeepSeek-Coder’s repository itself is labeled MIT, but model weight distribution falls under a separate Model License with usage restrictions.
This article evaluates models under Apache 2.0 and MIT licenses, aligned with the OSI (Open Source Initiative) definition. The practical differences between these two licenses:
| Aspect | Apache 2.0 | MIT |
|---|---|---|
| Commercial use | Yes | Yes |
| Modification & redistribution | Yes (LICENSE inclusion, change notice required) | Yes (copyright & permission notice required) |
| Patent license | Explicitly granted | Not explicitly stated |
| Simplicity | Somewhat lengthy | Very concise |
Model Tier List (March 2026)
Tier 1 — Frontier-Class (Matches Proprietary Models)
| Model | Developer | License | Parameters | SWE-bench Verified | Highlights |
|---|---|---|---|---|---|
| Qwen3-Coder-Next | Alibaba | Apache 2.0 | 80B total / 3B active | 70.6% | 2026’s efficiency breakthrough |
| DeepSeek-V3.2 | DeepSeek | MIT | 671B total / 37B active | 70.2% | LiveCodeBench 86%. Requires serious hardware |
| GLM-4.7 | Zhipu AI | MIT | 355B total / 32B active | 73.8% | HumanEval 94.2%, thinking mode |
| Qwen3.5-397B-A17B | Alibaba | Apache 2.0 | 397B total / 17B active | 76.4% | LiveCodeBench 83.6%. Multimodal, up to 1M context |
| Kimi K2.5 | Moonshot AI | MIT* | 1T total / 32B active | 76.8% | HumanEval 99.0% (highest among OSS) |
| MiMo-V2-Flash | Xiaomi | MIT | 309B total / 15B active | 73.4% | LiveCodeBench 87%. Remarkable efficiency |
*Kimi K2.5 is MIT-licensed but includes a 100M MAU cap on commercial use
Tier 2 — Excellent for Daily Development
| Model | Developer | License | Parameters | Key Score | Highlights |
|---|---|---|---|---|---|
| Qwen2.5-Coder-32B | Alibaba | Apache 2.0 | 32B (dense) | HumanEval ~92% | Best FIM code completion. Runs on 24GB GPU |
| gpt-oss-20b | OpenAI | Apache 2.0+policy | 20B MoE | SWE-bench 60.7% | Runs on 16GB memory |
| QwQ-32B | Alibaba | Apache 2.0 | 32B (dense) | LiveCodeBench 63.4% | Best reasoning-to-size ratio among dense models |
| DeepSeek-R1 Distills | DeepSeek | Apache 2.0 | 7B–32B | AIME 72.6% (32B) | CoT reasoning for debugging |
Tier 3 — Strong for Specific Use Cases
| Model | Developer | License | Parameters | Highlights |
|---|---|---|---|---|
| Seed-Coder-8B | ByteDance | MIT | 8B | Top performance in the 8B class |
| Ling-Coder-Lite | InclusionAI | MIT | 16.8B / 2.75B active | Low-latency IDE completion |
| Yi-Coder-9B | 01.AI | Apache 2.0 | 9B | Only sub-10B model with 128K context |
| IBM Granite 3.3-8B | IBM | Apache 2.0 | 8B | 116 languages. Enterprise-grade |
| Microsoft Phi-4 | Microsoft | MIT | 14B | Outperforms 70B models in reasoning |
Detailed Model Profiles
Qwen2.5-Coder-32B — The Local Development Workhorse
Released in late 2024, the Qwen2.5 Coder series reshaped the local LLM landscape. Under Apache 2.0, it matches GPT-4o-class coding performance.
The secret lies in training data quality and composition. Trained on 5.5 trillion tokens with a 70% code / 20% text / 10% math ratio, it achieves the critical property of being “coding-focused without losing general conversational ability.”
| Benchmark | Qwen2.5-Coder 32B | GPT-4o (Reference) |
|---|---|---|
| HumanEval | ~92% | 90.6% |
| MBPP | 91.1 (Base) | — |
| Aider (Code Repair) | 73.7 | ~73.7 |
| MultiPL-E (8-lang avg) | 79.4 | — |
| BigCodeBench Full | SOTA (OSS) | — |
The architecture uses Grouped Query Attention (GQA) and SwiGLU activation for optimized memory efficiency during inference. It supports a 128K token context window and comes in six sizes: 0.5B / 1.5B / 3B / 7B / 14B / 32B.
Qwen3-Coder-Next — The 2026 Efficiency Revolution
Released February 2026, Qwen3-Coder-Next uses a revolutionary MoE architecture: 80B total parameters with only 3B active per token. Out of 512 experts, only 10+1 activate per token.
Trained with ~800K verifiable tasks using executable RL environments, it excels at agentic coding: long-horizon planning, tool usage, and autonomous failure recovery.
| Benchmark | Score |
|---|---|
| SWE-bench Verified (SWE-Agent) | 70.6% |
| SWE-bench Verified (OpenHands) | 71.3% |
| SWE-bench Pro | 44.3 |
| Aider-Polyglot | 66.2 |
| Codeforces Elo | 2100 |
| TerminalBench 2.0 | 36.2 |
It supports a native 262K token context window and integrates with Claude Code, Qwen Code CLI, and Cline.
Qwen3.5 — The Next-Gen Multimodal × MoE Flagship
Released March 2026, Qwen3.5 is the latest flagship of the Qwen series. It introduces a novel architecture combining Gated DeltaNet and Gated Attention with MoE, achieving just 17B active parameters out of 397B total — a highly efficient design.
Its standout feature is early fusion training on trillions of multimodal tokens, enabling vision-language capabilities across all model sizes. It supports 201 languages and dialects, excelling not only at coding but also visual understanding tasks.
| Benchmark | Score |
|---|---|
| SWE-bench Verified | 76.4% |
| LiveCodeBench v6 | 83.6% |
| SWE-bench Multilingual | 69.3% |
| SecCodeBench | 68.3% |
| TerminalBench 2.0 | 52.5% |
| AIME26 | 91.3% |
It supports a native 262K token context window, extendable to ~1 million tokens via RoPE scaling. Available in 8 sizes: 397B-A17B / 122B-A10B / 35B-A3B / 27B / 9B / 4B / 2B / 0.8B, covering everything from edge devices to large-scale deployments.
gpt-oss — OpenAI’s First OSS Model
Released August 2025 under Apache 2.0 (+ usage policy), gpt-oss comes in 20B and 120B sizes. Its standout feature is strength in agent-based workflows with tool usage.
| Metric | gpt-oss-20b | gpt-oss-120b |
|---|---|---|
| SWE-bench Verified (high) | 60.7% | 62.4% |
| Codeforces Elo (no tools) | 2230 | 2463 |
| Codeforces Elo (with tools) | 2516 | 2622 |
| Aider Polyglot (high) | 34.2 | 44.4 |
Checkpoint sizes are 12.8 GiB (20b) and 60.8 GiB (120b). With MoE + MXFP4 quantization, 20b runs on 16GB memory and 120b runs on a single 80GB GPU (H100, etc.).
Note: Requires the Harmony chat format for optimal performance.
IBM Granite Code — The Enterprise Choice
IBM’s Granite Code series stands out for clear data provenance and legal cleanliness. Its training data preparation framework “data-prep-kit” is itself open-sourced, covering 116 programming languages. For organizations wanting to minimize copyright infringement risk, Granite is among the most trustworthy choices.
It’s also optimized for “application modernization” — migrating legacy systems (e.g., COBOL) to modern languages.
Microsoft Phi-4 — The Small Giant
Phi-4 (14B, MIT) embodies the philosophy that “data quality beats quantity.” Trained on “textbook-style” synthetic data generated by powerful models like GPT-4, it achieves logical reasoning that previously required tens of billions of parameters.
| Model | Parameters | HumanEval | License |
|---|---|---|---|
| Phi-4 | 14B | 82.6 | MIT |
| Qwen 2.5 | 14.7B | 72.1 | Apache 2.0 |
| Llama-3.3 | 70B | 78.9 | Llama (non-OSS) |
A 14B model outperforming the 70B Llama-3.3 is remarkable. It supports 128K context, and the latest Phi-4 Multimodal handles images, audio, and text in a single checkpoint.
Hardware Requirements and Quantization Guide
Quantization Basics
Quantization converts model weights to lower bit precision to reduce size. Q4_K_M (4-bit quantization) is the community standard — minimal quality loss while reducing size by roughly 4×.
| Quantization Type | Quality Retention | Best For |
|---|---|---|
| Q8_0 (8-bit) | Very high | When maximum precision is needed |
| Q4_K_M (4-bit) | High | General coding use — the sweet spot |
| IQ2_XXS (2-bit) | Low | Testing or ultra-low-spec environments |
A general rule: a larger model at Q4 outperforms a smaller model at Q8.
VRAM Budget Guide
| VRAM Budget | Best Coding Models | GPU Examples |
|---|---|---|
| 4–8 GB | Qwen2.5-Coder-7B (Q4: ~5GB), Yi-Coder-9B | RTX 3060/4060 8GB |
| 12–16 GB | Qwen2.5-Coder-14B (Q4: ~9GB), Phi-4, gpt-oss-20b | RTX 4060 Ti 16GB |
| 24 GB | Qwen2.5-Coder-32B (Q5: ~22GB), DeepSeek-R1-Distill-32B (Q4: ~20GB) | RTX 3090/4090 — the sweet spot |
| 48–64 GB | Qwen3-Coder-Next (Q4: ~46GB), GLM-4.7 | Mac M-series 64GB+, 2× RTX 4090 |
| 128–512 GB | DeepSeek-V3.2, Qwen3.5-397B-A17B, gpt-oss-120b | Mac Studio M3 Ultra 512GB, multi-H100 |
VRAM estimation formula: VRAM (GB) ≈ (Parameters in billions × Bits per weight) / 8 + KV cache overhead + ~1GB
Apple Silicon with unified memory is a community favorite for large models — an M3 Pro with 36GB runs 70B models at ~15 tok/s.
Practical Setup Guide
Inference Engine Selection
| Engine | License | Commercial Use | Best For | Setup Complexity |
|---|---|---|---|---|
| Ollama | MIT | Yes | Simplest setup — one command to start | Minimal |
| LM Studio | Proprietary | Paid plan required | GUI-based model management and chat | Minimal |
| llama.cpp | MIT | Yes | Maximum customization control | Moderate |
| vLLM | Apache 2.0 | Yes | Team sharing, high throughput | Moderate–High |
| SGLang | Apache 2.0 | Yes | Large MoE models | Moderate–High |
| MLX | MIT | Yes | Apple Silicon native optimization | Low–Moderate |
Quick Start with Ollama
# Start Qwen 2.5 Coder 32B (24GB GPU recommended)
ollama pull qwen2.5-coder:32b
ollama run qwen2.5-coder:32b
# For lighter setups, use 7B
ollama pull qwen2.5-coder:7b
# Start Phi-4
ollama run phi4
# DeepSeek-R1 distill (for reasoning/debugging)
ollama pull deepseek-r1:32bRecommended IDE Stack
Continue.dev (31K+ GitHub stars, Apache 2.0) is the most recommended open-source Copilot replacement. It supports VS Code and JetBrains, connects to Ollama/LM Studio, and handles both chat and tab-complete (FIM).
For 24GB VRAM (RTX 3090/4090):
ollama pull qwen2.5-coder:32b # FIM autocomplete (best quality)
ollama pull deepseek-r1:32b # Chat-based reasoning/debuggingPlus Continue.dev for VS Code and Aider for terminal work.
For 8GB VRAM:
ollama pull qwen2.5-coder:7b # FIM autocomplete
ollama pull qwen3:8b # Chat/debuggingWhy Fill-in-the-Middle (FIM) Matters
FIM reads the context both before and after the cursor to insert the most appropriate code in between. Qwen2.5-Coder includes extensive FIM training data.
Internally, inputs are converted to <PRE> {prefix} <SUF> {suffix} <MID> format, with the model generating from <MID> onward — producing far more accurate completions than traditional “continue writing” models.
Production Considerations
The Benchmark-Reality Gap
Local models match GPT-4o on HumanEval and MBPP, but SWE-bench Verified (which tests real GitHub issue resolution) reveals a persistent gap versus top proprietary models like Claude Opus 4.5 (80.9%) and GPT-5.2 (80.0%).
The practical community consensus:
Use your local model for 80% of daily work; switch to cloud for the 20% that requires frontier reasoning. — Practical consensus from r/LocalLLaMA
Common pain points include hallucinated APIs, quality degradation at context window limits, and weaker performance on niche languages and frameworks.
Security and Data Privacy
“Local execution ≠ secure.” These risks still require management:
- Sensitive data in prompts/logs: Source code may persist in I/O logs
- Supply chain for dependencies: Tamper protection for model weights (hash verification, internal storage)
- Generated code quality assurance: Automated testing, static analysis, and review are essential
- Model server access control: Network-level access management
IBM Granite’s model card explicitly warns about over-reliance on generated code. Rather than merging LLM output directly, run lint / type checks / tests mechanically and minimize diffs before merging.
License Compliance in Practice
| License | Minimum Operational Requirements |
|---|---|
| Apache 2.0 | Include LICENSE file, preserve NOTICE attributions, provide change notice |
| MIT | Include copyright and permission notices |
Some models like gpt-oss append additional usage policies alongside Apache 2.0 — verify with your legal team before adoption.
Future Outlook
1. Reasoning-at-Inference Goes Mainstream
Chain-of-thought reasoning (as seen in DeepSeek-R1 and OpenAI’s o-series) is being applied to coding, significantly reducing logical errors in algorithm generation.
2. Small Model Ensembles (Multi-Agent)
Rather than relying on a single massive LLM, role-based specialization is becoming the norm:
- Phi-4 Mini (ultra-fast) → Code completion
- Qwen2.5-Coder-32B → Refactoring
- IBM Granite → Documentation and legal checks
3. MoE Architecture Dominance
Every Tier 1 model uses Mixture of Experts, maximizing performance per active parameter. Qwen3-Coder-Next achieving 70%+ on SWE-bench with 3B active parameters, and Qwen3.5 reaching 76%+ with 17B active parameters, symbolize this paradigm shift.
Summary: Best Models by Use Case
| Use Case | Recommended Model | Why |
|---|---|---|
| Best code completion on 24GB GPU | Qwen2.5-Coder-32B | FIM support, Apache 2.0, ~22GB at Q5 |
| Getting started on ≤16GB | gpt-oss-20b / Seed-Coder-8B | Agent workflows on low resources |
| Agentic autonomous development | Qwen3-Coder-Next / Qwen3.5 | 70%+ SWE-bench, agent-specialized design. Qwen3.5 adds multimodal |
| Enterprise deployment | IBM Granite Code | Data transparency, 116 languages, minimal legal risk |
| Reasoning and debugging | DeepSeek-R1 Distills / Phi-4 | CoT reasoning, strong logic in small packages |
| Low-latency IDE completion | Ling-Coder-Lite | ~1.5–2× faster at equivalent performance |
OSS-licensed coding LLMs are no longer “cheap substitutes” for proprietary models. With the right combination of model and hardware, you can build a development environment that balances privacy and productivity — entirely under your control.
That’s all for the complete guide to OSS-licensed local coding LLMs (2026 edition) — from the gemba.