The Complete Guide to OSS-Licensed Local Coding LLMs (2026 Edition)

From 2024 to 2025, coding AI evolved from simple code completion tools into “coding agents” capable of understanding entire repositories and autonomously performing debugging and refactoring. Meanwhile, the security risks of sending proprietary source code to external servers and the increasing costs of API usage have become pressing concerns.

This article focuses exclusively on coding-focused LLMs available under OSS licenses (Apache 2.0 / MIT) that allow commercial use, providing comprehensive guidance for running them locally.

Why OSS Licensing Matters

The definition of “open source” has become increasingly blurred in the AI world. Many models are released as “open weights,” but they don’t necessarily carry true OSS licenses.

For example, these popular models have commercial use restrictions under their custom licenses:

Model	License	Restrictions
Meta Llama 3.x / 4	Llama Community License	700M MAU cap, acceptable use restrictions
Codestral (Mistral)	MNPL	Paid license required for commercial use
CodeGemma (Google)	Gemma Terms of Use	Must agree to Google’s usage license
DeepSeek-Coder-V2	DeepSeek Model License	Custom license with use-based restrictions

There’s also the “code is OSS but weights aren’t” pattern — DeepSeek-Coder’s repository itself is labeled MIT, but model weight distribution falls under a separate Model License with usage restrictions.

This article evaluates models under Apache 2.0 and MIT licenses, aligned with the OSI (Open Source Initiative) definition. The practical differences between these two licenses:

Aspect	Apache 2.0	MIT
Commercial use	Yes	Yes
Modification & redistribution	Yes (LICENSE inclusion, change notice required)	Yes (copyright & permission notice required)
Patent license	Explicitly granted	Not explicitly stated
Simplicity	Somewhat lengthy	Very concise

Model Tier List (March 2026)

Tier 1 — Frontier-Class (Matches Proprietary Models)

Model	Developer	License	Parameters	SWE-bench Verified	Highlights
Qwen3-Coder-Next	Alibaba	Apache 2.0	80B total / 3B active	70.6%	2026’s efficiency breakthrough
DeepSeek-V3.2	DeepSeek	MIT	671B total / 37B active	70.2%	LiveCodeBench 86%. Requires serious hardware
GLM-4.7	Zhipu AI	MIT	355B total / 32B active	73.8%	HumanEval 94.2%, thinking mode
Qwen3.5-397B-A17B	Alibaba	Apache 2.0	397B total / 17B active	76.4%	LiveCodeBench 83.6%. Multimodal, up to 1M context
Kimi K2.5	Moonshot AI	MIT*	1T total / 32B active	76.8%	HumanEval 99.0% (highest among OSS)
MiMo-V2-Flash	Xiaomi	MIT	309B total / 15B active	73.4%	LiveCodeBench 87%. Remarkable efficiency

*Kimi K2.5 is MIT-licensed but includes a 100M MAU cap on commercial use

Tier 2 — Excellent for Daily Development

Model	Developer	License	Parameters	Key Score	Highlights
Qwen2.5-Coder-32B	Alibaba	Apache 2.0	32B (dense)	HumanEval ~92%	Best FIM code completion. Runs on 24GB GPU
gpt-oss-20b	OpenAI	Apache 2.0+policy	20B MoE	SWE-bench 60.7%	Runs on 16GB memory
QwQ-32B	Alibaba	Apache 2.0	32B (dense)	LiveCodeBench 63.4%	Best reasoning-to-size ratio among dense models
DeepSeek-R1 Distills	DeepSeek	Apache 2.0	7B–32B	AIME 72.6% (32B)	CoT reasoning for debugging

Tier 3 — Strong for Specific Use Cases

Model	Developer	License	Parameters	Highlights
Seed-Coder-8B	ByteDance	MIT	8B	Top performance in the 8B class
Ling-Coder-Lite	InclusionAI	MIT	16.8B / 2.75B active	Low-latency IDE completion
Yi-Coder-9B	01.AI	Apache 2.0	9B	Only sub-10B model with 128K context
IBM Granite 3.3-8B	IBM	Apache 2.0	8B	116 languages. Enterprise-grade
Microsoft Phi-4	Microsoft	MIT	14B	Outperforms 70B models in reasoning

Detailed Model Profiles

Qwen2.5-Coder-32B — The Local Development Workhorse

Released in late 2024, the Qwen2.5 Coder series reshaped the local LLM landscape. Under Apache 2.0, it matches GPT-4o-class coding performance.

The secret lies in training data quality and composition. Trained on 5.5 trillion tokens with a 70% code / 20% text / 10% math ratio, it achieves the critical property of being “coding-focused without losing general conversational ability.”

Benchmark	Qwen2.5-Coder 32B	GPT-4o (Reference)
HumanEval	~92%	90.6%
MBPP	91.1 (Base)	—
Aider (Code Repair)	73.7	~73.7
MultiPL-E (8-lang avg)	79.4	—
BigCodeBench Full	SOTA (OSS)	—

The architecture uses Grouped Query Attention (GQA) and SwiGLU activation for optimized memory efficiency during inference. It supports a 128K token context window and comes in six sizes: 0.5B / 1.5B / 3B / 7B / 14B / 32B.

Qwen3-Coder-Next — The 2026 Efficiency Revolution

Released February 2026, Qwen3-Coder-Next uses a revolutionary MoE architecture: 80B total parameters with only 3B active per token. Out of 512 experts, only 10+1 activate per token.

Trained with ~800K verifiable tasks using executable RL environments, it excels at agentic coding: long-horizon planning, tool usage, and autonomous failure recovery.

Benchmark	Score
SWE-bench Verified (SWE-Agent)	70.6%
SWE-bench Verified (OpenHands)	71.3%
SWE-bench Pro	44.3
Aider-Polyglot	66.2
Codeforces Elo	2100
TerminalBench 2.0	36.2

It supports a native 262K token context window and integrates with Claude Code, Qwen Code CLI, and Cline.

Qwen3.5 — The Next-Gen Multimodal × MoE Flagship

Released March 2026, Qwen3.5 is the latest flagship of the Qwen series. It introduces a novel architecture combining Gated DeltaNet and Gated Attention with MoE, achieving just 17B active parameters out of 397B total — a highly efficient design.

Its standout feature is early fusion training on trillions of multimodal tokens, enabling vision-language capabilities across all model sizes. It supports 201 languages and dialects, excelling not only at coding but also visual understanding tasks.

Benchmark	Score
SWE-bench Verified	76.4%
LiveCodeBench v6	83.6%
SWE-bench Multilingual	69.3%
SecCodeBench	68.3%
TerminalBench 2.0	52.5%
AIME26	91.3%

It supports a native 262K token context window, extendable to ~1 million tokens via RoPE scaling. Available in 8 sizes: 397B-A17B / 122B-A10B / 35B-A3B / 27B / 9B / 4B / 2B / 0.8B, covering everything from edge devices to large-scale deployments.

gpt-oss — OpenAI’s First OSS Model

Released August 2025 under Apache 2.0 (+ usage policy), gpt-oss comes in 20B and 120B sizes. Its standout feature is strength in agent-based workflows with tool usage.

Metric	gpt-oss-20b	gpt-oss-120b
SWE-bench Verified (high)	60.7%	62.4%
Codeforces Elo (no tools)	2230	2463
Codeforces Elo (with tools)	2516	2622
Aider Polyglot (high)	34.2	44.4

Checkpoint sizes are 12.8 GiB (20b) and 60.8 GiB (120b). With MoE + MXFP4 quantization, 20b runs on 16GB memory and 120b runs on a single 80GB GPU (H100, etc.).

Note: Requires the Harmony chat format for optimal performance.

IBM Granite Code — The Enterprise Choice

IBM’s Granite Code series stands out for clear data provenance and legal cleanliness. Its training data preparation framework “data-prep-kit” is itself open-sourced, covering 116 programming languages. For organizations wanting to minimize copyright infringement risk, Granite is among the most trustworthy choices.

It’s also optimized for “application modernization” — migrating legacy systems (e.g., COBOL) to modern languages.

Microsoft Phi-4 — The Small Giant

Phi-4 (14B, MIT) embodies the philosophy that “data quality beats quantity.” Trained on “textbook-style” synthetic data generated by powerful models like GPT-4, it achieves logical reasoning that previously required tens of billions of parameters.

Model	Parameters	HumanEval	License
Phi-4	14B	82.6	MIT
Qwen 2.5	14.7B	72.1	Apache 2.0
Llama-3.3	70B	78.9	Llama (non-OSS)

A 14B model outperforming the 70B Llama-3.3 is remarkable. It supports 128K context, and the latest Phi-4 Multimodal handles images, audio, and text in a single checkpoint.

Hardware Requirements and Quantization Guide

Quantization Basics

Quantization converts model weights to lower bit precision to reduce size. Q4_K_M (4-bit quantization) is the community standard — minimal quality loss while reducing size by roughly 4×.

Quantization Type	Quality Retention	Best For
Q8_0 (8-bit)	Very high	When maximum precision is needed
Q4_K_M (4-bit)	High	General coding use — the sweet spot
IQ2_XXS (2-bit)	Low	Testing or ultra-low-spec environments

A general rule: a larger model at Q4 outperforms a smaller model at Q8.

VRAM Budget Guide

VRAM Budget	Best Coding Models	GPU Examples
4–8 GB	Qwen2.5-Coder-7B (Q4: ~5GB), Yi-Coder-9B	RTX 3060/4060 8GB
12–16 GB	Qwen2.5-Coder-14B (Q4: ~9GB), Phi-4, gpt-oss-20b	RTX 4060 Ti 16GB
24 GB	Qwen2.5-Coder-32B (Q5: ~22GB), DeepSeek-R1-Distill-32B (Q4: ~20GB)	RTX 3090/4090 — the sweet spot
48–64 GB	Qwen3-Coder-Next (Q4: ~46GB), GLM-4.7	Mac M-series 64GB+, 2× RTX 4090
128–512 GB	DeepSeek-V3.2, Qwen3.5-397B-A17B, gpt-oss-120b	Mac Studio M3 Ultra 512GB, multi-H100

VRAM estimation formula: VRAM (GB) ≈ (Parameters in billions × Bits per weight) / 8 + KV cache overhead + ~1GB

Apple Silicon with unified memory is a community favorite for large models — an M3 Pro with 36GB runs 70B models at ~15 tok/s.

Practical Setup Guide

Inference Engine Selection

Engine	License	Commercial Use	Best For	Setup Complexity
Ollama	MIT	Yes	Simplest setup — one command to start	Minimal
LM Studio	Proprietary	Paid plan required	GUI-based model management and chat	Minimal
llama.cpp	MIT	Yes	Maximum customization control	Moderate
vLLM	Apache 2.0	Yes	Team sharing, high throughput	Moderate–High
SGLang	Apache 2.0	Yes	Large MoE models	Moderate–High
MLX	MIT	Yes	Apple Silicon native optimization	Low–Moderate

Quick Start with Ollama

# Start Qwen 2.5 Coder 32B (24GB GPU recommended)
ollama pull qwen2.5-coder:32b
ollama run qwen2.5-coder:32b
 
# For lighter setups, use 7B
ollama pull qwen2.5-coder:7b
 
# Start Phi-4
ollama run phi4
 
# DeepSeek-R1 distill (for reasoning/debugging)
ollama pull deepseek-r1:32b

Recommended IDE Stack

Continue.dev (31K+ GitHub stars, Apache 2.0) is the most recommended open-source Copilot replacement. It supports VS Code and JetBrains, connects to Ollama/LM Studio, and handles both chat and tab-complete (FIM).

For 24GB VRAM (RTX 3090/4090):

ollama pull qwen2.5-coder:32b    # FIM autocomplete (best quality)
ollama pull deepseek-r1:32b      # Chat-based reasoning/debugging

Plus Continue.dev for VS Code and Aider for terminal work.

For 8GB VRAM:

ollama pull qwen2.5-coder:7b     # FIM autocomplete
ollama pull qwen3:8b             # Chat/debugging

Why Fill-in-the-Middle (FIM) Matters

FIM reads the context both before and after the cursor to insert the most appropriate code in between. Qwen2.5-Coder includes extensive FIM training data.

Internally, inputs are converted to <PRE> {prefix} <SUF> {suffix} <MID> format, with the model generating from <MID> onward — producing far more accurate completions than traditional “continue writing” models.

Production Considerations

The Benchmark-Reality Gap

Local models match GPT-4o on HumanEval and MBPP, but SWE-bench Verified (which tests real GitHub issue resolution) reveals a persistent gap versus top proprietary models like Claude Opus 4.5 (80.9%) and GPT-5.2 (80.0%).

The practical community consensus:

Use your local model for 80% of daily work; switch to cloud for the 20% that requires frontier reasoning. — Practical consensus from r/LocalLLaMA

Common pain points include hallucinated APIs, quality degradation at context window limits, and weaker performance on niche languages and frameworks.

Security and Data Privacy

“Local execution ≠ secure.” These risks still require management:

Sensitive data in prompts/logs: Source code may persist in I/O logs
Supply chain for dependencies: Tamper protection for model weights (hash verification, internal storage)
Generated code quality assurance: Automated testing, static analysis, and review are essential
Model server access control: Network-level access management

IBM Granite’s model card explicitly warns about over-reliance on generated code. Rather than merging LLM output directly, run lint / type checks / tests mechanically and minimize diffs before merging.

License Compliance in Practice

License	Minimum Operational Requirements
Apache 2.0	Include LICENSE file, preserve NOTICE attributions, provide change notice
MIT	Include copyright and permission notices

Some models like gpt-oss append additional usage policies alongside Apache 2.0 — verify with your legal team before adoption.

Future Outlook

1. Reasoning-at-Inference Goes Mainstream

Chain-of-thought reasoning (as seen in DeepSeek-R1 and OpenAI’s o-series) is being applied to coding, significantly reducing logical errors in algorithm generation.

2. Small Model Ensembles (Multi-Agent)

Rather than relying on a single massive LLM, role-based specialization is becoming the norm:

Phi-4 Mini (ultra-fast) → Code completion
Qwen2.5-Coder-32B → Refactoring
IBM Granite → Documentation and legal checks

3. MoE Architecture Dominance

Every Tier 1 model uses Mixture of Experts, maximizing performance per active parameter. Qwen3-Coder-Next achieving 70%+ on SWE-bench with 3B active parameters, and Qwen3.5 reaching 76%+ with 17B active parameters, symbolize this paradigm shift.

Summary: Best Models by Use Case

Use Case	Recommended Model	Why
Best code completion on 24GB GPU	Qwen2.5-Coder-32B	FIM support, Apache 2.0, ~22GB at Q5
Getting started on ≤16GB	gpt-oss-20b / Seed-Coder-8B	Agent workflows on low resources
Agentic autonomous development	Qwen3-Coder-Next / Qwen3.5	70%+ SWE-bench, agent-specialized design. Qwen3.5 adds multimodal
Enterprise deployment	IBM Granite Code	Data transparency, 116 languages, minimal legal risk
Reasoning and debugging	DeepSeek-R1 Distills / Phi-4	CoT reasoning, strong logic in small packages
Low-latency IDE completion	Ling-Coder-Lite	~1.5–2× faster at equivalent performance

OSS-licensed coding LLMs are no longer “cheap substitutes” for proprietary models. With the right combination of model and hardware, you can build a development environment that balances privacy and productivity — entirely under your control.

That’s all for the complete guide to OSS-licensed local coding LLMs (2026 edition) — from the gemba.