The Complete Guide to OSS-Licensed Local Coding LLMs (2026 Edition)

Tadashi Shigeoka ·  Wed, March 18, 2026

From 2024 to 2025, coding AI evolved from simple code completion tools into “coding agents” capable of understanding entire repositories and autonomously performing debugging and refactoring. Meanwhile, the security risks of sending proprietary source code to external servers and the increasing costs of API usage have become pressing concerns.

This article focuses exclusively on coding-focused LLMs available under OSS licenses (Apache 2.0 / MIT) that allow commercial use, providing comprehensive guidance for running them locally.

Why OSS Licensing Matters

The definition of “open source” has become increasingly blurred in the AI world. Many models are released as “open weights,” but they don’t necessarily carry true OSS licenses.

For example, these popular models have commercial use restrictions under their custom licenses:

ModelLicenseRestrictions
Meta Llama 3.x / 4Llama Community License700M MAU cap, acceptable use restrictions
Codestral (Mistral)MNPLPaid license required for commercial use
CodeGemma (Google)Gemma Terms of UseMust agree to Google’s usage license
DeepSeek-Coder-V2DeepSeek Model LicenseCustom license with use-based restrictions

There’s also the “code is OSS but weights aren’t” pattern — DeepSeek-Coder’s repository itself is labeled MIT, but model weight distribution falls under a separate Model License with usage restrictions.

This article evaluates models under Apache 2.0 and MIT licenses, aligned with the OSI (Open Source Initiative) definition. The practical differences between these two licenses:

AspectApache 2.0MIT
Commercial useYesYes
Modification & redistributionYes (LICENSE inclusion, change notice required)Yes (copyright & permission notice required)
Patent licenseExplicitly grantedNot explicitly stated
SimplicitySomewhat lengthyVery concise

Model Tier List (March 2026)

Tier 1 — Frontier-Class (Matches Proprietary Models)

ModelDeveloperLicenseParametersSWE-bench VerifiedHighlights
Qwen3-Coder-NextAlibabaApache 2.080B total / 3B active70.6%2026’s efficiency breakthrough
DeepSeek-V3.2DeepSeekMIT671B total / 37B active70.2%LiveCodeBench 86%. Requires serious hardware
GLM-4.7Zhipu AIMIT355B total / 32B active73.8%HumanEval 94.2%, thinking mode
Qwen3.5-397B-A17BAlibabaApache 2.0397B total / 17B active76.4%LiveCodeBench 83.6%. Multimodal, up to 1M context
Kimi K2.5Moonshot AIMIT*1T total / 32B active76.8%HumanEval 99.0% (highest among OSS)
MiMo-V2-FlashXiaomiMIT309B total / 15B active73.4%LiveCodeBench 87%. Remarkable efficiency

*Kimi K2.5 is MIT-licensed but includes a 100M MAU cap on commercial use

Tier 2 — Excellent for Daily Development

ModelDeveloperLicenseParametersKey ScoreHighlights
Qwen2.5-Coder-32BAlibabaApache 2.032B (dense)HumanEval ~92%Best FIM code completion. Runs on 24GB GPU
gpt-oss-20bOpenAIApache 2.0+policy20B MoESWE-bench 60.7%Runs on 16GB memory
QwQ-32BAlibabaApache 2.032B (dense)LiveCodeBench 63.4%Best reasoning-to-size ratio among dense models
DeepSeek-R1 DistillsDeepSeekApache 2.07B–32BAIME 72.6% (32B)CoT reasoning for debugging

Tier 3 — Strong for Specific Use Cases

ModelDeveloperLicenseParametersHighlights
Seed-Coder-8BByteDanceMIT8BTop performance in the 8B class
Ling-Coder-LiteInclusionAIMIT16.8B / 2.75B activeLow-latency IDE completion
Yi-Coder-9B01.AIApache 2.09BOnly sub-10B model with 128K context
IBM Granite 3.3-8BIBMApache 2.08B116 languages. Enterprise-grade
Microsoft Phi-4MicrosoftMIT14BOutperforms 70B models in reasoning

Detailed Model Profiles

Qwen2.5-Coder-32B — The Local Development Workhorse

Released in late 2024, the Qwen2.5 Coder series reshaped the local LLM landscape. Under Apache 2.0, it matches GPT-4o-class coding performance.

The secret lies in training data quality and composition. Trained on 5.5 trillion tokens with a 70% code / 20% text / 10% math ratio, it achieves the critical property of being “coding-focused without losing general conversational ability.”

BenchmarkQwen2.5-Coder 32BGPT-4o (Reference)
HumanEval~92%90.6%
MBPP91.1 (Base)
Aider (Code Repair)73.7~73.7
MultiPL-E (8-lang avg)79.4
BigCodeBench FullSOTA (OSS)

The architecture uses Grouped Query Attention (GQA) and SwiGLU activation for optimized memory efficiency during inference. It supports a 128K token context window and comes in six sizes: 0.5B / 1.5B / 3B / 7B / 14B / 32B.

Qwen3-Coder-Next — The 2026 Efficiency Revolution

Released February 2026, Qwen3-Coder-Next uses a revolutionary MoE architecture: 80B total parameters with only 3B active per token. Out of 512 experts, only 10+1 activate per token.

Trained with ~800K verifiable tasks using executable RL environments, it excels at agentic coding: long-horizon planning, tool usage, and autonomous failure recovery.

BenchmarkScore
SWE-bench Verified (SWE-Agent)70.6%
SWE-bench Verified (OpenHands)71.3%
SWE-bench Pro44.3
Aider-Polyglot66.2
Codeforces Elo2100
TerminalBench 2.036.2

It supports a native 262K token context window and integrates with Claude Code, Qwen Code CLI, and Cline.

Qwen3.5 — The Next-Gen Multimodal × MoE Flagship

Released March 2026, Qwen3.5 is the latest flagship of the Qwen series. It introduces a novel architecture combining Gated DeltaNet and Gated Attention with MoE, achieving just 17B active parameters out of 397B total — a highly efficient design.

Its standout feature is early fusion training on trillions of multimodal tokens, enabling vision-language capabilities across all model sizes. It supports 201 languages and dialects, excelling not only at coding but also visual understanding tasks.

BenchmarkScore
SWE-bench Verified76.4%
LiveCodeBench v683.6%
SWE-bench Multilingual69.3%
SecCodeBench68.3%
TerminalBench 2.052.5%
AIME2691.3%

It supports a native 262K token context window, extendable to ~1 million tokens via RoPE scaling. Available in 8 sizes: 397B-A17B / 122B-A10B / 35B-A3B / 27B / 9B / 4B / 2B / 0.8B, covering everything from edge devices to large-scale deployments.

gpt-oss — OpenAI’s First OSS Model

Released August 2025 under Apache 2.0 (+ usage policy), gpt-oss comes in 20B and 120B sizes. Its standout feature is strength in agent-based workflows with tool usage.

Metricgpt-oss-20bgpt-oss-120b
SWE-bench Verified (high)60.7%62.4%
Codeforces Elo (no tools)22302463
Codeforces Elo (with tools)25162622
Aider Polyglot (high)34.244.4

Checkpoint sizes are 12.8 GiB (20b) and 60.8 GiB (120b). With MoE + MXFP4 quantization, 20b runs on 16GB memory and 120b runs on a single 80GB GPU (H100, etc.).

Note: Requires the Harmony chat format for optimal performance.

IBM Granite Code — The Enterprise Choice

IBM’s Granite Code series stands out for clear data provenance and legal cleanliness. Its training data preparation framework “data-prep-kit” is itself open-sourced, covering 116 programming languages. For organizations wanting to minimize copyright infringement risk, Granite is among the most trustworthy choices.

It’s also optimized for “application modernization” — migrating legacy systems (e.g., COBOL) to modern languages.

Microsoft Phi-4 — The Small Giant

Phi-4 (14B, MIT) embodies the philosophy that “data quality beats quantity.” Trained on “textbook-style” synthetic data generated by powerful models like GPT-4, it achieves logical reasoning that previously required tens of billions of parameters.

ModelParametersHumanEvalLicense
Phi-414B82.6MIT
Qwen 2.514.7B72.1Apache 2.0
Llama-3.370B78.9Llama (non-OSS)

A 14B model outperforming the 70B Llama-3.3 is remarkable. It supports 128K context, and the latest Phi-4 Multimodal handles images, audio, and text in a single checkpoint.

Hardware Requirements and Quantization Guide

Quantization Basics

Quantization converts model weights to lower bit precision to reduce size. Q4_K_M (4-bit quantization) is the community standard — minimal quality loss while reducing size by roughly 4×.

Quantization TypeQuality RetentionBest For
Q8_0 (8-bit)Very highWhen maximum precision is needed
Q4_K_M (4-bit)HighGeneral coding use — the sweet spot
IQ2_XXS (2-bit)LowTesting or ultra-low-spec environments

A general rule: a larger model at Q4 outperforms a smaller model at Q8.

VRAM Budget Guide

VRAM BudgetBest Coding ModelsGPU Examples
4–8 GBQwen2.5-Coder-7B (Q4: ~5GB), Yi-Coder-9BRTX 3060/4060 8GB
12–16 GBQwen2.5-Coder-14B (Q4: ~9GB), Phi-4, gpt-oss-20bRTX 4060 Ti 16GB
24 GBQwen2.5-Coder-32B (Q5: ~22GB), DeepSeek-R1-Distill-32B (Q4: ~20GB)RTX 3090/4090 — the sweet spot
48–64 GBQwen3-Coder-Next (Q4: ~46GB), GLM-4.7Mac M-series 64GB+, 2× RTX 4090
128–512 GBDeepSeek-V3.2, Qwen3.5-397B-A17B, gpt-oss-120bMac Studio M3 Ultra 512GB, multi-H100

VRAM estimation formula: VRAM (GB) ≈ (Parameters in billions × Bits per weight) / 8 + KV cache overhead + ~1GB

Apple Silicon with unified memory is a community favorite for large models — an M3 Pro with 36GB runs 70B models at ~15 tok/s.

Practical Setup Guide

Inference Engine Selection

EngineLicenseCommercial UseBest ForSetup Complexity
OllamaMITYesSimplest setup — one command to startMinimal
LM StudioProprietaryPaid plan requiredGUI-based model management and chatMinimal
llama.cppMITYesMaximum customization controlModerate
vLLMApache 2.0YesTeam sharing, high throughputModerate–High
SGLangApache 2.0YesLarge MoE modelsModerate–High
MLXMITYesApple Silicon native optimizationLow–Moderate

Quick Start with Ollama

# Start Qwen 2.5 Coder 32B (24GB GPU recommended)
ollama pull qwen2.5-coder:32b
ollama run qwen2.5-coder:32b
 
# For lighter setups, use 7B
ollama pull qwen2.5-coder:7b
 
# Start Phi-4
ollama run phi4
 
# DeepSeek-R1 distill (for reasoning/debugging)
ollama pull deepseek-r1:32b

Continue.dev (31K+ GitHub stars, Apache 2.0) is the most recommended open-source Copilot replacement. It supports VS Code and JetBrains, connects to Ollama/LM Studio, and handles both chat and tab-complete (FIM).

For 24GB VRAM (RTX 3090/4090):

ollama pull qwen2.5-coder:32b    # FIM autocomplete (best quality)
ollama pull deepseek-r1:32b      # Chat-based reasoning/debugging

Plus Continue.dev for VS Code and Aider for terminal work.

For 8GB VRAM:

ollama pull qwen2.5-coder:7b     # FIM autocomplete
ollama pull qwen3:8b             # Chat/debugging

Why Fill-in-the-Middle (FIM) Matters

FIM reads the context both before and after the cursor to insert the most appropriate code in between. Qwen2.5-Coder includes extensive FIM training data.

Internally, inputs are converted to <PRE> {prefix} <SUF> {suffix} <MID> format, with the model generating from <MID> onward — producing far more accurate completions than traditional “continue writing” models.

Production Considerations

The Benchmark-Reality Gap

Local models match GPT-4o on HumanEval and MBPP, but SWE-bench Verified (which tests real GitHub issue resolution) reveals a persistent gap versus top proprietary models like Claude Opus 4.5 (80.9%) and GPT-5.2 (80.0%).

The practical community consensus:

Use your local model for 80% of daily work; switch to cloud for the 20% that requires frontier reasoning. — Practical consensus from r/LocalLLaMA

Common pain points include hallucinated APIs, quality degradation at context window limits, and weaker performance on niche languages and frameworks.

Security and Data Privacy

“Local execution ≠ secure.” These risks still require management:

  • Sensitive data in prompts/logs: Source code may persist in I/O logs
  • Supply chain for dependencies: Tamper protection for model weights (hash verification, internal storage)
  • Generated code quality assurance: Automated testing, static analysis, and review are essential
  • Model server access control: Network-level access management

IBM Granite’s model card explicitly warns about over-reliance on generated code. Rather than merging LLM output directly, run lint / type checks / tests mechanically and minimize diffs before merging.

License Compliance in Practice

LicenseMinimum Operational Requirements
Apache 2.0Include LICENSE file, preserve NOTICE attributions, provide change notice
MITInclude copyright and permission notices

Some models like gpt-oss append additional usage policies alongside Apache 2.0 — verify with your legal team before adoption.

Future Outlook

1. Reasoning-at-Inference Goes Mainstream

Chain-of-thought reasoning (as seen in DeepSeek-R1 and OpenAI’s o-series) is being applied to coding, significantly reducing logical errors in algorithm generation.

2. Small Model Ensembles (Multi-Agent)

Rather than relying on a single massive LLM, role-based specialization is becoming the norm:

  • Phi-4 Mini (ultra-fast) → Code completion
  • Qwen2.5-Coder-32B → Refactoring
  • IBM Granite → Documentation and legal checks

3. MoE Architecture Dominance

Every Tier 1 model uses Mixture of Experts, maximizing performance per active parameter. Qwen3-Coder-Next achieving 70%+ on SWE-bench with 3B active parameters, and Qwen3.5 reaching 76%+ with 17B active parameters, symbolize this paradigm shift.

Summary: Best Models by Use Case

Use CaseRecommended ModelWhy
Best code completion on 24GB GPUQwen2.5-Coder-32BFIM support, Apache 2.0, ~22GB at Q5
Getting started on ≤16GBgpt-oss-20b / Seed-Coder-8BAgent workflows on low resources
Agentic autonomous developmentQwen3-Coder-Next / Qwen3.570%+ SWE-bench, agent-specialized design. Qwen3.5 adds multimodal
Enterprise deploymentIBM Granite CodeData transparency, 116 languages, minimal legal risk
Reasoning and debuggingDeepSeek-R1 Distills / Phi-4CoT reasoning, strong logic in small packages
Low-latency IDE completionLing-Coder-Lite~1.5–2× faster at equivalent performance

OSS-licensed coding LLMs are no longer “cheap substitutes” for proprietary models. With the right combination of model and hardware, you can build a development environment that balances privacy and productivity — entirely under your control.

That’s all for the complete guide to OSS-licensed local coding LLMs (2026 edition) — from the gemba.