Surveying Services That Serve OSS Model APIs in a Japan Region — Comparing Them on Data Residency and Coding-Agent Practicality
Coding agents that live in your terminal or IDE and read and write code autonomously have become everyday tools. The MIT-licensed OpenCode stands out with a provider-agnostic architecture that is not tied to any single LLM vendor: you just swap in the API key you prefer. Its maker, Anomaly, offers OpenCode Go, which bundles access to several OSS and open-weight models for $10/month ($5 the first month) and has earned a following for its cost efficiency.
The catch is that OpenCode Go’s API endpoints, per the official docs, sit in the US, EU, and Singapore, with no Japan region. The closest is Singapore. Once you factor in data-protection compliance (Japan’s APPI, for example), organizational data governance, and the network latency that compounds across an autonomous agent’s many turns, it becomes worth evaluating alternatives that keep inference infrastructure inside Japan.
This post compares the alternatives for calling OSS coding models of the kind OpenCode Go bundles, but “in a Japan region” and “over an OpenAI-compatible API,” from both the data-residency and the coding-agent-practicality angles. Models, regions, and prices change fast, so treat this as a snapshot based on research from mid-June 2026, and verify against each vendor’s official pages as of your own check date before adopting anything in production.
Evaluation Criteria — Three Conditions and the Data-Residency Trap
When picking an OSS model API you can call directly from a coding agent inside Japan, the conditions you want come down to three.
- Japan region / in-country processing: inference runs physically inside Japan (Tokyo or Osaka), and ideally the data never leaves the country.
- OSS / open-model support: it hosts open models like Qwen, DeepSeek, GLM, or Kimi.
- OpenAI-compatible API: swapping the base URL and API key is enough to connect from OpenCode, Cline, Cursor, and so on.
The biggest trap here is data residency. “Setting the region to Tokyo” and “the data staying inside Japan” are entirely different claims. On many managed services, pointing the endpoint at Tokyo does not stop the actual inference from being processed in a global pool in the US or elsewhere. Google Cloud’s documentation explicitly states that a regional endpoint alone does not guarantee data residency or in-region ML processing, and that the global endpoint does not satisfy data-residency requirements. Azure likewise notes that Global deployments may be processed in any of the deployed regions.
So in this post, rather than just asking “is there a Tokyo region,” I look at each service through the stricter lens of “can you call OSS models, in a managed way, with processing confined to Japan.”
Comparing Services That Can Serve OSS Models in a Japan Region
Here are the main candidates organized against those three conditions plus data governance. Legend: ◎ = clearly supported, ○ = conditional, △ = limited / caution, ✕ = not supported.
| Service | Japan region / in-country | Key OSS models | OpenAI-compatible | Data governance | Pricing feel |
|---|---|---|---|---|---|
| Sakura AI Engine | ◎ in-country DC, no training use | gpt-oss-120b, Qwen3-Coder-480B/30B, llm-jp-3.1, etc. | ◎ OpenAI / Anthropic compatible | domestic-law compliant, no training use | free tier; gpt-oss usage $0.10 in / $0.50 out per 1M tokens |
| OCI Generative AI (Osaka) | ○ hosted in Osaka | gpt-oss-120b/20b (GA), Llama 3.1/4, Grok | ◎ API keys since 2026/1 | ZDR endpoints, sovereign AI | gpt-oss-120b $0.15 in / 1M tokens, etc. |
| Google Cloud Vertex AI (Tokyo) | ○ regional processes in Tokyo (verify) | gpt-oss, Llama 4, DeepSeek, Qwen3, Gemma | △ MaaS-native, partly compatible | regional in-region, global not guaranteed | gpt-oss-120b $0.09 in / $0.36 out per 1M tokens |
| AWS Bedrock (Tokyo) | ○ In-Region in-country (no JP geo for OSS) | Llama 3.3/4, Mistral Large 3 (no gpt-oss) | ○ Bedrock-compatible layer | SOC2 / ISO, IAM / DPA | usage-based, Batch / Flex 50% off |
| Azure AI Foundry (Japan East) | △ single region only, no Japan Data Zone | Llama, gpt-oss, DeepSeek, Mistral, Qwen | ○ | Data Zone is US / EU only | usage-based / PTU |
| Fireworks AI (Tokyo, Ent.) | ○ Tokyo footprint (dedicated / BYOC) | Llama, Qwen, DeepSeek, Kimi, 400+ | ◎ | ZDR / SOC2 / HIPAA, residency controls | usage + enterprise contract |
| Cloudflare Workers AI | △ Tokyo edge, execution location not guaranteed | Llama, Mistral, Gemma, Qwen3, gpt-oss, Kimi | ○ | inference cannot be pinned to Japan | neuron usage-based |
| Domestic GPU IaaS + vLLM | ◎ in-country DC, dedicated | any (self-host Llama / Qwen / DeepSeek / gpt-oss) | ◎ self-built | maximum control (provider’s ISMS, etc.) | GPU-hour billing |
| Together / Groq / DeepInfra, etc. | ✕ US-centric, no published Japan region | Llama / Qwen / DeepSeek and many more | ◎ | residency depends on choice; Japan DC by negotiation | low-cost usage |
Let me go through them in roughly descending order of practical priority.
Sakura AI Engine — The Cleanest Way to Stay In-Country
The service that satisfies all three conditions most clearly is Sakura Internet’s Sakura AI Engine. It became generally available on September 24, 2025, built on the company’s “Koukaryoku” GPU cloud. All data processing is completed inside Japanese data centers, and input data is not used for training. Unlike overseas vendors’ APIs, this physically and legally avoids disclosure risk under things like the US CLOUD Act, which makes it well suited to building secure applications that satisfy APPI and government-agency guidelines.
The OSS models on offer include gpt-oss-120b, Qwen3-Coder-480B-A35B-Instruct-FP8, Qwen3-Coder-30B-A3B-Instruct, and llm-jp-3.1-8x13b-instruct4, plus whisper-large-v3-turbo for audio and multilingual-e5-large for embeddings. For coding-agent work, the Qwen3-Coder family fits. The API exposes OpenAI/Anthropic-compatible endpoints (/v1/chat/completions, /v1/responses, /v1/messages, /v1/embeddings, and more) at the base URL https://api.ai.sakura.ad.jp/v1, with reported working setups from Cursor and qwen-code.
The pricing is a real strength: the foundation models come with a permanent free tier (up to 3,000 chat calls, 50 audio calls, and 10,000 embedding calls per month). Usage-based pricing for gpt-oss-120b and the llm-jp family is roughly $0.10 in / $0.50 out per 1M tokens (15 yen / 75 yen at the published rates). A review write-up (note / synapse_ai) puts Sakura’s gpt-oss-120b at roughly a tenth of OpenAI’s o4-mini on a per-token basis. Being able to start validating an OpenCode or Cline connection on the free tier alone is a major advantage for individual developers.
When you later need a dedicated environment or private-network connectivity for production, you can step up to Sakura’s AI Solution (launched October 30, 2025), giving you a clean staged path.
OCI Generative AI (Osaka) — gpt-oss, Managed, in a Japan Region
If your requirement is “use gpt-oss or Llama in a managed Japan region,” one of the few hyperscalers that answers it is Oracle Cloud Infrastructure’s OCI Generative AI. Generative AI launched in Osaka (Japan Central) in December 2024, and per Oracle’s blog, OpenAI’s gpt-oss models reached general availability as hosted options in December 2025.
What stands out is the OpenAI-compatible API. According to a third-party write-up (Qiita, January 2026), OCI Generative AI API keys arrived on January 23, 2026, letting you use OpenAI OSS, Meta Llama, and xAI Grok by “just setting an API key and base URL,” with openai.gpt-oss-120b confirmed working in the Osaka region. This is one of the few cases where “Japan region × gpt-oss × OpenAI-compatible” actually holds on a hyperscaler. Pricing runs around $0.15 in per 1M tokens for gpt-oss-120b and $0.07 in / $0.30 out per 1M tokens for gpt-oss-20b, and Oracle promotes zero-data-retention (ZDR) endpoints and sovereign AI options.
That said, per-region model availability shifts, so confirm Osaka availability on Oracle’s “Models by Region” page before adopting. The Osaka gpt-oss confirmation rests on third-party reporting, so official SLAs and pricing are worth verifying before you sign.
Google Cloud Vertex AI (Tokyo) — Strong Evidence, but Read the Residency Terms
Google Cloud’s Vertex AI Model Garden organizes open-model delivery into four layers (MaaS, self-deploy, prebuilt containers, and custom vLLM), so you can choose managed or self-managed on the same platform. The official open-model locations list includes Tokyo (asia-northeast1), with endpoints for gpt-oss, Llama 4, DeepSeek, Qwen3, Gemma, and others. Pricing is spelled out per model, for example gpt-oss-120b at $0.09 in / $0.36 out per 1M tokens.
The residency reading takes care, though. With a regional endpoint, ML processing happens in the chosen region and at-rest data stays in the selected location, but the global endpoint does not satisfy data-residency requirements. On top of that, the scope of residency guarantees for open and partner models has shifted over time, and at points only the Gemini family was covered. Using OSS models in a managed way in Tokyo has become realistic, but if strict in-country processing is a requirement, explicitly use the regional endpoint and confirm the residency guarantee for your specific models before you sign.
AWS Bedrock (Tokyo) — Llama / Mistral In-Region, but No gpt-oss
For organizations already on AWS, Amazon Bedrock in Tokyo (ap-northeast-1) is a practical answer. Meta Llama (Llama 3.3 70B, Llama 4 Scout / Maverick) and Mistral (Mistral Large 3 and others) are available in Tokyo, and IAM integration plus AWS’s DPA and compliance carry over directly. There is an OpenAI-compatible endpoint, and pricing is usage-based with 50% off for Flex / Batch and a premium for Priority.
There are two important constraints, though. First, gpt-oss-120b/20b is not offered on Bedrock in Tokyo; Bedrock’s gpt-oss is US-centric. To run gpt-oss in Tokyo, your only path is deploying it onto your own GPUs via SageMaker JumpStart. Second, the Japan (JP) geo profile (the jp. prefix) that keeps inference confined between Tokyo and Osaka is currently Anthropic-Claude-only and is not offered for open-weight models like Llama, Mistral, or gpt-oss. To meet Japan data residency with OSS models, the reliable route is direct In-Region calls in ap-northeast-1 (for In-Region-listed models only). Avoid the Geo / Global profiles and explicitly specify In-Region.
Azure AI Foundry (Japan East) — No Japan Data Zone
Azure AI Foundry offers a rich set of OSS models (Llama, gpt-oss, DeepSeek, Mistral, Qwen) in Japan East. But the data-residency Data Zone covers only the US and EU; there is no Japan Data Zone. If you need Japan data residency, you must pick Japan East with a single-region (Regional / Standard) deployment, yet open-weight models are often limited to deployment forms like Global Standard, and Microsoft explicitly states that Global may be processed in any geography. In-country confinement for OSS models in a Japan region needs per-model verification and is, honestly, limited.
For what it’s worth, April 2026 brought a three-way collaboration between Microsoft, Sakura Internet, and SoftBank, opening a path toward hybrid setups where domestic companies pull their own physical GPU resources into the Azure environment for open-model inference.
Fireworks AI / Together AI — Global Specialists’ Japan Footprint and Latency
The global inference specialists Fireworks AI and Together AI are appealing for low latency and broad model coverage. Fireworks AI has physical GPU facilities in Tokyo (AP_TOKYO_1 / AP_TOKYO_2) and has demonstrated high decode throughput on long-context, reasoning models like Kimi K2.5. That said, third-party technical comparisons note that Fireworks’s serverless runs in US regions, and non-US footprints, Tokyo included, take the form of dedicated / on-demand / BYOC. Residency controls and ZDR are enterprise features, so if a Japan DC is mandatory, plan to confirm it under an enterprise contract.
Together AI’s strengths are 200+ open models and low cost via batch processing, but here too the public information says serverless inference is US-centric, with no advertised in-region serverless for APAC. Groq, DeepInfra, Novita, and OpenRouter are similar. If data residency is a hard requirement, domestic players are the realistic answer for now; if you can tolerate US processing, these win on cost and speed.
Cloudflare Workers AI and Domestic GPU IaaS
Cloudflare Workers AI offers serverless inference across a global edge that includes Tokyo, with a rich set of OSS models (gpt-oss, Llama, Qwen3, Kimi) and an OpenAI-compatible API. It is unmatched for convenience, but which edge location (country) actually runs the inference is not guaranteed, and there is no mechanism to pin inference to Tokyo for residency purposes. If residency is a requirement, you should assume it likely does not fit.
To maximize control, you can self-build an OpenAI-compatible endpoint with vLLM on a domestic GPU provider’s IaaS. Sakura’s Koukaryoku, GMO’s GPU cloud, and Highreso’s GPUSOROBAN provide GPUs in Japanese data centers, letting you run any OSS model (Llama, Qwen, DeepSeek, gpt-oss) in-country. The operational burden goes up, but model choice and data control are maximized.
Why Latency Matters for Autonomous Agents
Alongside data residency, the other reason to pick a Japan region is latency. Unlike a one-shot chat, a coding agent’s work repeats many steps: mapping the directory structure, reading files, generating diffs, running tests, and re-editing. It is not unusual for a single task to span dozens of inference turns.
Total task latency is the sum, across all turns, of network RTT, time to first byte (TTFB), generation time, and tool-execution time. The thing that bites is that network RTT gets multiplied by the number of turns. When the endpoint is in the US or Europe, geographic distance adds 200–400 ms or more per round trip, and multiplied by the turn count this can degrade a whole task by tens of seconds to minutes. Use a Tokyo region and RTT shrinks to single-digit milliseconds, so the agent starts responding almost instantly. In work where a human and an AI write code together, this difference dominates perceived productivity.
A Recommended Strategy — Working Backward from Requirements
Here is the whole picture distilled into a requirement-driven selection strategy.
- Organizations that prioritize in-country processing and privacy: Sakura AI Engine
- It satisfies APPI compliance, in-country processing, no-training-use, and OpenAI compatibility most clearly, and you can PoC immediately on the free tier. Step up to Sakura’s AI Solution when production needs dedicated or private-network setups.
- Want gpt-oss / Llama managed in a Japan region: OCI Generative AI Osaka
- The OpenAI-compatible API keys (since January 2026) let you use Osaka-hosted gpt-oss by swapping the base URL. Re-confirm Osaka availability before adopting.
- Want a managed MaaS with strong evidence: Google Cloud Vertex AI Tokyo
- Tokyo open-model endpoints are officially listed, with deployment forms and per-model pricing all in place. Use the regional endpoint explicitly and confirm the residency scope before signing.
- Already locked into AWS / Azure: AWS Bedrock Tokyo In-Region (Llama / Mistral)
- If gpt-oss is required, Tokyo Bedrock cannot do it, so self-deploy via SageMaker JumpStart, or go to Sakura / OCI. Note that Azure has no Japan Data Zone.
- A global specialist with a mandatory Japan DC: Fireworks AI enterprise (Tokyo dedicated / BYOC)
- If you can tolerate US processing, Together / Groq and others win on cost and speed.
- Maximize model choice and data control: domestic GPU IaaS (Sakura Koukaryoku / GMO / Highreso) + vLLM
- Heavy to operate, but control is maximal.
The conditions that split the decision are simple.
- If gpt-oss is required and you want it managed, the options narrow to Sakura AI Engine or OCI Osaka. Tokyo Bedrock cannot serve gpt-oss.
- If you strictly need a managed service that confines inference to Tokyo and Osaka, Bedrock’s in-country profile (the “JP geo” that keeps processing between Tokyo and Osaka) is currently Anthropic-Claude-only, so for OSS models it’s an In-Region single region, a domestic provider, or OCI Osaka.
- If you can tolerate US processing and prioritize global low cost and speed, the candidates are Together, DeepInfra, and Groq.
Conclusion
To sum up:
- OpenCode Go is a cheap, capable bundle, but its regions are US / EU / Singapore only, with no Japan region.
- “Region = Tokyo” and “data stays in Japan” are different problems. With global endpoints and managed MaaS, residency is often not guaranteed.
- The service that most clearly satisfies all three conditions (Japan region × OSS models × OpenAI-compatible) is Sakura AI Engine. If you want gpt-oss managed inside Japan, OCI Osaka is one of the few options.
- For hyperscalers, having a Tokyo endpoint and offering open models In-Region are separate things. Bedrock has no gpt-oss in Tokyo, its JP geo is Claude-only, and Azure has no Japan Data Zone; you have to read each constraint individually.
- Latency compounds multiplicatively across an autonomous agent’s many turns. A Japan region’s single-digit-millisecond RTT, versus 200–400 ms to the US or Europe, makes a tens-of-seconds-to-minutes difference per task.
One last emphasis: this space sees rapid change in models, regions, and pricing, and each vendor’s “model × region” tables get rewritten on short timescales. Treat this post as a mid-June 2026 map, and always check primary sources as of your own date before adopting anything in production.
That’s all from the Gemba, where I surveyed the services that serve OSS model APIs in a Japan region from the angles of data residency and coding-agent practicality.