NeuralWatt: AI Inference Priced in Kilowatt-Hours — How It Works and What It Costs
NeuralWatt is a small startup that sells access to AI models — Qwen, Kimi, GLM, MiniMax, Mistral — but bills in kilowatt-hours instead of tokens. A basic subscription costs $20/month for 6 kWh of compute energy, with a flat rate of $5/kWh across all 14 available models. The pitch: your bill reflects real GPU power consumed, not arbitrary per-token markups. Here is how the model works, where it actually saves money, and what it signals about the future of AI pricing.
Why Kilowatt-Hours Instead of Tokens
The standard AI pricing model charges per million tokens — input and output separately, with different rates per model. This creates opacity: a “cheap” model might consume more GPU time per token than an expensive one, and users have no way to verify. NeuralWatt flips this by measuring the actual electrical energy a GPU uses to process a request and billing at a single flat rate of $5.00/kWh.
The logic is straightforward. GPU compute is ultimately constrained by power. A data center pays for electricity, cooling, and hardware depreciation — all of which scale with watts drawn, not tokens emitted. By pricing in kWh, NeuralWatt aligns the customer’s cost with the provider’s actual resource consumption. Every API response includes an energy metric in milliwatt-hours, so users can see exactly what each request cost in physical terms.
Subscription Plans and Pay-As-You-Go Rates
NeuralWatt offers three subscription tiers, each with a monthly kWh allocation and a per-kWh effective discount compared to the pay-as-you-go rate of $5/kWh:
| Plan | Price | Included | Effective $/kWh | Key Feature |
|---|---|---|---|---|
| Basic | $20/mo | 6 kWh | ~$3.33 | All models, energy dashboard |
| Standard | $50/mo | 16 kWh | ~$3.13 | Priority access, analytics |
| Pro | $100/mo | 33 kWh | ~$3.03 | Highest priority, $5/kWh overage |
The Basic plan effectively gives a ~33% discount off pay-as-you-go pricing. Pro subscribers get the steepest discount at roughly 39% off, plus the option to buy overage at the standard $5/kWh rate when the included allocation runs out. New accounts also receive $1 in free credits to test the API before committing.
Where Energy Pricing Beats Token Pricing
The key advantage of kWh billing shows up with Mixture-of-Experts (MoE) models. MoE architectures like Qwen3.5 397B and Kimi K2.5 activate only a fraction of their total parameters per request — Qwen3.5 397B uses roughly 17B active parameters out of 397B total. This means the GPU does significantly less work per token than a dense model of comparable size.
Under token pricing, MoE models are often priced similarly to dense models with the same parameter count, because providers charge for the model’s “size.” Under energy pricing, MoE models cost dramatically less because they physically consume less power. NeuralWatt claims up to 95% savings on MoE models compared to equivalent token-based rates. For a model like Kimi K2.5, which consumes 1.23 Wh per request, the energy cost at $5/kWh is roughly $0.006 — fractions of a cent per inference.
Available Models and Energy Per Request
The platform currently hosts 14 models from seven providers. Energy consumption per request varies widely depending on model architecture:
| Model | Provider | Energy/Request | Context |
|---|---|---|---|
| GPT-OSS 20B | OpenAI | 52.78 mWh | 16K |
| Qwen3.6 35B | Qwen | 191.93 mWh | 131K |
| MiniMax M2.5 | MiniMax | 296.02 mWh | 196K |
| Devstral Small 2 24B | Mistral | 331.63 mWh | 262K |
| Qwen3.5 397B | Qwen | 336.69 mWh | 262K |
| GLM-5.1 Fast | ZhipuAI | 712.39 mWh | 200K |
| GLM-5.1 | ZhipuAI | 922.83 mWh | 200K |
| Kimi K2.6 Fast | MoonshotAI | 1.42 Wh | 262K |
| Kimi K2.5 | MoonshotAI | 1.23 Wh | 262K |
The most energy-efficient option is GPT-OSS 20B at just 52.78 mWh per request — that is roughly 19,000 requests per kWh. At the Basic plan’s effective rate of $3.33/kWh, each inference costs about $0.00018. The most demanding model listed, GLM-5.1 at 922.83 mWh, still delivers over 1,000 requests per kWh — practical for most workloads at these price points.
How NeuralWatt Measures Energy
Each API response includes a per-request energy figure in milliwatt-hours. NeuralWatt applies an attribution cap — a maximum energy charge per single request — which means under concurrent load, actual per-request energy is lower than the listed figure. The measurement covers GPU compute only: the electricity drawn by the GPU during inference, plus proportional shares of cooling and overhead.
The platform also supports prefix caching: when the same system prompt or context prefix is reused across requests, cached tokens skip recomputation and are billed at 25% of the normal input rate. This directly reduces energy consumption for long-context applications — a chatbot with a 10,000-token system prompt, for example, avoids re-processing that prefix on every turn.
Neuralwatt Optimize: The Technology Behind the Pricing
The energy-based pricing is not just a billing gimmick — it is backed by a GPU power optimization engine called Neuralwatt Optimize. The system performs real-time adaptive power tuning on NVIDIA datacenter GPUs (Ada, Hopper, Blackwell architectures), dynamically adjusting power limits based on workload characteristics.
According to NeuralWatt, the optimization achieves roughly 33% more inference output from the same power envelope — essentially running 8 GPUs on the power of 6. The system also supports grid-aware demand response, meaning it can flex power consumption based on electricity grid conditions, reducing cooling requirements and carbon intensity during peak demand periods. The technology is tested in data centers with 100+ MW capacity and is also available as an on-premise deployment (Neuralwatt Deploy) for organizations that want to run the optimization stack on their own hardware.
Who Benefits from Energy-Based Pricing
The model makes the most sense for three categories of users. Developers running high-volume MoE model workloads benefit the most — the 95% savings claim on models like Qwen3.5 and Kimi K2.5 is significant at scale. A batch job processing 100,000 requests through Qwen3.5 397B would consume roughly 33.7 kWh of energy — about $112 at pay-as-you-go rates, or covered entirely by a single Pro subscription.
Cost-conscious teams who want predictable monthly AI spend benefit from the subscription model — no surprise bills from a runaway prompt loop, because the kWh cap is hard. Once the allocation is exhausted, requests stop or switch to overage pricing at $5/kWh.
Energy-aware organizations — companies with ESG reporting requirements or sustainability mandates — get built-in energy reporting at no extra cost. Every API response includes energy metrics, and the dashboard provides consumption trends and model efficiency comparisons. For a company tracking the carbon footprint of its AI usage, this data would otherwise require third-party tools or manual estimation.
Does Energy Pricing Make Sense as an Industry Model
NeuralWatt’s approach is novel for the inference API market, but it mirrors how cloud compute has always worked: AWS, GCP, and Azure all charge for VM hours and GPU time, not per-operation. The token pricing model emerged as a convenient abstraction for LLM APIs, but it obscures the physical cost of computation. As models grow larger and data center power consumption becomes a genuine constraint — major providers are now spending billions on power infrastructure — pricing that reflects real energy use may become more common.
The limitation is that energy pricing requires users to think in a new unit. Most developers are accustomed to estimating costs per million tokens and have budgeting tools built around that model. Switching to kWh means recalibrating expectations and building new cost-estimation workflows. For now, NeuralWatt mitigates this by also offering traditional token-based pricing as an alternative, so users can choose the billing model that fits their existing processes.
Bottom Line
NeuralWatt is a small player in a market dominated by OpenAI, Anthropic, and Google — but its pricing model is genuinely different. For workloads heavy on MoE models, the energy-based approach delivers real savings that are easy to verify: every request comes with a wattmeter reading. The $20/month Basic plan is a low-risk entry point for individual developers, and the $1 free credit removes the barrier to testing. Whether energy-based pricing becomes an industry standard is uncertain, but NeuralWatt proves it is technically feasible and commercially viable today.
Leave a Reply
You must be logged in to post a comment.