Every GPU cluster has dead time. Training jobs finish, workloads shift, and hardware sits idle while power and cooling costs keep running.
FriendliAI wants to fill that gap. The company is launching a platform called InferenceSense that runs paid AI inference workloads on unused GPU cycles and splits the token revenue with the operator.
The pitch draws on an ad-tech parallel: just as publishers use Google AdSense to monetize unsold inventory, neocloud operators use InferenceSense to fill unused compute with live inference demand. The operator’s own jobs always take priority — the moment a scheduler reclaims a GPU, InferenceSense yields. According to the announcement, the handoff happens within seconds.
The researcher behind continuous batching
FriendliAI was founded in 2021 by Byung-Gon Chun, a former professor at Seoul National University who spent over a decade studying efficient execution of machine learning models at scale. His research produced a paper called Orca, which introduced continuous batching — the technique that processes inference requests dynamically rather than waiting to fill a fixed batch before executing.
That paper became foundational to vLLM, the open-source inference engine now used across most production deployments.
“What we are providing is that instead of letting GPUs be idle, by running inferences they can monetize those idle GPUs,” Chun said.
How the platform works
InferenceSense runs on top of Kubernetes, which most neocloud operators already use for resource orchestration. An operator allocates a pool of GPUs to a Kubernetes cluster managed by FriendliAI, declaring which nodes are available and under what conditions they can be reclaimed. Idle detection runs through Kubernetes itself.
When GPUs go unused, InferenceSense spins up isolated containers serving paid inference workloads. Supported models include DeepSeek, Qwen, Kimi, GLM, and MiniMax. The company also supports more than 500,000 open-weight models through its existing endpoint service, which appears alongside Azure, AWS, and GCP as a deployment option on Hugging Face.
Demand is aggregated through FriendliAI’s direct clients and through inference aggregators including OpenRouter. There are no upfront fees and no minimum commitments. A real-time dashboard shows operators which models are running, tokens being processed, and revenue accrued.
The distinction from spot GPU markets is structural. Platforms like CoreWeave, Lambda Labs, and RunPod involve the vendor renting out its own hardware. InferenceSense runs on hardware the operator already owns — monetizing tokens rather than raw capacity.
The company claims its inference engine delivers two to three times the throughput of a standard vLLM deployment, though Chun notes the figure varies by workload type.
Photo by Pixabay
This article is a curated summary based on third-party sources. Source: Read the original article