Skip to content

Need persistent HTTP server mode for deployments (like Ollama) #55

@pfrydids

Description

@pfrydids

Very new here so if this already exists apologies...

The intent here is to allow JVM warm up to kick in...

Description:
Currently, GPULlama3.java requires spawning a new JVM process for each inference request when wrapped in a web API. This causes 20-80s latency per request due to repeated JVM/TornadoVM/model loading overhead.

Request: Add a persistent server mode where:

  1. Model loads once at startup and stays in GPU memory
  2. HTTP server accepts inference requests without process restarts
  3. Similar to how Ollama operates (loads model once, serves all requests from same process)

Current workaround limitations:

  • Flask + subprocess: 20-80s latency (JVM/model reload per request)
  • Spring Boot + LangChain4j: Version incompatibility (langchain4j-gpu-llama3 requires Java 21, base image has Java 17)

Ideal solution: Built-in HTTP server (like Ollama) or Java 17-compatible LangChain4j integration

Metadata

Metadata

Labels

questionFurther information is requested

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions