-
Notifications
You must be signed in to change notification settings - Fork 22
Open
Labels
questionFurther information is requestedFurther information is requested
Description
Very new here so if this already exists apologies...
The intent here is to allow JVM warm up to kick in...
Description:
Currently, GPULlama3.java requires spawning a new JVM process for each inference request when wrapped in a web API. This causes 20-80s latency per request due to repeated JVM/TornadoVM/model loading overhead.
Request: Add a persistent server mode where:
- Model loads once at startup and stays in GPU memory
- HTTP server accepts inference requests without process restarts
- Similar to how Ollama operates (loads model once, serves all requests from same process)
Current workaround limitations:
- Flask + subprocess: 20-80s latency (JVM/model reload per request)
- Spring Boot + LangChain4j: Version incompatibility (langchain4j-gpu-llama3 requires Java 21, base image has Java 17)
Ideal solution: Built-in HTTP server (like Ollama) or Java 17-compatible LangChain4j integration
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested