Large Language Models (LLMs) are everywhere, from everyday apps to advanced tools. Using them is easy. But what if you need to run your own model? Whether you’ve fine-tuned one or are dealing with privacy-sensitive data, the complexity increases. In this post, we’ll share what we learned while building our own LLM inference system. We’ll cover storing and deploying models, designing the service architecture, and solving real-world issues like routing, streaming, and managing microservices. The process involved challenges, but ultimately, we built a reliable system and gathered lessons worth sharing.
IntroductionLLMs are powering a wide range of applications — from chatbots and workflow agents to smart automation tools. While retrieval-augmented generation, tool-calling, and multi-agent protocols are important, they operate at a level above the core engine: a foundational LLM.
Many projects rely on external providers, such as OpenAI, Gemini, or Anthropic, which is sufficient for most use cases. But for certain applications, it quickly becomes a problem. What if the provider goes down? What if you need full control over latency, pricing, or uptime? Most importantly — what if you care about privacy and can’t afford to send user data to a third party?
That’s where self-hosting becomes essential. Serving a pretrained or fine-tuned model provides control, security, and the ability to tailor the model to specific business needs. Building such a system doesn’t require a large team or extensive resources. We built it with a modest budget, a small team, and just a few nodes. This constraint influenced our architectural decision, requiring us to focus on practicality and efficiency. In the following sections, we’ll cover the challenges faced, the solutions implemented, and the lessons learned along the way.
General overviewThese are the core components that form the backbone of the system.
Choosing the right schema for data transfer is crucial. A shared format across services simplifies integration, reduces errors, and improves adaptability. We aimed to design the system to work seamlessly with both self-hosted models and external providers — without exposing differences to the user.
Why Schema Design MattersThere’s no universal standard for LLM data exchange. Many providers follow schemas similar to OpenAI’s, while others — like Claude or Gemini — introduce subtle but important differences. Many of these providers offer OpenAI-compatible SDKs that retain the same schema, though often with limitations or reduced feature sets (e.g., Anthropic’s OpenAI-compatible SDK, Gemini’s OpenAI compatibility layer). Other projects such as OpenRouter aim to unify these variations by wrapping them into an OpenAI-compatible interface.
Sticking to a single predefined provider’s schema has its benefits:
But there are real downsides too:
To address this, we chose to define our own internal data model — a schema designed around our needs, which we can then map to various external formats when necessary.
Internal Schema DesignBefore addressing the challenges, let’s define the problem and outline our expectations for the solution:
We began by reviewing major LLM schemas to understand how providers structure messages, parameters, and outputs. This allowed us to extract core domain entities common across most systems, including:
We identified certain parameters, such as service_tier, usage_metadata, or reasoning_mode, as being specific to the provider's internal configuration and business logic. These elements lie outside the core LLM domain and are not part of the shared schema. Instead, they are treated as optional extensions. Whenever a feature becomes widely adopted or necessary for broader interoperability, we evaluate integrating it into the core schema.
At a high level, our input schema is structured with these key components:
This leads us to the following schema, represented in a Pydantic-like format. It illustrates the structure and intent of the design, though some implementation details are omitted for simplicity.
class ChatCompletionRequest(BaseModel): model: str # Routing key to select the appropriate model or service messages: list[Message] # Prompt and dialogue history generation_parameters: GenerationParameters # Core generation settings tools: list[Tool] # Optional tool defenitions class GenerationParameters(BaseModel): temperature: float top_p: float max_tokens: int beam_search: BeamSearchParams # Optional, non-core fields specific to certain providers provider_extensions: dict[str, Any] = {} ... # Other parametersWe deliberately moved generation parameters into a separate nested field instead of placing them at the root level. This design choice makes a distinction between constant parameters (e.g., temperature, top-p, model settings) and variable components (e.g., messages, tools). Many teams in our ecosystem store these constant parameters in external configuration systems, making this separation both practical and necessary.
We include an additional field called provider_extensions within the GenerationParameters class. These parameters vary significantly across different LLM providers, validation and interpretation of these fields is delegated to the final module that handles model inference—the component that knows how to communicate with a specific model provider. Thus, we avoid unnecessary pass-through coupling caused by redundant data validation across multiple services.
To ensure backward compatibility, new output schema features are introduced as explicit, optional fields in the request schema. These fields act as feature flags — users must set them to opt into specific behaviors. This approach keeps the core schema stable while enabling incremental evolution. For example, reasoning traces will only be included in the output if the corresponding field is set in the request.
These schemas are maintained in a shared Python library and used across services to ensure consistent request and response handling.
Working with Third-Party providersWe began by outlining how we built our own platform — so why bother with compatibility across external providers? Despite relying on our internal infrastructure, there are still several scenarios where external models play a role:
The overall communication flow with external providers can be summarized as follows:
\n
This process involves the following steps:
This is a high-level schematic that abstracts away some individual microservices. Details about specific components and the streaming response format will be covered in the following sections.
Streaming formatLLM responses are generated incrementally — token by token — and then aggregated into chunks for efficient transmission. From the user’s perspective, whether through a browser, mobile app, or terminal, the experience must remain fluid and responsive. This requires a transport mechanism that supports low-latency, real-time streaming.
There are two primary options for achieving this:
While both options are viable, SSE is the more commonly used solution for standard LLM inference — particularly for OpenAI-compatible APIs and similar systems. This is due to several practical advantages:
Because of these benefits, SSE is typically chosen for text-only, prompt-response streaming use cases.
However, some emerging use cases require richer, low-latency, bidirectional communication — such as real-time transcription or speech-to-speech interactions. OpenAI’s Realtime API addresses these needs using WebSockets (for server-to-server). These protocols are better suited for continuous multimodal input and output.
Since our system focuses exclusively on text-based interactions, we stick with SSE for its simplicity, compatibility, and alignment with our streaming model.
Response Stream ContentWith SSE selected as the transport layer, the next step was defining whatdata to include in the stream. Effective streaming requires more than just raw text — it needs to provide sufficient structure, metadata, and context to support downstream consumers such as user interfaces and automation tools. The stream must include the following information:
After defining the structure of the streamed response, we also considered several non-functional requirements essential for reliability and future evolution.
Our stream design is intended to be:
In many applications — such as side-by-side comparison or diverse sampling — multiple sequences (completions) are generated in parallel as part of a single generation request.
The most comprehensive format for streaming responses is defined in the OpenAI API reference. According to the specification, a single generation chunk may include multiple sequences in the choices array:
choices
array
A list of chat completion choices. Can contain more than one element if n is greater than 1. Can also be empty for the last chunk.
Although, in practice, individual chunks usually contain only a single delta, the format allows for multiple sequence updates per chunk. It’s important to account for this, as future updates might make broader use of this capability. Notably, even the official Python SDK is designed to support this structure.
We chose to follow the same structure to ensure compatibility with a wide range of potential features. The diagram below illustrates an example from our implementation, where a single generation consists of three sequences, streamed in six chunks over time:
As you see, to make the stream robust and easier to parse, we opted to explicitly signal Start and Finish events for both the overall generation and each individual sequence, rather than relying on implicit mechanisms such as null checks, EOFs, or magic tokens. This structured approach simplifies downstream parsing, especially in environments where multiple completions are streamed in parallel, and also improves debuggability and fault isolation during development and runtime inspection.
Moreover, we introduce an additional Error chunk that carries structured information about failures. Some errors — such as malformed requests or authorization issues — can be surfaced directly via standard HTTP response codes. However, if an error occurs during the generation process, we have two options: either abruptly terminate the HTTP stream or emit a well-formed SSE error event. We chose the latter. Abruptly closing the connection makes it hard for clients to distinguish between network issues and actual model/service failures. By using a dedicated error chunk, we enable more reliable detection and propagation of issues during streaming.
Backend Services and Request FlowAt the center of the system is a single entrypoint: LLM-Gateway. It handles basic concerns like authentication, usage tracking and quota enforcement, request formatting, and routing based on the specified model. While it may look like the Gateway carries a lot of responsibility, each task is intentionally simple and modular. For external providers, it adapts requests to their APIs and maps responses back into a unified format. For self-hosted models, requests are routed directly to internal systems using our own unified schema. This design allows seamless support for both external and internal models through a consistent interface.
Self-Hosted ModelsAs mentioned earlier, Server-Sent Events (SSE) is well-suited for streaming responses to end users, but it’s not a practical choice for internal backend communication. When a request arrives, it must be routed to a suitable worker node for model inference, and the result streamed back. While some systems handle this using chained HTTP proxies and header-based routing, in our experience, this approach becomes difficult to manage and evolve as the logic grows in complexity.
Our internal infrastructure needs to support:
To address these requirements, we use a message broker to decouple task routing from result delivery. This design provides better flexibility and resilience under varying load and routing conditions. We use RabbitMQ for this purpose, though other brokers could also be viable depending on your latency, throughput, and operational preferences. RabbitMQ was a natural fit given its maturity and alignment with our existing tooling.
Now let’s take a closer look at how this system is implemented in practice:
We use dedicated queues per model, allowing us to route requests based on model compatibility and node capabilities. The process is as follows:
To handle large payloads, we avoid overwhelming the message broker:
When it comes to routing and publishing messages, each Request Queue is a regular RabbitMQ queue, dedicated to handling a single model type. We require priority-aware scheduling, which can be achieved using message priorities. In this setup, messages with higher priority values are delivered and processed before lower priority ones. For hardware-aware routing, where messages should be directed to the most performant available nodes first, consumer priorities can help. Consumers with higher priority receive messages as long as they are active; lower-priority consumers only receive messages when the higher-priority ones are blocked or unavailable.
If message loss is unacceptable, the following must be in place:
So far, we’ve covered how tasks are published — but how is the streamed response handled? The first step is to understand how temporary queueswork in RabbitMQ. The broker supports a concept called exclusive queues, which are bound to a single connection and automatically deleted when that connection closes. This makes them a natural fit for our setup.
We create one exclusive queue per Scheduler service replica, ensuring it’s automatically cleaned up when the replica shuts down. However, this introduces a challenge: while each service replica has a single RabbitMQ queue, it must handle many requests in parallel.
To address this, we treat the RabbitMQ queue as a transport layer, routing responses to the correct Scheduler replica. Each user request is assigned a unique identifier, which is included in every response chunk. Inside the Scheduler, we maintain an additional in-memory routing layer with short-lived in-memory queues — one per active request. Incoming chunks are matched to these queues based on the identifier and forwarded accordingly. These in-memory queues are discarded once the request completes, while the RabbitMQ queue persists for the lifetime of the service replica.
Schematically this looks as follows:
A central dispatcher within the Scheduler dispatches chunks to the appropriate in-memory queue, each managed by a dedicated handler. Handlers then stream the chunks to users using SSE-protocol.
InferenceThere are several mature frameworks available for efficient LLM inference, such as vLLM and SGLANG. These systems are designed to process multiple sequences in parallel and generate response tokens in real time, often with features like continuous batching and GPU memory optimization. In our setup, we use vLLM as the core inference engine, with a few custom modifications:
Through experience, we’ve learned that even minor library updates can significantly alter model behavior — whether in output quality, determinism, or concurrency behavior. Because of this, we’ve established a robust testing pipeline:
Most modern systems run in containerized environments — either in the cloud or within Kubernetes (K8s). While this setup works well for typical backend services, it introduces challenges around model weight storage. LLM models can be tens or even hundreds of gigabytes in size, and baking model weights directly into Docker images — quickly becomes problematic:
To solve this, we separate model storage from the Docker image lifecycle. Our models are stored in an external S3-compatible object storage, and fetched just before inference service startup. To improve startup time and avoid redundant downloads, we also use local persistent volumes (PVCs) to cache model weights on each node.
ObservabilityA system like this — built on streaming, message queues, and real-time token generation — requires robust observability to ensure reliability and performance at scale.
In addition to standard service-level metrics (CPU, memory, error rates, etc.), we found it essential to monitor the following:
While our system is production-ready, there are still important challenges and opportunities for optimization:
While building a reliable and provider-independent LLM serving system can seem complex at first, it doesn’t require reinventing the wheel. Each component — streaming via SSE, task distribution through message brokers, and inference handled by runtimes like vLLM — serves a clear purpose and is grounded in existing, well-supported tools. With the right structure in place, it’s possible to create a maintainable and adaptable setup that meets production requirements without unnecessary complexity.
In the next post, we’ll explore more advanced topics such as distributed KV-caching, handling multiple models across replicas, and deployment workflows suited to ML-oriented teams.
AuthorsStanislav Shimovolos, Tochka
Maxim Afanasyev, Tochka
AcknowledgmentsDmitry Kryukov, work done at Tochka
\
All Rights Reserved. Copyright , Central Coast Communications, Inc.