DISTRIBUTED SYSTEM PROTOTYPE

Scalable E2E
Inference Platform

A personal engineering project demonstrating the implementation of a distributed, multimodal AI inference system. Built to showcase proficiency in microservices, low-latency streaming, and system architecture.

Enter System View Source

Core Technologies

Next.js 16

FastAPI

gRPC / Protobuf

Supabase

Docker

System Architecture

Designed to solve the challenge of high-latency HTTP requests in AI applications by utilizing streaming gRPC for internal communication.

Frontend Layer

Next.js 16 App Router handles the UI. Establishes a streaming connection to the API Gateway.

API Gateway (Orchestrator)

FastAPI service that authenticates via Supabase, validates schemas, and routes requests to inference nodes via gRPC.

Inference Engine

Isolated Python service running PyTorch. Models are kept in memory for hot-path execution. Returns a stream of tokens.

Interface Definition (IDL)

service InferenceService {// Bidirectional streaming for real-time interactionrpc ChatStream (stream ChatRequest)returns (stream ChatResponse); }

Why this stack?

gRPC vs REST: Chosen for smaller payload size and strongly typed contracts between microservices, critical for high-throughput AI streams.
FastAPI: Native async support allows handling thousands of concurrent connections (websockets/streams) efficiently compared to synchronous frameworks.

Technical Specifications

Containerization

Docker Compose
Multi-stage Builds
Isolated Networks

Frontend

Next.js 16 (App Router)
Tailwind CSS v4
Lucide React

Backend Services

Python 3.11
FastAPI
gRPC / Protobuf

Data & Auth

Supabase (PostgreSQL)
Row Level Security (RLS)
SSR Auth

Performance

Streaming Responses
Async I/O
< 50ms TTFB

Security

Service-to-Service Token
Environment Isolation
Type Safety

Scalable E2E Inference Platform