DISTRIBUTED SYSTEM PROTOTYPE

Scalable E2E
Inference Platform

A personal engineering project demonstrating the implementation of a distributed, multimodal AI inference system. Built to showcase proficiency in microservices, low-latency streaming, and system architecture.

Core Technologies

Next.js 16
FastAPI
gRPC / Protobuf
Supabase
Docker

System Architecture

Designed to solve the challenge of high-latency HTTP requests in AI applications by utilizing streaming gRPC for internal communication.

01

Frontend Layer

Next.js 16 App Router handles the UI. Establishes a streaming connection to the API Gateway.

02

API Gateway (Orchestrator)

FastAPI service that authenticates via Supabase, validates schemas, and routes requests to inference nodes via gRPC.

03

Inference Engine

Isolated Python service running PyTorch. Models are kept in memory for hot-path execution. Returns a stream of tokens.

Interface Definition (IDL)

service InferenceService {// Bidirectional streaming for real-time interactionrpc ChatStream (stream ChatRequest)returns (stream ChatResponse); }

Why this stack?

  • gRPC vs REST: Chosen for smaller payload size and strongly typed contracts between microservices, critical for high-throughput AI streams.
  • FastAPI: Native async support allows handling thousands of concurrent connections (websockets/streams) efficiently compared to synchronous frameworks.

Technical Specifications

Containerization

  • Docker Compose
  • Multi-stage Builds
  • Isolated Networks

Frontend

  • Next.js 16 (App Router)
  • Tailwind CSS v4
  • Lucide React

Backend Services

  • Python 3.11
  • FastAPI
  • gRPC / Protobuf

Data & Auth

  • Supabase (PostgreSQL)
  • Row Level Security (RLS)
  • SSR Auth

Performance

  • Streaming Responses
  • Async I/O
  • < 50ms TTFB

Security

  • Service-to-Service Token
  • Environment Isolation
  • Type Safety