CalSync — Automate Outlook Calendar Colors

Auto-color-code events for your team using rules. Faster visibility, less admin. 10-user minimum · 12-month term.

CalSync Colors is a service by CPI Consulting

In this blog post Deploy a Model with TensorFlow Serving on Docker and Kubernetes we will walk through how to package a TensorFlow model, serve it locally with Docker, and scale it on Kubernetes. Deploy a Model with TensorFlow Serving on Docker and Kubernetes is aimed at technical teams who want a reliable, fast, and maintainable way to serve models in production.

At a high level, TensorFlow Serving is a purpose-built, high-performance inference server. It loads models in TensorFlow’s SavedModel format, exposes standard REST and gRPC endpoints, and supports model versioning and batching out of the box. Compared to DIY Flask or FastAPI wrappers, it’s faster to stand up, easier to operate, and designed for zero-downtime upgrades.

What is TensorFlow Serving

TensorFlow Serving (TF Serving) is a C++ server that:

  • Reads TensorFlow SavedModel directories (versioned as 1, 2, 3…)
  • Serves predictions over HTTP/REST (default port 8501) and gRPC (default port 8500)
  • Hot-reloads new model versions and supports canarying/rollback
  • Optionally batches requests for higher throughput

Because it’s optimized in C++ and tightly integrated with TensorFlow runtimes (CPU and GPU), you get strong performance without writing server code. Your team focuses on model training and packaging; TF Serving handles the serving.

Prerequisites

  • Docker installed locally
  • Python 3.9+ and TensorFlow for exporting a model
  • curl for quick REST testing

Step 1: Export a SavedModel

We’ll create a simple Keras model and export it in the SavedModel format, versioned under models/my_model/1. TF Serving looks for numeric subfolders representing versions.

This export includes a default signature (serving_default) TF Serving will use for inference.

Step 2: Serve locally with Docker

Run the official TF Serving container, mounting your model directory and exposing REST and gRPC ports:

What this does:

  • Binds REST on localhost:8501 and gRPC on localhost:8500
  • Loads the highest numeric version under /models/my_model
  • Exposes the model under the name my_model

Step 3: Send a prediction

Use REST for a quick test:

You’ll get back a JSON with predictions. In production, you can switch to gRPC for lower latency and better throughput, but REST is perfect for quick testing and many web services.

Step 4: Upgrade and roll back with versions

To deploy a new model version without downtime:

  • Export your updated model to models/my_model/2
  • Place it alongside version 1 on the same path
  • TF Serving will detect the new version and start serving it once loaded

Roll back by removing or disabling version 2; the server will return to serving the latest available version. You can tune how quickly it polls the filesystem with --file_system_poll_wait_seconds if needed.

Step 5: Serve multiple models

For multi-model setups, point TF Serving at a model config file:

Step 6: Move to Kubernetes

On Kubernetes, mount your model directory from a PersistentVolume and expose a Service. A minimal example:

Add an Ingress or API gateway with TLS, and consider autoscaling:

Performance and reliability tips

  • Batching: Enable batching to increase throughput under load.
  • CPU vs GPU: For heavy models or large batches, use tensorflow/serving:latest-gpu with NVIDIA Container Toolkit.
  • Model size and cold starts: Keep models lean, and pre-warm by sending a small request after rollout.
  • Versioning strategy: Always deploy to a new numeric folder (e.g., /2), test, then cut traffic. Keep N-1 for quick rollback.
  • Input validation: Enforce shapes and dtypes at your API edge to avoid malformed requests reaching TF Serving.
  • Observability: Log request IDs at the caller, track latency and error rates, and capture model version in every metric/event.
  • Security: Put TF Serving behind an Ingress or API gateway with TLS and authentication. Restrict direct access to ports 8500/8501.

Common pitfalls

  • Signature mismatches: Ensure your client payload matches the SavedModel signature (serving_default). If in doubt, inspect with saved_model_cli show --dir <path> --all.
  • Wrong JSON shape: REST instances must match the model’s expected shape. For a single vector input, wrap it as a list of lists.
  • Mount paths: The container must see versioned subfolders under the base path (/models/my_model/1, /2, …).
  • Resource limits: Without CPU/memory limits in Kubernetes, noisy neighbors can cause latency spikes. Set requests/limits and autoscaling.

Why this approach works

TF Serving abstracts the serving layer with an optimized, battle-tested server. Docker makes it reproducible on a laptop, CI, or any cloud VM. Kubernetes adds elasticity, resilience, and a paved path to GitOps and blue/green rollouts. Together, they remove bespoke server code and let your team focus on model quality and business impact.

Wrap-up

You now have a clean path from a trained TensorFlow model to a production-ready, scalable serving stack. Start with Docker for fast iteration, then move to Kubernetes when you need high availability and autoscaling. If you want help adapting this for your environment—object storage model syncing, canarying, observability, security—CloudProinc.com.au can assist with reference architectures and hands-on implementation.


Discover more from CPI Consulting

Subscribe to get the latest posts sent to your email.