What are the main challenges of running generative AI workloads on Kubernetes?

The primary challenges include scaling infrastructure to handle GPU-intensive workloads, distributing models and data across nodes for fast training, reducing training failures, scaling inference up and down based on demand while maintaining performance, and optimizing costs given the expense of GPU compute resources.

How does the JARK stack support the generative AI workflow?

JupyterHub provides collaborative notebooks for data scientist experimentation, Argo Workflows orchestrates parallel training jobs, Ray distributes training across multiple nodes and serves inference at scale, and Kubernetes provides the underlying container orchestration for deployment, scaling, and resource management across the entire AI lifecycle.

What prerequisites are needed to deploy this solution?

You need AWS CLI, kubectl, Terraform, and jq (JSON parser) installed locally. You also need a Hugging Face account with an access token that has write scope to push and pull models from the Hugging Face model hub during the training and inference workflow.

HashiCorp: Deploying Generative AI Models on Amazon EKS with Terraform

Name: HashiCorp: Deploying Generative AI Models on Amazon EKS with Terraform
Uploaded: 2026-03-26T17:23:06-04:00
Duration: 26 min 4 s
Description: TL;DR HashiCorp demonstrates deploying generative AI models on Amazon EKS using Terraform and the JARK stack (JupyterHub, Argo, Ray, Kubernetes) to address scaling, performance, and cost challenges of GPU-based AI workloads The architecture uses EKS ma...

HashiCorp

03/26/2026

0 (0%)

Report Like Favorite

TL;DR

HashiCorp demonstrates deploying generative AI models on Amazon EKS using Terraform and the JARK stack (JupyterHub, Argo, Ray, Kubernetes) to address scaling, performance, and cost challenges of GPU-based AI workloads
The architecture uses EKS managed node groups with GPU-optimized AMIs, Karpenter or Cluster Autoscaler for dynamic scaling, and specialized networking like Elastic Fabric Adapter for high-performance distributed training
The hands-on demo fine-tunes the Stable Diffusion text-to-image model using DreamBooth on EKS, with data scientists working in JupyterHub notebooks and models versioned in Hugging Face
Inference is served through Ray Serve running on the EKS cluster, pulling fine-tuned models from Hugging Face and scaling dynamically based on demand
The entire infrastructure—VPC, EKS cluster, node groups, and Kubernetes operators—is provisioned as code using Terraform modules from the open-source Data on EKS project

Foundation Models and the JARK Stack on EKS

This technical demonstration explores deploying generative AI models on Amazon Elastic Kubernetes Service (EKS) using HashiCorp Terraform and the JARK stack (JupyterHub, Argo Workflows, Ray, and Kubernetes). The session addresses the fundamental challenges of running generative AI workloads at scale, including infrastructure auto-scaling, distributed training across GPU nodes, cost optimization, and performance management. The presenter positions EKS as the optimal platform for the complete generative AI lifecycle due to its managed Kubernetes capabilities, support for GPU-optimized AMIs, integration with AWS deep learning containers, and compatibility with specialized networking like Elastic Fabric Adapter for high-performance inter-node communication. The architecture leverages foundation models that can be adapted for multiple tasks using minimal data and compute compared to training from scratch, representing a significant efficiency gain over traditional machine learning approaches that require separate models for each specific task.

Hands-On Implementation: Fine-Tuning Stable Diffusion

The practical demonstration walks through fine-tuning the Stable Diffusion text-to-image model using the DreamBooth technique on EKS infrastructure provisioned entirely through Terraform. The architecture separates workloads across two managed node groups: a core node group hosting infrastructure services like the AWS Load Balancer Controller and CSI drivers, and a GPU node group running the actual training and inference workloads. Data scientists access JupyterHub notebooks running on GPU nodes to experiment with model fine-tuning, with the resulting models pushed to Hugging Face for versioning. The implementation uses Hugging Face's Accelerate library to optimize distributed training and the Diffusers library for working with diffusion models. For inference, the solution deploys a Ray cluster using Ray Serve custom resource definitions to pull the fine-tuned model from Hugging Face and serve predictions at scale. The entire workflow—from infrastructure provisioning to model deployment—is managed as code through Terraform, demonstrating infrastructure-as-code principles applied to AI/ML workloads.

Deployment Architecture and Resource Management

The reference architecture implements a production-ready generative AI platform on EKS with careful attention to compute, storage, and networking requirements. Compute scaling is handled through either Karpenter (a flexible, high-performance cluster autoscaler) or the Kubernetes Cluster Autoscaler, both supporting sub-minute provisioning of GPU instances in response to workload demands. Storage leverages the EFS CSI driver for shared file systems and supports FSx for Lustre for high-performance workloads, with customizable NVMe instance store volume provisioning through EKS managed node group preboot commands. Networking optimizations include EC2 placement groups for low-latency inter-node communication and the AWS Neuron K8s device plugin for managing Inferentia and Trainium accelerator nodes. The demonstration uses the Data on EKS open-source project, which provides Terraform modules for deploying the complete stack including VPC, EKS cluster, managed node groups, and all necessary Kubernetes operators and controllers. The session concludes with a successful fine-tuning run that generates custom images from text prompts, validating the end-to-end workflow.

Chapters

0:00 - Introduction and Generative AI Overview
1:42 - Foundation Models vs Traditional ML
3:04 - Challenges of Running Gen AI on Kubernetes
4:47 - Why Amazon EKS for Generative AI
9:30 - The JARK Stack Architecture
11:30 - Solution Architecture Deep Dive
15:18 - Stable Diffusion Model and Hugging Face
17:29 - Prerequisites and Setup
18:20 - Deploying Infrastructure with Terraform
23:36 - Running the JupyterHub Notebook
25:02 - Results and Cleanup

Key Quotes

1:42 "Foundation models can also be customized to perform domain-specific functions that are differentiating to their business, using only a small fraction of data and compute required to train a model from scratch."
5:29 "Karpenter is a flexible, high-performance cluster autoscaler that helps improve application availability and cluster efficiency. Karpenter launches right-size compute resources, for example, Amazon EC2 instances, in response to changing application load in under a minute."
9:30 "One emerging stack on Kubernetes is JupyterHub, Argo workflows, Ray, and Kubernetes, also known as the JARK stack. You can run this entire stack on Amazon EKS."
11:52 "Ray is used to distribute the training of generative models across multiple nodes, which accelerates the training process and allows for handling of larger data sets."
25:05 "We did everything end to end on EKS using Terraform and we can clearly see the power of Terraform."

Categories:

Tags:

Show more Show less

TL;DR

Foundation Models and the JARK Stack on EKS

Hands-On Implementation: Fine-Tuning Stable Diffusion

Deployment Architecture and Resource Management

Chapters

Key Quotes

Managing Configuration at Scale Across Group Policy and Intune

Service Account Security in the Age of AI: From Legacy Accounts to Agentic Identities

Insights from the 2026 Keepit Annual Data Report on SaaS Data Protection