What are the main challenges of running generative AI workloads on Kubernetes?

The primary challenges include scaling infrastructure to handle GPU-intensive workloads, distributing models and data across nodes for fast training, reducing training failures, scaling inference up and down based on demand while maintaining performance, and optimizing costs given the expense of GPU compute resources.

How does the JARK stack support the generative AI workflow?

JupyterHub provides collaborative notebooks for data scientist experimentation, Argo Workflows orchestrates parallel training jobs, Ray distributes training across multiple nodes and serves inference at scale, and Kubernetes provides the underlying container orchestration for deployment, scaling, and resource management across the entire AI lifecycle.

What prerequisites are needed to deploy this solution?

You need AWS CLI, kubectl, Terraform, and jq (JSON parser) installed locally. You also need a Hugging Face account with an access token that has write scope to push and pull models from the Hugging Face model hub during the training and inference workflow.

HashiCorp: Deploying Generative AI Models on Amazon EKS with Terraform

Name: HashiCorp: Deploying Generative AI Models on Amazon EKS with Terraform
Uploaded: 2026-03-26T17:23:06-04:00
Duration: 26 min 4 s
Description: TL;DR HashiCorp demonstrates deploying generative AI models on Amazon EKS using Terraform and the JARK stack (JupyterHub, Argo, Ray, Kubernetes) to address scaling, performance, and cost challenges of GPU-based AI workloads The architecture uses EKS ma...

HashiCorp

03/26/2026

0 (0%)

Report Like Favorite

Transcript

and I'm a Principal Solutions Architect, and I'm excited to talk to you about the trending topic of generative AI. We will learn with a practical implementation of how you can deploy a generative AI model on communities using Terraform. Let's get started with the first question, which is a foundational question, what exactly is generative AI? Now, generative AI is a type of artificial intelligence. It has AI in it, and like the name suggests, it is the type of AI that can create new content, ideas, including conversations, stories, images, videos, and music. Now, like all the artificial intelligence, generative AI is powered by machine learning models, very large machine learning models, we also call as LLM, that are pre-trained on vast corpuses of data and commonly referred to as foundation models, also short form sometimes as FM. Now, a recent advancement in machine learning, specifically the invention of the transformer-based neural network architecture, lot of compute and large proliferation of data has led to the rise of models that contains billions with a B of parameters or variables. Now, let's think about why are foundation models so popular? What is so special about it and how do they compare with traditional models? Now, the size and the general purpose nature of foundation models make them different than traditional models. Now, if you look at a traditional model, typically, traditional models will perform specific tasks like analyzing text for sentiments, classifying images, and forecasting trend. In order to achieve these specific tasks, customer will go ahead and gather label data, train a model, and deploy the model. So you can see for each individual task, you have six different tasks, you have six different models, and you have six different label data. In contrast to that, what happens in foundation model is with foundation model, instead of gathering label data for each models and training multiple models, customers can use the same pre-trained foundation models to adapt various tasks. Foundation models can also be customized to perform domain-specific functions that are differentiating to their business, using only a small fraction of data and compute required to train a model from scratch. If you look at a traditional machine learning cycle, there are, of course, you have the data prep, etc. Then feature engineering, and then you will do the training and influence. In generative AI, the challenges that we have, if you were to do training by yourself, using the compute that you have, or if you have a model ready and you want to do inference, what are the challenges? So let's take a look. So first is scale. So you would want to have a provision to auto-scale the infrastructure, control plane to handle node scale. Then once you have that, how do you want to split the model and the data across different nodes for fast training? Then when you're training, how can you reduce failures? Then on the inference side, the challenges is how do you scale up and down based on demands? Remember, these are GPUs. It requires a lot of compute to run inference for these engineering models. Performance, how you can ensure high performance of these models. Then again, these are GPUs. They are very expensive. How can you optimize the cost? For folks who are interested in running the end-to-end gen AI lifecycle on Kubernetes, what we propose from our side, from AWS side is leveraging EKS for doing that. But why EKS? What is EKS? So let's take a look at that. So EKS is a managed Kubernetes service that makes it easier to deploy, manage, scale, containerized application using Kubernetes on AWS. Now, one of the core strength of EKS is its scalability. The data plane can dynamically expand, which ensures that the AI model demand for more computational power or more compute. It can seamlessly accommodate that. But if you look at how it supports a large distributed training, I want to divide it into three buckets. The first one is compute. So EKS supports two types of auto scaling. So one is Carpenter, which is Carpenter with a K, which is a flexible, high-performance Carpenter cluster auto scaler that helps improve application availability and cluster efficiency. Carpenter launches a right-size compute resources, for example, Amazon EC2 instances, in response to changing application load in under a minute. And the second option is the Kubernetes cluster auto scaler, which automatically adjusts the number of nodes in your cluster when the pods fail or are rescheduled onto other nodes. The auto scaler uses auto scaling group. Now, the EKS optimized accelerated AMI is actually built on top of the standard AMI, but it is configured to serve an optimal image for the EKS nodes to support GPUs. Of course, Gen AI, if you're doing training or fine tuning, you need a GPU support and inferential, which is Amazon's large language model inference chip based workload. And this AMI includes Nvidia drivers, Nvidia container runtime, and AWS neuron container runtime. And then AWS deep learning containers are a set of docker images for training and serving models in TensorFlow on EKS. It also supports Nvidia CUDA, which is specifically for the GPU instance. And then in terms of the storage, it supports the EFS CSR driver. And if the customer wants FSx for luxury, it supports that driver as well. And by using Amazon EKS managed node group preboot commands, you can customize provisioning of the NVMe volume or instance store volumes to meet your unique PV demands. In terms of networking, it has a plugin for Elastic Fabric Adapter. Elastic Fabric Adapter is a network interface for EC2s that enables you to run applications requiring high level of inter-node communication at scale on AWS. And the advantage with that is for MS specific applications, it uses Nvidia's collective communication libraries, also known as NCCL, that can help scale thousands of GPUs or CPUs. Now, the second point over here under networking is the EC2 placement group. So in order to meet the needs of your workload, you can launch a group of interdependent EC2 instances in a placement group. Now, there are three types of placement group. There is cluster, there is partition, and there is spread. And depending upon your use case, you can pick one. For example, if you are looking for low latency, you can pick cluster. And the last point on the network is the AWS Neuron K8 device plugin. This particular plugin is kind of an extension, which helps you in terms of deploying and management of Inferentia Trinium nodes within the Kubernetes cluster. Now, there are a vast ecosystem of tools available to build and run models, even with Kubernetes landscape. One emerging stack on Kubernetes is JupyterHub, Argo workflows, Gray, and Kubernetes, also known as the dark stack. So you can run this entire stack on Amazon EKS. And if you look at our demo, which is coming up next, we can see exactly the magic of the dark stack. So the first component of the dark stack, the J stands for JupyterHub. So JupyterHub is a shared platform for running notebooks that are popular in business, education, research, and very popular with data scientists or folks who are machine learning practitioners. It accelerates experimentation process, collaborative environment. You can work together and you can execute code. The second one is Argo workflow. So Argo workflow is a open source container native workflow engine for orchestrating parallel jobs on Kubernetes. So it's primarily an orchestrator. It provides a structured and automated pipeline tailored for fine tuning of models. Next up is Gray. So Gray is a open source distributed computing framework that makes it easy to scale application and to use a state of art machine learning libraries. Gray is used to distribute the training of generative models across multiple nodes, which accelerates the training process and allows for handling of larger data sets. And last up is Kubernetes. So Kubernetes is a powerful container orchestration platform that automates deployment, scaling, management of containerized application. Kubernetes provides the infrastructure to run and scale generative AI models in containers, which ensures high availability, fault tolerance, and efficient resource utilization. Now here is an architecture and I know it looks very overwhelming. So I will kind of break it down for you. So what we're going to do in our example is we are going to take a model and we are going to fine tune that model and all of it would be in EKS leveraging Terraform. Now we use the example of Dreamboot to demonstrate how we can adopt a large text to image model. So we are going to use stable diffusion model. And what this model does is it takes a text as an input and it generates an image as an output. Now what we're going to do is we are going to fine tune it for a specific data set. And we will be looking at the different pieces or the components that will be involved in that. So if you look at this particular image, let's look at the bottom of it, right? So everything which is going on over here from training, you know, from training to deploying and doing inference, everything is in EKS. And if you look at that from the bottom top second layer, you will see the core managed node group and then you will have GPU managed node group. So the core managed node group will have all the infrastructure related services like you have the AWS LBC load balance controller, you have all the plugins and the drivers, you have the operator. If you look on the right side, you can see there are the data scientists, right? So data scientists will be doing experimentation, you know, on the Jupyter Notebook. So they will be accessing the Jupyter Notebook right here. So they will be accessing the Jupyter Notebook, which is hosted on the GPU managed node group on EKS. And once they have done their experimentation and the model looks good, they will be running the model image will be pushed to Hugging Face. And then we will be running the inference on the EKS cluster again, and we'll be leveraging Ray to do that. So what Ray does is once you have a fine-tuned model and you want to host an inference in the EKS cluster, you can use a Ray service custom resource definition to deploy a Ray cluster with a Ray serve application that pushes the model from Hugging Face that you pushed earlier once you've done the experimentation, et cetera, via accelerated training script as an output of the fine-tuned experiment. So that's what is happening here. In terms of the other pieces, we are going to use Hugging Face and the two popular libraries offered by Hugging Face, one is Accelerate, which is a open source library specifically designed to simplify and optimize the process of training and fine-tuning deep learning models and then Diffuser, which is a go-to library for state-of-art pre-trained diffusion models for generating image, audio, and even 3D structures of molecules. So those are the two libraries that we'll be making use of. I know this architecture looks very overwhelming, but we are going to work together and deploy this architecture, fine-tune the model and run inference against it. So don't get overwhelmed. So let's do right that. So the model that we are going to use today is the stable diffusion model, as I said, and you can read more about it on Hugging Face. So Hugging Face is a central repository where you can find models, data set, et cetera. And there's also a leaderboard, so you can keep in track of which model, provider or model is leading. So it has a good variety of open source. And the one that we are going to use is the stable diffusion and you can read more details about how the model was trained, what it does, but what it's going to do is it's going to take a text as an input and going to provide an image as an output. Just wanted to show you that. And one of the prereq for our project today is to have an access token with a right scope. So I just want to walk you through how you can do it. So basically on Hugging Face, you create a profile and then you can go to the user access token and you can simply create a new token, write the name and create the scope as a right and then generate the token, copy it and then update the spec file in the project so it can take the correct one. So we will be using an export command. So when we use that, you have to replace it with the token that you generate from your account. So I just wanted to show that. The project that we are going to use today is the project called as Data on EKS. If you look at the website, you can see the different modules within it, like AI ML, data analytics, Amazon EMR, data streaming platforms, schedulers, et cetera. It's a great way of getting started. And you can additionally look at the different modules like Gen AI and EKS, the blueprints which are available, best practices, benchmarks and other resources. So I think it's a great repository and that's the repository that we are going to download today and use it. So some of the prerequisite for our lab is to have certain tools. So first one is AWS CLI. So I do have CLI installed. If you want to check if you have it or not, you can put in AWS-CLI. Other one is, of course, kubectl. So you can check if kubectl is on your machine. And of course, we need Terraform. So we can also check the Terraform version. And the last one is the JQ. So it's a lightweight JSON parser and that needs to be installed as well. So now that we have all our prerequisite installed on our machine, we can go ahead with our deployment. So here I am in my Amazon command line terminal and I am going to put in the first command which is git clone of the repository. So in this particular repository, I have the JARQ stack and I'm going to navigate to the JARQ stack and... So when you clone the GitHub repository for this particular project, you can see the different modules which are available. The one that we are going to use today is under AIML and under JARQ stack. And you can see the Terraform module in this particular... In this particular folder. So if you go to the install.sh that we are going to use, you can see the different steps that are going to be taken. So step number one is it's going to take the region and going to install the Terraform script in that region. And then we can go ahead and explore the source here. So there are three folders for the source. We have the app, the notebook and the service. First, let's take a look at the app. So in the app, you can see the Streamlit.py file which is actually the front end that we are going to take a look at it when we have the project deployed and ready. The second folder here is the Python, the Jupyter notebook, which is the ipynb extension over here, which is the DogBooth notebook. And we will take a look and run that notebook as well once our script is done. And then the last step is the DogBooth.py file. So the DogBooth.py file, basically it is a fast API. And what it is doing is, it is actually invoking the endpoint to do the inference when we run our test. So let's take a look at the install file again. So the two targets that we have is, the target module is the VPC module and the EKS Terraform module. So let's take a look at the VPC first. The VPC module is a module in which we are creating all the foundational infrastructure including the VPC, the internet gateway, the NAT gateway, the subnets, the route tables to have that ready. Now, next up is the EKS module or in the other Terraform file, which is the EKS cluster. Now in the EKS cluster module, what we are doing is, as you can see, we are going to have the EKS cluster and all the other resources that are required like the cluster security group rules, the managed node group, which is the core managed node group and then we also have the GPU managed node group as well. And if you go through the Terraform file here, you can know more details about it as well. I'm going to now update the Hugginface token and then I am going to call the install.sh script. And what is going to happen in the backend is it will be initializing the backend and run the Terraform script. You can see the logs of what it's doing in the terminal and it will take a few minutes and it will deploy the stack into AWS. Okay, so it took a few minutes and after a few minutes, you can see all the resources getting created and then you can go on to do the next steps. So here I am in the AWS console and I have navigated to EKS and in EKS, you can see one cluster has been deployed and then you can click on the cluster and you can inspect more details about the cluster. So we are going to go ahead and look at the compute and if you look at the node group, you can see two node groups. One is the core node group and other one is the GPU node group as we had defined in the Terraform. Okay, so here we are in the Jupyter Hub and we are going to click on the dock booth and open the Jupyter Notebook. Now, if you look at, for folks who are not aware of what a Jupyter Notebook is, Jupyter Notebook is actually a combination of code in Python. So we have cells something like this, which are codes in Python. We can also have some statistical diagrams, architectural diagrams, and some other simple markup language, markdown language to write about what this notebook is all about. This notebook is a simple notebook. So the way we can, and wherein we are going to use the stable diffusion model and fine tune it for a specific data set. And I am going to run this Jupyter Notebook by clicking the run button over here. But optionally you can click on individual and run and try to understand what's going on in the notebook. So this notebook will take a few minutes to run and we'll come back once it's done. So in an hour or so, our training job was completed. And when I queried a dog on the moon, here's the image that our model generated. So isn't it cool that we did everything end to end on EKS using Terraform and we can clearly see the power of Terraform. So now that we have completed our end to end training, we can go ahead and run our training. So now we have completed our end to end project. Let us clean it up so you don't incur any charges. So I'm going to call the cleanup.sh file and it'll take a few minutes to clean it up. I'm going to put in the agent name. So it'll take a few minutes to clean it up. So thank you so much for taking the time to listen into my session. Please stay in touch and reach out to me. Here's my LinkedIn profile and hope you enjoyed the session and please let me know if you have any questions. Thank you so much.

TL;DR

HashiCorp demonstrates deploying generative AI models on Amazon EKS using Terraform and the JARK stack (JupyterHub, Argo, Ray, Kubernetes) to address scaling, performance, and cost challenges of GPU-based AI workloads
The architecture uses EKS managed node groups with GPU-optimized AMIs, Karpenter or Cluster Autoscaler for dynamic scaling, and specialized networking like Elastic Fabric Adapter for high-performance distributed training
The hands-on demo fine-tunes the Stable Diffusion text-to-image model using DreamBooth on EKS, with data scientists working in JupyterHub notebooks and models versioned in Hugging Face
Inference is served through Ray Serve running on the EKS cluster, pulling fine-tuned models from Hugging Face and scaling dynamically based on demand
The entire infrastructure—VPC, EKS cluster, node groups, and Kubernetes operators—is provisioned as code using Terraform modules from the open-source Data on EKS project

Foundation Models and the JARK Stack on EKS

This technical demonstration explores deploying generative AI models on Amazon Elastic Kubernetes Service (EKS) using HashiCorp Terraform and the JARK stack (JupyterHub, Argo Workflows, Ray, and Kubernetes). The session addresses the fundamental challenges of running generative AI workloads at scale, including infrastructure auto-scaling, distributed training across GPU nodes, cost optimization, and performance management. The presenter positions EKS as the optimal platform for the complete generative AI lifecycle due to its managed Kubernetes capabilities, support for GPU-optimized AMIs, integration with AWS deep learning containers, and compatibility with specialized networking like Elastic Fabric Adapter for high-performance inter-node communication. The architecture leverages foundation models that can be adapted for multiple tasks using minimal data and compute compared to training from scratch, representing a significant efficiency gain over traditional machine learning approaches that require separate models for each specific task.

Hands-On Implementation: Fine-Tuning Stable Diffusion

The practical demonstration walks through fine-tuning the Stable Diffusion text-to-image model using the DreamBooth technique on EKS infrastructure provisioned entirely through Terraform. The architecture separates workloads across two managed node groups: a core node group hosting infrastructure services like the AWS Load Balancer Controller and CSI drivers, and a GPU node group running the actual training and inference workloads. Data scientists access JupyterHub notebooks running on GPU nodes to experiment with model fine-tuning, with the resulting models pushed to Hugging Face for versioning. The implementation uses Hugging Face's Accelerate library to optimize distributed training and the Diffusers library for working with diffusion models. For inference, the solution deploys a Ray cluster using Ray Serve custom resource definitions to pull the fine-tuned model from Hugging Face and serve predictions at scale. The entire workflow—from infrastructure provisioning to model deployment—is managed as code through Terraform, demonstrating infrastructure-as-code principles applied to AI/ML workloads.

Deployment Architecture and Resource Management

The reference architecture implements a production-ready generative AI platform on EKS with careful attention to compute, storage, and networking requirements. Compute scaling is handled through either Karpenter (a flexible, high-performance cluster autoscaler) or the Kubernetes Cluster Autoscaler, both supporting sub-minute provisioning of GPU instances in response to workload demands. Storage leverages the EFS CSI driver for shared file systems and supports FSx for Lustre for high-performance workloads, with customizable NVMe instance store volume provisioning through EKS managed node group preboot commands. Networking optimizations include EC2 placement groups for low-latency inter-node communication and the AWS Neuron K8s device plugin for managing Inferentia and Trainium accelerator nodes. The demonstration uses the Data on EKS open-source project, which provides Terraform modules for deploying the complete stack including VPC, EKS cluster, managed node groups, and all necessary Kubernetes operators and controllers. The session concludes with a successful fine-tuning run that generates custom images from text prompts, validating the end-to-end workflow.

Chapters

0:00 - Introduction and Generative AI Overview
1:42 - Foundation Models vs Traditional ML
3:04 - Challenges of Running Gen AI on Kubernetes
4:47 - Why Amazon EKS for Generative AI
9:30 - The JARK Stack Architecture
11:30 - Solution Architecture Deep Dive
15:18 - Stable Diffusion Model and Hugging Face
17:29 - Prerequisites and Setup
18:20 - Deploying Infrastructure with Terraform
23:36 - Running the JupyterHub Notebook
25:02 - Results and Cleanup

Key Quotes

1:42 "Foundation models can also be customized to perform domain-specific functions that are differentiating to their business, using only a small fraction of data and compute required to train a model from scratch."
5:29 "Karpenter is a flexible, high-performance cluster autoscaler that helps improve application availability and cluster efficiency. Karpenter launches right-size compute resources, for example, Amazon EC2 instances, in response to changing application load in under a minute."
9:30 "One emerging stack on Kubernetes is JupyterHub, Argo workflows, Ray, and Kubernetes, also known as the JARK stack. You can run this entire stack on Amazon EKS."
11:52 "Ray is used to distribute the training of generative models across multiple nodes, which accelerates the training process and allows for handling of larger data sets."
25:05 "We did everything end to end on EKS using Terraform and we can clearly see the power of Terraform."

Categories:

Tags:

Show more Show less

Browse videos

Upcoming Webinar Calendar

06/25/2026

01:00 PM

06/25/2026

Generative AI Security: Preventing AI from Becoming a Data Breach Multiplier

https://www.truthinit.com/index.php/channel/1998/generative-ai-security-preventing-ai-from-becoming-a-data-breach-multiplier/
06/30/2026

01:00 PM

06/30/2026

Mastering Active Directory Certificate Services for Long-Term Success

https://www.truthinit.com/index.php/channel/2018/mastering-active-directory-certificate-services-for-long-term-success/
07/01/2026

04:00 AM

07/01/2026

Integrating Security in AI: Automated Red Teaming Strategies for Private Models

https://www.truthinit.com/index.php/channel/1969/integrating-security-in-ai-automated-red-teaming-strategies-for-private-models/
07/01/2026

04:00 AM

07/01/2026

Schutz von KI in Anwendungen, Agenten und APIs.

https://www.truthinit.com/index.php/channel/2008/schutz-von-ki-in-anwendungen-agenten-und-apis/
07/01/2026

01:00 PM

07/01/2026

Preventing Your AI from Turning Against You: Essential Strategies

https://www.truthinit.com/index.php/channel/2021/preventing-your-ai-from-turning-against-you-essential-strategies/
07/02/2026

10:00 AM

07/02/2026

When the cloud goes dark: Resilience lessons from hybrid threats

https://www.truthinit.com/index.php/channel/2011/resilience-insights-from-hybrid-threats-when-the-cloud-faces-challenges/
07/09/2026

01:00 PM

07/09/2026

The HUMAN Experience: Implementing AgenticTrust for Transformative Engagement

https://www.truthinit.com/index.php/channel/2026/the-human-experience-implementing-agentictrust-for-transformative-engagement/
07/14/2026

01:00 PM

07/14/2026

Crafting a Championship-Quality Security Team for Unmatched Defense

https://www.truthinit.com/index.php/channel/2025/crafting-a-championship-quality-security-team-for-unmatched-defense/
07/21/2026

04:00 AM

07/21/2026

Strategies for Managing AI Governance and Securing App-to-LLM API Traffic

https://www.truthinit.com/index.php/channel/1967/strategies-for-managing-ai-governance-and-securing-app-to-llm-api-traffic/
07/21/2026

01:00 PM

07/21/2026

HUMAN Dialogue: Insights from Attackers During the FIFA World Cup

https://www.truthinit.com/index.php/channel/2029/human-dialogue-insights-from-attackers-during-the-fifa-world-cup/
07/22/2026

06:30 AM

07/22/2026

Understanding the Dynamics of Data Privacy and Protection Regulations

https://www.truthinit.com/index.php/channel/2000/understanding-the-dynamics-of-data-privacy-and-protection-regulations/
07/28/2026

01:00 PM

07/28/2026

Illumio + Netskope: Zero Trust in the Age of AI Autonomy

https://www.truthinit.com/index.php/channel/2031/illumio-netskope-zero-trust-in-the-age-of-ai-autonomy/
07/29/2026

04:00 AM

07/29/2026

Real-Time Strategies for Safeguarding Against Prompt Injections

https://www.truthinit.com/index.php/channel/1968/real-time-strategies-for-safeguarding-against-prompt-injections/
08/19/2026

12:00 PM

08/19/2026

Witness Cyera Agent Security in Action: A Firsthand Experience

https://www.truthinit.com/index.php/channel/2036/witness-cyera-agent-security-in-action-a-firsthand-experience/
09/30/2026

04:00 AM

09/30/2026

AI Command Center: Optimizing Visibility and Control in Your Operations

https://www.truthinit.com/index.php/channel/2024/ai-command-center-optimizing-visibility-and-control-in-your-operations/