Transcript
and I'm a Principal Solutions Architect, and I'm excited to talk to you about the trending topic of generative AI. We will learn with a practical implementation of how you can deploy a generative AI model on communities using Terraform. Let's get started with the first question, which is a foundational question, what exactly is generative AI? Now, generative AI is a type of artificial intelligence. It has AI in it, and like the name suggests, it is the type of AI that can create new content, ideas, including conversations, stories, images, videos, and music. Now, like all the artificial intelligence, generative AI is powered by machine learning models, very large machine learning models, we also call as LLM, that are pre-trained on vast corpuses of data and commonly referred to as foundation models, also short form sometimes as FM. Now, a recent advancement in machine learning, specifically the invention of the transformer-based neural network architecture, lot of compute and large proliferation of data has led to the rise of models that contains billions with a B of parameters or variables. Now, let's think about why are foundation models so popular? What is so special about it and how do they compare with traditional models? Now, the size and the general purpose nature of foundation models make them different than traditional models. Now, if you look at a traditional model, typically, traditional models will perform specific tasks like analyzing text for sentiments, classifying images, and forecasting trend. In order to achieve these specific tasks, customer will go ahead and gather label data, train a model, and deploy the model. So you can see for each individual task, you have six different tasks, you have six different models, and you have six different label data. In contrast to that, what happens in foundation model is with foundation model, instead of gathering label data for each models and training multiple models, customers can use the same pre-trained foundation models to adapt various tasks. Foundation models can also be customized to perform domain-specific functions that are differentiating to their business, using only a small fraction of data and compute required to train a model from scratch. If you look at a traditional machine learning cycle, there are, of course, you have the data prep, etc. Then feature engineering, and then you will do the training and influence. In generative AI, the challenges that we have, if you were to do training by yourself, using the compute that you have, or if you have a model ready and you want to do inference, what are the challenges? So let's take a look. So first is scale. So you would want to have a provision to auto-scale the infrastructure, control plane to handle node scale. Then once you have that, how do you want to split the model and the data across different nodes for fast training? Then when you're training, how can you reduce failures? Then on the inference side, the challenges is how do you scale up and down based on demands? Remember, these are GPUs. It requires a lot of compute to run inference for these engineering models. Performance, how you can ensure high performance of these models. Then again, these are GPUs. They are very expensive. How can you optimize the cost? For folks who are interested in running the end-to-end gen AI lifecycle on Kubernetes, what we propose from our side, from AWS side is leveraging EKS for doing that. But why EKS? What is EKS? So let's take a look at that. So EKS is a managed Kubernetes service that makes it easier to deploy, manage, scale, containerized application using Kubernetes on AWS. Now, one of the core strength of EKS is its scalability. The data plane can dynamically expand, which ensures that the AI model demand for more computational power or more compute. It can seamlessly accommodate that. But if you look at how it supports a large distributed training, I want to divide it into three buckets. The first one is compute. So EKS supports two types of auto scaling. So one is Carpenter, which is Carpenter with a K, which is a flexible, high-performance Carpenter cluster auto scaler that helps improve application availability and cluster efficiency. Carpenter launches a right-size compute resources, for example, Amazon EC2 instances, in response to changing application load in under a minute. And the second option is the Kubernetes cluster auto scaler, which automatically adjusts the number of nodes in your cluster when the pods fail or are rescheduled onto other nodes. The auto scaler uses auto scaling group. Now, the EKS optimized accelerated AMI is actually built on top of the standard AMI, but it is configured to serve an optimal image for the EKS nodes to support GPUs. Of course, Gen AI, if you're doing training or fine tuning, you need a GPU support and inferential, which is Amazon's large language model inference chip based workload. And this AMI includes Nvidia drivers, Nvidia container runtime, and AWS neuron container runtime. And then AWS deep learning containers are a set of docker images for training and serving models in TensorFlow on EKS. It also supports Nvidia CUDA, which is specifically for the GPU instance. And then in terms of the storage, it supports the EFS CSR driver. And if the customer wants FSx for luxury, it supports that driver as well. And by using Amazon EKS managed node group preboot commands, you can customize provisioning of the NVMe volume or instance store volumes to meet your unique PV demands. In terms of networking, it has a plugin for Elastic Fabric Adapter. Elastic Fabric Adapter is a network interface for EC2s that enables you to run applications requiring high level of inter-node communication at scale on AWS. And the advantage with that is for MS specific applications, it uses Nvidia's collective communication libraries, also known as NCCL, that can help scale thousands of GPUs or CPUs. Now, the second point over here under networking is the EC2 placement group. So in order to meet the needs of your workload, you can launch a group of interdependent EC2 instances in a placement group. Now, there are three types of placement group. There is cluster, there is partition, and there is spread. And depending upon your use case, you can pick one. For example, if you are looking for low latency, you can pick cluster. And the last point on the network is the AWS Neuron K8 device plugin. This particular plugin is kind of an extension, which helps you in terms of deploying and management of Inferentia Trinium nodes within the Kubernetes cluster. Now, there are a vast ecosystem of tools available to build and run models, even with Kubernetes landscape. One emerging stack on Kubernetes is JupyterHub, Argo workflows, Gray, and Kubernetes, also known as the dark stack. So you can run this entire stack on Amazon EKS. And if you look at our demo, which is coming up next, we can see exactly the magic of the dark stack. So the first component of the dark stack, the J stands for JupyterHub. So JupyterHub is a shared platform for running notebooks that are popular in business, education, research, and very popular with data scientists or folks who are machine learning practitioners. It accelerates experimentation process, collaborative environment. You can work together and you can execute code. The second one is Argo workflow. So Argo workflow is a open source container native workflow engine for orchestrating parallel jobs on Kubernetes. So it's primarily an orchestrator. It provides a structured and automated pipeline tailored for fine tuning of models. Next up is Gray. So Gray is a open source distributed computing framework that makes it easy to scale application and to use a state of art machine learning libraries. Gray is used to distribute the training of generative models across multiple nodes, which accelerates the training process and allows for handling of larger data sets. And last up is Kubernetes. So Kubernetes is a powerful container orchestration platform that automates deployment, scaling, management of containerized application. Kubernetes provides the infrastructure to run and scale generative AI models in containers, which ensures high availability, fault tolerance, and efficient resource utilization. Now here is an architecture and I know it looks very overwhelming. So I will kind of break it down for you. So what we're going to do in our example is we are going to take a model and we are going to fine tune that model and all of it would be in EKS leveraging Terraform. Now we use the example of Dreamboot to demonstrate how we can adopt a large text to image model. So we are going to use stable diffusion model. And what this model does is it takes a text as an input and it generates an image as an output. Now what we're going to do is we are going to fine tune it for a specific data set. And we will be looking at the different pieces or the components that will be involved in that. So if you look at this particular image, let's look at the bottom of it, right? So everything which is going on over here from training, you know, from training to deploying and doing inference, everything is in EKS. And if you look at that from the bottom top second layer, you will see the core managed node group and then you will have GPU managed node group. So the core managed node group will have all the infrastructure related services like you have the AWS LBC load balance controller, you have all the plugins and the drivers, you have the operator. If you look on the right side, you can see there are the data scientists, right? So data scientists will be doing experimentation, you know, on the Jupyter Notebook. So they will be accessing the Jupyter Notebook right here. So they will be accessing the Jupyter Notebook, which is hosted on the GPU managed node group on EKS. And once they have done their experimentation and the model looks good, they will be running the model image will be pushed to Hugging Face. And then we will be running the inference on the EKS cluster again, and we'll be leveraging Ray to do that. So what Ray does is once you have a fine-tuned model and you want to host an inference in the EKS cluster, you can use a Ray service custom resource definition to deploy a Ray cluster with a Ray serve application that pushes the model from Hugging Face that you pushed earlier once you've done the experimentation, et cetera, via accelerated training script as an output of the fine-tuned experiment. So that's what is happening here. In terms of the other pieces, we are going to use Hugging Face and the two popular libraries offered by Hugging Face, one is Accelerate, which is a open source library specifically designed to simplify and optimize the process of training and fine-tuning deep learning models and then Diffuser, which is a go-to library for state-of-art pre-trained diffusion models for generating image, audio, and even 3D structures of molecules. So those are the two libraries that we'll be making use of. I know this architecture looks very overwhelming, but we are going to work together and deploy this architecture, fine-tune the model and run inference against it. So don't get overwhelmed. So let's do right that. So the model that we are going to use today is the stable diffusion model, as I said, and you can read more about it on Hugging Face. So Hugging Face is a central repository where you can find models, data set, et cetera. And there's also a leaderboard, so you can keep in track of which model, provider or model is leading. So it has a good variety of open source. And the one that we are going to use is the stable diffusion and you can read more details about how the model was trained, what it does, but what it's going to do is it's going to take a text as an input and going to provide an image as an output. Just wanted to show you that. And one of the prereq for our project today is to have an access token with a right scope. So I just want to walk you through how you can do it. So basically on Hugging Face, you create a profile and then you can go to the user access token and you can simply create a new token, write the name and create the scope as a right and then generate the token, copy it and then update the spec file in the project so it can take the correct one. So we will be using an export command. So when we use that, you have to replace it with the token that you generate from your account. So I just wanted to show that. The project that we are going to use today is the project called as Data on EKS. If you look at the website, you can see the different modules within it, like AI ML, data analytics, Amazon EMR, data streaming platforms, schedulers, et cetera. It's a great way of getting started. And you can additionally look at the different modules like Gen AI and EKS, the blueprints which are available, best practices, benchmarks and other resources. So I think it's a great repository and that's the repository that we are going to download today and use it. So some of the prerequisite for our lab is to have certain tools. So first one is AWS CLI. So I do have CLI installed. If you want to check if you have it or not, you can put in AWS-CLI. Other one is, of course, kubectl. So you can check if kubectl is on your machine. And of course, we need Terraform. So we can also check the Terraform version. And the last one is the JQ. So it's a lightweight JSON parser and that needs to be installed as well. So now that we have all our prerequisite installed on our machine, we can go ahead with our deployment. So here I am in my Amazon command line terminal and I am going to put in the first command which is git clone of the repository. So in this particular repository, I have the JARQ stack and I'm going to navigate to the JARQ stack and... So when you clone the GitHub repository for this particular project, you can see the different modules which are available. The one that we are going to use today is under AIML and under JARQ stack. And you can see the Terraform module in this particular... In this particular folder. So if you go to the install.sh that we are going to use, you can see the different steps that are going to be taken. So step number one is it's going to take the region and going to install the Terraform script in that region. And then we can go ahead and explore the source here. So there are three folders for the source. We have the app, the notebook and the service. First, let's take a look at the app. So in the app, you can see the Streamlit.py file which is actually the front end that we are going to take a look at it when we have the project deployed and ready. The second folder here is the Python, the Jupyter notebook, which is the ipynb extension over here, which is the DogBooth notebook. And we will take a look and run that notebook as well once our script is done. And then the last step is the DogBooth.py file. So the DogBooth.py file, basically it is a fast API. And what it is doing is, it is actually invoking the endpoint to do the inference when we run our test. So let's take a look at the install file again. So the two targets that we have is, the target module is the VPC module and the EKS Terraform module. So let's take a look at the VPC first. The VPC module is a module in which we are creating all the foundational infrastructure including the VPC, the internet gateway, the NAT gateway, the subnets, the route tables to have that ready. Now, next up is the EKS module or in the other Terraform file, which is the EKS cluster. Now in the EKS cluster module, what we are doing is, as you can see, we are going to have the EKS cluster and all the other resources that are required like the cluster security group rules, the managed node group, which is the core managed node group and then we also have the GPU managed node group as well. And if you go through the Terraform file here, you can know more details about it as well. I'm going to now update the Hugginface token and then I am going to call the install.sh script. And what is going to happen in the backend is it will be initializing the backend and run the Terraform script. You can see the logs of what it's doing in the terminal and it will take a few minutes and it will deploy the stack into AWS. Okay, so it took a few minutes and after a few minutes, you can see all the resources getting created and then you can go on to do the next steps. So here I am in the AWS console and I have navigated to EKS and in EKS, you can see one cluster has been deployed and then you can click on the cluster and you can inspect more details about the cluster. So we are going to go ahead and look at the compute and if you look at the node group, you can see two node groups. One is the core node group and other one is the GPU node group as we had defined in the Terraform. Okay, so here we are in the Jupyter Hub and we are going to click on the dock booth and open the Jupyter Notebook. Now, if you look at, for folks who are not aware of what a Jupyter Notebook is, Jupyter Notebook is actually a combination of code in Python. So we have cells something like this, which are codes in Python. We can also have some statistical diagrams, architectural diagrams, and some other simple markup language, markdown language to write about what this notebook is all about. This notebook is a simple notebook. So the way we can, and wherein we are going to use the stable diffusion model and fine tune it for a specific data set. And I am going to run this Jupyter Notebook by clicking the run button over here. But optionally you can click on individual and run and try to understand what's going on in the notebook. So this notebook will take a few minutes to run and we'll come back once it's done. So in an hour or so, our training job was completed. And when I queried a dog on the moon, here's the image that our model generated. So isn't it cool that we did everything end to end on EKS using Terraform and we can clearly see the power of Terraform. So now that we have completed our end to end training, we can go ahead and run our training. So now we have completed our end to end project. Let us clean it up so you don't incur any charges. So I'm going to call the cleanup.sh file and it'll take a few minutes to clean it up. I'm going to put in the agent name. So it'll take a few minutes to clean it up. So thank you so much for taking the time to listen into my session. Please stay in touch and reach out to me. Here's my LinkedIn profile and hope you enjoyed the session and please let me know if you have any questions. Thank you so much.