How to improve cloud-based generative AI performance

The problems can be hard to find but easy to solve. With a proactive approach and best practices, you can avoid unhappy users and a damaged business reputation.

How to improve cloud-based generative AI performance
Ollyy/Shutterstock

It’s Monday. You come into the office only to be met with a dozen emails from your system development teammates requesting to speak with you right away. It seems that the generative AI-enabled inventory management system you launched a week ago is frustrating its new users. It’s taking minutes, not seconds to respond. Shipments are now running late. Customers are hanging up on your service reps because they are taking too long to answer customer questions. Website sales are down by 20% due to performance lags. Whoops. You have a performance problem.

But you did everything right. You’re using only GPUs for processing training and inferences; you did all recommended performance testing; you have over-provisioned the memory space, and you are only using the fastest storage with the best I/O performance. Indeed, your cloud bill is greater than $100K a month. How can performance be failing?

I’m hearing this story more often as the early adopters of generative AI systems on the cloud have gotten around to deploying their first or second system. It’s an exciting time as cloud providers promote their generative AI capabilities, and you basically copy the architecture configurations you saw at the last major cloud-branded conference. You’re a follower and have followed what you believe are proven architectures and best practices.

Emerging performance problems

The core issues of poorly performing models are difficult to diagnose, but the solution is usually easy to implement. Performance issues normally come from a single component that limits the overall AI system performance: a slow API gateway, a bad network component, or even a bad set of libraries used for the last build. It’s simple to correct, but much harder to find.

Let’s address the fundamentals.

High latency in generative AI systems can impact real-time applications, such as natural language processing or image generation. Suboptimal network connectivity or inefficient resource allocation can contribute to latency. My experience says start there.

Generative AI models can be resource-intensive. Optimizing resources on the public cloud is essential to ensure efficient performance while minimizing costs. This involves auto-scaling capabilities and choosing the right instance types to match the workload requirements. As you review what you provided, see if those resources are reaching saturation or otherwise showing symptoms of performance issues. Monitoring is a best practice that many organizations overlook. There should be an observability strategy around your AI system management planning, and worsening performance should be relatively easy to diagnose when using these tools.

Scaling generative AI workloads to accommodate fluctuating demand can be challenging and often can cause problems. Ineffective auto-scaling configurations and improper load balancing can hinder the ability to efficiently scale resources.

Managing the training and inference processes of generative AI models requires workflows that facilitate efficient model training and inference. Of course, this must be done while taking advantage of the scalability and flexibility offered by the public cloud.

Inference performance issues are most often the culprits, and although the inclination is to toss resources and money at the problem, a better approach would be to tune the model first. Tunables are part of most AI toolkits; they should be able to provide some guidance as to what the tables should be set to for your specific use case.

Other issues to look for

Training generative AI models can be time-consuming and very expensive, especially when dealing with large data sets and complex architectures. Inefficient utilization of parallel processing capabilities and storage resources can prolong the model training process.

Keep in mind that we’re using GPUs in many instances, which are not cheap to purchase or rent. Model training should be as efficient as possible and only occur when the models need to be updated. You have other options to access the information needed, such as retrieval-augmented generation (RAG).

RAG is an approach used in natural language processing (NLP) that combines information retrieval with the creativity of text generation. It addresses the limitations of traditional language models, which often struggle with factual accuracy, and offers access to external and up-to-date knowledge.

You can augment inference processing with access to other information sources that can validate and add updated information as needed to the model. This means the model does not have to be retrained or updated as often, leading to lower costs and better performance.

Finally, ensuring the security and compliance of generative AI systems on public clouds is paramount. Data privacy, access controls, and regulatory compliance can impact performance if not adequately addressed. I often find that compliance governance is often overlooked during performance testing.

Best practices for AI performance management

My advice here is straightforward and related to most of the best practices you’re already aware of.

  • Training. Stay current on what the people who support your AI tools are saying about performance management. Make sure a few team members are signed up for recurring training.
  • Observability. I’ve already mentioned this, but have a sound observability program in place. This includes key monitoring tools that can alert to performance issues before the users experience them. Once that occurs, it’s too late. You’ve lost credibility.
  • Testing. Most organizations don’t do performance testing on their cloud-based AI systems. You may have been told there is no need since you can always allocate more resources. That’s just silly. Do performance testing as part of deployment. No exceptions.
  • Performance operations. Don’t wait to address performance until there’s a problem. Actively manage it on an ongoing basis. If you’re reacting to performance issues, you’ve already lost.

This is not going away. As more generative AI systems pop up, whether cloud or on-premises, more performance issues will arise than people understand now. The key here is to be proactive. Don’t wait for those Monday morning surprises; they are not fun.

Copyright © 2024 IDG Communications, Inc.