3 Mistakes When Deploying ML Models to Production (don't do THIS!)

August 17, 20223 Mistakes When Deploying ML Models to Production (don't do THIS!)

Time is money, and mistakes cost money. Don't make these beginner mistakes when you are deploying machine learning models into production.

Our team has identified a few common mistakes we see ML teams make when they are deploying their models to production. Hopefully this article helps prevent you from making the same ones!

Building your own deployment infrastructure is really hard and time consuming. Quite frankly, that is why companies like Banana exist. We take care of the messy, boring, and downright challenging infrastructure work for you with the goal of reducing the friction in deploying ML models to production.

Some of what we talk about might get "into the weeds" when it comes to production infrastructure. If it interests you, thats awesome! If it gets too technical that's okay too, just take the primary mistakes and learnings from this article with you and you'll be set.

Mistake #1: Over-Provisioning GPUs

If an ML company wants to provide concurrency of 20 at peak, something they will realize is that with GPUs processes are one-to-one with GPUs. In other words, you can only have one process running against a GPU based model.

This is in stark contrast to CPU workers that can have multiple processes running at once without issue. In fact, some languages can spin up thousands of concurrent HTTP handlers and get those responses back without overloading the CPU machine.

Back to our example. If GPUs are a necessity for running a model and you need concurrency of 20 at peak, the mistake happens when teams think they have no choice but to have 20 GPUs running all the time. This is an incredibly common mistake we would see when we used to offer ML consulting. There are companies that could exist today but don't because they think this is a roadblock for them to build with ML.

There are three things you can do to solve this.

  1. ML application-level batching (multiplexing)
  2. Pack GPU as full as possible with replicas
  3. Use Banana (serverless GPUs)

ML Application-level Batching (multiplexing)

Batching is where you use an extra dimension that allows you to feed in multiple requests, or inferences in a single pass at once. This concept applies for most machine learning models, and is sometimes referred to as "multiplexing".

For example, if you want to handle 20 concurrent calls you could set your server to have an accumulation step that gathers up as many calls as you can in a one second window, and then batches those together in the batch dimension of the tensor that you feed into the GPU.

You could do this because generally GPUs have enough cores to run extra batches without any slowdown. Consider it bonus throughput by just stacking calls together and running those through at once.

Maximize Your GPU Capacity

Another trick that can be done to prevent over-provisioning is to pack the GPU as full as possible. If you're running on an 8GB RAM GPU and your average model plus the runtime tensors is 2GB RAM, you could run four replicas on that machine. Don't bother scaling out to four different machines, just pack four servers onto the same machine and use that GPU to its limit.

They do need to be distinct servers and you do need to redundantly load the model for every server. It needs to be one-to-one models and servers. But it can be very handy if you are able to fit multiple models onto one GPU.

You'll have to keep an eye on minimizing the usage of the GPU so that you don't overflow the machine with one process and then kill the other processes. That can become a little dangerous. But, when not running three extra GPUs saves you $1,000+ a month with your cloud provider, it can be very worthwhile to do this.

Use Banana (serverless GPUs)

The simplest way to prevent over-provisioning GPUs is to use a serverless GPU provider like Banana for your model deployment. The headache of having to worry about the cost of running 20 GPUs at once goes away because our serverless platform will auto-scale your GPU compute based on what you actually need real-time. You also don't have to dive into the maximizing GPU compute rabbit-hole because our infrastructure already has these kind of optimizations baked into the product.

Mistake #2: Don't Forget About Unit Economics

Many teams think working with machine learning comes with the assumption that GPUs are expensive, and therefore it's accepted to spend a lot of money on ML compute. And almost always these teams either die or have to aggressively cut spending when they realize how unsustainable this acceptance is. Sure, GPUs can be expensive, but don't accept that they are expensive and not try to optimize your compute costs.

Make sure your unit economics actually work. The amount of companies that we've worked with that use machine learning models, but lose money for every call they run would blow your mind. Especially during the peak of the venture capital funded AI gold rush, people were throwing money at machine learning.

There are many ways to run your ML that could improve your unit economics. You could run on CPUs, serverless CPUs, GPUs, or serverless GPUs. Investigate each option and choose the one that makes the most business sense.

Mistake #3: Choosing the wrong Platform for your Models

Another common mistake is that teams choose the wrong platform or the wrong location to run their model. I'm specifically talking about the difference between running on a device/on the edge, versus running on the cloud.

If you're a self-driving company, don't spend any time offloading machine learning to the cloud. It has latency that will never meet your needs of responding in one millisecond of time that self-driving cars demand.

Or, if you have a computer vision model that needs to do real-time inference and can afford the latency of round-tripping to the cloud, you could still download the model to the device and have it run on the edge if the model is small enough. Why would you do that? Economically, that's one less cloud model you're paying for. It could be worth it to help unit economics.

Take the time up-front to choose your deployment target intelligently, based both off of latency constraints and where you can reduce costs. At times it can make sense to deploy on the edge, other times it doesn't because the models are huge and you need a GPU or TPU to execute your speed requirements, in which case it warrants the trip to the cloud.

Mistake #4 (bonus): Shiny Objects & Not Doing Your Research

Plenty of content online when researching deployment options will steer people toward these flashy inference frameworks. They can look really great at first, offering all these useful-sounding features that hook you and so you jump right into using their tool. A significant amount of work and time goes into figuring out how each system works. And once you go live and deploy, they often times end up being significantly more expensive and slower than just running a simple HTTP server yourself.

For example, there are speed-optimizing transpilers out there that claim to increase your speeds, but in the end running just bare framework inferences and wrapping that up in an HTTP server yourself will almost always be faster than using one of these general purpose serving frameworks.

We recognize that this advice is coming from Banana, an inference framework of sorts and could come off a bit awkwardly. But we are a team of engineers at our core, and this point was worth mentioning based on our experience using various frameworks out there in the past. We certainly believe Banana can add significant value for your model deployment needs, but we encourage you to do your own research to make sure tools like Banana are a fit for your team and come to your own decision. Test the product before you over-invest time and money into any inference framework.