One of the features within Modzy to help customers manage costs and optimize available infrastructure is the ability to manage the infrastructure directly within the platform. This tutorial describes the process to manually turn off autoscaling of processing engines to decrease latency on inference runs and manage associated infrastructure costs.
A processing engine is an infrastructure unit that runs your models. Making an inference job to a model spins up a processing engine, that will then deploy a single instance of that model and perform inference on the user specified data. You can find this functionality within the "Model Management" section.
You can find Model Management by clicking on the "Operations" tab in the menu on the top of the page. The "Model Management" tab is the last section under Operations. From there, you can scroll through the list of available models or type a specific one in the search bar.
By default Modzy optimizes your account's autoscaling by setting the minimum number of processing engines for each model at 0 and maximum at 1. To reduce the amount of latency for a model, you can click the min/max field under the "Engine Autoscaling column." There you can use the sliding bar to adjust the min and max number or processing engines you have dedicated to the model, or directly type your desired number into the box below.
The "Reservable engines" field will show your account's remaining available processing engines.
If you scaled the minimum up to 1, once you hit "Save," you will be able to see your processing engine spinning up in the page, or spinning down if you scaled it back down to 0. When it is on, you'll see the engine status reflected as "Ready." When it is off you'll see the status reflected as "Stopped."
Updated 3 months ago