Upon job submittal, Modzy sends the inputs to the input queue. Once there, inputs are sent to and run by a processing engine. When a model-version is set to run with multiple processing engines, serial queues and multiple model-version instances are created.
Once a job is in progress, the speed at which it gets completed depends on the input size, input amount, the model, the memory and hardware available, and the setting and number of processing engines that run the model.
The processing engines are infrastructure units that run models. The total amount of processing engines purchased defines the account’s overall parallel processing capacity. To check the processing engine's current status, get processing details.
Modzy provides the capability to granularly manage the job completion speed for each model-version. You can set model readiness and autoscaling for each individual model-version in your account.
Each processing engine deploys a single model-version instance and runs one input at a time. Inputs are run in parallel the same amount of times as the amount of processing engines set and started for that model-version. For example, if a job is submitted to run 4 inputs through a model that has 2 processing engines, 2 inputs get processed at the same time. When a processing engine finishes running an input, it picks up the next input in the queue.
Deployed model versions are set with 0 minimum and 1 maximum engine. Each model-version may be granularly set with a minimum and a maximum amount of processing engines. This sets the model-version processing capacity: the amount of model-version instances stood up to run job inputs in parallel.
Modzy runs inference jobs with a number of engines between the minimum and maximum amounts, as per engine requirements and availability. Set minimum and maximum amounts on the Model Operations page.
The minimum capacity sets a model’s readiness.
The processing engines set as the minimum are always ready to run the model. They are reserved for the model and cannot be used to run other models.
When multiple processing engines are set to the minimum, the model has multiple model-version instances ready to process inputs in parallel. This reduces latency and optimizes the job completion speed but the processing engines reserved signify a higher infrastructure usage cost.
Every processing engine set to the model’s minimum is fixed within the configuration and becomes unavailable to all other model-versions. Modzy requires that at least 1 processing engine remains available to spin up other models. Therefore, the number of engines set to the minimum processing capacity across all models is capped by the account’s total amount of processing engines - 1.
What it does
If set to 0:
Modzy spins up the model upon each run. It’s the default amount.
If set to 1:
The model is always ready to run inputs one by one. The organization has 1 processing engine reserved exclusively for this model.
If set higher:
The model is ready to run this amount of inputs in parallel. The organization has this number of processing engines reserved exclusively for this model.
Set a minimum greater than 0 when:
- a model is frequently used,
- a model has time-sensitive results such as outputs that trigger user interaction.
The maximum capacity sets a model’s autoscaling limit. It controls the maximum number of processing engines that may run in parallel.
The processing engines set to the maximum get started according to availability and to the effort required by the model to process inputs.
If the amount of processing engines in use across all models becomes the account’s total amount of processing engines, the model’s maximum may not be reached. In this case, Modzy cannot increment the model processing engines even if the model’s maximum processing capacity is higher than the engines in use.
When multiple processing engines are set to the maximum, the model can spin up multiple model-version instances and process inputs in parallel. This optimizes the job completion speed and also optimizes the infrastructure usage cost.
The maximum can be set to any number between 1 and the account’s total amount of processing engines while remaining equal or larger than the minimum. It cannot be 0, otherwise, Modzy would not be able to assign any engines to run the model.
Selecting the right maximum:
What it does
If set to 1:
The model runs inputs one by one. It’s the default amount.
If set higher:
The model can scale to run this amount of inputs in parallel.
Set a maximum to 1 when:
- a model’s job completion speed is not relevant,
- a model only receives one input per job.
Set a maximum greater than 1 when:
- there is a need to parallelize inputs for a model but there is no need to have fixed processing engines assigned,
- a model generally receives a fixed input amount but exceptionally receives an increased input amount which requires to temporarily increase the capacity to process this higher input amount at the regular speed.