Batch processing is the capability to run inference on multiple input items at the same time on a GPU.
Without it, models run inference on input items one at a time. Models use the full GPU to process each input item not taking advantage of the GPU’s total capacity.
With batch processing, models can process a list of input items simultaneously with the same processing engine. Batch processing allows models to process as many input items as the GPU can handle. As a result, it optimizes GPU usage significantly which improves the processing speed and gets jobs completed faster.
Any model can be set for batch processing. This feature is set in the model’s container specifications. Batch processing models require a GPU and may require a larger RAM availability. The model’s
/status route should return a batch processing size.
batch_size is the maximum amount of input items it can process simultaneously while mounted on a GPU.
Model authors calculate the model’s
batch_size in relation to memory and hardware values. It’s recommended to test the model with two inputs and then continue to increment and test the input amount to the power of 2. Batch sizes of 8 and higher significantly improve processing speed while batch sizes over 64 are rare. The batch size value may be a number but ideally, it’s a dynamic formula that handles variable memory and hardware.
Upon a job request, Modzy checks the model’s batch size and sends the inputs to it automatically.