Benchmarking Model Latency

Published on

In Data Science we are obsessed with model performance. If you’re dealing with a classification task one might look at Area Under Curve (AUC), in a regression task at Mean Absolute Error (MAE).

Typically, Data Scientists create an offline evaluation pipeline where each model is benchmarked in different scenarios. The model (or input features) are typically iterated so that we beat iteratively our performance metrics and until we have enough confidence to roll out our model into production (and if you are fancy enough to an A/B test).

My question is: are these performance metrics enough for real-life scenarios ??

Lets take a step back. Most of ML models don’t simply live in the vacum of offline evaluations, they usually make part of a product and serve real customers. Think for example of providing recommendations in an e-commerce, answers in a chatbot or detecting fraud in payments.

When interacting with customers at real-time there are other metrics that are important besides model performance. For example, no one wants to navigate in a slow e-commerce website and It is known that website load time directly impacts the conversion rate. If your model.predict is adding to the load time of the webpage, then it means that your model taking 1s vs 100ms (aka model latency) might make a big difference.

But how can we measure this model latency in a similar way to what we do with the DS model performance metrics discussed later ? How can we have a fast feedback loop for model latency ?

The short answer is: using pytest-benchmark . pytest-benchmark is a pytest plugin that allows you to easily benchmark parts of your code by writing (you can guess) tests.

Lets jump into an example :) All the code used in this blogpost can be found in this repo

Imagine you have a simple torch model.

class SimpleRegressionModel(torch.nn.Module):
    def __init__(
        self,
        embeddings: List[EmbeddingConfig],
        hidden_layers: List[int],
        numerical_cols: List[str],
        dropout: Optional[float] = None,
    ) -> None:
    ...

    def forward(self, x: SimpleRegressionModelData) -> torch.FloatTensor:
    ...

Without getting into much detail, this is a simple neural network (NN) for regression that can receive both categorical and numerical features. If you want to know more details regarding the model check it here.

The important part is that the model’s input is a abstracted as x which is a SimpleRegressionModelData. This wrapper around the input data makes it easy to understand what are the inputs for the model.

@dataclass
class NamedTensor:
    columns: Tuple[str]
    data: torch.Tensor
    _column_idx_map: Dict[str, int] = field(init=False)

    def __post_init__(self):
        self._validate_data()
        self._column_idx_map = {
            column: idx for idx, column in enumerate(self.columns)
        }

    def _validate_data(self):
        if len(self.data.shape) != 2:
            raise RuntimeError("NamedTensor only supports data of dim=2!")
        if len(self.columns) != self.data.shape[1]:
            raise RuntimeError(
                "Number of columns should be the same as number size of dim=2!"
            )

    def get_data(self, columns: List[str]) -> torch.Tensor:
        idx_columns = [self._column_idx_map[column] for column in columns]
        return self.data[:, idx_columns]

    def get_data_at_idx(self, idx: int) -> "NamedTensor":
        return NamedTensor(self.columns, self.data[idx, :])


@dataclass
class SimpleRegressionModelData:
    numericals: NamedTensor
    categoricals: NamedTensor

it’s now easy to use pytest-benchmark to create a latency benchmark for the forward method. The first step is to generate input data.

def generate_data(
    categorical_data_info: List[EmbeddingConfig],
    numerical_data_info: List[str],
    n_rows: int,
) -> SimpleRegressionModelData:

    _categorical_data = []
    _categorical_data_columns = []
    for embedding_cfg in categorical_data_info:
        _categorical_data.append(
            torch.randint(0, embedding_cfg.cardinality - 1, (n_rows, 1))
        )
        _categorical_data_columns.append(embedding_cfg.name)

    _numerical_data = torch.randn((n_rows, len(numerical_data_info)))

    return SimpleRegressionModelData(
        numericals=NamedTensor(tuple(numerical_data_info), _numerical_data),
        categoricals=NamedTensor(
            tuple(_categorical_data_columns), torch.cat(_categorical_data, dim=1)
        ),
    )

We also need to instantiate a model in order to benchmark it. We can wrap up the model creation and the data generation in another function.

def setup_model_and_data(
    categorical_data_info: List[EmbeddingConfig],
    numerical_data_info: List[str],
    n_hidden_layers: int,
    batch_size: int,
) -> Tuple[SimpleRegressionModel, SimpleRegressionModelData]:
    data = generate_data(categorical_data_info, numerical_data_info, batch_size)

    hidden_layers = [2**n for n in range(1, n_hidden_layers)]
    hidden_layers.reverse()
    dummy_model = SimpleRegressionModel(
        embeddings=categorical_data_info,
        hidden_layers=hidden_layers,
        numerical_cols=numerical_data_info,
    )

    return dummy_model, data

Having these two functions is now fairly easy to create a latency benchmark.

def test_benchmark_model_batch_1(
    benchmark,
    categorical_data_info: List[EmbeddingConfig],
    numerical_data_info: List[str],
) -> None:
    model, data = setup_model_and_data(categorical_data_info, numerical_data_info, 4, 1)

    model.eval()
    with torch.no_grad():
        benchmark(model.forward, data)

This test generates an input with 1 instance and a model containing 4 hidden layers, then it uses that model and data to run the forward method multiple times and acquires how much time it spends in each of these times. All of that by using the benchmark fixture provided by pytest-benchmark!

To run it, we can use the command pytest <yourtestfile.py> and it will give you an output as the one bellow.

We can see that the benchmark was run for 1175 rounds and that in average the forward pass of the model takes ~180us.

Since pytest-benchmark is a pytest plugin, we can use all the pytest cool features. One might want to run the benchmark with models of different sizes, i.e changing the number of hidden layers, or simply see how the model latency scales with the batch size. This can be easily done using the pytest.parametrize decorator.

@pytest.mark.parametrize(
    "n_hidden_layers",
    [
        pytest.param(8, id="n_hidden_layers=8"),
        pytest.param(4, id="n_hidden_layers=4"),
    ],
)
@pytest.mark.parametrize("batch_size", [1, 8, 16, 32, 64, 128, 256, 512, 1024])
def test_benchmark_model_parametrized(
    benchmark,
    batch_size: int,
    n_hidden_layers: int,
    categorical_data_info: List[EmbeddingConfig],
    numerical_data_info: List[str],
) -> None:

    model, data = setup_model_and_data(
        categorical_data_info, numerical_data_info, n_hidden_layers, batch_size
    )

    model.eval()
    with torch.no_grad():
        benchmark(model.forward, data)

Finally, we can also use pytest-benchmark to generate an histogram .svg with all the parametrised tests. For that we should run the benchmark using the command pytest <yourtestfile.py> --benchmark-histogram.

Speed in Microseconds (us)24024032032040040048048056056064064072072080080088088096096010401040112011201200120012801280136013601440144015201520160016001680168017601760test_benchmark_model_parametrized[1-n_hidden_laye…test_benchmark_model_parametrized[1-n_hidden_layers=4]test_benchmark_model_parametrized[8-n_hidden_laye…test_benchmark_model_parametrized[8-n_hidden_layers=4]test_benchmark_model_parametrized[16-n_hidden_lay…test_benchmark_model_parametrized[16-n_hidden_layers=4]test_benchmark_model_parametrized[32-n_hidden_lay…test_benchmark_model_parametrized[32-n_hidden_layers=4]test_benchmark_model_parametrized[64-n_hidden_lay…test_benchmark_model_parametrized[64-n_hidden_layers=4]test_benchmark_model_parametrized[128-n_hidden_la…test_benchmark_model_parametrized[128-n_hidden_layers=4]test_benchmark_model_parametrized[256-n_hidden_la…test_benchmark_model_parametrized[256-n_hidden_layers=4]test_benchmark_model_parametrized[512-n_hidden_la…test_benchmark_model_parametrized[512-n_hidden_layers=4]test_benchmark_model_parametrized[1-n_hidden_laye…test_benchmark_model_parametrized[1-n_hidden_layers=8]test_benchmark_model_parametrized[16-n_hidden_lay…test_benchmark_model_parametrized[16-n_hidden_layers=8]test_benchmark_model_parametrized[8-n_hidden_laye…test_benchmark_model_parametrized[8-n_hidden_layers=8]test_benchmark_model_parametrized[32-n_hidden_lay…test_benchmark_model_parametrized[32-n_hidden_layers=8]test_benchmark_model_parametrized[64-n_hidden_lay…test_benchmark_model_parametrized[64-n_hidden_layers=8]test_benchmark_model_parametrized[1024-n_hidden_l…test_benchmark_model_parametrized[1024-n_hidden_layers=4]test_benchmark_model_parametrized[128-n_hidden_la…test_benchmark_model_parametrized[128-n_hidden_layers=8]test_benchmark_model_parametrized[256-n_hidden_la…test_benchmark_model_parametrized[256-n_hidden_layers=8]test_benchmark_model_parametrized[512-n_hidden_la…test_benchmark_model_parametrized[512-n_hidden_layers=8]test_benchmark_model_parametrized[1024-n_hidden_l…test_benchmark_model_parametrized[1024-n_hidden_layers=8]Min: 189.0500 Q1-1.5IQR: 189.0500 Q1: 206.6383 Median: 240.9310 Q3: 291.3992 Q3+1.5IQR: 419.3690 Max: 1128.035032.15811965811966184.44569326185132Min: 198.6960 Q1-1.5IQR: 198.6960 Q1: 218.3170 Median: 259.8190 Q3: 310.4755 Q3+1.5IQR: 449.7700 Max: 937.365069.55128205128204182.27322555786344Min: 199.9570 Q1-1.5IQR: 199.9570 Q1: 225.2790 Median: 265.4420 Q3: 355.4060 Q3+1.5IQR: 552.2590 Max: 901.4150106.94444444444444178.36705886669634Min: 204.7630 Q1-1.5IQR: 204.7630 Q1: 230.5788 Median: 282.6880 Q3: 365.9282 Q3+1.5IQR: 569.7190 Max: 1036.2240144.33760683760684177.02676086321037Min: 206.2020 Q1-1.5IQR: 206.2020 Q1: 225.7415 Median: 264.4100 Q3: 324.8610 Q3+1.5IQR: 475.0520 Max: 883.3260181.73076923076923180.83955189371136Min: 213.4990 Q1-1.5IQR: 213.4990 Q1: 234.9457 Median: 280.3690 Q3: 367.0660 Q3+1.5IQR: 569.8540 Max: 833.6080219.1239316239316176.73472273199988Min: 228.6390 Q1-1.5IQR: 228.6390 Q1: 250.7540 Median: 299.0260 Q3: 369.5310 Q3+1.5IQR: 549.3000 Max: 825.8860256.517094017094175.97133734580956Min: 239.5030 Q1-1.5IQR: 239.5030 Q1: 287.8108 Median: 351.1120 Q3: 452.8268 Q3+1.5IQR: 700.7120 Max: 1242.0600293.91025641025635167.8638956987888Min: 243.8510 Q1-1.5IQR: 243.8510 Q1: 264.0900 Median: 314.0525 Q3: 380.2870 Q3+1.5IQR: 560.7470 Max: 822.5170331.30341880341877174.3780772900975Min: 260.3980 Q1-1.5IQR: 260.3980 Q1: 297.2155 Median: 356.9195 Q3: 440.0630 Q3+1.5IQR: 654.8360 Max: 1166.6060368.6965811965812168.40967704051863Min: 262.7540 Q1-1.5IQR: 262.7540 Q1: 307.4030 Median: 378.0750 Q3: 470.9133 Q3+1.5IQR: 724.2030 Max: 899.3520406.08974358974353165.16596282294444Min: 264.5700 Q1-1.5IQR: 264.5700 Q1: 320.5830 Median: 399.4670 Q3: 567.8140 Q3+1.5IQR: 938.8000 Max: 2590.3000443.48290598290595156.73948766887733Min: 287.1200 Q1-1.5IQR: 287.1200 Q1: 331.3758 Median: 409.2570 Q3: 491.6070 Q3+1.5IQR: 731.9610 Max: 1114.3070480.87606837606836162.55065711248878Min: 315.0240 Q1-1.5IQR: 315.0240 Q1: 371.5165 Median: 447.4850 Q3: 552.4062 Q3+1.5IQR: 829.7050 Max: 1251.9550518.2692307692307156.13629422338198Min: 315.2700 Q1-1.5IQR: 315.2700 Q1: 358.8125 Median: 442.3170 Q3: 537.4677 Q3+1.5IQR: 806.2180 Max: 1302.6130555.6623931623931157.49397148418677Min: 391.5540 Q1-1.5IQR: 391.5540 Q1: 463.1830 Median: 574.2860 Q3: 675.7325 Q3+1.5IQR: 999.6760 Max: 1641.0660593.0555555555555141.88664158833706Min: 573.9900 Q1-1.5IQR: 573.9900 Q1: 728.1850 Median: 816.0500 Q3: 916.0973 Q3+1.5IQR: 1233.1770 Max: 1592.0060630.4487179487179113.71485962268088Min: 1067.7550 Q1-1.5IQR: 1067.7550 Q1: 1274.6155 Median: 1380.8025 Q3: 1467.9695 Q3+1.5IQR: 1760.8130 Max: 2178.0180667.841880341880348.692064371676935Speed in Microseconds (us)TrialDuration

This ends our journey about pytest-benchmark and how to measure your model’s latency in an easy, reproducible and standardised way. By using this approach a Data Scientist can simply run its benchmarks just like any other unit-test or evaluation pipeline and have a fast feedback loop.

There are many more things that you can do with it pytest-benchmark, like configure it to run in your CI/CD, keep track of the performance tests at each build, etc.. for more details I invite you to check the pytest-benchmark documentation.

Hope you enjoyed, thanks for reading :)