With the fast paced nature of ML innovations, the lack of a standard for publishing, evaluating and profiling ML models is a significant pain point for AI consumers.
Application builders — who may have limited ML knowledge — struggle to discover and experiment with state-of-the-art models within their application pipelines.
Data scientists find it difficult to reproduce, reuse, or gather unbiased comparison between published models.
And, finally, system developers often fail to keep up with current trends, and lag behind in measuring and optimizing frameworks, libraries, and hardware.
There are concerted efforts by ML stakeholders to remedy this.
This section describes some of these efforts and how MLModelScope is different.
To compare models and HW/SW stacks, both research and industry have developed coarse-grained (model-level) and fine-grained (layer-level) benchmark suites. The benchmarks are designed to be manually run offline and the benchmark authors encourage the users to submit their evaluation results. The evaluation results are curated to provide a score board of model performance across systems.
- Reference Workloads — There have been efforts to codify a set of ML applications [1,2] that are representative of modern AI workloads — to enable comparisons between hardware stacks. These workloads are aimed at serving a purpose similar to that of SPEC benchmarks for CPUs.
- Model Benchmark Suites — There has been work to replicate and measure the performance of published ML models [3,4,5,6,7,8,9,10]. These benchmark suites provides scripts to run frameworks and models to capture the end-to-end time of model execution and evaluate their accuracy.
- Layer Benchmark Suites — At the other end of the spectrum, system and framework developers have developed sets of fine-grained benchmarks that profile specific layers - [3,11,12,13]. The targets users of these benchmarks are compiler writers (to propose new transformations and analysis for loop structures found within ML kernels), and system researchers (to propose new hardware to accelerate ML workloads).
Curated model repositories 14,15,16,17,18,19,20,21,22 are currated by frameworks. These framework specific model zoos are used for testing or demonstrating the kinds of models a framework supports. There are also catalogs of models [23,24] or public hubs linking ML/DL papers with their corresponding code .
Artifact Management Frameworks
 proposes a model catalog design to store and search developed models. The design also includes a model
versioning scheme and a domain specific language for searching through model catalog.
 manages ML models and experiments by maintaining metadata and links to the artifacts. It also provides a web UI to visualize or compare experiment results.
 defines a common layer of abstractions to represent ML models and pipelines, and provides a web front end for visual exploration.
FAI-PEP  is a benchmarking framework targeting at mobile devices, and features performance regression detection.
The current practice of measuring and profiling ML models is cumbersome.
It relies on the use of a concoction of tools that are aimed at capturing ML model performance characteristics at different granularities, or levels, within the HW/SW stack.
Across-stack profiling thus means the use of multiple tools and stitching of their outputs, which is often done ad-hoc and manually by researchers.
It is difficult or sometimes impossible to stitch and correlate results from these disjoint profiling tools to get a consistent across-stack timeline.
- To profile the application or model level, one must manually log the time taken by the important steps within the pipelines.
- To profile framework level, one enables the built-in, or community contributed, framework profiler [30,31,32] — which usually outputs the profile to a file. These framework profilers are typically bundled with the frameworks, and aim to help users understand the framework’s layer performance and execution pipeline,
- To understand the model performance within a layer, one either uses tools to intercept and log library calls (through tools such as strace or DTrace), or uses hardware vendors’ profilers (such as NVIDIA’s nvprof, NVVP, Nsight[35,36,37] or Intel’s VTune)
- To capture hardware and OS level events, one uses a different set of tools, such as PAPI and Perf.
We observe that the inability to rapidly understand state-of-the art model performance is partly due to the lack of tools or methods that allow researchers to introspect model performance across the HW/SW stack — while still being agile to cope with the diverse and fast paced nature of the ML landscape.
Because of the lack of an ML model publishing and evaluation standard, models shared through repositories (e.g. GitHub) — where the authors may have information on the HW/SW stack requirements and ad-hoc scripts to run the experiments — are hard to reproduce.
CK  is a community driven Python framework to abstract, reuse and share R&D workflows and Python modules.
To ensure reproducibility, CK uses JSON meta-descriptions to describe the software stack of the workflows.
Similar to a python package manager, CK manages workflows and the corresponding Python modules.
Other solutions that leverage Nix, Spack, and docker exist.
- MLPerf, https://mlperf.org.
- HPE Deep Learning Performance Guide, https://dlpg.labs.hpe.com.
- AI-Matrix, https://aimatrix.ai.
- Fathom: Reference workloads for modern deep learning methods, IISWC 2016.
- DAWNBench: An End-to-End Deep Learning Benchmark and Competition, SOSP 2017.
- Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark, https://arxiv.org/abs/1806.01427
- DNNMark: A Deep Neural Network Benchmark Suite for GPUs, GPGPU 2017.
- Latency and Throughput Characterization of Convolutional Neural Networks for Mobile Computer Vision, https://arxiv.org/abs/1803.09492.
- Performance analysis of CNN frameworks for GPUs, ISPASS 2017.
- TBD: Benchmarking and Analyzing Deep Neural Network Training, IISWC 2018.
- DeepBench, https://github.com/baidu-research/DeepBench.
- ConvNet Benchmarks, https://github.com/soumith/convnet-benchmarks.
- LSTM Benchmarks for Deep Learning Frameworks, https://arxiv.org/abs/1806.01818.
- Caffe2 Model Zoo, https://caffe2.ai/docs/zoo.html.
- Caffe Model Zoo, https://caffe.berkeleyvision.org/model_zoo.html.
- Gluon CV, https://gluon-cv.mxnet.io.
- Gluon NLP, https://gluon-nlp.mxnet.io/.
- ONNX Model Zoo, https://github.com/onnx/models.
- TensorFlow Detection Model Zoo, https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md
- TensorFlow-Slim Image Classification Model Library, https://github.com/tensorflow/models/tree/master/research/slim.
- TensorFlow Hub, https://www.tensorflow.org/hub.
- PyTorch Vision, https://github.com/pytorch/vision.
- Modelhub, http://modelhub.ai.
- ModelZoo, https://modelzoo.co.
- Papers with Code, https://paperswithcode.com.
- Modelhub: Deep learning lifecycle management, ICDE 2017.
- Runway: machine learning model experiment management tool, SysML 2018.
- ModelDB: a system for machine learning model management, https://mitdbg.github.io/modeldb/.
- Facebook AI Performance Evaluation Platform, https://github.com/facebook/FAI-PEP.
- TensorFlow Profiler, https://www.tensorflow.org/api_docs/python/tf/profiler
- MXNet Profiler, https://mxnet.incubator.apache.org/api/python/profiler/profiler.html.
- PyTorch Autograd, https://pytorch.org/docs/stable/autograd.html.
- strace, https://strace.io/.
- dtrace, http://dtrace.org/blogs/.
- nvprof, https://docs.nvidia.com/cuda/profiler-users-guide/index.html.
- Nvidia Visual Profiler, https://docs.nvidia.com/cuda/profiler-users-guide/index.html#visual
- Nsight, https://developer.nvidia.com/tools-overview.
- Intel Vtune, https://software.intel.com/en-us/vtune.
- A collective knowledge workflow for collaborative research into multi-objective autotuning and machine learning techniques, https://github.com/ctuning/ck.
- Nix Package Manager, https://nixos.org/nix/
- Spack Package Manager, https://spack.io
- Docker, https://www.docker.com