Sparsify Hugging Face BERT for Better CPU Performance & Smaller File Size

Neural Magic
Deep Sparse
Published in
2 min readOct 8, 2021

Get Started: Sparsify Hugging Face BERT Using Your Data

12-layer BERT performance comparisons
12-layer BERT size comparisons

You can replicate the performance and compression results mentioned in the video with your own data using Neural Magic’s open source and freely available tools.

Visit our BERT Getting Started page in the SparseZoo and:

  1. Ensure that you have a correct setup and that performance results are compelling by running a quick benchmarking exercise. You can find the code here.
  2. In order to run with your own data, follow a recipe that will help you encode the transferable hyperparameters necessary for creating sparse models. You will be creating a “teacher” model pre-trained on your dataset, and ultimately distilling the knowledge down to the pruned “student’ BERT model on the same dataset.
  3. Export the “student” model to deploy on your CPU hardware using ONNX-compatible inference engines such as DeepSparse. To achieve the performance mentioned in the video, we encourage you to use the freely available DeepSparse Engine which is explicitly coded to accelerate the performance of sparsified models.

No matter where you are in the process, if you run into any issues, we are here to help. Join our Slack or Discourse forums to get direct access to our engineering and support teams.

Originally published at https://neuralmagic.com on October 8, 2021.

--

--

Neural Magic
Deep Sparse

Optimize your DL models with ease. Run on CPUs at GPU speeds. The future of #deeplearning is sparse.