How to export Hugging Face's 🤗 NLP Transformers models to ONNX and use the exported model with the appropriate Transformers pipeline.

Overview

Hugging Face 🤗 NLP Transformers pipelines with ONNX

logo

ONNX is a machine learning format for neural networks. It is portable, open-source and really awesome to boost inference speed without sacrificing accuracy.

I found a lot of tutorials and articles about ONNX benchmarks but none of them presented a convenient way to use it for real-world NLP tasks. I also answered a lot of questions about ONNX and the best way to use it for NLP on Hugging Face's discord server.

This is why I decided to write this blog post: I want to help you get the best possible results using ONNX with awesome Transformers pipelines.

This project is linked to the following Medium blog post: NLP Transformers 🤗 pipelines with ONNX: How to build real-world NLP applications with ONNX, not just for benchmarking tensors.

How to use it

This repository contains a notebook to show how to export Hugging Face's NLP Transformers models to ONNX and how to use the exported model with the appropriate Transformers pipeline.

This is the fastest way I found to use ONNX models with the awesome Transformers pipelines in production and without copying and pasting hundreds of lines of code.

The workflow is as follows:

  • Export the model to ONNX.
  • Create a new pipeline that inherits from the Transformers pipeline.
  • Overcharge the pipeline's task class to use the exported model.
  • Run the pipeline with ONNX.

All steps are explained in the notebook. Enjoy! 🤗

Support

If you have any questions or face any issues, please open an issue on GitHub.

I'm planning to add more examples and support for other NLP tasks. Let me know if you have any ideas!

You might also like...

A very simple tool that forces a change in the IR Version of an ONNX graph. Simple IR version Changer for ONNX.

sic4onnx A very simple tool that forces a change in the IR Version of an ONNX graph. Simple IR version Changer for ONNX. https://github.com/PINTO0309/

Apr 17, 2022

A very simple tool that forces a change in the opset of an ONNX graph. Simple Opset Changer for ONNX.

A very simple tool that forces a change in the opset of an ONNX graph. Simple Opset Changer for ONNX.

soc4onnx A very simple tool that forces a change in the opset of an ONNX graph. Simple Opset Changer for ONNX. https://github.com/PINTO0309/simple-onn

May 15, 2022

Tool for onnx-keras or onnx-tflite.

ONNX-Keras and ONNX-TFLite tools

Sep 30, 2022

A very simple script that only initializes the batch size of ONNX. Simple Batchsize Initialization for ONNX.

A very simple script that only initializes the batch size of ONNX. Simple Batchsize Initialization for ONNX.

sbi4onnx A very simple script that only initializes the batch size of ONNX. Simple Batchsize Initialization for ONNX. https://github.com/PINTO0309/sim

Sep 13, 2022

This is a tool💻 that can convert database scripts exported from 'SQL Server' to 'MySQL'.✨Good output format.

This is a tool💻 that can convert database scripts exported from 'SQL Server' to 'MySQL'.✨Good output format.

mssql_to_mysql 这是一个工具 💻 ,可以把 SQL Server 导出的数据库脚本,转换为 mysql 的数据库脚本。 ✨ 具有良好的输出格式。 This is a tool 💻 that can convert database scripts exported from 'SQ

Jun 8, 2022

Exported browser's bookmarks file parser

usage: bookmarks_parser.py [-h] [-f] [-l] [-r] [-i] [-e] [--folders-name | --folders-path] [--folders folder | --folders-

Jun 29, 2022

A 90-minute hands on workshop about Hugging Face on SageMaker.

Hugging Face on Amazon SageMaker and AWS Workshop You’ve just been hired by the Chicago Tribune to start a new poetry column. Congrats! The catch? You

Aug 18, 2022

Scripts to convert datasets from various sources to Hugging Face Datasets.

Hugging Face Datasets Converter Scripts to convert datasets from various sources to Hugging Face datasets. Demo Convert Any Kaggle Dataset To a Huggin

Sep 26, 2022
Comments
  • Getting incorrect output

    Getting incorrect output

    Hey @ChainYo , I tried your script with my custom fine-tuned model. But the output is not as expected. It's predicting for each tokens instead. Here are some outputs sample, [{'entity_group': 'LABEL_6', 'score': 0.14850605, 'word': 'ab', 'start': 0, 'end': 2}, {'entity_group': 'LABEL_0', 'score': 0.12011145, 'word': '##hishe', 'start': 2, 'end': 7}, {'entity_group': 'LABEL_6', 'score': 0.11439563, 'word': '##k kumar', 'start': 7, 'end': 14}, {'entity_group': 'LABEL_13', 'score': 0.11188321, 'word': 'education', 'start': 16, 'end': 25}, {'entity_group': 'LABEL_0', 'score': 0.11445558, 'word': '&', 'start': 26, 'end': 27}, {'entity_group': 'LABEL_13', 'score': 0.10697147, 'word': 'credentials', 'start': 28, 'end': 39}, {'entity_group': 'LABEL_9', 'score': 0.12449409, 'word': 'msc (', 'start': 40, 'end': 45}, {'entity_group': 'LABEL_13', 'score': 0.123251475, 'word': 'information', 'start': 45, 'end': 56}, {'entity_group': 'LABEL_0', 'score': 0.13867705, 'word': 'technology management', 'start': 57, 'end': 78}, {'entity_group': 'LABEL_1', 'score': 0.11498813, 'word': ')', 'start': 78, 'end': 79}, {'entity_group': 'LABEL_8', 'score': 0.12129795, 'word': 'from', 'start': 80, 'end': 84}, {'entity_group': 'LABEL_6', 'score': 0.12780227, 'word': 'university of', 'start': 85, 'end': 98}]

    My onnx exporting script is as follows:

    import torch
    from transformers import BertTokenizerFast, BertForTokenClassification
    from transformers.convert_graph_to_onnx import convert
    from pathlib import Path
    
    RESUME_NUM_LABELS = 14
    
    DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    MODEL_PATH =  'BertBaseUncased'                              #'YituTech/conv-bert-base'          #'bert-base-uncased'
    
    RESUME_STATE_DICT = torch.load("models/model-state-resume_14342_chunks_02-02.bin", map_location=DEVICE)
    TOKENIZER = BertTokenizerFast.from_pretrained('C:\\Users\\Ujjawal\\Downloads\\docker_util\\BertBaseUncased')
    MAX_LEN = 512
    
    resume_model = BertForTokenClassification.from_pretrained(
        MODEL_PATH, state_dict=RESUME_STATE_DICT['model_state_dict'], num_labels=RESUME_NUM_LABELS)
    resume_model.to(DEVICE)
    resume_model.eval()
    output=Path('C:\\Users\\Ujjawal\\Downloads\\docker_util\\output\\onnx\\ner_model_v2.onnx').absolute()
    convert(pipeline='ner',
    		framework="pt",
            model=resume_model,
            tokenizer=TOKENIZER,
            output=output,
            opset=11)
    

    Here is the inference script provided by you,

    import torch
    from time import time
    from onnxruntime import (
        InferenceSession, SessionOptions, GraphOptimizationLevel
    )
    from transformers import (
        TokenClassificationPipeline, AutoTokenizer, AutoModelForTokenClassification,BertTokenizerFast,BertForTokenClassification
    )
    
    options = SessionOptions() # initialize session options
    options.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_ALL
    
    session = InferenceSession(
        "output/onnx/ner_model_v1.onnx", sess_options=options, providers=["CPUExecutionProvider"]
    )
    
    # disable session.run() fallback mechanism, it prevents for a reset of the execution provider
    session.disable_fallback() 
    
    class OnnxTokenClassificationPipeline(TokenClassificationPipeline):
    
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            
        
        def _forward(self, model_inputs):
            """
            Forward pass through the model. This method is not to be called by the user directly and is only used
            by the pipeline to perform the actual predictions.
    
            This is where we will define the actual process to do inference with the ONNX model and the session created
            before.
            """
    
            # This comes from the original implementation of the pipeline
            special_tokens_mask = model_inputs.pop("special_tokens_mask")
            offset_mapping = model_inputs.pop("offset_mapping", None)
            sentence = model_inputs.pop("sentence")
    
            inputs = {k: v.cpu().detach().numpy() for k, v in model_inputs.items()} # dict of numpy arrays
            outputs_name = session.get_outputs()[0].name # get the name of the output tensor
    
            logits = session.run(output_names=[outputs_name], input_feed=inputs)[0] # run the session
            logits = torch.tensor(logits) # convert to torch tensor to be compatible with the original implementation
    
            return {
                "logits": logits,
                "special_tokens_mask": special_tokens_mask,
                "offset_mapping": offset_mapping,
                "sentence": sentence,
                **model_inputs,
            }
    
        # We need to override the preprocess method because the onnx model is waiting for the attention masks as inputs
        # along with the embeddings.
        def preprocess(self, sentence, offset_mapping=None):
            truncation = True if self.tokenizer.model_max_length and self.tokenizer.model_max_length > 0 else False
            model_inputs = self.tokenizer(
                sentence,
                return_attention_mask=True, # This is the only difference from the original implementation
                return_tensors=self.framework,
                truncation=truncation,
                return_special_tokens_mask=True,
                return_offsets_mapping=self.tokenizer.is_fast,
            )
            if offset_mapping:
                model_inputs["offset_mapping"] = offset_mapping
    
            model_inputs["sentence"] = sentence
    
            return model_inputs
    
    RESUME_NUM_LABELS = 14
    DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    MODEL_PATH =  'BertBaseUncased'                              #'YituTech/conv-bert-base'          #'bert-base-uncased'
    
    RESUME_STATE_DICT = torch.load("models/model-state-resume_14342_chunks_02-02.bin", map_location=DEVICE)
    TOKENIZER = AutoTokenizer.from_pretrained('C:\\Users\\Ujjawal\\Downloads\\docker_util\\BertBaseUncased')
    MAX_LEN = 512
    
    resume_model = AutoModelForTokenClassification.from_pretrained(
        MODEL_PATH, state_dict=RESUME_STATE_DICT['model_state_dict'], num_labels=RESUME_NUM_LABELS)
    
    onnx_pipeline = OnnxTokenClassificationPipeline(
        task="ner", 
        model=resume_model,
        tokenizer=TOKENIZER,
        framework="pt",
        aggregation_strategy="simple",
    )
    
    text="Abhishek Kumar  Education & Credentials MSc (Information Technology Management) from University of Bradford, UK in 2017; secured 74% (Distinction) Senior Certificate in Computer Science & Engineering from University of Florida, US in 2013; secured 3.1/4.0  Bachelor of Technology in Computer Science Engineering from Jaypee University of Information & Technology, HP, India in 2012;"
    t1=time()
    res=onnx_pipeline(text)
    t2=time()
    print('result',res)
    print('Time Taken : ',(t2-t1))
    

    Can you just guide where I'm doing wrong? Thanks.

    opened by ujjawalcse 2
  • Saving pipeline

    Saving pipeline

    First, thank you very much for this!

    There is one final piece that I suggest adding to your notebook, which is to save the pipeline itself to ONNX. I haven't seen this done anywhere for NLP, but I gather it is possible to save a pipeline and not just a model.

    The great thing about that is that any device can then load the pipeline and put in text directly. As it is, the hugging face model itself gets saved and so can be run from, say C#, but it also needs the transformer, and the hugging face transformers are available with Rust, Python, and NodeJS interfaces only. It would be really convenient to save the entire pipeline so that it just takes text as input.

    opened by alunap 1
  • GPT2 text generation pipeline

    GPT2 text generation pipeline

    Hello, thank you for this tutorial, I have tried to modify the code in order to use the text generation pipeline with gpt2 model. The problem is that the performance of vanilla Pytorch is better than ONNX optimized models. This is true for my home setup and also on colab pro with T4 and P100 GPUs.

    image

    I have also tried text generation pipeline in https://github.com/AlekseyKorshuk/optimum-transformers library, but the results are similar - the ONNX performance is still slower.

    Do you have any idea what could be the problem?

    opened by C00reNUT 1
Owner
Thomas Chaigneau
Passionate ML Engineer | Docker aficionado 🐳
Thomas Chaigneau
A very simple tool to rewrite parameters such as attributes and constants for OPs in ONNX models. Simple Attribute and Constant Modifier for ONNX.

sam4onnx A very simple tool to rewrite parameters such as attributes and constants for OPs in ONNX models. Simple Attribute and Constant Modifier for

Katsuya Hyodo 6 May 15, 2022
Pix In, Text Out. Recognize Chinese, English Texts, and Math Formulas of Images. Automatically choose the appropriate recognition model.

???? 在线Demo | ?? 交流群 English | 中文 Pix2Text Pix2Text 期望成为 Mathpix 的免费开源 Python 替代工具,完成与 Mathpix 类似的功能。当前 Pix2Text 可识别截屏图片中的数学公式、英文、或者中文文字。它的流程如下: Pix2T

BreezeDeus 45 Sep 28, 2022
Simple tool to combine(merge) onnx models. Simple Network Combine Tool for ONNX.

snc4onnx Simple tool to combine(merge) onnx models. Simple Network Combine Tool for ONNX. https://github.com/PINTO0309/simple-onnx-processing-tools 1.

Katsuya Hyodo 7 May 15, 2022
Checker with simple ONNX model structure. Simple Structure Checker for ONNX.

ssc4onnx Checker with simple ONNX model structure. Simple Structure Checker for ONNX. https://github.com/PINTO0309/simple-onnx-processing-tools Key co

Katsuya Hyodo 1 May 27, 2022
Accelerated NLP pipelines for fast inference on CPU and GPU. Built with Transformers, Optimum and ONNX Runtime.

Optimum Transformers Accelerated NLP pipelines for fast inference ?? on CPU and GPU. Built with ?? Transformers, Optimum and ONNX runtime. Installatio

Aleksey Korshuk 114 Sep 17, 2022
These files are for bulk renaming files, bulk render blender files, convert xml to csv (labelimg), export blender camera data, export additional test files from frcnn tests.

FRCNN_Related_Code The file you see here are all used for preparing frcnn training and testing files. Software Version Windows 11 Blender 3.1 Python 3

null 1 Sep 23, 2022
MobileOne - An Improved One millisecond Mobile Backbone. Unofficial Pytorch implementation for edge export - ONNX, OpenVINO.

MobileOne - An Improved One millisecond Mobile Backbone Pytorch implementation This is an unofficial Pytorch implementation of MobileOne from Apple. S

Matija Teršek 10 Sep 25, 2022
A very simple tool to swap connections between output and input variables in an ONNX graph. Simple Variable Switch for ONNX.

svs4onnx A very simple tool to swap connections between output and input variables in an ONNX graph. Simple Variable Switch for ONNX. https://github.c

Katsuya Hyodo 1 Sep 17, 2022
A very simple tool for situations where optimization with onnx-simplifier would exceed the Protocol Buffers upper file size limit of 2GB, or simply to separate onnx files to any size you want.

sne4onnx A very simple tool for situations where optimization with onnx-simplifier would exceed the Protocol Buffers upper file size limit of 2GB, or

Katsuya Hyodo 10 Aug 30, 2022
Simple ONNX operation generator. Simple Operation Generator for ONNX.

sog4onnx Simple ONNX operation generator. Simple Operation Generator for ONNX. https://github.com/PINTO0309/simple-onnx-processing-tools Key concept V

Katsuya Hyodo 6 May 15, 2022