Deploying Custom LLMs: A Hugging Face End to End Guide

Category Data Science, Generative AI

If you are new to LLM world, Then I would suggest you to go through previous articles to have the LLM understanding before going further with the articles Exploring LLM Platforms and Models: Unpacking OpenAI, Azure, and Hugging Face

Custom LLMs (Large Language Models) have become indispensable tools in the field of Natural Language Processing (NLP). They empower developers and researchers to tailor language models to their specific needs. In this comprehensive guide, we’ll explore three popular methods for deploying custom LLMs and delve into the detailed process of deploying models as Hugging Face Inference Endpoints, Other options we will go and explore in our upcoming articles

Overview of Deployment Options

Before we embark on our journey to deploy custom LLMs, it’s essential to understand the options available:

  1. Hugging Face Inference Endpoints: Hugging Face, known for its vast model repository, offers not only pre-trained models but also a user-friendly platform for deployment. It simplifies the deployment process, making it an excellent choice for experimentation and small-scale deployment.
  2. Amazon Sage Maker: Amazon Sage Maker, a part of AWS, provides a comprehensive machine-learning platform. It’s ideal for larger-scale deployments, offering additional capabilities for data preprocessing, training, and monitoring.
  3. Azure Machine Learning: Microsoft’s Azure ML platform is another robust option for deploying custom LLMs. It provides a cloud-based environment for building, training, and deploying machine learning models, including LLMs.

The Role of Hugging Face

Hugging Face is a central hub for all things related to NLP and language models. It plays a pivotal role in both sourcing models and facilitating their deployment.

Sourcing Models from Hugging Face

Before delving into deployment specifics, let’s discuss Hugging Face’s significance as a model source:

  1. Pre-Trained Models: Hugging Face’s Model Hub boasts an extensive collection of pre-trained models. These models, such as FALCON-40B, LLAMA-7B, LLAMA-40B, and more, serve as foundational building blocks for custom LLM development.
  2. Custom Fine-Tuning: Researchers and developers can leverage pre-trained models from Hugging Face and fine-tune them on domain-specific data. This process allows for the creation of custom LLMs tailored to unique NLP tasks.

Deploying the LLM as a hugging face Inference Endpoint:

Here, we are trying to deploy falcon-40B-instruct which is the hugging face model

  1. To start with, we need to have an account on Hugging Face (HF), which can be created at the following link: The signup process is straightforward; you need to provide the email for which the account has to be created and a password of your choice.
Hugging face sign-up and login screen

2. Once the account is created, you can log in with the credentials you provided during registration. On the homepage, you can search for the models you need and select to view the details of the specific model you’ve chosen.

Hugging face home screen for searching required models
Hugging face selected models details page

3. There is an option called Deploy, under which you will be able to see the deployment options. Since we are focusing on the Hugging Face inference endpoints, you can choose the same and proceed

4. When you select the Inference Endpoint, it will take you to the page where you can create an inference endpoint for the selected model. The “Create Inference Endpoint” page looks as shown below:

Create a new Inference Endpoint page

There are several fields and options to be filled up and selected accordingly. This guide will go through the steps to deploy tiiuae/falcon-40b-instruct for text classification.

  • Enter the Hugging Face Repository ID and your desired endpoint name:
  • Select your Cloud Provider and region
  • Define the [Security Level](security) for the Endpoint:
  • Create your Endpoint by clicking **Create Endpoint**. By default, your Endpoint is created with a large GPU (4x Nvidia Tesla T4). The cost estimate assumes the Endpoint will be up for an entire month, and does not take autoscaling into account.
  • Wait for the Endpoint to build, initialize, and run which can take between 5 to 10 minutes
  • Test your endpoint with the API endpoint generated from the deployment using the Hugging Face access token

In conclusion, this guide provides an overview of deploying Hugging Face models, specifically focusing on creating inference endpoints for text classification. However, for more in-depth insights into deploying Hugging Face models on cloud platforms like Azure and AWS, stay tuned for future articles where we will explore these topics in greater detail.

Author: Sriram C

Ready to embark on a transformative journey? Connect with our experts and fuel your growth today!