How to Access Free Open-Source LLMs Like LLaMA 3 from Hugging Face Using Python API: Step-by-Step Guide

7 min read6 days ago

What is an LLM?

A Large Language Model (LLM) is an artificial intelligence model trained on a vast amount of text data, enabling it to understand, generate, and even interact with human language. These models have become essential tools in natural language processing (NLP) tasks such as answering questions, summarizing text, or generating creative content.

Expensive LLMs and Free Alternatives

Many popular LLMs, like GPT-4 or Claude, are paid services, which can be quite costly depending on your usage. However, there are excellent open-source alternatives available for free, such as LLaMA 3 and other models hosted on Hugging Face. These open-source models provide a cost-effective way to integrate advanced AI into your projects without worrying about huge expenses.

This will guide you through the process of accessing these open-source LLMs from Hugging Face using Python, with step-by-step explanations.

Step-by-Step Implementation

1. Create an Account on Hugging Face

First, you’ll need an account on Hugging Face to access their hosted models. Here’s how to do it:

Go to the Hugging Face website.
Click on “Sign Up” and create an account.
After signing up, navigate to your profile -> settings -> access tokens -> click on “create a new token” to generate an API token. This token is essential for authenticating API requests.

2. Search for a Free LLM Model

Once your account is ready, you can search for available LLMs on Hugging Face. For example, Meta’s LLaMA 3 is an open-source LLM you can use.

To find a model:

Visit the Hugging Face model hub.
Type the name of the model in the search bar (e.g., “Llama 3”).

You’ll find models like meta-llama/Meta-Llama-3–8B-Instruct. This is the model we’ll use for our demonstration.

3. Get the Model Name/Path

Once you find the desired model, note the model path. In this case, the path for LLaMA 3 is meta-llama/Meta-Llama-3-8B-Instruct.

4. Python Code to Use the LLM via API

Now that you have the model path, you can interact with it via Hugging Face’s API using Python. Below is a simple Python code that shows how to send a query to the model and receive a response.

Import the requests Library

We will use the requests library to interact with the Hugging Face API.

import requests

Define the API URL and Token

You will need the API URL for the specific model and the token you obtained from your Hugging Face account.

url = "https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-8B-Instruct"
token = "hf_mKBoIPadsfvDMFRiHBxlmMBgpZaAPo"  # Replace with your Hugging Face token

Function to Send a Query to the LLM

def llm(query):
  parameters = {
      "max_new_tokens": 5000,
      "temperature": 0.01,
      "top_k": 50,
      "top_p": 0.95,
      "return_full_text": False
      }
  
  prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a helpful and smart assistant. You accurately provide answer to the provided user query.<|eot_id|><|start_header_id|>user<|end_header_id|> Here is the query: ```{query}```.
      Provide precise and concise answer.<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
  
  headers = {
      'Authorization': f'Bearer {token}',
      'Content-Type': 'application/json'
  }
  
  prompt = prompt.replace("{query}", query)
  
  payload = {
      "inputs": prompt,
      "parameters": parameters
  }
  
  response = requests.post(url, headers=headers, json=payload)
  response_text = response.json()[0]['generated_text'].strip()

  return response_text

We define a function llm() that sends a query to the LLaMA 3 model. Let’s break it down step by step:

1. Function Definition: llm(query)

def llm(query):

The function llm(query) is defined with one input parameter, query, which represents the text (or question) you want to ask the LLM.

2. Defining Parameters for the Model

  parameters = {
      "max_new_tokens": 5000,
      "temperature": 0.01,
      "top_k": 50,
      "top_p": 0.95,
      "return_full_text": False
  }

Here, the parameters dictionary contains configuration options that influence how the LLM generates its response:

max_new_tokens: Limits the maximum number of tokens (or words) the model can generate in response. Setting this to 5000 allows the model to generate up to 5000 tokens.
temperature: Controls the randomness of the output. A low value like 0.01 makes the model’s responses more predictable and deterministic.
top_k: Limits the number of token choices the model considers at each step, which narrows down the possible responses.
top_p: A probability-based method to filter which tokens can be chosen by the model, balancing randomness and meaningfulness in output.
return_full_text: If set to True, the model returns the entire conversation history. Since it is False, it only returns the generated part of the response.

3. Formatting the Prompt

  prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a helpful and smart assistant. You accurately provide answer to the provided user query.<|eot_id|><|start_header_id|>user<|end_header_id|> Here is the query: ```{query}```.
      Provide precise and concise answer.<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

The prompt is the instruction sent to the LLM. It contains:

System message: This sets the role of the AI. Here, the system instructs the model to act as a “helpful and smart assistant.”
User message: The query (the user’s input) is inserted into the prompt. This is dynamically added using query, which is a placeholder for the user's question.
Assistant message: The assistant’s response comes after this.

4. Preparing the Request Headers

  headers = {
          'Authorization': f'Bearer {token}',
          'Content-Type': 'application/json'
  }

Here, we define the headers required for the API request:

Authorization: This contains the Hugging Face API token in the format Bearer {token}. You must replace token with your actual Hugging Face API key.
Content-Type: The content type is set to application/json, as we are sending JSON data in our API request.

5. Replacing the Placeholder in the Prompt

  prompt = prompt.replace("{query}", query)

This line replaces the {query} placeholder in the prompt with the actual user input (query), creating a personalized instruction for the model.

6. Preparing the Payload for the API Request

  payload = {
          "inputs": prompt,
          "parameters": parameters
  }

The payload is the actual data sent to the API. It consists of:

inputs: The fully formatted prompt that includes the user’s query and system instructions.
parameters: The generation settings (like max tokens, temperature) that define how the model should behave when generating the response.

7. Sending the API Request

  response = requests.post(url, headers=headers, json=payload)

This line sends an HTTP POST request to the Hugging Face model API using the requests.post() function:

url: This is the endpoint for the model, which points to the specific LLaMA 3 model hosted on Hugging Face.
headers: Includes the authorization token and content type.
json: The payload (prompt and parameters) is sent in JSON format.

8. Extracting the Response

  response_text = response.json()[0]['generated_text'].strip()

Once the API responds, this line extracts the generated_text from the JSON response:

response.json() converts the API's response from JSON into a Python dictionary.
[0] accesses the first response item in case there are multiple (Hugging Face models typically return an array).
['generated_text'] accesses the text generated by the model.
.strip() removes any leading or trailing whitespace from the result.

9. Returning the Final Answer

  return response_text

Finally, the function returns the model’s generated text, which is the answer to the user’s query.

Example Usage

print(llm('write a python program to generate fibonacci series'))

The function will:

Send this query to the LLaMA 3 model on Hugging Face.
Get a response with Python code to generate a Fibonacci series.
Print the model’s response.

When you run this, the model will return a Python program that generates the Fibonacci series.

Full Code

Here is the full code block for reference:

import requests

url = "https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-8B-Instruct"
token = "hf_mKBoIPadsfvDMFRiHBxlmMBgpZaAPo"  # Replace with your Hugging Face token

def llm(query):
  parameters = {
      "max_new_tokens": 5000,
      "temperature": 0.01,
      "top_k": 50,
      "top_p": 0.95,
      "return_full_text": False
      }
  
  prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a helpful and smart assistant. You accurately provide answer to the provided user query.<|eot_id|><|start_header_id|>user<|end_header_id|> Here is the query: ```{query}```.
      Provide precise and concise answer.<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
  
  headers = {
      'Authorization': f'Bearer {token}',
      'Content-Type': 'application/json'
  }
  
  prompt = prompt.replace("{query}", query)
  
  payload = {
      "inputs": prompt,
      "parameters": parameters
  }
  
  response = requests.post(url, headers=headers, json=payload)
  response_text = response.json()[0]['generated_text'].strip()

  return response_text

print(llm('write a python program to generate fibonacci series'))

Conclusion

Using open-source models like LLaMA 3 from Hugging Face allows you to leverage the power of large language models for free. In this guide, we walked through setting up an account, finding a model, and using Python code to send queries to the model and retrieve intelligent responses. This is a cost-effective way to integrate cutting-edge AI into your projects.

Follow Me for More Updates!

For more tutorials and updates on the latest in AI and technology, follow me on:
Instagram: https://www.youtube.com/@ataglanceofficial
YouTube: https://www.instagram.com/at_a_glance_official/
Happy coding!

— Yash Paddalwar