How to Access Free Open-Source LLMs Like LLaMA 3 from Hugging Face Using Python API: Step-by-Step Guide
What is an LLM?
A Large Language Model (LLM) is an artificial intelligence model trained on a vast amount of text data, enabling it to understand, generate, and even interact with human language. These models have become essential tools in natural language processing (NLP) tasks such as answering questions, summarizing text, or generating creative content.
Expensive LLMs and Free Alternatives
Many popular LLMs, like GPT-4 or Claude, are paid services, which can be quite costly depending on your usage. However, there are excellent open-source alternatives available for free, such as LLaMA 3 and other models hosted on Hugging Face. These open-source models provide a cost-effective way to integrate advanced AI into your projects without worrying about huge expenses.
This will guide you through the process of accessing these open-source LLMs from Hugging Face using Python, with step-by-step explanations.
Step-by-Step Implementation
1. Create an Account on Hugging Face
First, you’ll need an account on Hugging Face to access their hosted models. Here’s how to do it:
- Go to the Hugging Face website.
- Click on “Sign Up” and create an account.
- After signing up, navigate to your profile -> settings -> access tokens -> click on “create a new token” to generate an API token. This token is essential for authenticating API requests.
2. Search for a Free LLM Model
Once your account is ready, you can search for available LLMs on Hugging Face. For example, Meta’s LLaMA 3 is an open-source LLM you can use.
To find a model:
- Visit the Hugging Face model hub.
- Type the name of the model in the search bar (e.g., “Llama 3”).
- You’ll find models like meta-llama/Meta-Llama-3–8B-Instruct. This is the model we’ll use for our demonstration.
3. Get the Model Name/Path
Once you find the desired model, note the model path. In this case, the path for LLaMA 3 is meta-llama/Meta-Llama-3-8B-Instruct
.
4. Python Code to Use the LLM via API
Now that you have the model path, you can interact with it via Hugging Face’s API using Python. Below is a simple Python code that shows how to send a query to the model and receive a response.
Import the requests
Library
We will use the requests
library to interact with the Hugging Face API.
import requests
Define the API URL and Token
You will need the API URL for the specific model and the token you obtained from your Hugging Face account.
url = "https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-8B-Instruct"
token = "hf_mKBoIPadsfvDMFRiHBxlmMBgpZaAPo" # Replace with your Hugging Face token
Function to Send a Query to the LLM
def llm(query):
parameters = {
"max_new_tokens": 5000,
"temperature": 0.01,
"top_k": 50,
"top_p": 0.95,
"return_full_text": False
}
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a helpful and smart assistant. You accurately provide answer to the provided user query.<|eot_id|><|start_header_id|>user<|end_header_id|> Here is the query: ```{query}```.
Provide precise and concise answer.<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
headers = {
'Authorization': f'Bearer {token}',
'Content-Type': 'application/json'
}
prompt = prompt.replace("{query}", query)
payload = {
"inputs": prompt,
"parameters": parameters
}
response = requests.post(url, headers=headers, json=payload)
response_text = response.json()[0]['generated_text'].strip()
return response_text
We define a function llm()
that sends a query to the LLaMA 3 model. Let’s break it down step by step:
1. Function Definition: llm(query)
def llm(query):
- The function
llm(query)
is defined with one input parameter,query
, which represents the text (or question) you want to ask the LLM.
2. Defining Parameters for the Model
parameters = {
"max_new_tokens": 5000,
"temperature": 0.01,
"top_k": 50,
"top_p": 0.95,
"return_full_text": False
}
Here, the parameters
dictionary contains configuration options that influence how the LLM generates its response:
- max_new_tokens: Limits the maximum number of tokens (or words) the model can generate in response. Setting this to 5000 allows the model to generate up to 5000 tokens.
- temperature: Controls the randomness of the output. A low value like 0.01 makes the model’s responses more predictable and deterministic.
- top_k: Limits the number of token choices the model considers at each step, which narrows down the possible responses.
- top_p: A probability-based method to filter which tokens can be chosen by the model, balancing randomness and meaningfulness in output.
- return_full_text: If set to
True
, the model returns the entire conversation history. Since it isFalse
, it only returns the generated part of the response.
3. Formatting the Prompt
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a helpful and smart assistant. You accurately provide answer to the provided user query.<|eot_id|><|start_header_id|>user<|end_header_id|> Here is the query: ```{query}```.
Provide precise and concise answer.<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
The prompt
is the instruction sent to the LLM. It contains:
- System message: This sets the role of the AI. Here, the system instructs the model to act as a “helpful and smart assistant.”
- User message: The
query
(the user’s input) is inserted into the prompt. This is dynamically added usingquery
, which is a placeholder for the user's question. - Assistant message: The assistant’s response comes after this.
4. Preparing the Request Headers
headers = {
'Authorization': f'Bearer {token}',
'Content-Type': 'application/json'
}
Here, we define the headers required for the API request:
- Authorization: This contains the Hugging Face API token in the format
Bearer {token}
. You must replacetoken
with your actual Hugging Face API key. - Content-Type: The content type is set to
application/json
, as we are sending JSON data in our API request.
5. Replacing the Placeholder in the Prompt
prompt = prompt.replace("{query}", query)
This line replaces the {query}
placeholder in the prompt
with the actual user input (query
), creating a personalized instruction for the model.
6. Preparing the Payload for the API Request
payload = {
"inputs": prompt,
"parameters": parameters
}
The payload is the actual data sent to the API. It consists of:
- inputs: The fully formatted
prompt
that includes the user’s query and system instructions. - parameters: The generation settings (like max tokens, temperature) that define how the model should behave when generating the response.
7. Sending the API Request
response = requests.post(url, headers=headers, json=payload)
This line sends an HTTP POST request to the Hugging Face model API using the requests.post()
function:
- url: This is the endpoint for the model, which points to the specific LLaMA 3 model hosted on Hugging Face.
- headers: Includes the authorization token and content type.
- json: The payload (prompt and parameters) is sent in JSON format.
8. Extracting the Response
response_text = response.json()[0]['generated_text'].strip()
Once the API responds, this line extracts the generated_text
from the JSON response:
response.json()
converts the API's response from JSON into a Python dictionary.[0]
accesses the first response item in case there are multiple (Hugging Face models typically return an array).['generated_text']
accesses the text generated by the model..strip()
removes any leading or trailing whitespace from the result.
9. Returning the Final Answer
return response_text
Finally, the function returns the model’s generated text, which is the answer to the user’s query.
Example Usage
print(llm('write a python program to generate fibonacci series'))
The function will:
- Send this query to the LLaMA 3 model on Hugging Face.
- Get a response with Python code to generate a Fibonacci series.
- Print the model’s response.
When you run this, the model will return a Python program that generates the Fibonacci series.
Full Code
Here is the full code block for reference:
import requests
url = "https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-8B-Instruct"
token = "hf_mKBoIPadsfvDMFRiHBxlmMBgpZaAPo" # Replace with your Hugging Face token
def llm(query):
parameters = {
"max_new_tokens": 5000,
"temperature": 0.01,
"top_k": 50,
"top_p": 0.95,
"return_full_text": False
}
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a helpful and smart assistant. You accurately provide answer to the provided user query.<|eot_id|><|start_header_id|>user<|end_header_id|> Here is the query: ```{query}```.
Provide precise and concise answer.<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
headers = {
'Authorization': f'Bearer {token}',
'Content-Type': 'application/json'
}
prompt = prompt.replace("{query}", query)
payload = {
"inputs": prompt,
"parameters": parameters
}
response = requests.post(url, headers=headers, json=payload)
response_text = response.json()[0]['generated_text'].strip()
return response_text
print(llm('write a python program to generate fibonacci series'))
Conclusion
Using open-source models like LLaMA 3 from Hugging Face allows you to leverage the power of large language models for free. In this guide, we walked through setting up an account, finding a model, and using Python code to send queries to the model and retrieve intelligent responses. This is a cost-effective way to integrate cutting-edge AI into your projects.
Follow Me for More Updates!
For more tutorials and updates on the latest in AI and technology, follow me on:
Instagram: https://www.youtube.com/@ataglanceofficial
YouTube: https://www.instagram.com/at_a_glance_official/
Happy coding!
— Yash Paddalwar