Natural Language Processing on Tesla P40

11 min readJul 5, 2022

Introduction

Natural Language Processing (NLP) is a subfield of computer science involving the interaction between humans and computers, using natural human language. It is a subfield of linguistics and artificial intelligence (AI) that offers potentials that were not available to us until a few years ago.

GPT-J-6B is an NLP large model with 6 billion parameters, built by EleutherAI [1] and open-sourced via the Hugging Face community. [2]

The model has impressive capabilities, creativity, and power of prediction. However, loading a large NLP model like GPT-J-6B demands a challenging hardware setup and, at least, basic knowledge about its internals and methods of inference.

In this article, I will explain the basic steps of building consumer-level hardware to load the model and how to interact with it. It is not meant to be a step-by-step tutorial, still, this article may provide helpful suggestions, even if you are perfectly familiar with the concept.

Running GPT-J-6B model on Nvidia Tesla P40

Loading the model on the CPU is impractical due to the loss of accuracy and time. The 6B model requires a significant number of CUDA cores to run accurately at a productive speed, at least 35GB of CPU memory, and 23GB of GPU memory which must be contiguous and not fragmented.
The GPU memory must be directly accessible for CUDA cores without other video processing interruptions.

This project is demonstrated on an Nvidia Tesla P40 accelerator for the first time. Three accelerators are available in the used products market, sold as 24GB GPU and CUDA compatible; those are:

Nvidia Tesla K80 Kepler architecture, 24GB GDDR5 memory
Nvidia Tesla M40 Maxwell architecture, 24GB GDDR5 memory
Nvidia Tesla P40 Pascal architecture, 24GB GDDR5x memory [3]

A common mistake would be to try a Tesla K80 with 24GB of memory. It does not work with larger models like GPT-J-6B because K80 is not a single GPU with a contiguous 24GB. It is two GPUs embedded on one board, each with 12GB of memory. In practice, Tesla K80 equals two GPUs while only preserving them on a single PCIe socket.

In other words, Tesla K80 (2 x 12GB) does not provide 24GB of contiguous CUDA memory, which leaves us to select between M40 and P40. Both Tesla M40 and Tesla P40 must be able to load large models. In this project I tried a refurbished P40 in combination with consumer PC hardware.

Refurbished Nvidia Tesla P40 accelerator

Hardware setup

Be aware that Tesla P40 draws 350W of power, requires a PCIe 3 x16 socket, and requires “Above 4G decoding” in BIOS. It does not have a video output and has no cooling fan.

The accelerator is designed to be used on a dedicated GPU server or an HP DL380 G10 server with a specific PCIe riser and cooling tunnel. It is not designed to run on a consumer PC mainboard.

After some research, I decided to try a mainboard with an Intel Z270 chipset and finally went with an Asus Z270-P full ATX mainboard with an LGA1151 CPU socket; then loaded it up with an Intel i3 7200 CPU and 48GB of RAM and a Kingston 500GB NVMe M.2 disk for persistent data.

It is possible to calculate the maximum draw, but it is too risky to rely on. Therefore an 850W modular PSU (probably overpowered) is used to fuel the mainboard. A decommissioned aluminum frame of a mining rig has made the system chassis.

A decomissioned mining rig frame — A decommissioned mining rig frame

The first step is to test PSU and make sure about its functionality. A malfunctioning PSU can be a reason for other “unexplainable” hardware failures. The PSU starts automatically after switching on the power button by connecting pins 4 and 5 on an ATX 24-pin connector.
All sockets are tested for the correct voltage, following the pinouts schemes, to ensure that PSU is in good shape.

The next step was refurbishing an Intel i3 7200 CPU (two cores, four threads) and setting it on the mainboard, along with 48 GB DDR4:

Installed PSU and mainboard on the aluminum rig

The mainboard has an onboard HDMI output which is used for normal video output. However, two problems still need to be solved:

1- Cooling: Tesla P40 is “passive” cooled and designed to be cooled by GPU servers’ air tunnels.

2- Weight: It is a 2 kg brick: Without a proper bracket, it will lag to the right side, exert pressure on the PCIe socket, and probably break it.

After watching some helpful videos, I decided to try a StarTech PCIe fan. I removed the PCIe metal bracket and cut a 2CM opening for the Tesla power connector.

Note: the Tesla P40 power connector is a modular 8-pin CPU (not a PCI or GPU connector), and the voltage pinout is different. Connecting Tesla to a PCIe or GPU modular pinout will be fatal for the device.

The result was better than expected. This particular cooling fan is a perfect fit for Tesla P40, and there is no need for tape or glue. A Molex to Modular PCIe socket is used to power the fan:

Caution: Molex to Modular adapter cables are following a variety of pinouts while the socket seems to be a fit. A wrong voltage pinout can damage the mainboard and/or PSU. Always double check if the adapter pinout correctly matches the modular socket pinout on the PSU!

To manage the weight, the fabric P40 bracket is removed, turned 90 degrees clockwise, and installed horizontally. Now, the device pressures to the aluminum frame instead of the PCIe socket:

Bracket to control sideways pressure on PCIe socket

The reason not to raise P40 with standard mining PCIe risers was the assumption that the PCIe bandwidth ratio between a raised x4 and the regular x16 would matter for NLP language processing with large models. That assumption must still be benchmarked.

Large NLP models are loaded in CPU RAM and offloaded to GPU. The memory transport might be slower when the GPU is raised with an x1 or x4 bandwidth. I am uncertain if it significantly affects GPU processing or sending back the NLP results to the PCIe bus.

Software configuration

After enabling “Above 4G Decoding” in Asus Z270 BIOS configuration, Windows 10 Pro is installed on the onboard NVMe M.2 disk, on a single partition:

NVMe disk always must be installed on Socket-1 (M2_1 socket)

I installed the latest version of Nvidia drivers for Tesla P40 [4] and after reboot, Windows 10 device manager shows the Tesla P40 as an available graphic card. Then installed “MSI Afterburner”[5] to monitor the GPU temperature in real time.

To ensure that Nvidia drivers are working properly and the GPU is recognized, use the nvidia-smi command line utility:

The minimum version of Python to run the Hugging Face transformers for GPT-J-6B is Python 3.10.x [6]. The available tutorials for installing PyTorch on Windows did not work for me. PyTorch needs to be installed by the command provided by PyTorch website:
https://pytorch.org/get-started/locally/

Build the installation command for PyTorch

The installation command is:

> pip install torch torchvision torchaudio — extra-index-url https://download.pytorch.org/whl/cu113

To install Jupyter notebook and the Hugging Face transformers:

pip install — upgrade pip
pip install jupyter
pip install transformers

At this stage the setup is ready to test:

> Jupyter notebook

Inference

In the introduction, I claimed that “NLP opens up linguistic potentials that were not available to us until a few years ago.”

It is time to put the bold statement to the test. To examine the level of creativity of the GPT-J-6B model, I used the iconic opening line of one of the most famous novels created in human history, “100 years of solitude” by Gabriel García Márquez.

The prompt will provide a short learning to GPT model, to understand the continuation is in the “magical realist” style. The sentence is unfinished at the end, and controls if GPT is able to predict the correct words.

Prompt:
“Many years later, as he faced the firing squad, Colonel Aureliano Buendia was to remember that distant afternoon when his father took him to”

It is now possible to answer three questions:

1- Does the GPT-J-6B model understand the unfinished opening line and confirm it with the prediction of the legendary words of “discover ice”?

2- Does it continue the text without plagiarism or exact repeating of the novel’s text?

3- Most importantly, does it generate a text that not only written in “magical realist” style, but represents a level of creativity that can be mistaken for a human?

The below code is used for inference:

import torch
from transformers import GPTJForCausalLM, GPT2Tokenizer, AutoTokenizermodel = GPTJForCausalLM.from_pretrained(“EleutherAI/gpt-j-6B”, torch_dtype=torch.float16).cuda() 
tokenizer = AutoTokenizer.from_pretrained(“EleutherAI/gpt-j-6B”)prompt = “Many years later, as he faced the firing squad, Colonel Aureliano Buendia was to remember that distant afternoon when his father took him to” 
tokens = 1024input_ids = tokenizer.encode(str(prompt), return_tensors=’pt’).cuda()
output = model.generate( 
 input_ids, 
 do_sample=True, 
 max_length=tokens, 
temperature=0.8,
)print (tokenizer.decode(output[0], skip_special_tokens=True))

Conclusion

When talking about literature and creativity, judgments are subjective and highly depends on the readers expectations. Therefore I simply copy the output here, without any modification or edit and leave the judgment to the reader.
The below text, after the initial prompt, is created by GPT-J-6B model:

Many years later, as he faced the firing squad, Colonel Aureliano Buendia was to remember that distant afternoon when his father took him to discover ice.
Near the village of Chapul, in the state of Michoacan in Mexico, Aureliano and his father were approaching the ruins of an extinct volcano. The air was still and close, full of the smell of flowers and resin. When he was very young, Aureliano had seen the first snow, and the beauty of that pale, silent world had left a deep impression on him. For a moment, he felt his father’s hand on his shoulder, and it seemed to him that they were walking slowly through a great silence, a silence that seemed to be waiting for the sound of their footsteps.
“My father said he still remembered when he was a boy the first time he saw snow, and the first time he heard the wind,” Aureliano wrote in his notebooks. “Now, as he walks beside me, I can feel in his hand the same sense of mystery and expectation.”
His father stopped at the edge of a small park and pointed in the direction of the volcano. Aureliano understood at once that he was going to see ice. “The landscape has changed,” he wrote, “the volcano has disappeared, and in its place, as if in payment for all the volcanoes in our century, is a park. It is a garden, with trees, flowers and the smell of flowers, and one cannot tell that it is surrounded by a city, by a forest of concrete, by machines and men. The same quiet light fills the spaces between the trees as in the past. But the silence is different, for it is not the silence that is waiting for sound, that waits for the wind to begin, for the birds to start singing in the trees, but the silence of a place that has once been a volcano, that is still in the process of becoming a jungle.”
It was in his notebooks that Aureliano Buendia was to remember that afternoon when his father took him to discover ice.
The volcano had been extinct for a long time. No one knew in what year it had last exploded, no one knew why it had become extinct. Some said that it was a dormant volcano, while others said that it was an active volcano, and that a new eruption every thirty years was the secret of its extraordinary longevity. What no one could deny was that the peak of the volcano was covered with ice.
When the children from the village were little, they played among the rocks near the foot of the volcano. When they were older, they climbed the sides of the peak. They were a small band of mountaineers. They climbed the rocks at night, or during the mornings when the moon was nearly full. When they were very little, they used to throw stones at the ice, and when they were older, they hurled them in great numbers at the peak. The ice was hard as rock, and the rocks bounced off without breaking.
Aureliano Buendia and his father climbed to the top of the peak one night and looked down at the city. It was a clear night, the moon was almost full, and the stars were very bright. The city was black. The houses, the buildings, the trees were all as still as the shadows of the statues that stood on the edge of the park. His father pointed to the peak. “You need a long telescope,” he said, “and a clear night to see it.”
Aureliano held his father’s hand, and they walked together in silence to the edge of the park. Then Aureliano’s father took a telescope from his backpack. It was a large glass tube with a wide base that could easily be carried. He placed it in Aureliano’s hands and looked through it. He looked for a long time.
“Can you see it?” he asked.
“I can see it,” Aureliano answered.
“Do you see the ice?”
“Yes, I see it.”
“Can you recognize it?”
“The ice is like rock, and the rocks are like ice.”
“The volcano is a living thing,” his father said. “It has a heart that is cold, and its breath is cold. The ice is like the breath of the volcano. It is the breath of a creature that is alive. It has a heart, and it is like living beings, like men, like women, like children. The volcanoes are a part of nature.
The volcanoes are born, they live and they die. They are very old; they are older than time.”

—

The NLP text ends here, as the model was asked to predict only 1024 tokens. However, the creativity of GPT-J-6B was beyond my expectations.

In this article I explained a simple hardware setup for loading Hugging Face large NLP models. It is of course yet another tiny step to learn about NLP and to be amazed by its capabilities.

References and links:

[1] https://www.eleuther.ai/
[2] https://huggingface.co/EleutherAI/gpt-j-6B
[3] https://www.techpowerup.com/gpu-specs/tesla-p40.c2878
[4] https://www.nvidia.com/Download/driverResults.aspx/189904/en-us/
[5] https://www.msi.com/Landing/afterburner/graphics-cards
[6] https://www.python.org/downloads/windows/