Resource Monitoring Tool with Prometheus + Grafana

Published in

VUNO SW Dev

9 min readJul 23, 2020

안녕하세요, 뷰노 SW 개발팀의 최우혁입니다.

뷰노는 인공지능 의료 솔루션을 개발하고 있으며, 인공지능 모델을 학습하기 위하여 GPU 클러스터를 사용합니다. GPU 클러스터를 사용하는 연구소는 효율적으로 GPU를 할당하여 잉여 리소스의 최소화를 위해 노력합니다.

이 게시물에서는 효율적인 클러스터의 자원 할당을 위한 첫번째 단계로 서버의 리소스를 관측할 수 있는 여러 툴 중 한가지인 Prometheus + Grafana 에 대해서 소개합니다.

Prometheus 의 설치 및 실행 방법과 Grafana 의 설치 및 실행 방법, Python 의 python-prometheus-client 모듈을 사용한 리소스 보고자 예제를 통해 간단한 클러스터의 리소스 모니터링 구축 방법을 안내합니다.

Prometheus

Prometheus 는 리소스 모니터링 및 경고에 사용되는 오픈소스 응용 프로그램입니다. 리소스 보고자의 데이터를 시간별로 저장하고 있으며, PromQL을 통해 저장된 데이터를 불러와 가시화할 수 있습니다.

Prometheus 는 Docker 를 통해 실행할 수 있으며 Prometheus 용 도커를 준비하는 방법은 다음과 같습니다.

$ docker pull prom/prometheus

위 명령어를 통해 Prometheus 도커 이미지를 준비했다면 다음으로 준비가 필요한 내용은 Prometheus 의 데이터 수집 대상과 수집 주기입니다.

다음 파일을 생성하십시오.
prometheus.yml

global:
  scrape_interval:     15s
  evaluation_interval: 15sscrape_configs:
  - job_name: 'node0'
    static_configs:
    - targets: ['192.168.0.1:5002']
  - job_name: 'node1'
    static_configs:
    - targets: ['192.168.0.2:5002']

global 세션의 scrape_interval 과 evaluation_interval 으로 데이터 수집 주기를 지정할 수 있습니다.
scrape_configs 세션에서 리소스 보고자의 job_name 과 보고자의 데이터 전송 포트를 지정할 수 있습니다.

Prometheus 의 설정이 완료되었다면 다음 명령어를 통해 컨테이너를 실행할 수 있습니다.

#!/bin/bashdocker run -d \
  --name prometheus \
  -h prometheus \
  -p 5001:9090 \
  -v prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --web.enable-lifecycle

설정이 완료된 후 크롬을 통해 지정한 IP:PORT 로 접속시 다음과 같은 화면을 확인할 수 있습니다.

Grafana

Grafana는 다양한 데이터베이스를 분석, 모니터링하는 오픈소스 응용 프로그램입니다. Prometheus 로 부터 추출한 시계열 데이터를 대화형으로 손쉽게 관리할 수 있어 관리 툴으로 적합합니다.

Grafana 또한 도커를 통해 실행하며 도커 이미지를 세팅하는 방법은 다음과 같습니다

$ docker image pull grafana/grafana

Grafana 도커 이미지를 세팅한 후 다음 명령어를 실행하여 컨테이너를 실행할 수 있습니다.

docker run -d \
  -p 5000:3000 \
  --name grafana \
  grafana/grafana:latest

세팅이 완료된 후, 지정한 IP 와 PORT 로 접근하면 다음과 같은 초기화면을 확인할 수 있습니다.

초기 비밀번호는 admin/admin 입니다.

로그인 후 좌측 Configuration 설정창의 Data Sources 를 통해 Prometheus 을 연동할 수 있습니다.

오른쪽에 있는 “Add data source” 버튼을 누른 후

Prometheus 를 선택한 후 URL 입력으로 데이터 소스를 등록할 수 있습니다.

Python Resource Exporter

Python 의 python-prometheus-client 모듈을 통해 자신이 원하는 데이터를 추출하여 보고할 수 있습니다.
python-prometheus-client 에는 총 4개의 데이터 포맷을 보고할 수 있으며 목록은 다음과 같습니다.
- Gauge
- Counter
- Summary
- Histogram
구현시에 Gauge 를 주로 사용하였음을 밝힙니다.

다음 예제는 prometheus_client 를 통해 최대 GPU 개수를 보고하는 예제입니다.

from prometheus_client import start_http_server, Gauge
import time# Get the number of GPU in server
def get_the_num_of_gpus():
  cmd = 'nvidia-smi --query-gpu=name --format=csv,noheader | wc -l'
  return int(run_shell_script(cmd).split('\n')[0])# Gauge(VALUE_NAME, DESCRIPTION_OF_THE_VALUE)
num_gpu = Gauge('the_number_of_gpu', 'Show the ratio of occupied GPU')num_gpu.set(get_the_num_of_gpus())if __name__ == '__main__':
  start_http_server(5002)
  while True:
    time.sleep(15)

위 예제는 변화가 없는 데이터를 보고하는 예제입니다.

다음 예제는 prometheus_client 를 통해 사용중인 GPU를 보고하는 예제입니다.

from prometheus_client import start_http_server, Gauge
import time# Gauge(VALUE_NAME, DESCRIPTION_OF_THE_VALUE)
occupied_gpus = Gauge('occupancy_gpus_ratio', 'Shows the ratio of occupied GPU')def get_occupied_gpus():
  cmd = 'nvidia-smi --query-gpu=pci.bus_id --format=csv,noheader'
  pci_ids = run_shell_script(cmd).split('\n')
  pci_ids = [p for p in pci_ids if p != '']cmd = 'nvidia-smi --query-compute-apps=gpu_bus_id --format=csv,noheader'
  process_bus_id = run_shell_script(cmd).split('\n')
  process_bus_id = [i for i in process_bus_id if i != '']pci_bus_id_dict = {}
  for pci_id in pci_ids:
    pci_bus_id_dict[pci_id] = 0for bus_id in process_bus_id:
    pci_bus_id_dict[bus_id] = 1num_of_working_gpu = 0
  for pci_id in pci_bus_id_dict.keys():
    if pci_bus_id_dict[pci_id] == 1:
      num_of_working_gpu += 1return num_of_working_gpu / len(pci_bus_id_dict.keys())def get_occupied_gpus():
    value = get_occupied_gpus()
    occupied_gpus.set(value)if __name__ == '__main__':
  start_http_server(5002)
  while True:
    get_occupied_gpus()
    time.sleep(15)

위 두 예제 처럼 본인이 원하는 데이터를 일정 시간마다 보고할 수 있도록 작성할 수 있으며, 보고 주기에 따라 시스템 부하를 조절할 수 있습니다.

위 예제를 확장하여 GPU 의 power consumption 과 utilization 을 확인할 수 있습니다.

Conclusion

이번 게시글에선 Prometheus + Grafana 와 Custom 리소스 보고자 작성을 통해 원하는 데이터를 가시화 하는 방법에 대해서 작성하였습니다. 클러스터에서 원하는 리소스만을 보고하여 가시화할 수 있고 Prometheus 에 저장된 데이터를 PromQL 을 통해 추출하여 문제가 발생할 수 있는 상황을 실시간으로 보고를 받을 수 있음에서 착안하여 유후 리소스 관측 및 클러스터의 신규 인원 할당 등의 작업을 자동화하여 자원을 공유할 수 있는 클러스터를 구축하는 것이 목표입니다.

References

https://prometheus.io/
https://grafana.com/
Figure 1. from https://prometheus.io/docs/introduction/overview/

Resource Monitoring Tool with Prometheus + Grafana

Prometheus

Grafana

Python Resource Exporter

Conclusion

Written by 최우혁