In order to improve the observability of a system, we can use Prometheus to monitor resource consumption at different levels of the system, such as system-level disk I/O and network bandwidth, tenant-level CPU consumption and memory usage, component-level Ray cluster resource utilization, backend platform system call count, garbage collection status, operator-level object size and execution time, etc. Accumulating this data can help developers troubleshoot issues and provide direction for subsequent system optimization.
There are two ways to deploy Prometheus. The first is a lightweight deployment based on Docker-Compose, which is mainly deployed in scenarios where there is only one machine or a limited number of machines. The second is a deployment based on Kubernetes.. We will provide detailed introductions to these deployments and specific operational steps for deployment personnel.
Docker-Compose-based deployment#
When deploying with Docker-Compose, the resources of the client's server are often limited. For some clients, it is not acceptable to occupy 500MB of memory for the Prometheus component. Therefore, the Prometheus component needs to be designed to be pluggable. In addition, considering that clients may have several servers, since Prometheus is relatively independent of system functions, it can be deployed separately on idle machines without occupying Ray cluster resources, etc. Therefore, we use a separate deployment file to deploy Prometheus instead of merging it into the existing deployment file.
Assuming that Docker and Docker-Compose environments are already installed on each server. There are N servers in total, and the Prometheus monitoring module can be divided into 1 Master machine and N-1 Slave machines. The process of deploying Prometheus based on Docker-Compose can be divided into the following two steps.
- Create a service on each Slave server.
- Create a service on the Master server.
Create a service on each Slave server#
Create a configuration file for the service. The configuration file occupies port 9100 on the slave server. Assuming the server's IP address is 192.168.88.101, the node-exporter will run on 192.168.88.101:9100. In the next step, the information of 192.168.88.101:9100 needs to be added to the prometheus.yml configuration file on the Master.
version: '3'
services:
node-exporter:
image: prom/node-exporter:latest
container_name: node_exporter
hostname: node-exporter
restart: always
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
- /etc/hostname:/etc/hostname:ro
ports:
- "9100:9100"
# If port 9100 is occupied, you can use
# - "9200:9100"
networks:
- monitor
networks:
monitor:
driver: bridge
ipam:
config:
- subnet: 172.16.102.0/24
Start the service by running docker-compose up -d
in the same directory.
Create a service on the Master server#
Assuming the IP address of the Master host is 192.168.88.13. First, create docker-compose.yml
and prometheus.yml
in the same directory. As shown in docker-compose.yml
:
- The
node-exporter
for the Master machine will be created at 192.168.88.13:9100. - The
prometheus
for the Master machine will be created at 192.168.88.13:9090. - The
grafana
for the Master machine will be created at 192.168.88.13:3000. - The
alertmanager
for the Master machine will be created at 192.168.88.13:9093. - The
cadvisor
for the Master machine will be created at 192.168.88.13:8080.
version: "3.7"
services:
# Service1: Node monitoring
node-exporter:
image: prom/node-exporter:latest
container_name: "node-exporter"
ports:
- "9100:9100"
restart: always
# Service2: Node monitoring
prometheus:
image: prom/prometheus:latest
container_name: "prometheus0"
restart: always
ports:
- "9090:9090"
volumes:
- "./prometheus.yml:/etc/prometheus/prometheus.yml"
- "./prometheus_data:/prometheus"
# Service3: Data dashboard
grafana:
image: grafana/grafana
container_name: "grafana"
ports:
- "3000:3000"
restart: always
volumes:
- "./grafana_data:/var/lib/grafana"
- "./grafana_log:/var/log/grafana"
- "./grafana_data/crypto_data:/crypto_data" # The host address is before the colon and the container address is after the colon. This is used to specify the location of the sqlite database.
# Service4: Alert processing
alertmanager:
image: prom/alertmanager:latest
container_name: Myalertmanager
hostname: alertmanager
restart: always
ports:
- '9093:9093'
volumes:
- './prometheus/config:/config'
- './prometheus/data/alertmanager:/alertmanager/data'
# Service5: Docker monitoring
cadvisor:
image: lagoudocker/cadvisor:v0.37.0
container_name: cadvisor
restart: always
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /dev/disk/:/dev/disk:ro
- /var/lib/docker/:/var/lib/docker:ro
command:
- "--disable_metrics=udp,tcp,percpu,sched"
- "--storage_duration=15s"
- "-docker_only=true"
- "-housekeeping_interval=30s"
- "-disable_metrics=disk"
ports:
- 8080:8080
networks:
- monitor
networks:
monitor:
name: monitor
driver: bridge
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
# Needs to be consistent with the 'alertmanager' configuration file in the master machine's docker-compose.yml.
- 192.168.88.13:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
#- "app/prometheus/rules/*.yml"
- "rule.yml"
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'slave101NodeExporter'
static_configs:
- targets: ['192.168.88.101:9100']
labels:
host: slave101
- job_name: 'masterNodeExporter'
static_configs:
- targets: ['192.168.88.13:9100']
labels:
host: master
- job_name: 'masterCadvisor'
static_configs:
- targets: ['192.168.88.13:8080']
labels:
host: master
# Add NodeExporter of other servers
# - job_name: 'slave21NodeExporter'
# static_configs:
# - targets: ['192.168.88.21:9100']
# labels:
# host: slave21NodeExporter
Deployment Based on K8S#
Due to the native installation of Prometheus and Node-exporter in Kubesphere, only Grafana needs to be installed in Kubesphere. The steps mainly consist of two parts: deploying Grafana using Helm and adding a persistent volume to Grafana.
Deploying Grafana Using Helm#
The deployment method based on K8S uses Helm for deployment. Use the following command to create Grafana in the kubesphere-monitoring-system namespace (which is the default).
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install grafana grafana/grafana -n kubesphere-monitoring-system
Adding a Persistent Volume to Grafana
Next, we add a persistent volume to Grafana so that it can persistently save dashboards and user information. First, create a PVC named "grafana-storage" in the kubesphere-monitoring-system.
kubectl create pvc grafana-storage -n kubesphere-monitoring-system --size=1Gi
Then, we modify the YAML file fragment corresponding to the development of Grafana as shown below.
volumes:
- configMap:
defaultMode: 420
name: grafana
name: config
- name: grafana-storage
persistentVolumeClaim:
claimName: grafana-storage```