Prometheus

In GDP Labs, we use the Grafana stack with Prometheus as our main monitoring platform. This open-source software allows us to query, visualize, set alerts, and analyze metrics like CPU, Memory, Network, and Disk usage efficiently.

Glossary

Prometheus: Open source systems monitoring toolkit that collects and stores metrics as time series data, with a multi-dimensional data model.
Metric: Numerical measurements in layperson terms.
Alert: Conditions based on Prometheus expression language expressions and to send notifications about firing alerts to an external service.
Kubernetes: Portable, extensible, open source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation.
Kubernetes Cluster: Group of nodes running containerized applications, managed by a control plane, and includes one or more nodes with running pods.
Kubernetes Namespace: Isolating groups of resources within a single cluster.
Kubernetes Workload: Application running on Kubernetes, run inside a set of pods.
Kubernetes Deployment: Manages Pods for an application, usually without state.
Kubernetes Statefulset: Manage stateful applications.
Kubernetes Pod: Smallest deployable units.
Kubernetes Node: Virtual or physical machines where containers run in Pods.
Memory Leak: Memory limits are enforced by the kernel with out of memory (OOM) kills.
Persistent Volume: Piece of storage in the cluster that has been provisioned.

Prerequisites

Web Browser (e.g., Google Chrome, Microsoft Edge, Mozilla, etc)
GDP Labs Google Account (@gdplabs.id)

Simple Walkthrough

Open https://grafana-kube-explore.obrol.id/
Click Sign in with Google
Choose your @gdplabs.id google account
Now, you are ready to explore visualization and dashboard:
1. Kubernetes / Compute Resources / Namespace (Workloads) - Dashboards - Staging
2. Kubernetes / Compute Resources / Namespace (Workloads) - Dashboards - Production

Workflow: Investigating Memory Leak Issue

Identify the Incident Scope
Determine the affected application/system, environment (production/staging), and time window.
Select Relevant Dashboard
Open Kubernetes / Compute Resources / Namespace (Workloads) - Dashboards. Focus on relevant applications or projects – e.g, choose “cluster: eks-gl-production , namespace: ai-agent-platform-prod” for AIP production issues.
Set the Time Range: Filter by Incident Period
1. Use the time picker to narrow the query to the timeframe of the incident.
2. Choose between relative dates (e.g., "Last 24 hours") or absolute values (specific start/end).
Narrow to specific widgets group
Since the above is just an example; it shows that AIP usage does not seem to be high beyond the limit.

Dashboard Usage Scenarios

Below is a list of dashboards you can use to investigating your app and its usage.

Kubernetes / Compute Resources / Namespace (Workloads) - Dashboards

Investigate application resource (CPU, Memory, Network) usage
Application specific usage

Relevant when these kind of alerts showed up:

✅ Alert: KubePodMemoryHigh-SEV1-Prod
Details: Pod bosa-platform-prod/bosa-api-worker-865f74786c-chrp8 (bosa-api-worker) has been using more than 90% of its memory limit for the last 5 minutes on cluster eks-gl-production.

Kubernetes / Compute Resources / Node (Pods) - Dashboards - Grafana
1. Investigate node resource (CPU, Memory, Network) usage
2. Node to application specific usage
3. Relevant when these kind of alerts showed up:
  🔥 Alert: NodeCPUHighUsage-SEV2-nonProdCluster Details: CPU usage at 10.10.14.193:9100 has been above 90% for the last 10 minutes, is currently at 96.13%.
Kubernetes Container Images - Apps - Dashboards - Grafana
1. Investigate which application version are deployed now

Kubernetes / Persistent Volume - Dashboards - Grafana

Investigate application persistent volume usage

Relevant when these kind of alerts showed up:

✅ Based on recent sampling, the PersistentVolume claimed by prometheus-prometheus-stack-kube-prom-prometheus-db-prometheus-prometheus-stack-kube-prom-prometheus-0 in Namespace kube-addons on Cluster eks-gl-staging is expected to fill up within four days. Currently 5.045% is available.

Best Practices from Grafana

Grafana documentation provide some tutorial best practice how to use their app. Here is the tutorial for Grafana dashboard and alert.

Quick-Reference Dashboard Filter

The dashboard features a filter bar at the top, enabling you to refine the data shown across all panels. Utilize these controls to focus on specific clusters, namespaces, or workloads and set the desired time range for your analysis. Each filter's description is detailed in the table below.

Filter

Description

@timestamp

The date and time the metric collected

cluster

Kubernetes cluster to inspect (e.g., eks-gl-production)

namespace

Kubernetes namespace to inspect (e.g., ai-agent-platform-prod)

workload_type

Workload type (statefulset or deployment)

workload

Specific workload on particular namespace (e.g., ai-agent-platform-runner-worker)

PreviousOpenSearch NextELK

Last updated 26 days ago

Was this helpful?

hashtagGlossary

hashtagPrerequisites

hashtagSimple Walkthrough

hashtagWorkflow: Investigating Memory Leak Issue

hashtagDashboard Usage Scenarios

hashtagBest Practices from Grafana

hashtagQuick-Reference Dashboard Filter