Prometheus
In GDP Labs, we use the Grafana stack with Prometheus as our main monitoring platform. This open-source software allows us to query, visualize, set alerts, and analyze metrics like CPU, Memory, Network, and Disk usage efficiently.
Glossary
Prometheus: Open source systems monitoring toolkit that collects and stores metrics as time series data, with a multi-dimensional data model.
Metric: Numerical measurements in layperson terms.
Alert: Conditions based on Prometheus expression language expressions and to send notifications about firing alerts to an external service.
Kubernetes: Portable, extensible, open source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation.
Kubernetes Cluster: Group of nodes running containerized applications, managed by a control plane, and includes one or more nodes with running pods.
Kubernetes Namespace: Isolating groups of resources within a single cluster.
Kubernetes Workload: Application running on Kubernetes, run inside a set of pods.
Kubernetes Deployment: Manages Pods for an application, usually without state.
Kubernetes Statefulset: Manage stateful applications.
Kubernetes Pod: Smallest deployable units.
Kubernetes Node: Virtual or physical machines where containers run in Pods.
Memory Leak: Memory limits are enforced by the kernel with out of memory (OOM) kills.
Persistent Volume: Piece of storage in the cluster that has been provisioned.
Prerequisites
Web Browser (e.g., Google Chrome, Microsoft Edge, Mozilla, etc)
GDP Labs Google Account (@gdplabs.id)
Simple Walkthrough
Click Sign in with Google
Choose your @gdplabs.id google account
Now, you are ready to explore visualization and dashboard:
Workflow: Investigating Memory Leak Issue
Identify the Incident Scope
Determine the affected application/system, environment (production/staging), and time window.
Select Relevant Dashboard
Open Kubernetes / Compute Resources / Namespace (Workloads) - Dashboards. Focus on relevant applications or projects – e.g, choose “cluster: eks-gl-production , namespace: ai-agent-platform-prod” for AIP production issues.

Set the Time Range: Filter by Incident Period
Use the time picker to narrow the query to the timeframe of the incident.
Choose between relative dates (e.g., "Last 24 hours") or absolute values (specific start/end).

Narrow to specific widgets group

Since the above is just an example; it shows that AIP usage does not seem to be high beyond the limit.
Dashboard Usage Scenarios
Below is a list of dashboards you can use to investigating your app and its usage.
Kubernetes / Compute Resources / Namespace (Workloads) - Dashboards
Investigate application resource (CPU, Memory, Network) usage
Application specific usage
Relevant when these kind of alerts showed up:
Kubernetes / Compute Resources / Node (Pods) - Dashboards - Grafana
Investigate node resource (CPU, Memory, Network) usage
Node to application specific usage
Relevant when these kind of alerts showed up:
Kubernetes Container Images - Apps - Dashboards - Grafana
Investigate which application version are deployed now
Kubernetes / Persistent Volume - Dashboards - Grafana
Investigate application persistent volume usage
Relevant when these kind of alerts showed up:
Best Practices from Grafana
Grafana documentation provide some tutorial best practice how to use their app. Here is the tutorial for Grafana dashboard and alert.
Quick-Reference Dashboard Filter
The dashboard features a filter bar at the top, enabling you to refine the data shown across all panels. Utilize these controls to focus on specific clusters, namespaces, or workloads and set the desired time range for your analysis. Each filter's description is detailed in the table below.

@timestamp
The date and time the metric collected
cluster
Kubernetes cluster to inspect (e.g., eks-gl-production)
namespace
Kubernetes namespace to inspect (e.g., ai-agent-platform-prod)
workload_type
Workload type (statefulset or deployment)
workload
Specific workload on particular namespace (e.g., ai-agent-platform-runner-worker)
Last updated