tvPrometheus

In GDP Labs, we use the Grafana stack with Prometheus as our main monitoring platform. This open-source software allows us to query, visualize, set alerts, and analyze metrics like CPU, Memory, Network, and Disk usage efficiently.

Glossary

  1. Prometheus: Open source systems monitoring toolkit that collects and stores metrics as time series data, with a multi-dimensional data model.

  2. Metric: Numerical measurements in layperson terms.

  3. Alert: Conditions based on Prometheus expression language expressions and to send notifications about firing alerts to an external service.

  4. Kubernetes: Portable, extensible, open source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation.

  5. Kubernetes Cluster: Group of nodes running containerized applications, managed by a control plane, and includes one or more nodes with running pods.

  6. Kubernetes Namespace: Isolating groups of resources within a single cluster.

  7. Kubernetes Workload: Application running on Kubernetes, run inside a set of pods.

  8. Kubernetes Deployment: Manages Pods for an application, usually without state.

  9. Kubernetes Statefulset: Manage stateful applications.

  10. Kubernetes Pod: Smallest deployable units.

  11. Kubernetes Node: Virtual or physical machines where containers run in Pods.

  12. Memory Leak: Memory limits are enforced by the kernel with out of memory (OOM) kills.

  13. Persistent Volume: Piece of storage in the cluster that has been provisioned.

Prerequisites

  1. Web Browser (e.g., Google Chrome, Microsoft Edge, Mozilla, etc)

  2. GDP Labs Google Account (@gdplabs.idarrow-up-right)

Simple Walkthrough

  1. Click Sign in with Google

  2. Choose your @gdplabs.id google account

Workflow: Investigating Memory Leak Issue

  1. Identify the Incident Scope

    Determine the affected application/system, environment (production/staging), and time window.

  2. Select Relevant Dashboard

    Open Kubernetes / Compute Resources / Namespace (Workloads) - Dashboardsarrow-up-right. Focus on relevant applications or projects – e.g, choose “cluster: eks-gl-production , namespace: ai-agent-platform-prod” for AIP production issues.

  3. Set the Time Range: Filter by Incident Period

    1. Use the time picker to narrow the query to the timeframe of the incident.

    2. Choose between relative dates (e.g., "Last 24 hours") or absolute values (specific start/end).

  4. Narrow to specific widgets group

    Since the above is just an example; it shows that AIP usage does not seem to be high beyond the limit.

Dashboard Usage Scenarios

Below is a list of dashboards you can use to investigating your app and its usage.

  1. Kubernetes / Compute Resources / Namespace (Workloads) - Dashboardsarrow-up-right

    1. Investigate application resource (CPU, Memory, Network) usage

    2. Application specific usage

    3. Relevant when these kind of alerts showed up:

  2. Kubernetes / Compute Resources / Node (Pods) - Dashboards - Grafanaarrow-up-right

    1. Investigate node resource (CPU, Memory, Network) usage

    2. Node to application specific usage

    3. Relevant when these kind of alerts showed up:

  3. Kubernetes Container Images - Apps - Dashboards - Grafanaarrow-up-right

    1. Investigate which application version are deployed now

  4. Kubernetes / Persistent Volume - Dashboards - Grafanaarrow-up-right

    1. Investigate application persistent volume usage

    2. Relevant when these kind of alerts showed up:

Best Practices from Grafana

Grafana documentation provide some tutorial best practice how to use their app. Here is the tutorial for Grafana dashboardarrow-up-right and alertarrow-up-right.

Quick-Reference Dashboard Filter

The dashboard features a filter bar at the top, enabling you to refine the data shown across all panels. Utilize these controls to focus on specific clusters, namespaces, or workloads and set the desired time range for your analysis. Each filter's description is detailed in the table below.

Filter
Description

@timestamp

The date and time the metric collected

cluster

Kubernetes cluster to inspect (e.g., eks-gl-production)

namespace

Kubernetes namespace to inspect (e.g., ai-agent-platform-prod)

workload_type

workload

Specific workload on particular namespace (e.g., ai-agent-platform-runner-worker)

Last updated