Using Grafana for Data provider usage audit: Prometheus + Grafana

Before start

4 min readNov 27, 2020

In this post, I want to describe our case of using Grafana + Prometheus. I would say that our case stands out a bit from a common usage pattern: we tried to monitor business metrics rather than technical.

In the company we use Grafana in a standard way: visualize monitoring of our infrastructure. It works fine and Engineers are happy with it.

However, once our Product Manager comes to us with a request that we found suitable to try to implement using Grafana + Prometheus: he asked to monitor\audit the usage of data we consume from 3rd party provider.

The story

Our feature team provides services based data we consume from 3rd party vendors. To simplify the case: we buy data in bulk. Then we process, enrich, combine with other data, and sell it in different dimensions (like retail). We pay data vendors per row of information we take. Moreover, different kind of data costs differently. They provide us means to see how much and what kind of data we have used per month. We had a kind of prepared contract: we pay for a “month plan” we selected: a fixed price for a fixed amount of data. Leftovers do not shift: each month you have your budget reset to a predefined value (does not matter what you left in the previous one). The contract provides us a budget, that we were not able to burn out with any exotic features \ negligent dev testing. One day it has changed we start experience lack of month limit. We were striving to find overuse — and it was not easy.

The Goal

Our products people wanted to have a tool to see where the budget is spent in real-time.

So the requirements at the beginning were:

We do not need to monitor precisely (rough estimates are just fine), it is not going to be used as a billing system
Sliced per system accounts (users) / features / data types
Make it possible to set some alerts (per accounts (users) / features / data types)
Minimal (UX) efforts: the reporting was not supposed to be user-facing. Just for internal usage. Cmd interface was acceptable

Why Prometheus and Grafana

The choice was easy, we did have all the stack. We were using Prometheus for collecting metrics. Grafana was already onboarded as a visualization tool for metrics.

How it was implemented

The majority of implementation time was spent on configuring our application to publish metrics with proper value (remember, each row has a price, but depending on data type, or even parameters value the price is different).

The metric that was published — counter. Each time we were fetching data from our providers — we publish counter to increase (according to price), with tags relevant to our data slices:

accountId — our internal identification of the user's account
clientApp — our internal microservice (features) marker
reportType — provider data type

The UI part was fully implemented on Grafana. We created boards that represented expenditures in different areas.

Overview dashboard

Commons picture plus common expenditure trend (aggregated per day) and Users that exceeded a particular threshold. On this dashboard, we can start searching our expenditure leaks.

Per account dashboard

This dashboard represents expenditures for a particular account. We can elaborate on a particular user account.

Grafana dashboard per account for implemented audit — Pic 2. Account dashboard

Per client application dashboard

The same goes for client application: all expenditures of an application.

Grafana overview dashboard for implemented audit: per client application — Pic 3. Client application dashboard

Per report type dashboard

This dashboard represents where a particular report type data was used.

Queries examples

The main query pattern was: get overall increase filtered by a tag and grouped by another tag (aggregated per day):

sum by (clientApp) (increase (counter_spent_units_total{accountId=”[[accountId]]”}[$__interval]))

Where interval was passed as a parameter by Grafana itself based on the selected time range for the page. While the user accountId/clientApp/reportType was set as a selectable dashboard parameter.

What we archived

Pros:

All requirements were met. We implemented a usage audit very quickly, without heavily investing in UX \ UI development. Moreover, in case of escalation cases, we still have the flexibility to dive deep up to minutes when a request happened, by whom, and what data was requested in which amount (although with some engineers' help).

Cons:

The single drawback that we noticed: was performance over time.

Each request is stored as a separate counter metric and there are a lot of invocations of the service. So our Prometheus was not ready to bear queries that aggregate data for a couple of months. The fix was implemented was simple, based on Prometheus recording rules:

record: counter_spent_units_total:increase_24h
  expr: increase(counter_spent_units_total[24h])

We created a recording rule for the counter metric aggregated per day. After we updated dashboards to use these recording rules instead of row metric.