Google Cloud’s SQL-Based Alerting Changes What SREs Can Detect Before Production Breaks

you

Google Cloud’s SQL-Based Alerting Changes What SREs Can Detect Before Production BreaksGoogle Cloud’s new SQL-based alerting in Observability Analytics lets teams alert on complex log and trace patterns, high-cardinality data, p99 latency, and customer-specific error rates.
͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     ͏     
Forwarded this email? Subscribe here for more
Google Cloud’s SQL-Based Alerting Changes What SREs Can Detect Before Production Breaks
Google Cloud’s new SQL-based alerting in Observability Analytics lets teams alert on complex log and trace patterns, high-cardinality data, p99 latency, and customer-specific error rates.
Hey Maria
Jun 29 

READ IN APP

The AI notepad for people in back-to-back meetings (Sponsor)
Most AI note-takers just transcribe what was said and send you a summary after the call.
Granola is an AI notepad. And that difference matters.
You start with a clean, simple notepad. You jot down what matters to youand, in the background, Granola transcribes the meeting.
When the meeting ends, Granola uses your notes to generate clearer summaries, action items, and next steps, all from your point of view.
Then comes the powerful part: you can chat with your notes. Use Recipes (pre-made prompts) to write follow-up emails, pull out decisions, prep for your next meeting, or turn conversations into real work in seconds.
Think of it as a super-smart notes app that actually understands your meetings.
Free 1 month with the code SCOOP
Try Granola for your next meeting
Google Cloud has introduced SQL-based alerting in Cloud Monitoring Observability Analytics, and on the surface, it sounds like a simple product update. Developers can now create alerts by writing SQL queries over observability data instead of relying only on predefined metrics or simple thresholds.
But the bigger story is this: Google Cloud is moving alerting closer to analytical debugging.
For SREs, platform engineers, and developers running distributed systems, this is a meaningful shift. Traditional alerts are good at telling you when a metric crosses a line. CPU is too high. Error count is above threshold. Latency exceeded a static limit. Those alerts still matter, but they often miss the messy reality of modern systems, where the real issue is hidden inside relationships between logs, traces, services, customers, regions, sessions, or request paths.
A single metric might say the system is healthy. But one enterprise customer could be failing at a 20% error rate. A specific AI agent workflow could be timing out only when it calls a certain external tool. A checkout flow could look fine globally while p99 latency quietly degrades for users in one region. These are the kinds of issues that basic threshold monitoring struggles to catch.
SQL-based alerting gives teams a way to describe those conditions directly.
Instead of asking, “Did this one metric cross a threshold?” teams can ask, “Did the percentage of failed requests for this customer exceed 5% over the last 10 minutes?” Or, “Did p99 latency for this orchestrator service exceed five seconds?” Or, “Are database timeout logs correlated with slow spans in checkout traces?”
That is a very different kind of alert.
What Google Cloud Actually Launched
The new feature allows teams to create alerting policies from SQL queries inside Observability Analytics. Observability Analytics, formerly Log Analytics, is Google Cloud’s SQL-based interface for querying logs and traces. It allows developers and SREs to perform aggregate analysis directly on telemetry data and use those results for troubleshooting, dashboards, and now alerts.
A team writes a SQL query, validates it, chooses the BigQuery query engine, then creates an alert from the query results. The alert runs on a schedule, such as every 10 minutes, using a lookback window that evaluates the data received since the last run.
Google supports two main alert patterns.
The first is a row-count threshold. For example, if a query returns more than 10 failed payment-gateway timeout events, create an incident.
The second is a boolean condition. This is more powerful because the SQL query itself can calculate a percentage, percentile, or custom business condition, then return true when the alert should fire.
That means alert logic can move from rigid metric configuration into SQL.
Why This Matters for SREs
The most important benefit is context.
Many production incidents are not obvious at the infrastructure level. A service can be “up” while a business workflow is broken. A fleet can look healthy while a single tenant is experiencing severe degradation. Average latency can be acceptable while p99 latency is terrible. Error count can look normal while the error percentage for a high-value customer is unacceptable.
This is where SQL-based alerting becomes useful.
SREs often care less about raw events and more about ratios, distributions, joins, and grouped conditions. They need to know not just that errors exist, but whether errors are meaningful relative to traffic. They need to know not just that latency increased, but whether latency increased for the most important path, service, customer, or agent workflow.
SQL is naturally good at this kind of analysis.
For example, a simple threshold alert might say:
“Alert when error_count > 100.”
A SQL-based alert can say:
“Alert when error_rate for customer_id = X exceeds 5% and total traffic is above 500 requests in the last 10 minutes.”
That second version is more operationally useful. It reduces noise, avoids alerting on tiny sample sizes, and gives responders immediate context.
The High-Cardinality Problem
One reason this feature is interesting is that it addresses a long-standing observability pain point: high-cardinality data.
High-cardinality data refers to fields with many possible values, such as customer IDs, user IDs, session IDs, IP addresses, transaction IDs, request paths, or trace IDs. These fields are extremely useful during debugging, but they can be expensive or difficult to model as traditional metrics.
For example, turning every customer ID into a metric label can explode the number of time series. That creates cost, performance, and governance problems. Many teams intentionally avoid putting unbounded values into metric labels for this reason.
But those same values are often present in logs and traces.
SQL-based alerting gives teams another path. Instead of converting every high-cardinality dimension into a metric, teams can query logs and traces directly when they need targeted detection. This is especially valuable for enterprise SaaS platforms, fintech systems, AI agent platforms, and marketplaces where customer-specific reliability matters.
You probably do not want a metric for every user session. But you might want an alert when a top customer’s checkout attempts are failing at an unusual rate.
Use Case 1: Customer-Specific Error Rate
A classic problem in SaaS reliability is that global metrics can hide tenant-level failures.
Imagine a B2B platform with 2,000 customers. Overall API success rate is 99.8%, which looks fine. But one strategic customer is seeing only 92% success on a key endpoint after a deployment. Traditional aggregat