Segments¶
When defining metrics it’s important to keep them simple and ideal – as close to their essence as possible. Filters and dimensions should always be applied at query time, instead of being part of the metric definition. This prevents metric proliferation and staleness. Segments are saved filters that can be used at query time to improve this workflow.
Database selector loading…
Imagine we have this table:
CREATE TABLE fact_orders (
id INTEGER PRIMARY KEY,
app_id INTEGER,
amount INTEGER,
payment_type STRING
)
CREATE TABLE fact_orders (
id INTEGER,
app_id INTEGER,
amount INTEGER,
payment_type TEXT
)
CREATE TABLE fact_orders (
id SERIAL PRIMARY KEY,
app_id INTEGER,
amount INTEGER,
payment_type TEXT
)
And we create a couple metrics:
CREATE VIEW total_orders AS
SELECT SUM(amount) AS "total_orders [type=INTEGER]"
FROM fact_orders;
CREATE VIEW total_cc_orders AS
SELECT SUM(amount) AS "total_cc_orders [type=INTEGER]"
FROM fact_orders
WHERE payment_type = 'card';
CREATE VIEW total_orders AS
SELECT SUM(amount) AS total_orders
FROM fact_orders;
CREATE VIEW total_cc_orders AS
SELECT SUM(amount) AS total_cc_orders
FROM fact_orders
WHERE payment_type = 'card';
Now imagine a data engineer tells us that app_id=1234 is actually a test application, and it shouldn’t be included in the metrics. We now have to update all the metrics, and risk leaving some of them producing incorrect results.
A better solution is to add what we call a transform, which is a view that doesn’t change cardinality, and point our metrics to it:
CREATE VIEW fact_orders_annotated AS
SELECT * FROM fact_orders
WHERE app_id <> 1234;
CREATE VIEW total_orders AS
SELECT SUM(amount) AS "total_orders [type=INTEGER]"
FROM fact_orders_annotated;
CREATE VIEW total_cc_orders AS
SELECT SUM(amount) AS "total_cc_orders [type=INTEGER]"
FROM fact_orders_annotated
WHERE payment_type = 'card';
CREATE VIEW fact_orders_annotated AS
SELECT * FROM fact_orders
WHERE app_id <> 1234;
CREATE VIEW total_orders AS
SELECT SUM(amount) AS total_orders
FROM fact_orders_annotated;
CREATE VIEW total_cc_orders AS
SELECT SUM(amount) AS total_cc_orders
FROM fact_orders_annotated
WHERE payment_type = 'card';
This still has the issue that we need to be sure to update all related metrics to read from fact_orders_annotated instead of fact_orders.
An even better approach is to define only the pure version of the metric:
CREATE VIEW fact_orders_annotated AS
SELECT * FROM fact_orders
WHERE app_id <> 1234;
CREATE VIEW total_orders AS
SELECT SUM(amount) AS "total_orders [type=INTEGER]"
FROM fact_orders_annotated;
CREATE VIEW fact_orders_annotated AS
SELECT * FROM fact_orders
WHERE app_id <> 1234;
CREATE VIEW total_orders AS
SELECT SUM(amount) AS total_orders
FROM fact_orders_annotated;
Instead of having to define a total_cc_orders metric, we simply pass the filters we want at query time. For credit card only orders we can pass filters:!('fact_orders_annotated.payment_type=card'):
% curl "http://localhost:1974/db/run?q=metrics:!(total_orders),filters:!('fact_orders_annotated.payment_type=card')" | jq
{
"database": "db",
"metrics": [
{
"name": "total_orders",
"type": "INTEGER",
"description": null,
"joins": null
}
],
"dimensions": [],
"rows": [
{
"total_orders": 10
}
]
}
One problem with this approach is that it only works for simple filters – we lost the expressiveness of SQL! To solve this, Cantrip implements a flow for reusing saved filters, which are called segments. Segments are simply transforms that filter fact tables, and they can be applied at query time to any metric that reads from the same fact tables. Here’s an example:
CREATE VIEW credit_card_only AS
-- cantrip: description=Orders paid with credit card
SELECT * FROM fact_orders
WHERE payment_type = 'card';
CREATE VIEW credit_card_only AS
SELECT * FROM fact_orders
WHERE payment_type = 'card';
COMMENT ON VIEW credit_card_only IS 'description=Orders paid with credit card';
We can use the /segments endpoint to find which segments are compatible with given metrics:
% curl 'http://localhost:1974/db/segments?q=metrics:!(total_orders)' | jq
{
"database": "db",
"segments": [
{
"name": "credit_card_only",
"description": "Orders paid with credit card"
},
{
"name": "fact_orders_annotated",
"description": null
}
]
}
Note that fact_orders_annotated is returned as a valid segment, since technically it’s a transform reading from a fact table connected to the metric. In the future we might change the logic so that trivial segments like this are ignored.
We can then use the segment when generating SQL:
% curl 'http://localhost:1974/db/sql?q=metrics:!(total_orders),segments:!(credit_card_only)' \
> | jq -r '.query.sql'
SELECT
"__annotated"."total_orders" AS "total_orders [INTEGER]"
FROM (
SELECT
"__final"."total_orders" AS "total_orders"
FROM (
SELECT
SUM("fact_orders_annotated"."amount") AS "total_orders"
FROM "fact_orders_annotated" AS "fact_orders_annotated"
WHERE
fact_orders_annotated.payment_type = 'card'
) AS __final
) AS __annotated
SELECT
"__final"."total_orders" AS "total_orders"
FROM (
SELECT
SUM("fact_orders_annotated"."amount") AS "total_orders"
FROM "fact_orders_annotated" AS "fact_orders_annotated"
WHERE
"fact_orders_annotated"."payment_type" = 'card'
) AS __final
It’s important to keep segments simple! Segments are applied to the query by copying their predicates, so they should read from a single fact table, without any column transformations other than filtering. In the future we might support more complex use cases but, for now, keep segments as simple views that filter fact tables.
Next let’s look at populations, since they are very similar to segments.