.. _segments:

.. role:: rison(code)
   :language: o-rison

Segments
========

When defining metrics it's important to keep them simple and ideal -- as close to their essence as possible. Filters and dimensions should always be applied at query time, instead of being part of the metric definition. This prevents metric proliferation and staleness. **Segments** are saved filters that can be used at query time to improve this workflow.

----

.. container:: db-selector

   Database selector loading...

Imagine we have this table:

.. code-block:: sql
   :class: db-example db-sqlite

   CREATE TABLE fact_orders (
     id INTEGER PRIMARY KEY,
     app_id INTEGER,
     amount INTEGER,
     payment_type STRING
   )

.. code-block:: sql
   :class: db-example db-duckdb

   CREATE TABLE fact_orders (
     id INTEGER,
     app_id INTEGER,
     amount INTEGER,
     payment_type TEXT
   )

.. code-block:: sql
   :class: db-example db-postgres

   CREATE TABLE fact_orders (
     id SERIAL PRIMARY KEY,
     app_id INTEGER,
     amount INTEGER,
     payment_type TEXT
   )

And we create a couple metrics:

.. code-block:: sql
   :class: db-example db-sqlite

   CREATE VIEW total_orders AS
   SELECT SUM(amount) AS "total_orders [type=INTEGER]"
   FROM fact_orders;

   CREATE VIEW total_cc_orders AS
   SELECT SUM(amount) AS "total_cc_orders [type=INTEGER]"
   FROM fact_orders
   WHERE payment_type = 'card';

.. code-block:: sql
   :class: db-example db-duckdb db-postgres

   CREATE VIEW total_orders AS
   SELECT SUM(amount) AS total_orders
   FROM fact_orders;

   CREATE VIEW total_cc_orders AS
   SELECT SUM(amount) AS total_cc_orders
   FROM fact_orders
   WHERE payment_type = 'card';

Now imagine a data engineer tells us that ``app_id=1234`` is actually a test application, and it shouldn't be included in the metrics. We now have to update all the metrics, and risk leaving some of them producing incorrect results.

A better solution is to add what we call a **transform**, which is **a view that doesn't change cardinality**, and point our metrics to it:

.. code-block:: sql
   :class: db-example db-sqlite

   CREATE VIEW fact_orders_annotated AS
   SELECT * FROM fact_orders
   WHERE app_id <> 1234;

   CREATE VIEW total_orders AS
   SELECT SUM(amount) AS "total_orders [type=INTEGER]"
   FROM fact_orders_annotated;

   CREATE VIEW total_cc_orders AS
   SELECT SUM(amount) AS "total_cc_orders [type=INTEGER]"
   FROM fact_orders_annotated
   WHERE payment_type = 'card';

.. code-block:: sql
   :class: db-example db-duckdb db-postgres

   CREATE VIEW fact_orders_annotated AS
   SELECT * FROM fact_orders
   WHERE app_id <> 1234;

   CREATE VIEW total_orders AS
   SELECT SUM(amount) AS total_orders
   FROM fact_orders_annotated;

   CREATE VIEW total_cc_orders AS
   SELECT SUM(amount) AS total_cc_orders
   FROM fact_orders_annotated
   WHERE payment_type = 'card';

This still has the issue that we need to be sure to update all related metrics to read from ``fact_orders_annotated`` instead of ``fact_orders``.

An even better approach is to define only the pure version of the metric:

.. code-block:: sql
   :class: db-example db-sqlite

   CREATE VIEW fact_orders_annotated AS
   SELECT * FROM fact_orders
   WHERE app_id <> 1234;

   CREATE VIEW total_orders AS
   SELECT SUM(amount) AS "total_orders [type=INTEGER]"
   FROM fact_orders_annotated;

.. code-block:: sql
   :class: db-example db-duckdb db-postgres

   CREATE VIEW fact_orders_annotated AS
   SELECT * FROM fact_orders
   WHERE app_id <> 1234;

   CREATE VIEW total_orders AS
   SELECT SUM(amount) AS total_orders
   FROM fact_orders_annotated;

Instead of having to define a ``total_cc_orders`` metric, we simply pass the filters we want at query time. For credit card only orders we can pass :rison:`filters:!('fact_orders_annotated.payment_type=card')`:

.. code-block:: bash

   % curl "http://localhost:1974/db/run?q=metrics:!(total_orders),filters:!('fact_orders_annotated.payment_type=card')" | jq

.. code-block:: json

   {
     "database": "db",
     "metrics": [
       {
         "name": "total_orders",
         "type": "INTEGER",
         "description": null,
         "joins": null
       }
     ],
     "dimensions": [],
     "rows": [
       {
         "total_orders": 10
       }
     ]
   }

One problem with this approach is that it only works for simple filters -- we lost the expressiveness of SQL! To solve this, Cantrip implements a flow for reusing saved filters, which are called **segments**. Segments are simply transforms that filter fact tables, and they can be applied at query time to any metric that reads from the same fact tables. Here's an example:

.. code-block:: sql
   :class: db-example db-sqlite

   CREATE VIEW credit_card_only AS
   -- cantrip: description=Orders paid with credit card
   SELECT * FROM fact_orders
   WHERE payment_type = 'card';

.. code-block:: sql
   :class: db-example db-duckdb db-postgres

   CREATE VIEW credit_card_only AS
   SELECT * FROM fact_orders
   WHERE payment_type = 'card';

   COMMENT ON VIEW credit_card_only IS 'description=Orders paid with credit card';

.. container:: only-light

   .. graphviz:: _graphs/segments/database.light.dot

.. container:: only-dark

   .. graphviz:: _graphs/segments/database.dark.dot

We can use the ``/segments`` endpoint to find which segments are compatible with given metrics:

.. code-block:: bash

   % curl 'http://localhost:1974/db/segments?q=metrics:!(total_orders)' | jq

.. code-block:: json

   {
     "database": "db",
     "segments": [
       {
         "name": "credit_card_only",
         "description": "Orders paid with credit card"
       },
       {
         "name": "fact_orders_annotated",
         "description": null
       }
     ]
   }

Note that ``fact_orders_annotated`` is returned as a valid segment, since technically it's a transform reading from a fact table connected to the metric. In the future we might change the logic so that trivial segments like this are ignored.

We can then use the segment when generating SQL:

.. code-block:: bash

   % curl 'http://localhost:1974/db/sql?q=metrics:!(total_orders),segments:!(credit_card_only)' \
   > | jq -r '.query.sql'

.. code-block:: sql
   :class: db-example db-sqlite

   SELECT
     "__annotated"."total_orders" AS "total_orders [INTEGER]"
   FROM (
     SELECT
       "__final"."total_orders" AS "total_orders"
     FROM (
       SELECT
         SUM("fact_orders_annotated"."amount") AS "total_orders"
       FROM "fact_orders_annotated" AS "fact_orders_annotated"
       WHERE
         fact_orders_annotated.payment_type = 'card'
     ) AS __final
   ) AS __annotated

.. code-block:: sql
   :class: db-example db-duckdb db-postgres

   SELECT
     "__final"."total_orders" AS "total_orders"
   FROM (
     SELECT
       SUM("fact_orders_annotated"."amount") AS "total_orders"
     FROM "fact_orders_annotated" AS "fact_orders_annotated"
     WHERE
       "fact_orders_annotated"."payment_type" = 'card'
   ) AS __final

.. container:: only-light

   .. graphviz:: _graphs/segments/query.light.dot

.. container:: only-dark

   .. graphviz:: _graphs/segments/query.dark.dot

It's important to keep segments simple! Segments are applied to the query by copying their predicates, so they should read from a single fact table, without any column transformations other than filtering. In the future we might support more complex use cases but, for now, keep segments as simple views that filter fact tables.

Next let's look at :ref:`populations <populations>`, since they are very similar to segments.