Maximizing productivity with selectors

blog
new feature
productivity
duckdb
Author

Phillip Cloud

Published

February 27, 2023

Before Ibis 5.0 it’s been challenging to concisely express whole-table operations with ibis. Happily this is no longer the case in ibis 5.0.

Let’s jump right in!

We’ll look at selectors examples using the palmerpenguins data set with the DuckDB backend.

Setup

from ibis.interactive import *

t = ex.penguins.fetch()
t
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
│ Adelie TorgersenNULLNULLNULLNULLNULL2007 │
│ Adelie Torgersen36.719.31933450female2007 │
│ Adelie Torgersen39.320.61903650male  2007 │
│ Adelie Torgersen38.917.81813625female2007 │
│ Adelie Torgersen39.219.61954675male  2007 │
│ Adelie Torgersen34.118.11933475NULL2007 │
│ Adelie Torgersen42.020.21904250NULL2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Examples

Normalization

Let’s say you want to compute the z-score of every numeric column and replace the existing data with that normalized value. Here’s how you’d do that with selectors:

t.mutate(s.across(s.numeric(), (_ - _.mean()) / _.std()))
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year      ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━┩
│ stringstringfloat64float64float64float64stringfloat64   │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────────┤
│ Adelie Torgersen-0.8832050.784300-1.416272-0.563317male  -1.257484 │
│ Adelie Torgersen-0.8099390.126003-1.060696-0.500969female-1.257484 │
│ Adelie Torgersen-0.6634080.429833-0.420660-1.186793female-1.257484 │
│ Adelie TorgersenNULLNULLNULLNULLNULL-1.257484 │
│ Adelie Torgersen-1.3227991.088129-0.562890-0.937403female-1.257484 │
│ Adelie Torgersen-0.8465721.746426-0.776236-0.688012male  -1.257484 │
│ Adelie Torgersen-0.9198370.328556-1.416272-0.719186female-1.257484 │
│ Adelie Torgersen-0.8648881.240044-0.4206600.590115male  -1.257484 │
│ Adelie Torgersen-1.7990250.480471-0.562890-0.906229NULL-1.257484 │
│ Adelie Torgersen-0.3520291.543873-0.7762360.060160NULL-1.257484 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────────┘

What’s Up With the year Column?

Whoops, looks like we included year in our normalization because it’s an int64 column (and therefore numeric) but normalizing the year doesn’t make sense.

We can exclude year from the normalization using another selector:

t.mutate(s.across(s.numeric() & ~s.cols("year"), (_ - _.mean()) / _.std()))
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64float64float64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen-0.8832050.784300-1.416272-0.563317male  2007 │
│ Adelie Torgersen-0.8099390.126003-1.060696-0.500969female2007 │
│ Adelie Torgersen-0.6634080.429833-0.420660-1.186793female2007 │
│ Adelie TorgersenNULLNULLNULLNULLNULL2007 │
│ Adelie Torgersen-1.3227991.088129-0.562890-0.937403female2007 │
│ Adelie Torgersen-0.8465721.746426-0.776236-0.688012male  2007 │
│ Adelie Torgersen-0.9198370.328556-1.416272-0.719186female2007 │
│ Adelie Torgersen-0.8648881.240044-0.4206600.590115male  2007 │
│ Adelie Torgersen-1.7990250.480471-0.562890-0.906229NULL2007 │
│ Adelie Torgersen-0.3520291.543873-0.7762360.060160NULL2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

cols selects one or more columns, and the ~ means “negate”. Combining those we get “every column except for ‘year’”!

Pretty neat right?

Composable Group By

The power of this approach comes in when you want the grouped version. Perhaps we think some of these columns vary by species.

With selectors, all you need to do is slap a .group_by("species") onto t:

t.group_by("species").mutate(
    s.across(s.numeric() & ~s.cols("year"), (_ - _.mean()) / _.std())
)
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64float64float64stringint64 │
├─────────┼────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Gentoo Biscoe-0.455854-1.816223-0.954050-1.142626female2007 │
│ Gentoo Biscoe-0.975022-0.287513-0.491442-0.448342female2009 │
│ Gentoo Biscoe0.387793-0.898997-1.108253-1.241809female2007 │
│ Gentoo Biscoe0.8096160.2220560.1253681.237778male  2007 │
│ Gentoo Biscoe0.030865-0.491341-0.3372400.642677male  2007 │
│ Gentoo Biscoe-0.326062-1.510481-1.108253-1.043442female2007 │
│ Gentoo Biscoe-0.682990-0.389427-0.954050-0.547525female2007 │
│ Gentoo Biscoe-0.2611670.3239700.2795710.245943male  2007 │
│ Gentoo Biscoe-1.364397-1.612395-1.262455-1.340993female2007 │
│ Gentoo Biscoe-0.2287190.425884-0.3372400.146759male  2007 │
│  │
└─────────┴────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Since ibis translates this into a run-of-the-mill selection as if you had called select or mutate without selectors, nothing special is needed for a backend to work with these new constructs.

Let’s look at some more examples.

Min-max Normalization

Grouped min/max normalization? Easy:

t.group_by("species").mutate(
    s.across(s.numeric() & ~s.cols("year"), (_ - _.min()) / (_.max() - _.min()))
)
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64float64float64stringint64 │
├─────────┼────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Gentoo Biscoe0.2780750.0238100.2857140.234043female2007 │
│ Gentoo Biscoe0.1925130.3809520.3928570.382979female2009 │
│ Gentoo Biscoe0.4171120.2380950.2500000.212766female2007 │
│ Gentoo Biscoe0.4866310.5000000.5357140.744681male  2007 │
│ Gentoo Biscoe0.3582890.3333330.4285710.617021male  2007 │
│ Gentoo Biscoe0.2994650.0952380.2500000.255319female2007 │
│ Gentoo Biscoe0.2406420.3571430.2857140.361702female2007 │
│ Gentoo Biscoe0.3101600.5238100.5714290.531915male  2007 │
│ Gentoo Biscoe0.1283420.0714290.2142860.191489female2007 │
│ Gentoo Biscoe0.3155080.5476190.4285710.510638male  2007 │
│  │
└─────────┴────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Casting and Munging

How about casting every column whose name ends with any of the strings "mm" or "g" to a float32? No problem!

t.mutate(s.across(s.endswith(("mm", "g")), _.cast("float32")))
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat32float32float32float32stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.09999818.700001181.03750.0male  2007 │
│ Adelie Torgersen39.50000017.400000186.03800.0female2007 │
│ Adelie Torgersen40.29999918.000000195.03250.0female2007 │
│ Adelie TorgersenNULLNULLNULLNULLNULL2007 │
│ Adelie Torgersen36.70000119.299999193.03450.0female2007 │
│ Adelie Torgersen39.29999920.600000190.03650.0male  2007 │
│ Adelie Torgersen38.90000217.799999181.03625.0female2007 │
│ Adelie Torgersen39.20000119.600000195.04675.0male  2007 │
│ Adelie Torgersen34.09999818.100000193.03475.0NULL2007 │
│ Adelie Torgersen42.00000020.200001190.04250.0NULL2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

We can make all string columns have the same case too!

t.mutate(s.across(s.of_type("string"), _.lower()))
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ adelie torgersen39.118.71813750male  2007 │
│ adelie torgersen39.517.41863800female2007 │
│ adelie torgersen40.318.01953250female2007 │
│ adelie torgersenNULLNULLNULLNULLNULL2007 │
│ adelie torgersen36.719.31933450female2007 │
│ adelie torgersen39.320.61903650male  2007 │
│ adelie torgersen38.917.81813625female2007 │
│ adelie torgersen39.219.61954675male  2007 │
│ adelie torgersen34.118.11933475NULL2007 │
│ adelie torgersen42.020.21904250NULL2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Multiple Computations per Column

What if I want to compute multiple things? Heck yeah!

t.group_by("sex").mutate(
    s.across(
        s.numeric() & ~s.cols("year"),
        dict(centered=_ - _.mean(), zscore=(_ - _.mean()) / _.std()),
    )
).select("sex", s.endswith(("_centered", "_zscore")))
┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ sex     bill_length_mm_centered  bill_depth_mm_centered  flipper_length_mm_centered  body_mass_g_centered  bill_length_mm_zscore  bill_depth_mm_zscore  flipper_length_mm_zscore  body_mass_g_zscore ┃
┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ stringfloat64float64float64float64float64float64float64float64            │
├────────┼─────────────────────────┼────────────────────────┼────────────────────────────┼──────────────────────┼───────────────────────┼──────────────────────┼──────────────────────────┼────────────────────┤
│ male  0.445238-2.09107110.494048504.3154760.082960-1.1222100.7213460.640296 │
│ male  2.245238-2.7910714.494048954.3154760.418349-1.4978780.3089141.211631 │
│ male  -6.2547620.208929-18.505952-95.684524-1.1654340.112125-1.272072-0.121484 │
│ male  -5.0547621.0089293.494048-245.684524-0.9418410.5414590.240176-0.311929 │
│ male  -11.2547623.208929-6.505952-145.684524-2.0970711.722128-0.447210-0.184966 │
│ male  -3.3547622.808929-7.505952-45.684524-0.6250841.507461-0.515948-0.058003 │
│ male  0.1452383.608929-10.505952-345.6845240.0270621.936795-0.722164-0.438893 │
│ male  -8.1547620.808929-24.505952-945.684524-1.5194560.434126-1.684504-1.200673 │
│ male  -7.6547620.208929-19.505952-595.684524-1.4262920.112125-1.340811-0.756301 │
│ male  -7.054762-0.691071-24.505952-745.684524-1.314496-0.370876-1.684504-0.946746 │
│  │
└────────┴─────────────────────────┴────────────────────────┴────────────────────────────┴──────────────────────┴───────────────────────┴──────────────────────┴──────────────────────────┴────────────────────┘

Don’t like the naming convention?

Pass a function to make your own name!

t.select(s.startswith("bill")).mutate(
    s.across(
        s.all(),
        dict(x=_ - _.mean(), y=_.max()),
        names=lambda col, fn: f"{col}_{fn}_improved",
    )
)
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ bill_length_mm  bill_depth_mm  bill_length_mm_x_improved  bill_depth_mm_x_improved  bill_length_mm_y_improved  bill_depth_mm_y_improved ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ float64float64float64float64float64float64                  │
├────────────────┼───────────────┼───────────────────────────┼──────────────────────────┼───────────────────────────┼──────────────────────────┤
│           39.118.7-4.821931.5488359.621.5 │
│           39.517.4-4.421930.2488359.621.5 │
│           40.318.0-3.621930.8488359.621.5 │
│           NULLNULLNULLNULL59.621.5 │
│           36.719.3-7.221932.1488359.621.5 │
│           39.320.6-4.621933.4488359.621.5 │
│           38.917.8-5.021930.6488359.621.5 │
│           39.219.6-4.721932.4488359.621.5 │
│           34.118.1-9.821930.9488359.621.5 │
│           42.020.2-1.921933.0488359.621.5 │
│               │
└────────────────┴───────────────┴───────────────────────────┴──────────────────────────┴───────────────────────────┴──────────────────────────┘

Don’t like lambda functions? We support a format string too!

t.select(s.startswith("bill")).mutate(
    s.across(
        s.all(),
        func=dict(x=_ - _.mean(), y=_.max()),
        names="{col}_{fn}_improved",
    )
).head(2)
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ bill_length_mm  bill_depth_mm  bill_length_mm_x_improved  bill_depth_mm_x_improved  bill_length_mm_y_improved  bill_depth_mm_y_improved ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ float64float64float64float64float64float64                  │
├────────────────┼───────────────┼───────────────────────────┼──────────────────────────┼───────────────────────────┼──────────────────────────┤
│           39.118.7-4.821931.5488359.621.5 │
│           39.517.4-4.421930.2488359.621.5 │
└────────────────┴───────────────┴───────────────────────────┴──────────────────────────┴───────────────────────────┴──────────────────────────┘

Working with other Ibis APIs

We’ve seen lots of mutate use, but selectors also work with .agg:

t.group_by("year").agg(s.across(s.numeric() & ~s.cols("year"), _.mean())).order_by("year")
┏━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ year   bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ int64float64float64float64float64     │
├───────┼────────────────┼───────────────┼───────────────────┼─────────────┤
│  200743.74036717.427523196.8807344124.541284 │
│  200843.54122816.914035202.7982464266.666667 │
│  200944.45294117.125210202.8067234210.294118 │
└───────┴────────────────┴───────────────┴───────────────────┴─────────────┘

Naturally, selectors work in grouping keys too, for even more convenience:

t.group_by(~s.numeric() | s.cols("year")).mutate(
    s.across(s.numeric() & ~s.cols("year"), dict(centered=_ - _.mean(), std=_.std()))
).select("species", s.endswith(("_centered", "_std")))
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ species  bill_length_mm_centered  bill_depth_mm_centered  flipper_length_mm_centered  body_mass_g_centered  bill_length_mm_std  bill_depth_mm_std  flipper_length_mm_std  body_mass_g_std ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ stringfloat64float64float64float64float64float64float64float64         │
├─────────┼─────────────────────────┼────────────────────────┼────────────────────────────┼──────────────────────┼────────────────────┼───────────────────┼───────────────────────┼─────────────────┤
│ Adelie -3.18-1.33-10.6-202.51.9274621.0604516.310485372.015905 │
│ Adelie 0.52-0.53-4.6-202.51.9274621.0604516.310485372.015905 │
│ Adelie -1.181.677.447.51.9274621.0604516.310485372.015905 │
│ Adelie -1.580.571.4-152.51.9274621.0604516.310485372.015905 │
│ Adelie -0.58-0.33-4.6547.51.9274621.0604516.310485372.015905 │
│ Adelie 0.42-1.036.4-202.51.9274621.0604516.310485372.015905 │
│ Adelie 3.720.277.4297.51.9274621.0604516.310485372.015905 │
│ Adelie -0.78-0.631.4497.51.9274621.0604516.310485372.015905 │
│ Adelie 0.72-0.43-6.6-677.51.9274621.0604516.310485372.015905 │
│ Adelie 1.921.772.447.51.9274621.0604516.310485372.015905 │
│  │
└─────────┴─────────────────────────┴────────────────────────┴────────────────────────────┴──────────────────────┴────────────────────┴───────────────────┴───────────────────────┴─────────────────┘

Filtering Selectors

You can also express complex filters more concisely.

Let’s say we only want to keep rows where all the bill size z-score related columns’ absolute values are greater than 2.

t.drop("year").group_by("species").mutate(
    s.across(s.numeric(), dict(zscore=(_ - _.mean()) / _.std()))
).filter(s.if_all(s.startswith("bill") & s.endswith("_zscore"), _.abs() > 2))
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     bill_length_mm_zscore  bill_depth_mm_zscore  flipper_length_mm_zscore  body_mass_g_zscore ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ stringstringfloat64float64int64int64stringfloat64float64float64float64            │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────────────────────┼──────────────────────┼──────────────────────────┼────────────────────┤
│ Gentoo Biscoe   59.617.02306050male  3.9246212.0565081.9757991.932062 │
│ Gentoo Biscoe   55.917.02285600male  2.7240462.0565081.6673941.039411 │
│ Adelie Torgersen46.021.51944200male  2.7065392.5920710.6187601.088911 │
│ Adelie Dream    32.115.51883050female-2.512345-2.339505-0.298747-1.418906 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────────────────────┴──────────────────────┴──────────────────────────┴────────────────────┘

Bonus: Generated SQL

The SQL for that last expression is pretty gnarly:

ibis.to_sql(
    t.drop("year")
    .group_by("species")
    .mutate(s.across(s.numeric(), dict(zscore=(_ - _.mean()) / _.std())))
    .filter(s.if_all(s.startswith("bill") & s.endswith("_zscore"), _.abs() > 2))
)
SELECT
  *
FROM (
  SELECT
    "t1"."species",
    "t1"."island",
    "t1"."bill_length_mm",
    "t1"."bill_depth_mm",
    "t1"."flipper_length_mm",
    "t1"."body_mass_g",
    "t1"."sex",
    (
      "t1"."bill_length_mm" - AVG("t1"."bill_length_mm") OVER (PARTITION BY "t1"."species" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
    ) / STDDEV_SAMP("t1"."bill_length_mm") OVER (PARTITION BY "t1"."species" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS "bill_length_mm_zscore",
    (
      "t1"."bill_depth_mm" - AVG("t1"."bill_depth_mm") OVER (PARTITION BY "t1"."species" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
    ) / STDDEV_SAMP("t1"."bill_depth_mm") OVER (PARTITION BY "t1"."species" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS "bill_depth_mm_zscore",
    (
      "t1"."flipper_length_mm" - AVG("t1"."flipper_length_mm") OVER (PARTITION BY "t1"."species" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
    ) / STDDEV_SAMP("t1"."flipper_length_mm") OVER (PARTITION BY "t1"."species" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS "flipper_length_mm_zscore",
    (
      "t1"."body_mass_g" - AVG("t1"."body_mass_g") OVER (PARTITION BY "t1"."species" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
    ) / STDDEV_SAMP("t1"."body_mass_g") OVER (PARTITION BY "t1"."species" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS "body_mass_g_zscore"
  FROM (
    SELECT
      "t0".*
      EXCLUDE ("year")
    FROM "penguins" AS "t0"
  ) AS "t1"
) AS "t2"
WHERE
  ABS("t2"."bill_length_mm_zscore") > 2 AND ABS("t2"."bill_depth_mm_zscore") > 2

Good thing you didn’t have to write that by hand!

Summary

This blog post illustrates the ability to apply computations to many columns at once and the power of ibis as a composable, expressive library for analytics.

Back to top