SQL generators are good actually

Data science students are often told that SQL is the most important tool to learn. This advice makes some sense given how ubiquitous SQL is in industry, but I think it’s a bit overrated. I’ve been working as a data scientist for eight or years, and now am a product manager on a large data platform product, but I’m still not entirely sure I really “know SQL” in any meaningful way. I can sort of read and write simple SQL statements, but I almost always use tools like dbplyr instead of writing SQL directly. Initially I thought this was a big gap in my knowledge base, and I felt that relying on SQL generators was a kind of crutch, but over time I’ve become convinced that this is a better way to work.

“Good SQL” is contextual

The motivation for leaning SQL is that it will allow you to write effective queries. The idea is that if you know SQL very well you’ll be able to express your ideas more quickly, and your queries will execute faster. The problem with this is that because different databases differ in how they store data, and in which dialect of SQL they use, there’s not really a general way to write effective SQL queries. A query which is optimal for a row-oriented database like Redshift may do poorly on a partitioned columnar database like Snowflake. Moreover how a specific database is set up will dramatically change query execution speed.

This was driven home for me recently when I was writing wrapper functions for Socure’s new data platform. I asked the engineers on the project for advice about how to query the database and they explained that it’s not really possible to predict which query will run faster without testing it. The reason for this is that modern databases do a lot of query parsing to optimize the query which is sent by the user into something that can run efficiently on the database. These parsers are quite complex which makes it hard to predict which types of queries will run most efficiently.

Consider the advice to use common table expressions (CTEs) instead of subqueries in a SQL statement. This is usually good advice because it leads to more readable SQL code, but it turns out that it can lead to 600% higher costs when run against some databases. There’s no single set of best practices for writing SQL because what counts as good and bad SQL depends on which specific database you’re querying.

As a result of all of this most data engineers try have more or less abandoned modifying user behaviour. Instead of asking the user to send just the right SQL, they modify the database to respond to the queries the user is actually sending. For example if a query is running too slowly, they may set up some kind of view, or change how the data is partitioned to make the query fast. The idea is that the user should write a query which is understandable to them, and the database should take care of

Writing dbplyr wrappers

A big chunk of my job over the last few years has been writing database wrappers to perform common queries, and whenever possible I try to use dbplyr for these functions. In general I have found that this approach has made it easier for me to write and maintain functions, and the queries that these functions generate are typically as or more performant than writing SQL directly. There are a few main reasons why I prefer this pattern.

I have to write less code

The main reason I like using dbplyr functions is that I can leverage the rest of the dbplyr ecosystem to write less code. For example let’s take a look at the nycflights13 data and imagine that we wanted to write a function to get the mean arrival time per airport. Here are the dplyr and SQL functions I would write:

library("DBI")
library("dplyr")
library("duckdb")
con <- dbConnect(duckdb::duckdb(dbdir = "flights.duckdb", read_only = TRUE))

mean_dest_time_dbplyr <- function(con = con) {
  tbl(con, "flights") |>
    group_by(dest) |> 
    summarize(mean_time = mean(arr_time))
}

mean_dest_time_sql <- function(con = con) {
  qry <- 'SELECT "dest", AVG("arr_time") AS "mean_time"
FROM "flights"
GROUP BY "dest"'
  DBI::dbGetQuery(con, qry)
}

When you start out these functions are about the same, but what happens when you start getting requests from users to add arguments to the function? For example maybe someone wants the function to allow you to filter by air time, if you’re patching SQL together you have to do something like this:

mean_dest_time_sql <- function(min_time, max_time, con = con) {
  where_clause <- paste0('WHERE ("air_time" >= ', 
                         min_time, 
                         ') AND ("air_time" <= ', 
                         max_time, 
                         ')')
  
  qry <- paste0(c('SELECT "dest", AVG("arr_time") AS "mean_time"',
                  'FROM "flights"',
                  where_clause,
                  'GROUP BY "dest"'
                  ),
                collapse = "\n"
  )
  DBI::dbGetQuery(con, qry)
}

Writing functions like this is annoying because you spend a lot of time pasting queries together, but the bigger issue is that users will never stop asking for functionality. Accommodating these requests will lead to a bloated, complex function with a lot of arguments. Using dbplyr relives us of this complication by letting the user to pass any filter they want to our wrapper function.

my_tbl <- tbl(con, "flights") 
mean_dest_time_dbplyr <- function(tbl = my_tbl) {
  tbl |>
    group_by(dest) |> 
    summarize(mean_time = mean(arr_time)) 
}

# They can filter by arr_time
my_tbl |> 
  filter(arr_time >= 100,
         arr_time >= 200) |> 
  mean_dest_time_dbplyr()

# But also by other stuff!

my_tbl |> 
  filter(month == 2) |> 
  mean_dest_time_dbplyr()

This pattern lets you leverage all of the dbplyr infrastructure which means that you have less code to maintain and less education to do.

Composable database wrappers

The second main reason to use dbplyr is that it lets you write composable SQL functions. One of the great things about dbplyr is that it is smart enough to generate adequate SQL regardless of the order in which you call the function. For example putting the filter and mutate in different places will generate different SQL, but both queries will work.

 tbl(con, "flights")  |> 
  filter(month == 1) |> 
  mutate(long_flight = ifelse(air_time > 100, "long", "short")) |> 
  show_query()

<SQL>
SELECT
  *,
  CASE WHEN ("air_time" > 100.0) THEN 'long' WHEN NOT ("air_time" > 100.0) THEN 'short' END AS "long_flight"
FROM "flights"
WHERE ("month" = 1.0)

 tbl(con, "flights")  |> 
  mutate(long_flight = ifelse(air_time > 100, "long", "short")) |> 
  filter(month == 1) |> 
  show_query()

<SQL>
SELECT *
FROM (
  SELECT
    *,
    CASE WHEN ("air_time" > 100.0) THEN 'long' WHEN NOT ("air_time" > 100.0) THEN 'short' END AS "long_flight"
  FROM "flights"
) "q01"
WHERE ("month" = 1.0)

Composable functions are amazing because they let the user build complex expressions out of simple to understand components. For example let’s say that we wrote a function by_day that grouped the flights data by day.

by_day <- function(tbl) {
  tbl |> 
    mutate(date = paste0(year, "-", month, "-", day)) |> 
    group_by(date)
}

Since this function can be stacked along with other dbplyr functions it gives the user a lot of flexibility. They can stack it with other dplyr verbs in an arbitrary order, or even use it on an entirely different table, and everything will still work.

SQL is generally speaking not composable. You can’t write small fragments of queries and easily insert them into other queries and so your wrapper tends to need to do more work. When you write wrappers with SQL query construction you end up trying to build a comprehensive function that limits the user to the things that you had in mind when you wrote it. Additionally you can’t easily share fragments across functions which means that you end up with repetitive code.

Backend-agnostic functions

Finally, one of the benefits of building dbplyr database wrappers is that your functions will run on a variety of data sources. For example a common pattern at my job is pulling a large set of data into an Apache Arrow Dataset for further analysis. Functions built around dbplyr will tend to work on these datasets without modification which reduces the number of things that the user has to learn or remember.

Testing dbplyr functions

I’ve been writing dbplyr wrappers for some time, but I’ve only recently come up with a testing pattern which I really. There are four main things that I want when testing database functions:

Tests should run without access to the actual database
They should allow me to test the output R object
They should include SQL assertions that I can use to communicate with the database owner
I don’t want to regenerate mocks every time I change the function

Previously I would test database functions with dittodb which allows you to record mocks for particular SQL queries and cache the result of those queries. This accomplished goals 1-3, but over time I found the upkeep difficult. Because dittodb mocks the particular query you end up with a lot of mocks, and you need to regenerate them whenever the function changes.

My new approach is to record a mock of a few records from the whole database and store that as an on-disk duckdb database. In the test files I point my functions to the new database and run two types of tests:

Test that the function produces the right output
Test that the function generates the expected SQL

For example I would test that the by_day function produced the right output with a test like this:

library(testthat)
test_that("by_day function genrates the right output", {
  day_counts <- tbl(mock_con, "flights") |> 
    by_day() |> 
    count() |> 
    collect(n = 5)
  
  expect_s3_class(day_counts, "data.frame")
  expect_equal(dim(day_counts), c(5, 2))
  expect_equal(day_counts$date,
               c("2013-6-26", "2013-6-27", "2013-6-28", "2013-6-29", "2013-6-30")
               )
})

This gives future developers a clear understanding of what this functions is supposed to do, which lets them make changes with the confidence that they won’t violate the user expectations. I also want to generate SQL so that I can use it to communicate with the people who maintain the database.

test_that("by_day function genrates the right SQL", {
  expect_snapshot({
    tbl(mock_con, "flights") |> 
      by_day() |> 
      count() |> 
      show_query()
  })
})

When you run the test suite for the first time this will generate the expected SQL query which is used to test the function in the future:

Code
  show_query(count(by_day(tbl(mock_con, "flights"))))
Output
  <SQL>
  SELECT "date", COUNT(*) AS "n"
  FROM (
    SELECT *, CONCAT_WS('', "year", '-', "month", '-', "day") AS "date"
    FROM "flights"
  ) "q01"
  GROUP BY "date"

This is an extremely useful test fixture for two reasons. First it gives you a something which can be easily shared with the database team. For example if you noticed odd results when running the query against the actual database you can send them the specific query which used to work, but now fails. You can even share this fixture with the database team to use in their tests. Secondly, it lets you lock down the expected query. This is useful if you do find out that some types of queries run better on your particular database and want to ensure that future developers don’t introduce bad query patterns.

Conclusion

No programming framework is comprehensive and there are plenty of times where it’s important to move past dbplyr and optimize the actual SQL that your functions generate. In general though I’ve found that starting with dbplyr saves me time and energy, and produces a better experience for the people who use my functions. Nine out of ten times dbplyr writes better SQL than I do.