Procedural SQL

Created by Ben Deverman, Modified on Wed, 6 Mar, 2024 at 1:21 AM by Alsabana Sahubarali

Sometimes, you’ll find yourself wanting to complete a task in BigQuery that requires running several different queries in a row or needs more functionality than SQL natively provides. In many cases, we can accomplish these tasks by using procedural SQL, an expanded version of regular SQL. Procedural SQL lets us chain different queries together, and it also lets us use some additional attributes such as IF statements, loops, having a query change course if an error occurs, and setting variables. Below are some examples of tasks you can perform with procedural SQL. You can also see Google’s full documentation here, with many additional options to explore. Google has implemented some safe gaps in what you can and cannot do with procedural SQL to avoid security risks - you can read more about best practices for security in the documentation!

Cost

BigQuery typically won’t be able to give you an estimate of what your procedural SQL statement will cost to run before you run it, so be mindful of the work you’re doing, especially if it’s a large query or a long loop. If you would like a partial estimate for the script, highlight just the SQL queries with some sample values and see what BigQuery estimates for those on their own. You can also create a setting in BigQuery that blocks queries over a set limit of bytes - see here for more information and other cost-cutting recommendations from Google. Additionally, your project will also have a set monthly budget cap that CTA monitors and will alert you about if you’ve gone over or are close to doing so. Your administrator can answer any questions about the set budget cap.

Creating and Dropping Tables and Views

Often, you’ll need to drop and create tables in tandem as part of regular data management. You can do this in one query with procedural SQL.

For example, if we wanted to create a test table and then drop a test table in one query, we would run:

BEGIN
  CREATE TABLE <yourtablehere> (<your column> columntype, <your column> columntype)
;
  DROP TABLE <yourtablehere>;
END;

If we wanted to make it even simpler for ourselves, we can also set a default variable for project and dataset instead of typing it out over and over again. To do that, we would run:

SET @@dataset_project_id = 'MyProject';
SET @@dataset_id = 'MyDataset';

BEGIN
  CREATE TABLE <yourtablehere> (<yourcolumnnamehere> <columndatatypehere>, <yourcolumnnametwohere> <columntwodatatypehere>);
  DROP TABLE <yourtablehere>;
END;

Note that when we set a default variable, it will apply to every instance of when we call that variable in the query (e.g., every time we create a table in this query, it will be in the specified project and dataset) so make sure that you do want that to be your default.

Loops in queries

In our above example, we are relying on one default variable to fill in every project name and ID. But sometimes, we’ll have a list of options for what that variable can be, and we’ll want to run queries based on that list.

To make this happen without having to write out a lot of SQL manually, we can use a loop. Loops are a common concept in programming, and you can also utilize them in SQL procedural language. A loop, much like it sounds, allows you to run through a list of predefined variables and use each of them in a query. For example, if we had a master table that combined data from several different organizations, and we wanted to split that up to run a query against the information for each organization's data separately, we could run:

DECLARE looplist ARRAY<STRING> DEFAULT ["'org_id_1'","'org_id_2'", "'org_id_3'"];

FOR x IN

 (SELECT * FROM UNNEST(looplist) as looplist2)

DO

EXECUTE IMMEDIATE CONCAT('SELECT <yourcolumnofinterestnamehere> FROM <yourtablenamehere> WHERE <yourorganizationidcolumnnamehere> = ',x.looplist2);

END FOR;

This code is doing a few different things. First, at the top, the DECLARE statement is naming the list of variables to loop through between the brackets. You can change these to whatever you need for your loop; in this case, since we want to run the same query for different organizations in our data, our variables are the organization names.

After the declare statement, we have a FOR statement. The FOR statement creates and runs the loop in several steps:

First, it names the list of variables to loop through in the FOR statement - ours is named looplist2 - and defines that list as a SELECT * (which means everything) from the variable we already declared, looplist.
From there, it starts running the loop with DO. Under DO is the actual SQL you want to run; in this case, we’re running a query. Our SELECT statement includes the name of the list to loop through - looplist2 - everywhere we otherwise would have had to type out each organization’s name. So here, we’re running a query that contains data from a master table limited to just where the organization name column equals the organization’s name in the loop list. BigQuery will run this query multiple times - one for each organization id in the loop.
Lastly, we have to tell the FOR statement to stop running, which we do with END FOR.

There are many other applications of this. For example, if you want to create views in multiple projects, you could loop through a list of project IDs, or if you need to deprecate a series of old tables, you could loop through a list of criteria (though we recommend caution when using loops or other automated approaches to dropping tables!). Note that for loops that require using project ID, dataset name or table name as the variable to loop through, you'll need to take some extra steps beyond the sample code above because those applications are considered dynamic SQL!

Using IF statements in queries

SQL procedural language allows you to use IF statements in queries, which has a number of useful applications. If you want a query to run if certain conditions are met and do something else if they are not, or if you want to check your data for a specific condition, you can do that with an IF statement in your query!

For example, if we wanted to check if a table had data from today loaded yet, we could run the below query:

DECLARE target_date DATE;

SET target_date = CURRENT_DATE;

IF EXISTS (SELECT 1 FROM <yourtablehere> WHERE target_date = <yourdatecolumnhere>) THEN SELECT ('found data from today');
ELSEIF EXISTS (SELECT 1 FROM <yourtablehere> WHERE <yourdatecolumnhere>  IS NOT NULL)THEN
SELECT ('found data, but not from today');
ELSE SELECT ('error - no data');
END IF;

This code is doing a few different things. First, we DECLARE and SET the variable we’ll be using in our query - today’s date, or CURRENT_DATE. Then, our IF statement follows several steps:

First, it checks if there is any data in the table that matches today’s date. If there is, then it returns ‘found data from today’.
However, if it doesn’t find any data in the table that matches today’s date, it will move on to the ELSEIF. This checks if there are any non-null dates in the data column at all. If it does find dates, it returns ‘found data, but not from today’.
Lastly, there is an ELSE statement that runs if neither of the two above conditions are met. Since the only scenario in which this should be possible is an entirely NULL column of dates, it returns ‘error - no data’ so that we know something is wrong!

You can include more than one ELSEIF if you need to check more conditions. ELSE should always be the last part of your statement before the END IF. If you don’t include an ELSE, you won’t get an output for that outcome, so we recommend including ELSE to avoid ambiguity.

To fully leverage IF statements, explore connecting them to more complex SQL, such as table creation or queries!