This post provides an example of a stored procedure which A) identifies the last two partitions of an Azure Synapse Analytics SQL pool table (which uses the columnstore index (default)) and B) rebuilds the index for these two partitions. Additionally, a sample PowerShell-based Azure Automation runbook is included for scheduling the execution of this procedure.
This post follows up on the previous post regarding a Power BI template to be used to analyze the health or quality of a columnstore index. For example, the template shared may help you find that the last one or two partitions such as partition numbers 39 and 40 out of 40 partitions may have many open (uncompressed) and/or not-optimized rowgroups. The cause of these low quality partitions could be that recent and ongoing data processing events are impacting these partitions (inserts,updates). Perhaps partitions 39 and 40 refer to the current and prior month for example.
Why Automate Index Rebuilds?
Large fact tables in data warehouses containing hundreds of millions or billions of rows are processed incrementally. For example, a nightly data warehouse load/processing job may impact only the last 15 days of data whereas older data does not change. Queries accessing the partitions for this recent data, such as Power BI in DirectQuery storage mode, will perform better if the indexes for the given partitions are rebuilt/optimized. Additionally, automating the index rebuilds avoids the manual work of periodically finding partitions to rebuild and running rebuild scripts.
*The relatively rare scenarios in which all historical data or large parts of it need to be reloaded or processed would still require a one-time rebuild of all the partitions impacted.
Index Partition Rebuild Procedure
A (.sql) file containing a sample stored procedure as well as a (.ps1) file containing the sample Azure Automation runbook script are available on my GitHub repository.
The stored procedure contains the following logic:
- Retrieve the last (maximum) partition number and the one prior (-1) containing rowgroups and store these integer values as variables
- A query against the sys.pdw_nodes_column_store_row_groups catalog view and other system views is filtered for a particular schema and table name is used to retrieve these values.
- Retrieving from the row groups system view ensures that the partitions retrieved have row groups (data)
- If you were to query sys.partitions, you would retrieve the latest/highest partition numbers for the table but these may reflect future time frames and thus not contain data
- Unicode string variables are defined and set to contain two ALTER INDEX SQL statements, one for each partition
- Notice that the integer variables containing the partition numbers is cast to nvarchar(3) within the concatenation of three strings. Without this data type conversion, there would be a type conversion error.
- A unicode string is required (N’…..’) for the sp_executesql system procedure
- The sp_executesql system procedure is used to execute the two ALTER INDEX SQL statements
Why dynamic SQL?
The ALTER INDEX statement requires the partition number to be a positive integer literal. Therefore, you can’t simply pass in a variable (or an expression that casts this variable to a different type) to an ALTER INDEX statement. The dynamic SQL approach addresses this requirement by building up an ALTER INDEX SQL statement containing the partition number as though it was a normal, hand-entered SQL statement.
Additionally, dynamic SQL leaves open the potential for greater reusability as described in the next section.
More dynamic and reusable?
In the example shared, the name of the schema and table are hard coded into the stored procedure. You could make the procedure more flexible and reusable by parameterizing the names of the table and schema. With these two parameters defined and required as part of the procedure, you could have several small scripts which simply call the procedure with different parameter values reflecting different tables rather than having multiple procedures with each procedure dedicated to one table.
The decision (one or many procedures) may be driven by how many large fact tables you need this for. If you only need to automate the index rebuilds for 2-3 tables, procedures dedicated to a single table or one one procedure which rebuilds partitions for all 2-3 tables may be sufficient.
Beyond the partition numbers containing rowgroups, you could modify the system view query to retrieve the partition numbers containing the highest percentage of open or not-optimized rowgroups. With this more advanced logic, you’d know that your index rebuilds are constantly targeting partitions that would benefit the most from a rebuild. You could go further with conditional (IF..ELSE) logic that first determines if a partition actually needs to be rebuilt and only then rebuild the partition if this is the case.
Azure Automation Runbook
There are several tools available for scheduling and orchestrating the SQL stored procedure. Probably the most common tool in Azure Synapse environments is Azure Data Factory (ADF). Within an ADF pipeline, the Store Procedure Activity could be executed following regular data loading and processing activities. If ADF pipelines are used to load the Syanpse SQL pool, this integrated approach would generally be preferable.
If data warehousing and other Azure environment processes are being handled via Azure Automation runbooks, scheduling runbooks which call the SQL procedure may make sense too. The example for this post contains the following three steps:
- Retrieve the values for four automation variables
- This includes the server, database, SQL user and password
- Two PowerShell SqlClient objects are defined – a SQL connection and a SQL command
- The connection object is essentially just a standard connection string that references the variables from Step #1
- The command object is based on both a SQLQuery variable containing the EXECUTE statement as well the connection object.
- Setting the CommandTimeout property to 0 ensures the command will wait to execute indefinitely. The default is 30 seconds.
- The connection is opened and the SQL command is executed
- With the stored procedure executed, the connection is closed
Thanks for visiting Insight Quest. If automatically rebuilding the indexes for certain partitions of your tables is relevant to you, I hope this post was useful. With healthy/optimized indexes being automatically maintained, users can get more value out of the data via greater performance and teams responsible for maintaining the data warehouse can focus on other value added activities.
If interested in email notification of future blog posts, you can click the ‘Follow’ button on the Subscribe widget.