Data Profiling with Power Query

Did you know that you can use Power Query as a data profiling tool? While the great majority of Power Query (M) functions are centered around accessing data sources and implementing transformations, a few functions exist to help analyze the structure and data characteristics of source tables. This profiling capability can inform the query developer (or data modeler, or both) on data integrity and quality issues that may need to be addressed either when retrieving from these sources with Power Query or further upstream in the architecture (e.g. Azure Data Factory).

In this blog post I’ll present an example of leveraging the Table.Schema() and Table.Profile() Power Query functions to help assess the data quality and structure of three dimension tables in a SQL Server database. As these functions accept table values as their input parameters, you could also use them against various other data sources.

Table Schema and Profile

Before we get into the example, let’s first take a quick look at the two functions individually as you could simply call these functions during design time on an ad hoc basis:

Table.Schema

The Table.Schema function returns 16 columns of metadata with each row reflecting a distinct column/field of the source table. In the following example I’m only displaying several rows (columns in the Product table) and nine (9) of the more commonly used metadata columns such as NativeTypeName, IsNullable, and NumericSale:

Table Schema Sample
Table.Schema Sample: Product Table in SQL Server

Per the image, you could use this function to quickly identify incorrect data types (e.g. numbers stored as text) or numbers with an unnecessarily high precision. For example, though not visible in the image, the Weight column is stored as a float data type in SQL Server and thus, without any transformation, will ultimately be a Decimal Number when loaded to the data model. To improve compression and query performance, it might make sense to change the data type from a decimal number to a fixed decimal number type in Power Query.

More technical details on using the function as part of a more robust profiling query are provided later in this post but to just get started you can pass the source table to the function (e.g. Table.Schema(dimProduct).  

Table.Profile

The Table.Profile function also returns one row for each column in the source table but also returns 7 columns (in addition to column name) which describe the contents or values of these columns:

Table Profile
Table.Profile Sample: Product Table in SQL Server

Per the image above, the Product table has 606 rows (Count = 606) and there are 4 distinct values for the Class column and 211 null values on DealerPrice. The Min, Max, Average, and Standard Deviation columns also give a good idea of the distribution of the values in the columns. These details might guide a query developer to advise business users and/or Data Warehouse stakeholders that certain columns are currently unfit for inclusion in the new BI solution. Additionally, the distinct count metadata gives the modeler an early idea of performance challenges with modeling relationships between fact tables and large dimension tables with millions of unique rows.

There are alternative profiling options such as the Data Profiling Task in SQL Server Integration Services (SSIS) or pre-built queries to return this data but Table.Profile() is much more lightweight as you only need access to the source and an updated version of the Power Query (M) engine on a common client application like Power BI Desktop or Excel. 

A Quick Step Back: Why Profile?

A few weeks ago I blogged (twice actually) about an Excel-based semantic model profiling tool – Tabular Model Schema Reference (v1.1). Similar to that simple tool,  which I’ve found useful on multiple projects thus far, the idea with a data profiling is to have quick and concrete technical metadata available to guide design and development activities.

Note: You may look at the data profiling features built into Azure Data Catalog if you haven’t already.

Here are a few use cases for data profiling ranging from simple to a bit more elaborate:

  1. Requirements gathering with business users
    • Walk through the columns currently supported in dimension tables and obtain business user input on what columns they use or need and what if any transformations they’re already applying to these columns.
    • The business may not be aware that certain columns even exist or they may be persuaded, given existing null values or other profiling results, to keep certain columns out of the solution.
  2. Get up to speed quickly
    • If you’re new to a data source, such as a consultant or contractor on a new project or if a data source is new to the IT team, schema and profiling query results can supplement other metadata (e.g. ERD diagrams) to gain an understanding of this source.
  3. Comparing data layers and testing data processing tasks 
    • Typically a data warehouse has multiple layers, often separated by schemas, such as an Operational Data Store (ODS), a staging layer, and the final presentation layer of dimension and fact table.
      • The schema and profiling queries could be useful in comparing the differences between these layers such as the nulls removed from columns or the additional columns or data type changes implemented.
      • One related option is to compare the schema and profile of a dimension or fact table to the schema and profile of a SQL view object for that given table which is utilized by data models. In many cases, the SQL view object itself implements significant business logic representing a variance from the data warehouse tables.
  4. Data Quality Analysis
    • It wouldn’t be that much work to leverage data alerts via Power BI dashboards or just conditional formatting to notify stakeholders that a row count variance now exists between two tables or that null values have been loaded to the production layer of the environment.
    • Conditional formatting and a DAX measure are used to help call attention to null values in the example below.

Data Profiling Example

I’ll show you an end result example first and then describe the development. You can see in the following link and image that the results of a data integration process has retrieved schema and profiling metadata for three dimension tables (Customer, Employee, and Product):

Publish to Web Example Report

In the following image, the data profiling report is filtered to one of three dimension tables (Customer) and only for numeric and text data types (kinds):

Data Profile Report
Data Profile Report Sample

Only a single table comprised of the columns from the Table.Schema() and Table.Profile() functions (for each of three tables) is loaded to the data model in Power BI Desktop for this example. You absolutely could, of course, split the profiling and schema columns (and dim tables) into separate queries or use an alternative application that supports Power Query such as Excel, Analysis Services, PowerApps, and more soon (MS Flow, Dataflows).

For example, you could load a query for each table being profiled to a separate worksheet in Excel. With relatively smaller dimension tables, it makes sense to me to go ahead and consolidate into one profiling table.

Data Profiling Query

Essentially the following query joins dimension-specific schema and profiling tables, lightly extends these tables, and then unions these merged tables before applying final transformations to the consolidated table.

let
//Source Tables
ProductTbl = SqlServerObjects{[Schema = “dbo”, Item = “DimProduct”]}[Data],
CustomerTbl = SqlServerObjects{[Schema = “dbo”, Item = “DimCustomer”]}[Data],
EmployeeTbl = SqlServerObjects{[Schema = “dbo”, Item = “DimEmployee”]}[Data],
//Product Table Profile
ProductTblSchema = Table.Schema(ProductTbl),
ProductTblProfile = Table.Profile(ProductTbl),
ProductTblJoin = Table.Join(ProductTblSchema, “Name”, ProductTblProfile, “Column”, JoinKind.Inner),
ProductTableName = Table.AddColumn(ProductTblJoin, “Table”, each “Product”, type text),
//Customer Table Profile
CustomerTblSchema = Table.Schema(CustomerTbl),
CustomerTblProfile = Table.Profile(CustomerTbl),
CustomerTblJoin = Table.Join(CustomerTblSchema, “Name”, CustomerTblProfile, “Column”, JoinKind.Inner),
CustomerTableName = Table.AddColumn(CustomerTblJoin, “Table”, each “Customer”, type text),
//Employee Table Profile
EmployeeTblSchema = Table.Schema(EmployeeTbl),
EmployeeTblProfile = Table.Profile(EmployeeTbl),
EmployeeTblJoin = Table.Join(EmployeeTblSchema,
“Name”, EmployeeTblProfile, “Column”, JoinKind.Inner),
EmployeeTableName = Table.AddColumn(EmployeeTblJoin, “Table”, each “Employee”, type text),
//Combine Tables
DWDimProfileTbl = Table.Combine({ProductTableName,CustomerTableName,EmployeeTableName}),
//EnhanceCombinedTbl
RemoveValueTypes = Table.SelectRows(DWDimProfileTbl, each
[TypeName] <> “Table.Type” and [TypeName] <> “Record.Type”),
ProfileTypeChanges = Table.TransformColumnTypes(RemoveValueTypes,
{
{“Count”, Int64.Type}, {“NullCount”, Int64.Type},
{“DistinctCount”, Int64.Type}
})
in
ProfileTypeChanges

 

The SqlServerObjects table variable referenced by the three source table variables is nothing more than the following:

= Sql.Database(Server,Database)

The Server and Database inputs for this staging query are provided via parameters as described in previous posts. The full Power BI Desktop (.PBIX) file is available for download from a new Insight Quest repo on GitHub.

The Table.Join() function brings together the Schema and Profile columns and the Table.Combine() function unions together the three different tables. The JoinKind.Inner argument is optional (it’s the default) but I prefer to be explicit with join kinds across query languages when possible.

*To refresh the query and report you would need to revise the queries and parameters to use your SQL Server database and tables. 

The PBIX file also includes a few DAX measures but the only one with any complexity is the Row Count:

Row Count =
IF (
    HASONEVALUE ( DWDimProfiling[Table] ),
    MAX ( 'DWDimProfiling'[Count] ),
    SUMX (
        VALUES ( 'DWDimProfiling'[Table] ),
        CALCULATE ( MAX ( 'DWDimProfiling'[Count] ) )
    )
)

The measure simply returns the max count for whichever single table is filtered on. However, if multiple tables are in the filter context, SUMX() is used to iterate over a table of these distinct tables and compute the max count for each table in the filter context. Filter context is achieved by using CALCULATE() for the expression to SUMX() to invoke context transition.

Wrapping Up

More tables/data and much richer visualization and analytics could always be applied but it’s the core data profiling, access and integration via Power Query that was the topic of this post. Again, in many cases you may not need your own dedicated profiling report/tool such as this example – you can always use the out-of-the-box M functions on an ad-hoc basis. However, as rapidly as data sources and BI projects move and change, it might not be a bad idea to invest a little time into something along these lines.

Next week’s blog will likely either be on Dataflows or in-model currency conversion. Be sure to subscribe if you want to be notified of future posts.

SQL Saturday Presentation

Just a quick note that I’ll speaking at SQL Saturday #767 near Lincoln, Nebraska on Power BI on October 27th, 2018.

9 comments

  1. I really love your articles, they always cover something that is not available on any other blog.
    Thanks for your great contribution.
    Also congrats on your new book, it is next on my reading list. Hopefully will be buying it as soon as i am finished with current one that i am reading these days. Hope to find some great content in the book 🙂

    Like

  2. Hi Brett,

    Great post and thanks for sharing the PBIX file! I found your M query need to specify the tables individually so based on your base query I modified a little by using table function for dynamic result rendering. Let me know if you want me to share that in your github entry.

    Like

    1. Hi Annie,

      Thanks for your note and I’m glad you found it useful. Yes, feel welcome to upload a PBIX file containing the table function to github.
      Regards,
      Brett

      Like

    2. Hello, I am also looking for a dynamic solution instead of manually specifying each and every table. If you could share the table function, it will help me a lot.

      Like

  3. Hi,

    This is really amazing but truly beyond my skills sets. If you can answer one question for me, I would truly appreciate it…I am invoking a custom function using the table.profile and I only want to perform a table profile if the district name = “Richmond”. Can you tell me how to modify the M code to look at the hardware table and give me a table profile only when the Richmond names = Richmond?

    Here is my present code but give me results for the whole table. Any assistance is appreciated.

    let
    Source = Query1(v_dom_cur_hardware, null)
    in
    Source

    Like

Leave a Reply