Data Model Anti-Patterns

I’ll be leading a two-hour a webinar on Data Modeling at the Power Up! virtual event on June 13th. Most of this session will focus on the ‘how’ and ‘why’ of building effective Power BI data models but I’ll also describe a number of anti-patterns to avoid.

I plan to walk through this list early in the webinar since, in many cases, simply avoiding these practices can result in a functional, usable model. Increased scale and analytical complexity dictates the need for more specialized skills (DAX, M) and design techniques.

Data Model Anti-Patterns

By ‘anti-pattern’, I mean a common data model design or practice that fails to deliver top objectives of analytical data models including usability, version control, scalability, performance, manageability, and availability.

  1.  Report Scoped
    • A Power BI dataset can and generally should should support multiple Power BI reports via live connection to the Power BI service.
      • Just like an Analysis Services model, the re-use of a Power BI dataset ensures version control across the reports and only a single data model needs to be refreshed.
    • Beginners with Power BI often create entirely new and separate datasets for each new Power BI report thus requiring separate data refresh processes and version control issues.
  2. Fragile
    • This encompasses a number of design decisions (or lack thereof) including directly retrieving from source table objects, hard coded dependencies (such as server, database, and column names), and insufficient testing.
      • For example, a revised column name is a source table could cause the data model refresh to fail.
      • As another example, a DAX measure doesn’t return accurate or even logical results in certain filter contexts.
      • A good read related to fragile data models is the series of pitfalls with M/Power Query by Gil Raviv
  3. Limited Usability
    • A user-friendly interface is one of the main reasons to build a model yet you still find production models with system/source table names (e.g. Dim_Product) and surrogate key columns exposed to the user.
      • This design is somewhat understandable if the model is only used by those familiar with the source system but generally the intent is to enable self-service reporting by business users.
    • Beyond arcane system names, failing to implement hierarchies and neglecting to hide attributes that drive default sort orders or fact columns which are supported by measures also contribute to a less usable model.
      • Why should the user have to see or navigate around that?
      • How difficult is it for the user to build a simple visual?
        • If hunting through the Fields list and many clicks are involved even for simple report visuals it’s time to consider some design changes.
  4. Data Silo
    • The data modeler made no effort to utilize a corporate data warehouse source or identify gaps between requirements and this standard source.
    • Alternatively, the data modeler utilized the corporate data warehouse but, without any conversation, implemented data transformations or logic to stray from these source definitions.
      • As a consequence, reports and dashboards built against this data model introduce version control issues with other reports and dashboards.
  5. Inappropriate Storage Mode
    • Rather than thinking through the pros and cons of the ‘DirectQuery vs. Import’ decision, the data modeler hastily (or naively) chose one of the two options.
    • Let’s say the organization has invested in a powerful data warehouse database (e.g. columnstore index) that fully supports all the requirements of the model yet the modeler chose the default import mode.
      • As a consequence, a separate data refresh is now required and the model’s scalability is limited in the absence of Power BI Premium capacity.
    • Let’s say that some data transformations and complex DAX calculations are required yet the modeler chose DirectQuery mode.
      • As a consequence, the required logic generates complex and inefficient query plans resulting in inadequate performance for users.
  6. No DAX Measures
    • The model author relies on implicit measures via the Default Summarization property of columns.
      • As a consequence, the analytical power of the model is severely limited and report authors are required to manually select the aggregation logic (ie Sum, Average) to apply against the column over and over again.
    • It’s difficult for experienced model authors or readers of this blog to imagine models without measures but less experienced modelers often see DAX as unnecessary and prefer a combination of graphical user interface (GUI) features including implicit measures, Quick Measures, and Show Value as.
      • It’s best to give up on the dependency on GUI tools (both DAX and M) early on – they will be insufficient at some point.
  7. Inefficient DAX Measures
    • The model author has created DAX measures and they return the expected or correct results. However, the measure expressions are unnecessarily complex and inefficient resulting in poor performance.
    • Common examples of this include the use of FILTER() when a filter parameter to CALCULATE() or CALCULATETABLE() would achieve the same results.
      • A related example is the use of FILTER(), SUMX(), and other iterating functions when it’s simply not necessary to iterate over the rows of a table.
  8. Unnecessary columns
    • Columns which serve no purpose for the model are often loaded into the model for no other reason than they share the same source table or view object with columns which are needed.
    • In a best case scenario, the unnecessary columns have low cardinality (and thus don’t consume resources) and are hidden from the user.
    • In a worst case scenario, the unnecessary columns significantly impair the scalability and usability of the data model.
  9. Incorrect data types
    • A date or a number column is stored as a text data type thus limiting the functionality of the column (e.g. date and numeric slicers, math/logic via DAX expressions).
    • A relationship is created between columns of different types (number to text).
    • A decimal number data type is used when a fixed decimal number would provide sufficient precision and avoid rounding issues.
    • In many cases the model author just accepts the default data types driven by the source system and Power BI’s detection of these types.
      • As a foundational component of the model, the data types should be a conscious decision. This is particularly true for the numeric columns of large fact tables.
  10. Misuse of bidirectional cross-filtering
    • The model author defines or attempts to define every relationship as a bidirectional cross-filtering relationship.
    • As a consequence, report users get unexpected results and/or query performance is negatively impacted.
    • Examples of bidirectional relationship misuse described in the bidirectional cross-filtering whitepaper include bidirectional relationships to date dimension tables and bidirectional relationships between a fact table and a dimension table which is also related to a separate fact table.
  11. Calculated columns on fact tables
    • Calculated columns can and should be avoided (generally) when the equivalent logic can be implemented further upstream in the solution such as the data warehouse, a SQL view object, or the M query (Power Query).
      • Common examples (beyond the RELATED() function mentioned in #14 below) of fact table calculated columns include IF/SWITCH conditional columns and derived columns that implement simple math based on other columns (Sales Net of Discounts = FactInternetSales[SalesAmount] – FactInternetSales[DiscountAmount])
    • Calculated columns on large fact tables are particularly inefficient as they’re not nearly as compressed as imported columns and their computation during data refresh/processing requires more time and resources (CPU).
  12. Multiple layers of data transformations
    • The model author started with a native SQL query for a table then added on some data transformations via Power Query (M) and then added on a few DAX calculated columns.
    • At a minimum, the data model is more difficult to support or manage as you have to trace back through each layer to understand the full logic being implemented.
    • Beyond manageability, the secondary data transformation layers also erode scalability as M transformations on top of native queries require local resources and, as previously mentioned, DAX calculated columns are not compressed like standard columns.
  13. Normalized dimension tables
    • Rather than presenting the user with a single, de-normalized dimension table, the model author retrieves multiple normalized tables describing the same entity (Product, Product Subcategory, Product Category).
    • This of course reduces usability as a hierarchy can’t be created, more tables are exposed on the Fields list, and the user has to know or learn the relationships between the tables.
      • Moreover, a filter applied on the parent tables in this snowflake schema results in less efficient queries as the multiple relationships have to be read at query time.
  14. Flattened fact tables
    • The model author uses the RELATED() function on fact tables similar to the lookup function in Excel.
      • The single, flat table is more intuitive and the retrieved dimension columns are often used as a source of still further calculated columns.
    • Now the model has copies of columns in separate table and has wasted compute and storage resources.
  15. Deficient date table
    • The date dimension table is missing many common and helpful columns such as a sequentially increasing integer for each grain (week, month, quarter, year), Day Number in Year, Prior Fiscal Year Date, End of Month Date, Holidays, Daylight Savings Time Indicators, and dynamic columns relative to the current date.
      • For example, a date table or SQL view used by the dataset may provide Calendar Month Status column with the values ‘Current Calendar Month’, ‘Prior Calendar Month’, 2 Mo Prior Calendar’, and ‘Other Calendar Month’. Such a column and other similar columns make it easy to define specific date interval filter conditions for reports.
      • The sequentially increasing integer columns are used for time intelligence calculations.

Wrapping Up

Data model anti-patterns is a very big and important topic worthy of multiple blog posts, papers, and more. The only intent of this post is to introduce a number of the anti-patterns I see in the field which you may want to avoid (or at least limit). I’ll describe these anti-patterns as well as better and best data modeling practices in detail at the Power Up! virtual event.

*Given this event and other projects I probably won’t blog again for a few weeks.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s