This is the fourth in a seven-blog series on generating data insights to drive business decision-making. In our first blog, we discussed why data management is important, followed by Step 1 (data ingestion) and Step 2 (data transformation). In this blog, we examine the elements of Step 3—data modeling.
Following the ingestion of data from your various data sources and then transforming the data so it can be analyzed, the next step is to create data models to generate the desired reports. The data modeling process includes four phases:
- Designing the base elements of the data model.
- Developing the model to optimize the analysis.
- Creating measures for the report calculations.
- Optimizing model performance.
You also need to optimize the models to strike the right balance between the reporting granularity and the performance speed of the analysis.
Phase 1 - Designing the Data Model
A data model consists of tables with various columns, such as a list of customers, products or vendors. A customer table might include columns like company name, primary contact, and address. A product table could have columns presenting the specs or characteristics of each product while a vendor table might list contact info and the type of products each vendor provides.
The data entities that are entered into the columns break down into two major types:
- Fact entities are numerical data such as the quantity and dollar values of sales transactions and financial ledger data.
- Dimension entities describe the data (customers, vendors, and products).
A fact sales table, for example, could show what the company sold over various time periods—day, week, month, quarter, and year. Another model element to consider during the design phase is the relationships among the tables. The sales table needs columns that connect to the customer table and the product table so you can see which customers purchased which products.
You might also want to have a vendor column in the sales table, so you know which vendor components went into each sale. If you manufactured bicycles, the sales table could show you sold 100 bicycles today to 10 retail store customers brought in from the customer table. You would also see a description of the bicycles brought in from the product table. The column connected to the vendor table would show the vendors that provided the components to build the bicycles—frame, pedals, handlebars, brakes, and other parts.
Another important model design concept is the cardinality of the relationship between tables. You can have one-to-one or one-to-many relationships. A one-to-many example is a product group that includes many products. For instance, one brand would have many products, or one state would have many cities and towns.
The ideal model is a star schema as shown below:
In the middle of the star, you have the fact table (sales), and then branching out, the points are the dimension tables—showing which products were sold to particular customers along with other pertinent data, such as vendor information.
In some one-to-one relationships, you can consider combining the two tables. Going back to the bike example, if all red bikes come from a single vendor, and the same is true for all blue and all green bikes, you could assign those vendors to columns in the product table—rather than having separate tables for each of the three vendors.
Phase 2 - Developing the Data Model
After designing the base elements of a data model, you will usually discover you need to build additional functionality. You might want a column, for example, that multiplies the price of a product by the quantity sold to calculate the total revenue.
The other way to enhance a model is to create dimension hierarchies. Consider a product group that includes many products within one brand.
There’s also the date hierarchy—days within weeks, weeks within months, and months within years. Hierarchies such as this example allow for analysis drill-down. Starting with a sales report for an entire year, you can analyze a specific month, the second week, and finally a single day. You can also define geographical hierarchies (city, state, region, country)
In addition to calculated tables and hierarchies, another important concept is row-level security. This pertains to the visibility of the data in a specific row and is driven by the role of each person looking at that data. A sales rep could see the details of every row of sales they completed while the sales manager may only see a summary of each rep’s territory—but can then drill down to sales rep granularity.
There is also Role-Based Access Control (RBAC). Sales reps can see their territories but not their colleagues’ territories. The system detects each person’s role and presents the authorized data accordingly.
Phase 3 - Creating Measures
When building data models, the data by itself is often insufficient. You will need to build measures (calculations) between various fields. For example, you might calculate revenue goals and actuals for specific periods of time.
A handy tool within Microsoft Power BI for creating complex measures is DAX (Data Analysis Expressions), a coding language. For example, you can build in time intelligence to compare year-to-date sales to the previous year, or the same month from one year to another, and then see the percentage increase or decrease.
A quick warning…if your DAX coding is complex, analysis performance could be impacted. You need to follow certain DAX patterns and best practices, which comes with experience in building measures. If the DAX code is not performing properly, there are tools such as DAX Studio to monitor the performance and improve it.
Phase 4 -Optimizing Model Performance
The star schema discussed above is one key optimization technique. Another best practice is removing unnecessary rows and columns. When tapping into data sources, your initial thought might be to use all the data, but that is not always necessary depending on the analysis you want to run.
To improve performance, it’s better to start with the minimum number of rows and columns needed to build a report. If necessary, you can always go back and add rows or columns. If you find a table is too big, such as a product table with thousands of products, it’s probably best to split the table into product groups, which will improve analysis performance.
You should also take a close look at the granularity of the analysis each table needs to produce. For example, sales analysis might be required daily, but for budgets and costs, a monthly view usually suffices. Similarly, sales reps need to view their individual data daily, but managers typically want weekly and monthly views while the CEO will want to see all sales rep activity rolled up to the business unit level.
A Tool for Data Modeling: Power BI Has You Covered
As companies start to manage all their data sources, they end up with large volumes of data—sometimes terabytes. If you don’t optimize the data model, you will run into performance problems.
A great tool to guide you is the <Model> tab in Power BI. It will assist you across all four phases of data modeling—design, develop, calculate and optimize. This tool also includes DAX, which is the native formula and query language for other Microsoft data analysis tools such as PowerPivot and SQL Server Analysis Services.
In our next data insights blog, we will present the process for data visualization and analysis. And if you need help now in managing your company’s data to generate business insights, Western Computer is here to assist. Contact us today to learn more about our Microsoft solutions and how they can help you aggregate, analyze and govern all your data sources.
About the Author
More Content by Rasvan Grigorescu