This is the third in a seven-blog series on data management. In our first blog, we discussed why data management is important, followed by Step 1 in the process—data ingestion. In this blog, we examine the elements of Step 2—data transformation.
Each data source that feeds into your data management system will use a set of rules for which record fields it uses and how data is entered into each field. In some cases, the fields and rules will be the same, but in other cases, they will vary. There will also be situations where data is entered incorrectly, or a record field is left blank.
All this makes it necessary to cleanse the data that arrives from the multiple sources of information. Only then can you generate valid business insights.
The Three Layers of Data Transformation
The process of data cleansing detects and corrects errors, and then reconciles data records. As you set up your cleansing process, you will likely maintain data sets across three transformation layers:
- Raw Layer, also known as the bronze layer, is where the data is centralized and stored as-is when ingested from each data source. No cleansing occurs at this stage, but you now have all the data from multiple sources in one place. This is useful for when source systems are retired and in cases where you need to refer to the original data after cleansing, which occurs in the next two layers. You can also retain point-in-time historical snapshots of your raw data. But this raw layer shouldn't be accessible to everyone. It’s usually limited to developers as technical skills are required to access the data, and you want to ensure the data is never tampered with.
- Curated Layer, also known as the silver layer, is where data cleansing starts. Missing data fields are replaced or deleted, and duplicates are removed or reconciled. You also correct misspellings and resolve multiple versions of the same field—rows representing the same record are merged into one record. Cleansing also validates the data that's relevant for analysis and creates that single source of truth that is essential to accurate analysis. For any data that is not valid, such as a field that is supposed to contain numbers but contains letters, the field is converted to the valid data type. Additional transformations that can occur during the Curated Layer include adding, renaming, splitting, merging, and parsing columns. Data at this layer is usually handled by business analysts and data engineers.
- Aggregated Layer, also known as the gold layer, is the final form of data transformation where you apply business rules. For example, you might identify the need to merge tables or append two tables that have the same structure. You might also want to aggregate (sum) fields. Aggregating fields for a table will improve the performance of the analysis phase and generate more meaningful data. Data at his level is usually handled by data scientists and subject matter experts—users who understand the operations of your business.
All three layers are usually stored in the same data lake. In cases where a report needs greater granularity, the data lake makes it easier to go back and extract data from the Curated Layer or the Raw Layer.
Tools for Streamlining Data Transformation
Two key tools that streamline data transformations are available in Microsoft Power BI:
- Power Query is a lightweight editor for extracting, transforming, and loading data. It's intuitive and essentially a graphical user interface that streamlines the process of data transformations.
- Dataflows allow you to create reusable transformation logic—so you don't have to repeat logic steps every time you set up a transformation process. Pre-configured Dataflows help create analysis that ties into the single source of truth as users are connected to the dataflow instead of directly to the data. This prevents users from setting up their own transformations, which could vary from one user to another. With everyone reusing the same data flow, you benefit from report consistency, and you can scale your analysis capacity.
Other tools that come in handy for data transformation…
- Azure Data Factory and Azure Synapse Integration orchestrate data transformations into multiple actions tied together in pipelines. When combining these tools with Power Query and Dataflows, you can execute complex transformations and calculations.
- Spark/Databricks Notebooks, a web-based document interface—with runnable code, narrative text, and visualizations—allow you to validate, transform and move datasets and invoke machine learning models to enrich those datasets and generate business insights.
- Data Analysis Expressions (DAX) is a formula expression language used in Analysis Services, Power BI, and Power Pivot in Excel. DAX formulas include functions, operators, and values to perform advanced calculations, queries, and filters on data in related tables and columns in tabular data models
As you interact with these tools, set up your data transformations as far upstream as possible (preferably within the data sources). This will increase performance while also improving usability and scalability.
For example, databases and data warehouses usually come with tools that can handle transformations for large volumes of data. From there, you can extract the data after the transformation and ingest it into a data lake—as described in the second blog of this series.
That's where you can use the Azure Data Factory and Azure Synapse pipelines to handle any complex transformations and to augment the data with additional data sources. Last but not least, you can use Power Query, Dataflows, and DAX in Power BI to further transform your data.
In our next data insights blog, we will present the process for data modeling. And if you need help now in managing your company’s data to generate valuable business insights, Western Computer is here to help. Contact us today to learn more about our Microsoft solutions and how they can help you aggregate, analyze and govern your data.
About the AuthorMore Content by Rasvan Grigorescu