Very often the construction of a Data Warehousing / Business Intelligence (DW / BI) system is performed following the flow of traditional engineering: analysis, design, construction, testing and implementation. Communication between developers and people interested in the business is almost nonexistent, and developers are interested in technologies for working with data but often neglect to ask this most important question: "What business questions do we want to answer with the available data in order to support the decision-making process?"
In this article, I will discuss how to build DW / BI systems following many of the practices and agile principles mentioned in the book of Ken Collier: Agile Analytics: A Value-Driven Approach to Business Intelligence and Data Warehousing.
Data Warehousing / Business Intelligence (DW / BI) system
A system has inputs, processes and outputs.
Considering this approach, the inputs are all sources from which we need to extract data. In the graph above we can observe: relational databases (RDBMS), CSV files, Excel files, flat files and Web services (REST / SOAP).
The processes of data extraction, transformation and loading to a repository are known as ETL (extract, transform, load). This processes make it possible to move data to a temporary repository known as Staging and finally bring them to the Data Warehouse.
A Data Warehouse is a repository of historical data that is the main source for data analysis activities. The set of activities performed to move data from source to the Data Warehouse is known as Data Warehousing.
Finally, the output encompasses all information that can be obtained from the Data Warehouse through various Business Intelligence activities.
A DW / BI system is the result of orchestrating the activities of Data Warehousing and Business Intelligence to answer business questions and support the decision-making process in an organization.
DW / BI Systems Development Manifesto
We are uncovering better ways of building DW / BI systems by doing it and helping others do it. Through this work, we have come to value:
Individuals and interactions over processes and tools
Working DW / BI systems over comprehensive documentation
Collaboration with end users and stakeholders over contract negotiation
Responding to change over following a detailed plan
Although we know that the elements on the right have value, we value the ones on the left more.
Agile practices are not prescriptive, they should be adaptive and help us to fulfill the main goal of building DW / BI systems that work, add value to the organization and to be built with high quality.
Individuals, interactions and collaboration
Sometimes the technology area acquires tools or platforms to build DW / BI systems and decides on technologies to be used, leaving aside the understanding of the business value that end users want to achieve.
Agile emphasizes close cooperation among all stakeholders: managers, business experts, developers, project managers, sponsors, consultants, among others. This allows a common understanding among all stakeholders to be achieved. But, how we did it when we have time and budget constraints?
Running an inception before such projects can help us to:
- Understand the business questions to be answered through the DW / BI system
- Discover the available data sources
- Understand the expected information delivery mechanisms: reports, dashboards, ad hoc reporting, infographies, etc.
- Train project stakeholders in agile fundamentals and the agile way of working
With the understanding of what we are looking for with the DW / BI system, the next step is to prioritize the work in order to adjust it to the time and budget constraints.
One technique that can be used is to write user stories with the business questions and use a frequency vs. difficulty quadrant to prioritize them.
The quadrant of more frequent and more difficulty will have user stories with the business questions that are more frequently made and more difficult to answer with the available data. Said stories might be considered as the highest priority and work can begin with them .
The common understanding among all stakeholders of the project requires continuous interaction, excellent communication, empowerment and especially close cooperation.
DW / BI system working
Within Agile the best measure of progress is to see the DW / BI system running as early as possible, this is achieved by dividing the work into iterations.
Iterations are usually two to four weeks long and at the end of each iteration a demonstration, known as showcase, is done with the results. But, how to build a DW / BI system using agile practices?
Before starting to work on the DW / BI system user stories, it is very helpful to start with what is known as the Iteration 0. This iteration can last from one to two weeks and its main objective is to create everything you need from the technical point of view to start building the DW / BI system. This includes:
- Version control system (SCM)
- Work environments provisioning: developer (developer sandbox), staging, preproduction and production with their respective software installation and configuration base, RDBMS, platforms and tools that will be used for working
- Installing and configuring the continuous integration server
- Installation and configuration of the agile project management and collaboration tools
It is common to have little versioning in DW / BI systems or almost no versioning. It is of utmost importance to use a version control system (SCM) to version all artifacts created during the project. Some of the most widely used Open Source SCM are git, SVN, CVS.
The following base structure can be used as a reference for project artifacts versioning:
Data Warehousing / Business Intelligence System
Environment provisioning code
System Source Code
BI Applications code
System Static data (.csv, .txt, .xml, .sql)
SQL scripts for databases change management
ETL Code (Data Warehousing)
Dashboards and reports source files
Metadata schemes or models
One of the agile success factors is automating repetitive tasks, so that development teams can focus on issues that add value to the DW / BI system.
With regard to environments where the DW / BI system will run, automation implies creating code that allows provision of the operating system, base software, database server, settings, tools, etc. This usually is called IaC (Infrastructure as Code). Ansible is a platform that allows you to create YAML code for provisioning environments and works together with Vagrant which is a virtual environment manager.
For a DW / BI system we can have four environments:
- Production Environment (PROD): DW / BI system production-ready
- Developer Environment (Developer Sandbox): this is the DW / BI system’s development environment
- Quality Assurance Environment (QA): all developers’ changes are integrated in this environment and the DW / BI system quality controls are performed
- Pre-production environment (PRE-PROD): is a production-like environment, tests and end-user demonstrations run on this environment
An important feature of Ansible is that the same infrastructure code can be used to provision any of the environments (DEV, QA, PRE-PROD, PROD). In addition, it accomplishes to maintain consistency of software versions used in all environments.
The infrastructure code is versioned it in the /provisioning directory mentioned in the versioning section above.
The provisioning code for Pentaho v5.4 CE platform using the PostgreSQL v9.4 RDBMS and running on CentOS v7.1 operating system can be found in this GitHub repository.
Iterative / Incremental development
Evolutionary Dimensional Modeling
A well-written user story is the work unit to begin building and evolving the DW / BI system. Assuming you have the following user story:
As a Financial Analyst
I need the ability to see the profit margin per year
In order to identify the least profitable years
In the user story you can identify quantitative data such as profit margin and qualitative data such as date (year). It leads to having a first version of the dimensional model, as shown below:
As it can be seen, it is important to build the minimum model that satisfies the user history. A second user story might say:
As a Financial Analyst
I need the ability to see the profit margin per branch and year
In order to identify the least profitable branches
The second dimensional model version would include the Branch Office dimension, as shown below:
As can be seen, the dimensional model evolves in each user story in progress. This is known as Evolutionary Dimensional Modeling.
To keep track of incremental changes in the structure of the dimensional model, it is suitable to use a database change management tool. Some of the best known Open Source tools are Flyway, Liquibase or DBDeploy.
Each new change in the structure of the dimensional model is versioned through what is known as a delta or migration. A migration is a file with SQL instructions named in the format the database change management tool requires.
Migration files are versioned in the /db_migrations directory mentioned in versioning section above.
The main advantage of a change management tool database is that it allows the versioning of all structure-level changes, something rarely or almost never done in such projects. In addition, migration can create the dimensional model to the latest version in any of the environments: DEV, QA, PRE-PROD, PROD.
Dimensional model loading processes
Once you have built a first version of the dimensional model and identified the data sources, you can begin building the processes to load the dimensional model. These processes are commonly known as ETL (Extraction, Transformation and Load) or data ingestion.
ETL processes allow us to move, transform and load the data into the temporary repository (staging) and then to the dimensional model. These processes can be programmed using a programming language, or constructed in a data integration tool. Some Open Source tools for building ETL processes and data warehousing activities are: Pentaho Data Integration, Talend, Jaspersoft ETL.
It is important that the ETL processes’ metadata is based on files so that they can be versioned in the to the /etls directory mentioned in versioning section above.
It is important to perform unit tests on the ETL processes to ensure they serve their purpose and that the data is consistent between repositories it gets moved and transformed to. To generate test data (fake data) you can use a tool like Mockaroo.
Continuous integration used in the context of DW / BI system allows two main activities:
- Provisioning environments
- Execution and scheduling of data loading processes (ETLs)
There are several continuous integration servers, some known Open Source are: go.cd, Jenkins, TravisCI. In the next picture contains two pipelines configured in the go.cd tool for the DW / BI system:
In the Provisioning pipeline you can launch provisioning code execution and provision the QA, PRE-PROD and PROD environments.
In the Deployment pipeline you can schedule the execution of the main ETLs processes to run first in the QA environment. Then, if everything is successful they run in the PRE-PROD environment and finally in PROD.
Information delivery mechanisms
Once the dimensional model has been populated with data through ETLs processes, different solutions are built for business intelligence activities. These solutions can be categorized into:
- Reporting: institutional reports, on-demand reports (ad hoc), dashboards
- OLAP Solutions: data analysis cubes and pivot tables
- Custom: web portals, visualization applications, infographies
The delivery mechanism that business users decide to have should add the maximum value to answer their business questions and support their decision-making process.
Building a DW / BI system using agile practices brings visible benefits at each iteration. The close interaction and collaboration with business users, the prioritization of questions to answer and the correct application of engineering practices, such as provisioning automation environments, evolutionary dimensional modeling and continuous integration allow us to deliver results early, minimize risks through continuous learning, avoid unnecessary investments, evolve and satisfy changing requirements.
Thank you to Maria José Ormaza for translating the article to English from its original Spanish.