In the world of data and analytical engineering there is a platform that has been gaining followers and generating great interest for some time: dbt (Data Build Tool).
In this post I’ll try to briefly describe what dbt is, what it is used for, what profiles use this platform, and its main advantages.
I also want to make it clear that I’m not an expert in dbt. In fact I’ve only been learning about the platform for a few weeks. My first impressions have been very good and I have set, as one of my short-medium term goals, to expand my knowledge about the platform as I find it a very interesting wayt to keep building bridges between data analysts and data engineers. And also due to it’s growing popularity across different companies and industries.
What is dbt (Data Build Tool)?
dbt (Data Build Tool) is a platform designed to facilitate data transformation from data warehouses. It-s focused on scalability and ease of use, making possible to work with raw data and turn it into more structured and ready for analysis data in a more efficient and fast way.
Everything in dbt is based on SQL and YAML command files. This might be a stopper for people not used to write SQL, but it’s on the other hand something very useful for companies and teams that are looking for a centralized data preparation platform, highly performant and that allows also to run data quality tests while documenting the entire process, all in a single platform. The platform also gives a complete view of the lineage of the data from the original source to its final fine-tuning state for the analysts and data consumers.
What is dbt used for?
dbt se utiliza principalmente para la transformación de datos, realización de tests y centralización de la documentación de estos procesos. Se conecta directamente a un data warehouse, para partiendo de los datos en bruto o crudos, realizar todas las transformaciones que son habituales en procesos de ingeniería de datos: agregaciones, cálculos y limpieza de datos.
El enfoque de dbt es pasar de procesos tradicionales de ETL (Extract, Transform, Load - Extraer, Transformar, Cargar -) en el cual los datos se extraían de un sistema, se transformaban y luego se cargaban en un data warehouse, a un enfoque ELT (Extract, Load, Transform).
The idea of this ELT approach is to make it easier for businesses to use data. Thanks to the current benefits of modern data lakes and data warehouses, data doesn’t need to be transformed before reaching the data warehouse. So the ELT approach is to load raw data into the data warehouse from the source systems. Then, analytics engineers use dbt to transform (the T in ELT) the raw data through SQL commands and processes that can be reused with reference functions, reducing redundancy and complexity.
Why performing this transformation directly in the data warehouse and after extraction and loading? Because it allows more control of the entire process, and keeping the raw data at hand allows more flexibility for future use cases. By always having raw data in the data warehouse, transformations can be adapted, updated, and documented as needs change within the organization.
Who uses dbt?
dbt is being used more and more by data engineers, data analysts, data scientists and by a professional role that is becoming more popular: analytics engineers. Analytics engineers are halfway between data engineers and analysts. These professionals use dbt to perform complex data transformation tasks more efficiently, making it easier to manage data and make it available to the business with greater agility.
Main advantages of dbt
The main advantages of dbt are:
Simplicity: dbt is based on SQL, a language very familiar to most data professionals, making it accessible and easy to learn. Version control: dbt integrates with version control systems such as Git, making it easy to collaborate and monitor changes in data transformation processes. Scalability: The SQL and YAML-based approach allows data professionals to start with very specific use cases and grow from there as the company’s analytics also grows. Modularity: dbt is based on a modular development philosophy. That is, the development of data transformations into modules that can be reused. This facilitates the maintenance and quality control of all processes. Documentation and Testing: dbt automatically generates documentation of the data transformation models and allows to perform data-quality tests for data sources and individual columns to ensure data quality and its compliance with the companies requirements and needs.
Conclusión
In short, dbt facilitates the centralization of data transformation processes. All through SQL commands allowing greater control, better management and faster documentation of these transformations. This streamlines the availability of data that meets business requirements for analysis and use.
It’s also growing fast due to the increasing need between data professionals and companies for greater control and centralization of data quality and transformation processes. As well as the rise of professionals specialized in these tasks sitting between data engineers and analytics: analytics engineers. It also has a very active community on slack anyone can join to learn from others and get support. Es una plataforma que está creciendo rápidamente debido a la creciente necesidad de un mayor control y centralización de los procesos de calidad y transformación de datos. Así como al auge de profesionales especializados en esos procesos. Profesionales que están a medio camino entre los ingenieros de datos y los analisitas: los ingenieros de analítica. Cuenta además con una comunidad de usuarios en Slack a la que cualquiera puede unirse para solicitar ayuda y aprender de otras personas.