Cut the costs of your data processing pipelines
Data engineering implements ETL pipelines that automate data transformation and value creation workflows. The proliferation of data sources, increasing data volume over time, migration to the cloud, the growing demand for near real-time data, and the digital transformation of organizations are some of the major drivers of rising data engineering costs. The goal of this article is to identify ways to reduce costs, facilitate maintenance and operations, and align data engineering investments to maximize business value for organizations.
1. Optimize storage
- Automatically archive old data (not to be confused with data deletion).
- Structuring data according to a proven modeling approach to optimize storage and use.
- Delete obsolete or non-business-value data.
- If a text format is used, prefer compressed column formats (Parquet, Delta, Iceberg) instead of CSV and JSON.
- Differentiate between “hot”, “warm” and “cold” storage of the platform, if applicable, as the cost may vary.
- Avoid data duplication by using virtualization when possible and leveraging “CLONE” functions if the platform allows it.
2. Optimize the treatments
- Isolate the key stages of the pipeline to avoid restarting the entire pipeline in case of failure.
- Automate the development of data pipelines with a common modules approach driven by metadata
- Break down the pipelines into modular steps/activities. Add conditions so that activities only run when necessary.
- Prioritize incremental pipelines ( Change Data Capture (CDC), Change Tracking (CT) , watermarking ). Loading a full source should be the exception.
- Some low-value, single-use, or low-volatility data could be excluded from processing or handled differently. For example, existing data that is 10 years old may be relevant to retain, but it may be unnecessary to check whether it has changed.
- Adapt latency to actual needs (not all data needs to be near real-time).
- Finding the right balance between parallelism, computing power, and processing time is the only option. There’s no other way than trial and error. The key is to measure and adjust the parameters accordingly. For example, in some contexts, increasing computing power won’t significantly change processing time.
- Dynamically adjust computing power based on workload to reduce processing costs. For example, by programmatically increasing or decreasing the number of Azure SQL vCores used specifically for more intensive tasks.
3. Make the costs visible
- Data engineering expenditure is variable, not fixed; it must be measured.
- Activate the monitoring features of your platform, at least until you fully understand the costs associated with its use.
- Measure costs by pipeline, business area, cluster , workspace , environment, etc.
- Identify rarely used but expensive pipelines.
- Identify costly queries and optimize them where possible.
- Eliminate or reduce the frequency of low value-added pipelines.
- Set up a dashboard to make visible all the measures that allow for monitoring costs.
4. Best practices to avoid missteps
- Always start with the smallest (least expensive) computing cluster, and document the patterns for increasing capacity
- Prohibition of persistent manual clusters.
- Mandatory automatic deactivation of computing clusters.
- Implement quotas and consumption alerts.
- Identify a person responsible for this monitoring.
These elements are fundamental principles that allow the engineering team to initiate cost monitoring from the outset of the project. To establish a cost management culture within a company, it may be helpful to learn about the key principles and workings of FinOps through the article FinOps: 10 Key Tactics to Optimize Your Cloud Investments.
FinOps is an operational framework and cultural practice that aims to maximize the business value of the cloud by creating shared financial accountability among technical, financial, and business teams. It enables faster, data-driven decision-making through improved visibility, continuous optimization, and organizational collaboration around cloud costs and usage.