dbt (data build tool) made data engineering accessible to analysts, but "spaghetti SQL" is still a massive problem. After working with over 100 data teams at Futureaiit, we have identified the patterns that separate world class analytics engineering from chaos. Here is how to structure your dbt projects for scale, maintainability, and team collaboration.
The dbt Revolution and Its Growing Pains
When dbt emerged in 2016, it fundamentally changed how data teams work. Suddenly, analysts could apply software engineering best practices to SQL: version control, testing, documentation, and modular code. The promise was compelling: transform your data warehouse into a well organized, tested, and documented asset.
However, as dbt adoption exploded, a new problem emerged. Teams would start with enthusiasm, building models rapidly, only to find themselves six months later with an unmaintainable mess. Models with circular dependencies, 500 line SQL files with no clear purpose, and tests that fail silently in production. At Futureaiit, we have helped dozens of companies rescue their dbt projects from this fate.
1. Project Structure: The Foundation of Scale
The single most important decision in your dbt project is how you structure your models. A well organized project makes it immediately clear where new models should live and how data flows through your transformation pipeline.
The Three Layer Architecture
Strictly separate your models into three distinct layers. This is not optional for teams that want to scale beyond 50 models.
Staging Layer (Views)
Staging models have exactly one job: create a clean, consistent interface to your raw source data. They should be 1:1
with source tables, meaning if you have a raw.users table, you should have exactly one
stg_users model.
What staging models should do:
- Rename columns to snake_case: Convert
FirstNametofirst_namefor consistency - Cast data types: Ensure dates are dates, not strings
- Basic cleaning: Trim whitespace, standardize null representations
- Light filtering: Remove test data or deleted records
What staging models should NOT do:
- Join to other tables
- Aggregate data
- Apply complex business logic
- Create derived metrics
Staging models should always be configured as views, not tables. They are lightweight transformations that should execute quickly, and materializing them wastes warehouse resources.
Intermediate Layer (Ephemeral or Views)
This is where the heavy lifting happens. Intermediate models contain your complex joins, window functions, and business logic. They are the "private functions" of your dbt project: not exposed to end users, but essential for building your final data products.
A common pattern at Futureaiit is to create intermediate models for each major entity in your business. For example:
int_users_with_first_order: Joins users to their first orderint_orders_with_line_items: Aggregates line items to the order levelint_subscriptions_with_mrr: Calculates monthly recurring revenue per subscription
Configure intermediate models as ephemeral when they are only used by one downstream model, or as views when multiple models depend on them. Never materialize intermediate models as tables unless you have a specific performance reason.
Marts Layer (Tables)
Marts are your final, polished data products. These are the tables that your BI tool queries, that your data scientists analyze, and that your executives trust. They should be denormalized, well documented, and optimized for query performance.
Organize marts by business domain:
marts/finance/: Revenue, expenses, cash flowmarts/marketing/: Campaign performance, attribution, funnel metricsmarts/product/: User engagement, feature adoption, retention
Always materialize marts as tables (or incremental models for large datasets). These are queried frequently, and the compute cost of materializing them is far less than the cost of running complex transformations on every query.
2. Testing: Your Safety Net
If a model is not tested, assume it is broken. This is not hyperbole. In our experience at Futureaiit, untested models fail silently, producing incorrect results that propagate through your entire analytics stack.
The Minimum Viable Test Suite
At an absolute minimum, every model must have these tests on its primary key:
unique: Ensures no duplicate recordsnot_null: Ensures the key is always present
This catches the most common data quality issues: accidental cartesian joins (which create duplicates) and missing join keys (which create nulls).
Advanced Testing with dbt_expectations
The built in dbt tests are a good start, but real world data quality requires more sophisticated checks. The
dbt_expectations package provides dozens of additional tests modeled after the Great Expectations
library.
Examples we use at Futureaiit:
expect_column_values_to_be_between: Ensure percentages are between 0 and 100expect_column_values_to_match_regex: Validate email formats or SKU patternsexpect_table_row_count_to_be_between: Alert if daily records drop suspiciouslyexpect_column_pair_values_to_be_equal: Ensure calculated totals match source totals
These tests catch subtle data quality issues that would otherwise go unnoticed until a stakeholder questions a dashboard.
Custom Tests for Business Logic
Some business rules are too specific for generic tests. For example, "revenue should never decrease month over month for active subscriptions" or "user cohorts should sum to total users." Write custom singular tests for these cases.
Store custom tests in tests/ directory as SQL files that return failing rows. If the query returns zero
rows, the test passes.
3. CI/CD for Analytics Engineering
Do not merge code blindly. Set up continuous integration to run dbt on every pull request, catching errors before they reach production.
Slim CI: Test Only What Changed
The naive approach is to run dbt build on every PR, which rebuilds your entire project. This is slow and
expensive. Instead, use dbt's state comparison feature to run only the models you changed and their downstream
dependencies.
In your GitHub Actions workflow:
dbt run --select state:modified+
dbt test --select state:modified+
This "Slim CI" approach runs in minutes instead of hours, making it practical to test every commit.
Production Deployment
When code merges to main, run a full dbt build in production. Schedule this to run during off peak hours
to minimize warehouse costs and query contention.
For large projects, consider incremental deployment strategies:
- Blue/Green Deployment: Build models in a staging schema, then swap schemas atomically
- Canary Deployment: Deploy to a subset of models first, monitor for errors, then roll out fully
4. Documentation: Your Future Self Will Thank You
Six months from now, you will not remember why you built that intermediate model or what that cryptic column name means. Document everything.
Model Documentation
Every model should have a description in its schema.yml file explaining:
- What business question it answers
- What grain it is at (one row per what?)
- Any important caveats or known limitations
Column Documentation
Document every column, especially:
- Calculated metrics: How is this computed?
- Categorical fields: What are the possible values?
- Timestamps: What timezone? What does this timestamp represent?
Auto Generated Documentation
Use dbt docs generate to create a searchable, interactive documentation site. Deploy this to a static
hosting service (Netlify, S3, etc.) so your entire team can explore the data lineage and understand how models
connect.
5. Performance Optimization
As your dbt project grows, build times will increase. Here is how to keep them manageable.
Incremental Models
For large fact tables, use incremental materialization to process only new data. This is essential for tables with billions of rows.
Key considerations:
- Unique key: Define how to identify and update existing records
- Incremental strategy: Choose between merge, delete+insert, or append
- Full refresh: Schedule periodic full refreshes to catch late arriving data
Partitioning and Clustering
On BigQuery, Snowflake, and Redshift, use partitioning and clustering to reduce query costs and improve performance. Partition by date for time series data, and cluster by frequently filtered columns.
Parallel Execution
dbt runs models in parallel based on their dependency graph. Ensure your models have clear dependencies (via
ref() and source()) so dbt can maximize parallelism.
How Futureaiit Can Help
At Futureaiit, we specialize in building production grade analytics infrastructure. Our team has implemented dbt for companies ranging from early stage startups to Fortune 500 enterprises. We can help you:
- Rescue failing dbt projects: Refactor spaghetti SQL into maintainable, tested models
- Establish best practices: Set up project structure, testing, and CI/CD from day one
- Train your team: Onboard analysts and engineers to dbt best practices
- Optimize performance: Reduce build times and warehouse costs through incremental models and smart materialization
- Integrate with your stack: Connect dbt to your data warehouse, orchestrator, and BI tools
We have built reusable dbt packages for common patterns like SCD Type 2 dimensions, attribution modeling, and cohort analysis. Our clients typically see 10x improvement in data team productivity and 50% reduction in data quality incidents.
Common Pitfalls to Avoid
Based on our experience with over 100 dbt projects, here are the mistakes we see most often:
1. Skipping the Staging Layer
Teams often jump straight to building marts, pulling directly from raw tables. This creates tight coupling to your source schema. When the source changes (and it will), you have to update dozens of downstream models. Staging models provide a stable interface that absorbs these changes.
2. Over Using Ephemeral Models
Ephemeral models are compiled into CTEs in downstream models. This seems efficient, but it can create massive, unreadable SQL and prevent dbt from parallelizing execution. Use views instead for models with multiple downstream dependencies.
3. Ignoring Test Failures
Tests are only useful if you act on failures. Set up alerting (via dbt Cloud, Slack, or email) so test failures are immediately visible. Treat test failures like production incidents: investigate and fix promptly.
4. Not Using Sources
Always define your raw tables in sources.yml and reference them with
source('schema', 'table') instead of hardcoding table names. This enables source freshness checks and
makes your lineage graph complete.
5. Monolithic Models
If a model is more than 200 lines of SQL, it is probably doing too much. Break it into smaller intermediate models. This improves readability, testability, and reusability.
Conclusion
dbt is a powerful tool, but like any tool, it requires discipline and best practices to use effectively. The patterns we have outlined—layered architecture, comprehensive testing, CI/CD, thorough documentation, and performance optimization—are not optional niceties. They are the foundation of a scalable, maintainable analytics platform.
At Futureaiit, we have seen firsthand how these practices transform data teams. Analysts spend less time debugging and more time delivering insights. Data quality improves dramatically. New team members onboard in days instead of weeks. And most importantly, stakeholders trust the data.
If your dbt project is becoming unwieldy, or if you are just getting started and want to avoid common pitfalls, we can help. Our team has the battle tested experience to set you up for long term success.
Ready to level up your analytics engineering? Contact Futureaiit to discuss how we can help you build a world class dbt project.
Futureaiit
AI & Technology Experts