dbt Best Practices: Lessons from 100+ Data Teams

dbt (data build tool) made data engineering accessible to analysts, but "spaghetti SQL" is still a massive problem. After working with over 100 data teams at Futureaiit, we have identified the patterns that separate world class analytics engineering from chaos. Here is how to structure your dbt projects for scale, maintainability, and team collaboration.

The dbt Revolution and Its Growing Pains

When dbt emerged in 2016, it fundamentally changed how data teams work. Suddenly, analysts could apply software engineering best practices to SQL: version control, testing, documentation, and modular code. The promise was compelling: transform your data warehouse into a well organized, tested, and documented asset.

However, as dbt adoption exploded, a new problem emerged. Teams would start with enthusiasm, building models rapidly, only to find themselves six months later with an unmaintainable mess. Models with circular dependencies, 500 line SQL files with no clear purpose, and tests that fail silently in production. At Futureaiit, we have helped dozens of companies rescue their dbt projects from this fate.

1. Project Structure: The Foundation of Scale

The single most important decision in your dbt project is how you structure your models. A well organized project makes it immediately clear where new models should live and how data flows through your transformation pipeline.

The Three Layer Architecture

Strictly separate your models into three distinct layers. This is not optional for teams that want to scale beyond 50 models.

Staging Layer (Views)

Staging models have exactly one job: create a clean, consistent interface to your raw source data. They should be 1:1 with source tables, meaning if you have a raw.users table, you should have exactly one stg_users model.

What staging models should do:

Rename columns to snake_case: Convert FirstName to first_name for consistency
Cast data types: Ensure dates are dates, not strings
Basic cleaning: Trim whitespace, standardize null representations
Light filtering: Remove test data or deleted records

What staging models should NOT do:

Join to other tables
Aggregate data
Apply complex business logic
Create derived metrics

Staging models should always be configured as views, not tables. They are lightweight transformations that should execute quickly, and materializing them wastes warehouse resources.

Intermediate Layer (Ephemeral or Views)

This is where the heavy lifting happens. Intermediate models contain your complex joins, window functions, and business logic. They are the "private functions" of your dbt project: not exposed to end users, but essential for building your final data products.

A common pattern at Futureaiit is to create intermediate models for each major entity in your business. For example:

int_users_with_first_order: Joins users to their first order
int_orders_with_line_items: Aggregates line items to the order level
int_subscriptions_with_mrr: Calculates monthly recurring revenue per subscription

Configure intermediate models as ephemeral when they are only used by one downstream model, or as views when multiple models depend on them. Never materialize intermediate models as tables unless you have a specific performance reason.

Marts Layer (Tables)

Marts are your final, polished data products. These are the tables that your BI tool queries, that your data scientists analyze, and that your executives trust. They should be denormalized, well documented, and optimized for query performance.

Organize marts by business domain:

marts/finance/: Revenue, expenses, cash flow
marts/marketing/: Campaign performance, attribution, funnel metrics
marts/product/: User engagement, feature adoption, retention

Always materialize marts as tables (or incremental models for large datasets). These are queried frequently, and the compute cost of materializing them is far less than the cost of running complex transformations on every query.

2. Testing: Your Safety Net

If a model is not tested, assume it is broken. This is not hyperbole. In our experience at Futureaiit, untested models fail silently, producing incorrect results that propagate through your entire analytics stack.

The Minimum Viable Test Suite

At an absolute minimum, every model must have these tests on its primary key:

unique: Ensures no duplicate records
not_null: Ensures the key is always present

This catches the most common data quality issues: accidental cartesian joins (which create duplicates) and missing join keys (which create nulls).

Advanced Testing with dbt_expectations

The built in dbt tests are a good start, but real world data quality requires more sophisticated checks. The dbt_expectations package provides dozens of additional tests modeled after the Great Expectations library.

Examples we use at Futureaiit:

expect_column_values_to_be_between: Ensure percentages are between 0 and 100
expect_column_values_to_match_regex: Validate email formats or SKU patterns
expect_table_row_count_to_be_between: Alert if daily records drop suspiciously
expect_column_pair_values_to_be_equal: Ensure calculated totals match source totals

These tests catch subtle data quality issues that would otherwise go unnoticed until a stakeholder questions a dashboard.

Custom Tests for Business Logic

Some business rules are too specific for generic tests. For example, "revenue should never decrease month over month for active subscriptions" or "user cohorts should sum to total users." Write custom singular tests for these cases.

Store custom tests in tests/ directory as SQL files that return failing rows. If the query returns zero rows, the test passes.

3. CI/CD for Analytics Engineering

Do not merge code blindly. Set up continuous integration to run dbt on every pull request, catching errors before they reach production.

Slim CI: Test Only What Changed

The naive approach is to run dbt build on every PR, which rebuilds your entire project. This is slow and expensive. Instead, use dbt's state comparison feature to run only the models you changed and their downstream dependencies.

In your GitHub Actions workflow:

dbt run --select state:modified+
dbt test --select state:modified+

This "Slim CI" approach runs in minutes instead of hours, making it practical to test every commit.

Production Deployment

When code merges to main, run a full dbt build in production. Schedule this to run during off peak hours to minimize warehouse costs and query contention.

For large projects, consider incremental deployment strategies:

Blue/Green Deployment: Build models in a staging schema, then swap schemas atomically
Canary Deployment: Deploy to a subset of models first, monitor for errors, then roll out fully

4. Documentation: Your Future Self Will Thank You

Six months from now, you will not remember why you built that intermediate model or what that cryptic column name means. Document everything.

Model Documentation

Every model should have a description in its schema.yml file explaining:

What business question it answers
What grain it is at (one row per what?)
Any important caveats or known limitations

Column Documentation

Document every column, especially:

Calculated metrics: How is this computed?
Categorical fields: What are the possible values?
Timestamps: What timezone? What does this timestamp represent?

Auto Generated Documentation

Use dbt docs generate to create a searchable, interactive documentation site. Deploy this to a static hosting service (Netlify, S3, etc.) so your entire team can explore the data lineage and understand how models connect.

5. Performance Optimization

As your dbt project grows, build times will increase. Here is how to keep them manageable.

Incremental Models

For large fact tables, use incremental materialization to process only new data. This is essential for tables with billions of rows.

Key considerations:

Unique key: Define how to identify and update existing records
Incremental strategy: Choose between merge, delete+insert, or append
Full refresh: Schedule periodic full refreshes to catch late arriving data

Partitioning and Clustering

On BigQuery, Snowflake, and Redshift, use partitioning and clustering to reduce query costs and improve performance. Partition by date for time series data, and cluster by frequently filtered columns.

Parallel Execution

dbt runs models in parallel based on their dependency graph. Ensure your models have clear dependencies (via ref() and source()) so dbt can maximize parallelism.

How Futureaiit Can Help

At Futureaiit, we specialize in building production grade analytics infrastructure. Our team has implemented dbt for companies ranging from early stage startups to Fortune 500 enterprises. We can help you:

Rescue failing dbt projects: Refactor spaghetti SQL into maintainable, tested models
Establish best practices: Set up project structure, testing, and CI/CD from day one
Train your team: Onboard analysts and engineers to dbt best practices
Optimize performance: Reduce build times and warehouse costs through incremental models and smart materialization
Integrate with your stack: Connect dbt to your data warehouse, orchestrator, and BI tools

We have built reusable dbt packages for common patterns like SCD Type 2 dimensions, attribution modeling, and cohort analysis. Our clients typically see 10x improvement in data team productivity and 50% reduction in data quality incidents.

Common Pitfalls to Avoid

Based on our experience with over 100 dbt projects, here are the mistakes we see most often:

1. Skipping the Staging Layer

Teams often jump straight to building marts, pulling directly from raw tables. This creates tight coupling to your source schema. When the source changes (and it will), you have to update dozens of downstream models. Staging models provide a stable interface that absorbs these changes.

2. Over Using Ephemeral Models

Ephemeral models are compiled into CTEs in downstream models. This seems efficient, but it can create massive, unreadable SQL and prevent dbt from parallelizing execution. Use views instead for models with multiple downstream dependencies.

3. Ignoring Test Failures

Tests are only useful if you act on failures. Set up alerting (via dbt Cloud, Slack, or email) so test failures are immediately visible. Treat test failures like production incidents: investigate and fix promptly.

4. Not Using Sources

Always define your raw tables in sources.yml and reference them with source('schema', 'table') instead of hardcoding table names. This enables source freshness checks and makes your lineage graph complete.

5. Monolithic Models

If a model is more than 200 lines of SQL, it is probably doing too much. Break it into smaller intermediate models. This improves readability, testability, and reusability.

Conclusion

dbt is a powerful tool, but like any tool, it requires discipline and best practices to use effectively. The patterns we have outlined—layered architecture, comprehensive testing, CI/CD, thorough documentation, and performance optimization—are not optional niceties. They are the foundation of a scalable, maintainable analytics platform.

At Futureaiit, we have seen firsthand how these practices transform data teams. Analysts spend less time debugging and more time delivering insights. Data quality improves dramatically. New team members onboard in days instead of weeks. And most importantly, stakeholders trust the data.

If your dbt project is becoming unwieldy, or if you are just getting started and want to avoid common pitfalls, we can help. Our team has the battle tested experience to set you up for long term success.

Ready to level up your analytics engineering? Contact Futureaiit to discuss how we can help you build a world class dbt project.