Lakehouse or warehouse — how do you decide?

By workload. Warehouses (Snowflake, BigQuery, Redshift) win for analytical / BI workloads with structured data and well-defined query patterns. Lakehouses (Databricks, Snowflake with Iceberg) win when ML workloads share the substrate, when semi-structured data is dominant, or when storage / compute economics favor decoupled scaling. We assess and recommend honestly — many estates work best as a warehouse with a small lakehouse extension.

Do you do streaming, or only batch?

Both. Most enterprise workloads are best served by micro-batch (15-minute or hourly) — streaming infrastructure adds operational complexity that the use case rarely justifies. We deploy true streaming (Kafka, Flink, Kinesis) when the freshness SLA genuinely demands it: real-time fraud detection, dispatch routing, certain trading workloads.

How do you handle data quality?

Quality checks are code that lives alongside the transformations. Great Expectations or Soda for declarative checks, dbt tests for SQL-native validation, freshness SLAs enforced through monitoring (Monte Carlo, custom). Broken data alerts the team that owns the producer, not the team that consumes the bad data.

What about data governance and lineage?

Lineage tracked through OpenLineage emitters in dbt and Airflow / Dagster, surfaced in DataHub, Atlan, or Collibra. Governance integrated with identity (Snowflake row-level security, Databricks Unity Catalog, BigQuery IAM). Access reviews quarterly per dataset and role. Audit-defensible by design.

Can you migrate off a legacy data warehouse?

Yes — Teradata, Netezza, on-prem Hadoop, and aging Redshift are common starting points. We design the migration as a strangler-fig: producer migration first, then transformation migration, then consumer cutover, with dual-running windows on critical reports. Most migrations land in 9–18 months with no analytical-reporting downtime.

What does a data engineering engagement cost?

Lakehouse / warehouse build with first use cases: 6–9 months, $400K–$1.5M. Pipeline modernization at scale: 4–8 months, $300K–$1M. Data platform migration off legacy: 9–18 months, $1M–$4M. Managed Services for ongoing operations: $30K–$120K monthly retainer. Brackets published honestly so visitors self-qualify before the first call.

Platform & Cloud · FOUNDATION + SKYWAY

Data Engineering.

Lakehouse and warehouse platforms with explicit producer / consumer contracts, ELT pipelines that don’t silently break, data quality monitoring, and lineage that survives an audit. Built so AI use cases land on a substrate the next ten can reuse.

Practice: Platform & Cloud
Department: FOUNDATION + SKYWAY

The problem

Most data platforms are a data lake nobody trusts next to a warehouse nobody can find.

The familiar pattern: a Hadoop-era data lake that became a data swamp; a downstream warehouse with conflicting truth; ETL jobs that fail silently and surface as 'why is the dashboard wrong?' three weeks later; a data catalog that's perpetually 60% accurate; ML teams writing the same enrichment three times because nobody trusts the canonical version; and a procurement spreadsheet for vendor data tools that no longer matches reality. The data isn't broken — the engineering around it is.

FOUNDATION builds data platforms the way other engineering practices build systems: with explicit interface contracts, versioned schemas, documented producer / consumer relationships, continuous quality monitoring, and lineage that survives an auditor inquiry. The goal is a data substrate AI use cases land on without re-engineering — not another iteration of the lake-vs-warehouse argument.

Where it ships

5 use cases, in production.

Specific applications we’ve built and operated. Not speculative — every example below is grounded in a real shipped engagement.

01
Lakehouse / warehouse build
Snowflake, Databricks, BigQuery, or Redshift — picked by workload fit. With dbt for transformations, Airflow / Dagster for orchestration, and explicit producer / consumer contracts at every boundary.
02
12x
pipeline reliability
ELT pipeline modernization
Replace hand-written ETL with managed extract (Fivetran, Airbyte) plus dbt-driven transformation. Idempotent, retriable, observable; broken jobs page on-call rather than failing silently.
03
Data quality and observability
Continuous data-quality monitoring (Great Expectations, Soda, Monte Carlo), schema-change detection, freshness SLAs per dataset, and alerting on the indicators that matter to downstream consumers.
04
Data lineage and governance
Lineage tracked end-to-end (source → transform → consumer) through OpenLineage or vendor catalog. Governance integrated with identity layer; access reviewed quarterly per dataset.
05
AI / ML data substrate
Feature stores, vector stores, embedding pipelines, and labeled dataset management built on the same substrate analytical workloads use. AI workloads inherit the lineage and governance the analytics platform produces.

How we engage

4 phases, named in the SOW.

Each phase has a deliverable, an owner, and an acceptance criterion. Not slogans — operating rules.

01
Use-case-driven discovery
We don't build data platforms in the abstract. Discovery starts with the use cases — analytical, operational, or AI — that the platform has to serve. Producer / consumer contracts, freshness requirements, and access patterns are inferred from those use cases, not from a generic blueprint.
02
Substrate design with contracts
Lakehouse / warehouse architecture with explicit data contracts at every boundary. Schemas are versioned; producer changes that break consumers are surfaced in CI before deployment. Storage tiers calibrated to access patterns and cost.
03
Pipelines and quality
ELT pipelines with idempotent operators, retry semantics, and observability. Data quality checks (completeness, freshness, distribution drift) run continuously and alert on the right channel. Tests live alongside the transformation code.
04
Governance and operating cadence
Catalog with lineage that survives, access reviews by dataset and role, and quarterly governance reviews against the regulatory frame the data sits in. We build platforms that compound trust over quarters — not platforms that decay between audits.

Capabilities

What’s in scope.

Lakehouse and warehouse design: Snowflake, Databricks, BigQuery, Redshift
ELT pipelines: Fivetran, Airbyte + dbt + Airflow / Dagster orchestration
Data quality: Great Expectations, Soda, Monte Carlo, dbt tests
Data lineage: OpenLineage, DataHub, Atlan, Collibra integrations
Streaming: Kafka, Kinesis, Pub/Sub, with stream processing (Flink, Beam)
AI / ML data: feature stores, vector stores, embedding pipelines
Governance: data contracts, access reviews, regulatory mapping

Stack

Tools we use in production.

Platforms: SnowflakeDatabricksBigQueryRedshiftSynapse
Transform & orchestration: dbtAirflowDagsterPrefectSQLMesh
Ingestion: FivetranAirbyteStitchHevoCustom CDC
Quality & lineage: Great ExpectationsSodaMonte CarloOpenLineageDataHub
Streaming: KafkaKinesisPub/SubFlinkApache Beam

Selected work

Quantified outcomes, not adjectives.

All case studies

01Healthcare
$4.2M
annual labor savings
FHIR-aligned data platform for a 240-clinic operator.
Stood up a HIPAA-eligible lakehouse on AWS with FHIR-aligned schemas, dbt transformations, and end-to-end lineage. Substrate now serves clinical decision support RAG, prior-authorization automation, and ambient documentation use cases.
9 months

AI & ML

Building the data substrate for AI?

Most AI engagements rest on a data substrate that has to be built first. We co-staff data-engineering programs with CORTEX engineers when the AI use case is part of the same engagement.

AI & Machine Learning

Common questions

Asked before the first call.

01
Lakehouse or warehouse — how do you decide?
By workload. Warehouses (Snowflake, BigQuery, Redshift) win for analytical / BI workloads with structured data and well-defined query patterns. Lakehouses (Databricks, Snowflake with Iceberg) win when ML workloads share the substrate, when semi-structured data is dominant, or when storage / compute economics favor decoupled scaling. We assess and recommend honestly — many estates work best as a warehouse with a small lakehouse extension.
02
Do you do streaming, or only batch?
Both. Most enterprise workloads are best served by micro-batch (15-minute or hourly) — streaming infrastructure adds operational complexity that the use case rarely justifies. We deploy true streaming (Kafka, Flink, Kinesis) when the freshness SLA genuinely demands it: real-time fraud detection, dispatch routing, certain trading workloads.
03
How do you handle data quality?
Quality checks are code that lives alongside the transformations. Great Expectations or Soda for declarative checks, dbt tests for SQL-native validation, freshness SLAs enforced through monitoring (Monte Carlo, custom). Broken data alerts the team that owns the producer, not the team that consumes the bad data.
04
What about data governance and lineage?
Lineage tracked through OpenLineage emitters in dbt and Airflow / Dagster, surfaced in DataHub, Atlan, or Collibra. Governance integrated with identity (Snowflake row-level security, Databricks Unity Catalog, BigQuery IAM). Access reviews quarterly per dataset and role. Audit-defensible by design.
05
Can you migrate off a legacy data warehouse?
Yes — Teradata, Netezza, on-prem Hadoop, and aging Redshift are common starting points. We design the migration as a strangler-fig: producer migration first, then transformation migration, then consumer cutover, with dual-running windows on critical reports. Most migrations land in 9–18 months with no analytical-reporting downtime.
06
What does a data engineering engagement cost?
Lakehouse / warehouse build with first use cases: 6–9 months, $400K–$1.5M. Pipeline modernization at scale: 4–8 months, $300K–$1M. Data platform migration off legacy: 9–18 months, $1M–$4M. Managed Services for ongoing operations: $30K–$120K monthly retainer. Brackets published honestly so visitors self-qualify before the first call.

Within Platform & Cloud

Other capabilities in this practice.

Back to Platform & Cloud

Talk to us

Bring a data engineering problem. We’ll bring a senior engineer.

A senior engineer plus the FOUNDATION + SKYWAY department lead joins the first call. No discovery gauntlet, no junior reps.

Book a discovery call Request a proposal

Data Engineering.

Practice

Platform & Cloud

Department

FOUNDATION + SKYWAY

What’s in scope.

Lakehouse and warehouse design: Snowflake, Databricks, BigQuery, Redshift

ELT pipelines: Fivetran, Airbyte + dbt + Airflow / Dagster orchestration

Data quality: Great Expectations, Soda, Monte Carlo, dbt tests

Data lineage: OpenLineage, DataHub, Atlan, Collibra integrations

Streaming: Kafka, Kinesis, Pub/Sub, with stream processing (Flink, Beam)

AI / ML data: feature stores, vector stores, embedding pipelines

Governance: data contracts, access reviews, regulatory mapping

Most data platforms are a data lake nobody trusts next to a warehouse nobody can find.

5 use cases, in production.

Lakehouse / warehouse build

ELT pipeline modernization

Data quality and observability

Data lineage and governance

AI / ML data substrate

Use-case-driven discovery

Substrate design with contracts

Pipelines and quality

Governance and operating cadence

What’s in scope.

Tools we use in production.

Quantified outcomes, not adjectives.

FHIR-aligned data platform for a 240-clinic operator.

Building the data substrate for AI?

Lakehouse or warehouse — how do you decide?

Do you do streaming, or only batch?

How do you handle data quality?

What about data governance and lineage?

Can you migrate off a legacy data warehouse?

What does a data engineering engagement cost?

Other capabilities in this practice.

Bring a data engineering problem. We’ll bring a senior engineer.

Most data platforms are a data lake nobody trusts next to a warehouse nobody can find.

5 use cases, in production.

Lakehouse / warehouse build

ELT pipeline modernization

Data quality and observability

Data lineage and governance

AI / ML data substrate

Use-case-driven discovery

Substrate design with contracts

Pipelines and quality

Governance and operating cadence

What’s in scope.

Tools we use in production.

Quantified outcomes, not adjectives.

FHIR-aligned data platform for a 240-clinic operator.

Building the data substrate for AI?

Lakehouse or warehouse — how do you decide?

Do you do streaming, or only batch?

How do you handle data quality?

What about data governance and lineage?

Can you migrate off a legacy data warehouse?

What does a data engineering engagement cost?

Other capabilities in this practice.

Bring a data engineering problem. We’ll bring a senior engineer.