Examples

Reference examples showing the range of difficulty, domains, and use cases this dataset covers. Each domain includes real-world business scenarios with full KPI definitions, chain of thought, and SQL. Use these as a guide when writing your own submission.

Retail 

Category Revenue YoY Comparison

Difficulty: Medium | Domain: Retail | Tables: fact_sales, dim_product, dim_date

Business Question

Which product categories generated the most revenue last quarter, and how does that compare to the same quarter last year?

Business Context

Merchandising leadership preparing for annual planning meeting. Need to identify growing vs declining categories YoY to guide inventory investment, promotional spend, and category expansion decisions.

KPIs

KPI	Formula
Category Revenue (Current Quarter)	`SUM(net_sales) WHERE year=2024 AND quarter=4, grouped by category`
Category Revenue (Prior Year Quarter)	`SUM(net_sales) WHERE year=2023 AND quarter=4, grouped by category`
YoY Revenue Change (Absolute)	`revenue_q4_2024 - revenue_q4_2023`
YoY Revenue Change (%)	`ROUND(100.0 * delta / NULLIF(prior_year, 0), 2)`

Key Technique

Conditional aggregation with CASE WHEN pivots both years into one row per category. NULLIF guards division-by-zero for new categories with no prior year sales.

Raw JSON Submission

{
  "q_id": 1,
  "difficulty": "Medium",
  "db_type": "Relational (SQL)",
  "domain": "Retail",
  "instruction": "Which product categories generated the most revenue last quarter, and how does that compare to the same quarter last year?",
  "context": "Merchandising leadership preparing for annual planning meeting. Need to identify growing vs declining categories YoY to guide inventory investment, promotional spend, and category expansion decisions.",
  "metrics_and_aggregation": [
    {"kpi_metric_name": "Category Revenue (Current Quarter)", "aggregation_formula": "SUM(net_sales) WHERE year=2024 AND quarter=4, grouped by category"},
    {"kpi_metric_name": "Category Revenue (Prior Year Quarter)", "aggregation_formula": "SUM(net_sales) WHERE year=2023 AND quarter=4, grouped by category"},
    {"kpi_metric_name": "YoY Revenue Change (Absolute)", "aggregation_formula": "revenue_q4_2024 - revenue_q4_2023"},
    {"kpi_metric_name": "YoY Revenue Change (%)", "aggregation_formula": "ROUND(100.0 * delta / NULLIF(prior_year, 0), 2)"}
  ],
  "chain_of_thought": [
    "Step 1: Rank product categories by net revenue for Q4 2024 and compare against Q4 2023.",
    "Step 2: Need fact_sales, dim_product, dim_date.",
    "Step 3: Filter Q4 only, years 2023 and 2024. Exclude returns (is_return = FALSE).",
    "Step 4: Use conditional aggregation CASE WHEN to pivot both years into same row per category.",
    "Step 5: Order by Q4 2024 revenue descending to surface top performers first.",
    "Step 6: NULLIF guards division-by-zero for new categories with no prior year sales.",
    "Step 7: Only completed transactions counted. Returns excluded at filter level."
  ],
  "schema_tables": {
    "fact_tables": ["fact_sales"],
    "dimension_tables": ["dim_product", "dim_date"]
  },
  "data_model_layers": {
    "hierarchies": "Date > Quarter > Year, Product > Category",
    "aggregations": "agg_quarterly_revenue_by_category",
    "snapshots": ""
  },
  "sql": "SELECT p.category, SUM(CASE WHEN d.year = 2024 AND d.quarter = 4 THEN s.net_sales ELSE 0 END) AS revenue_q4_2024, SUM(CASE WHEN d.year = 2023 AND d.quarter = 4 THEN s.net_sales ELSE 0 END) AS revenue_q4_2023, SUM(CASE WHEN d.year = 2024 AND d.quarter = 4 THEN s.net_sales ELSE 0 END) - SUM(CASE WHEN d.year = 2023 AND d.quarter = 4 THEN s.net_sales ELSE 0 END) AS revenue_change, ROUND(100.0 * (SUM(CASE WHEN d.year = 2024 AND d.quarter = 4 THEN s.net_sales ELSE 0 END) - SUM(CASE WHEN d.year = 2023 AND d.quarter = 4 THEN s.net_sales ELSE 0 END)) / NULLIF(SUM(CASE WHEN d.year = 2023 AND d.quarter = 4 THEN s.net_sales ELSE 0 END), 0), 2) AS yoy_change_pct FROM fact_sales s JOIN dim_product p ON s.product_id = p.product_id JOIN dim_date d ON s.date_id = d.date_id WHERE s.is_return = FALSE AND d.year IN (2023, 2024) AND d.quarter = 4 GROUP BY p.category ORDER BY revenue_q4_2024 DESC;"
}

Slow Moving Inventory Stock Value

Difficulty: Medium | Domain: Retail | Tables: fact_sales, dim_product, dim_date

Business Question

Which products have been sitting in inventory for more than 60 days without a sale, and what is their estimated stock value?

Business Context

Inventory management team runs monthly review to identify stagnant products. Zero sales in 60+ days ties up capital, consumes warehouse space, risks obsolescence. Report drives decisions on markdowns, clearance promotions, or supplier returns.

KPIs

KPI	Formula
Days Since Last Sale	`DATEDIFF(CURRENT_DATE, MAX(last_sale_date))` per product_id
Units Sold Last 90 Days (Stock Proxy)	`SUM(quantity)` last 90 days, non-returns, per product_id
Estimated Stagnant Stock Value	`ROUND(unit_cost * COALESCE(units_sold_last_90d, 0), 2)`

Key Technique

Two LEFT JOIN subqueries — one for all-time last sale date, one for 90-day volume. HAVING filters to products with no sale ever OR >60 days stagnant. COALESCE handles zero-sales products cleanly.

Raw JSON Submission

{
  "q_id": 2,
  "difficulty": "Medium",
  "db_type": "Relational (SQL)",
  "domain": "Retail",
  "instruction": "Which products have been sitting in inventory for more than 60 days without a sale, and what is their estimated stock value?",
  "context": "Inventory management team runs monthly review to identify stagnant products. Zero sales in 60+ days ties up capital, consumes warehouse space, risks obsolescence.",
  "metrics_and_aggregation": [
    {"kpi_metric_name": "Days Since Last Sale", "aggregation_formula": "DATEDIFF(CURRENT_DATE, MAX(last_sale_date)) per product_id"},
    {"kpi_metric_name": "Units Sold Last 90 Days (Stock Proxy)", "aggregation_formula": "SUM(quantity) WHERE full_date >= CURRENT_DATE - 90 days AND is_return = FALSE, grouped by product_id"},
    {"kpi_metric_name": "Estimated Stagnant Stock Value", "aggregation_formula": "ROUND(unit_cost * COALESCE(units_sold_last_90d, 0), 2) per product"}
  ],
  "chain_of_thought": [
    "Step 1: Surface all active products with no sales in 60+ days and estimate tied-up capital value.",
    "Step 2: Need dim_product, fact_sales, dim_date. No inventory table — use 90-day sales as stock proxy.",
    "Step 3: Filter non-returns only. Two date windows needed.",
    "Step 4: LEFT JOINs include zero-history products. HAVING filters last_sale_date IS NULL OR days > 60.",
    "Step 5: Order by estimated_stagnant_value DESC — highest capital-at-risk first."
  ],
  "schema_tables": {
    "fact_tables": ["fact_sales"],
    "dimension_tables": ["dim_product", "dim_date"]
  },
  "data_model_layers": {
    "hierarchies": "Product > Sub-Category > Category > Brand",
    "aggregations": "agg_slow_moving_inventory_monthly",
    "snapshots": "snap_inventory_stagnant_60d"
  },
  "sql": "SELECT p.product_id, p.product_name, p.category, p.unit_cost, COALESCE(recent.units_sold_last_90d, 0) AS units_sold_last_90d, MAX(last_sale.last_sale_date) AS last_sale_date, DATEDIFF(CURRENT_DATE, MAX(last_sale.last_sale_date)) AS days_since_last_sale, ROUND(p.unit_cost * COALESCE(recent.units_sold_last_90d, 0), 2) AS estimated_stagnant_value FROM dim_product p LEFT JOIN (SELECT s.product_id, MAX(d.full_date) AS last_sale_date FROM fact_sales s JOIN dim_date d ON s.date_id = d.date_id WHERE s.is_return = FALSE GROUP BY s.product_id) last_sale ON p.product_id = last_sale.product_id LEFT JOIN (SELECT s.product_id, SUM(s.quantity) AS units_sold_last_90d FROM fact_sales s JOIN dim_date d ON s.date_id = d.date_id WHERE s.is_return = FALSE AND d.full_date >= CURRENT_DATE - INTERVAL '90 days' GROUP BY s.product_id) recent ON p.product_id = recent.product_id GROUP BY p.product_id, p.product_name, p.category, p.unit_cost, recent.units_sold_last_90d HAVING MAX(last_sale.last_sale_date) IS NULL OR DATEDIFF(CURRENT_DATE, MAX(last_sale.last_sale_date)) > 60 ORDER BY estimated_stagnant_value DESC;"
}

Healthcare 

Hospital 30-Day Readmission Rate by Department

Difficulty: Medium | Domain: Healthcare | Tables: fact_encounters, dim_patient, dim_department, dim_date

Business Question

Which hospital departments have the highest 30-day patient readmission rates, and how does this compare to the national benchmark of 15%?

Business Context

Quality assurance team running quarterly clinical outcomes review. High readmission rates indicate gaps in discharge planning, post-care follow-up, or treatment effectiveness. Departments above benchmark are flagged for intervention.

KPIs

KPI	Formula
Total Discharges per Department	`COUNT(DISTINCT encounter_id) WHERE encounter_type = 'discharge'`
30-Day Readmissions	`COUNT(DISTINCT patient_id)` readmitted within 30 days of discharge
Readmission Rate (%)	`ROUND(100.0 * readmissions / total_discharges, 2)` per department
Variance from Benchmark	`readmission_rate_pct - 15.0`

Key Technique

Self-join on fact_encounters to detect readmissions within 30-day window. DATEDIFF between discharge date and next admission date per patient. National benchmark (15.0) passed as literal constant for variance calc.

Raw JSON Submission

{
  "q_id": 3,
  "difficulty": "Medium",
  "db_type": "PostgreSQL",
  "domain": "Healthcare",
  "instruction": "Which hospital departments have the highest 30-day patient readmission rates, and how does this compare to the national benchmark of 15%?",
  "context": "Quality assurance team running quarterly clinical outcomes review. High readmission rates indicate gaps in discharge planning. Departments above benchmark are flagged for intervention.",
  "metrics_and_aggregation": [
    {"kpi_metric_name": "Total Discharges per Department", "aggregation_formula": "COUNT(DISTINCT encounter_id) WHERE encounter_type = 'discharge', grouped by department_id"},
    {"kpi_metric_name": "30-Day Readmissions", "aggregation_formula": "COUNT(DISTINCT patient_id) WHERE next_admission_date <= discharge_date + 30 days"},
    {"kpi_metric_name": "Readmission Rate (%)", "aggregation_formula": "ROUND(100.0 * readmissions / total_discharges, 2) per department"},
    {"kpi_metric_name": "Variance from Benchmark", "aggregation_formula": "readmission_rate_pct - 15.0"}
  ],
  "chain_of_thought": [
    "Step 1: Find departments with 30-day readmission rate above 15% national benchmark.",
    "Step 2: Need fact_encounters, dim_department, dim_patient, dim_date.",
    "Step 3: Self-join fact_encounters on patient_id to find next admission after each discharge.",
    "Step 4: DATEDIFF between discharge_date and next admission_date — flag if <= 30 days.",
    "Step 5: Aggregate by department. Compute rate. Subtract 15.0 for benchmark variance.",
    "Step 6: Order by readmission_rate_pct DESC — worst performers first."
  ],
  "schema_tables": {
    "fact_tables": ["fact_encounters"],
    "dimension_tables": ["dim_patient", "dim_department", "dim_date"]
  },
  "data_model_layers": {
    "hierarchies": "Department > Division > Hospital, Date > Month > Quarter",
    "aggregations": "agg_readmission_rate_quarterly",
    "snapshots": "snap_department_outcomes_monthly"
  },
  "sql": "WITH discharges AS (SELECT e.encounter_id, e.patient_id, e.department_id, d.full_date AS discharge_date FROM fact_encounters e JOIN dim_date d ON e.date_id = d.date_id WHERE e.encounter_type = 'discharge'), readmits AS (SELECT e.patient_id, MIN(d.full_date) AS next_admission_date FROM fact_encounters e JOIN dim_date d ON e.date_id = d.date_id WHERE e.encounter_type = 'admission' GROUP BY e.patient_id), dept_rates AS (SELECT dis.department_id, COUNT(DISTINCT dis.encounter_id) AS total_discharges, COUNT(DISTINCT CASE WHEN DATEDIFF(ra.next_admission_date, dis.discharge_date) <= 30 THEN dis.patient_id END) AS readmissions FROM discharges dis LEFT JOIN readmits ra ON dis.patient_id = ra.patient_id AND ra.next_admission_date > dis.discharge_date GROUP BY dis.department_id) SELECT dp.department_name, dp.division, dr.total_discharges, dr.readmissions, ROUND(100.0 * dr.readmissions / NULLIF(dr.total_discharges, 0), 2) AS readmission_rate_pct, ROUND(100.0 * dr.readmissions / NULLIF(dr.total_discharges, 0), 2) - 15.0 AS variance_from_benchmark FROM dept_rates dr JOIN dim_department dp ON dr.department_id = dp.department_id ORDER BY readmission_rate_pct DESC;"
}

Insurance Claims Denial Rate by Diagnosis Code

Difficulty: Hard | Domain: Healthcare | Tables: fact_claims, dim_patient, dim_provider, dim_date

Business Question

Which diagnosis codes have the highest insurance claim denial rates, and which providers are submitting the most denied claims?

Business Context

Revenue cycle management team investigating claim denial patterns. High denial rates on specific diagnosis codes suggest documentation gaps, coding errors, or payer-specific policy mismatches.

KPIs

KPI	Formula
Total Claims per Diagnosis Code	`COUNT(claim_id)` grouped by diagnosis_code
Denied Claims	`COUNT(claim_id) WHERE claim_status = 'denied'`
Denial Rate (%)	`ROUND(100.0 * denied / total, 2)` per diagnosis_code
Avg Denied Claim Value	`ROUND(AVG(claim_amount) WHERE status = 'denied', 2)`
Top Denying Provider Count	`COUNT(claim_id) WHERE status = 'denied'` grouped by provider_id

Key Technique

Two-level aggregation — diagnosis code level first, then provider drill-down. RANK() window function to surface top 5 providers per diagnosis code. HAVING denial_rate > 20 filters to actionable codes only.

Raw JSON Submission

{
  "q_id": 4,
  "difficulty": "Hard",
  "db_type": "Snowflake",
  "domain": "Healthcare",
  "instruction": "Which diagnosis codes have the highest insurance claim denial rates, and which providers are submitting the most denied claims?",
  "context": "Revenue cycle management team investigating denial patterns to target documentation improvement and provider training.",
  "metrics_and_aggregation": [
    {"kpi_metric_name": "Total Claims per Diagnosis Code", "aggregation_formula": "COUNT(claim_id) grouped by diagnosis_code"},
    {"kpi_metric_name": "Denied Claims", "aggregation_formula": "COUNT(claim_id) WHERE claim_status = 'denied'"},
    {"kpi_metric_name": "Denial Rate (%)", "aggregation_formula": "ROUND(100.0 * denied_claims / total_claims, 2)"},
    {"kpi_metric_name": "Avg Denied Claim Value", "aggregation_formula": "ROUND(AVG(claim_amount) WHERE claim_status = 'denied', 2)"},
    {"kpi_metric_name": "Top Denying Provider Count", "aggregation_formula": "COUNT(claim_id) WHERE claim_status = 'denied', grouped by provider_id, RANK() per diagnosis_code"}
  ],
  "chain_of_thought": [
    "Step 1: Find diagnosis codes with >20% denial rate and surface top providers driving denials.",
    "Step 2: Need fact_claims, dim_provider, dim_patient, dim_date.",
    "Step 3: CTE1 — denial rate per diagnosis_code. Filter HAVING denial_rate > 20%.",
    "Step 4: CTE2 — provider-level denied claim counts for those diagnosis codes.",
    "Step 5: RANK() OVER (PARTITION BY diagnosis_code ORDER BY denied_count DESC) — top 5 providers.",
    "Step 6: Join provider dim for name and specialty. Order by denial_rate DESC."
  ],
  "schema_tables": {
    "fact_tables": ["fact_claims"],
    "dimension_tables": ["dim_patient", "dim_provider", "dim_date"]
  },
  "data_model_layers": {
    "hierarchies": "Provider > Specialty > Division, Date > Month > Year",
    "aggregations": "agg_denial_rate_by_diagnosis, agg_provider_denial_count",
    "snapshots": "snap_claims_denial_monthly"
  },
  "sql": "WITH diagnosis_denial AS (SELECT diagnosis_code, COUNT(claim_id) AS total_claims, COUNT(CASE WHEN claim_status = 'denied' THEN 1 END) AS denied_claims, ROUND(100.0 * COUNT(CASE WHEN claim_status = 'denied' THEN 1 END) / NULLIF(COUNT(claim_id), 0), 2) AS denial_rate_pct, ROUND(AVG(CASE WHEN claim_status = 'denied' THEN claim_amount END), 2) AS avg_denied_value FROM fact_claims GROUP BY diagnosis_code HAVING ROUND(100.0 * COUNT(CASE WHEN claim_status = 'denied' THEN 1 END) / NULLIF(COUNT(claim_id), 0), 2) > 20), provider_denials AS (SELECT fc.diagnosis_code, fc.provider_id, COUNT(fc.claim_id) AS provider_denied_count, RANK() OVER (PARTITION BY fc.diagnosis_code ORDER BY COUNT(fc.claim_id) DESC) AS denial_rank FROM fact_claims fc WHERE fc.claim_status = 'denied' AND fc.diagnosis_code IN (SELECT diagnosis_code FROM diagnosis_denial) GROUP BY fc.diagnosis_code, fc.provider_id) SELECT dd.diagnosis_code, dd.total_claims, dd.denied_claims, dd.denial_rate_pct, dd.avg_denied_value, p.provider_name, p.specialty, pd.provider_denied_count FROM diagnosis_denial dd JOIN provider_denials pd ON dd.diagnosis_code = pd.diagnosis_code AND pd.denial_rank <= 5 JOIN dim_provider p ON pd.provider_id = p.provider_id ORDER BY dd.denial_rate_pct DESC, pd.denial_rank;"
}

HighTech (SaaS)

Monthly Churn Rate by Subscription Plan

Difficulty: Medium | Domain: HighTech (SaaS) | Tables: fact_subscriptions, dim_customer, dim_plan, dim_date

Business Question

What is the monthly churn rate for each subscription plan tier over the past 6 months, and which plan has the worst retention?

Business Context

Product and growth team tracking retention health across Free, Pro, and Enterprise tiers. High churn in a specific tier signals pricing, onboarding, or feature-fit issues. Monthly tracking enables early intervention before churn compounds.

KPIs

KPI	Formula
Churned Customers per Month	`COUNT(customer_id) WHERE status = 'churned'` grouped by plan_tier, month
Active Customers Start of Month	`COUNT(customer_id) WHERE status = 'active'` at month start, per plan_tier
Monthly Churn Rate (%)	`ROUND(100.0 * churned / active_start, 2)` per plan_tier, per month
Avg Days to Churn	`AVG(DATEDIFF(churn_date, subscription_start_date))` per plan_tier

Key Technique

LAG() window function to get prior-month active count as denominator. Filter subscription_end_date within each calendar month to identify churns. Avg Days to Churn reveals whether churn is early (onboarding) or late (value).

Raw JSON Submission

{
  "q_id": 5,
  "difficulty": "Medium",
  "db_type": "BigQuery",
  "domain": "HighTech (SaaS)",
  "instruction": "What is the monthly churn rate for each subscription plan tier over the past 6 months, and which plan has the worst retention?",
  "context": "Product and growth team tracking retention health. High churn in a specific tier signals pricing or onboarding issues.",
  "metrics_and_aggregation": [
    {"kpi_metric_name": "Churned Customers per Month", "aggregation_formula": "COUNT(customer_id) WHERE subscription_status = 'churned' AND churn_date within month, grouped by plan_tier, month"},
    {"kpi_metric_name": "Active Customers Start of Month", "aggregation_formula": "COUNT(customer_id) WHERE status = 'active' at first day of month, per plan_tier"},
    {"kpi_metric_name": "Monthly Churn Rate (%)", "aggregation_formula": "ROUND(100.0 * churned_count / NULLIF(active_start_count, 0), 2)"},
    {"kpi_metric_name": "Avg Days to Churn", "aggregation_formula": "AVG(DATEDIFF(churn_date, subscription_start_date)) per plan_tier"}
  ],
  "chain_of_thought": [
    "Step 1: Compute monthly churn rate per plan tier for last 6 months.",
    "Step 2: Need fact_subscriptions, dim_customer, dim_plan, dim_date.",
    "Step 3: CTE1 — active customers at start of each month per plan. CTE2 — churned per month per plan.",
    "Step 4: Join CTE1 and CTE2 on plan_tier + month. Compute churn rate.",
    "Step 5: Include avg_days_to_churn to distinguish early vs late churn patterns.",
    "Step 6: Order by month ASC, churn_rate_pct DESC."
  ],
  "schema_tables": {
    "fact_tables": ["fact_subscriptions"],
    "dimension_tables": ["dim_customer", "dim_plan", "dim_date"]
  },
  "data_model_layers": {
    "hierarchies": "Plan > Tier > Product, Date > Month > Quarter",
    "aggregations": "agg_monthly_churn_by_plan",
    "snapshots": "snap_subscription_health_monthly"
  },
  "sql": "WITH monthly_active AS (SELECT dp.plan_tier, FORMAT_DATE('%Y-%m', DATE_TRUNC(d.full_date, MONTH)) AS month, COUNT(DISTINCT fs.customer_id) AS active_start_count FROM fact_subscriptions fs JOIN dim_plan dp ON fs.plan_id = dp.plan_id JOIN dim_date d ON fs.date_id = d.date_id WHERE fs.subscription_status = 'active' AND d.full_date = DATE_TRUNC(d.full_date, MONTH) AND d.full_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH) GROUP BY dp.plan_tier, month), monthly_churn AS (SELECT dp.plan_tier, FORMAT_DATE('%Y-%m', fs.churn_date) AS month, COUNT(DISTINCT fs.customer_id) AS churned_count, ROUND(AVG(DATE_DIFF(fs.churn_date, fs.subscription_start_date, DAY)), 1) AS avg_days_to_churn FROM fact_subscriptions fs JOIN dim_plan dp ON fs.plan_id = dp.plan_id WHERE fs.subscription_status = 'churned' AND fs.churn_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH) GROUP BY dp.plan_tier, month) SELECT ma.plan_tier, ma.month, ma.active_start_count, COALESCE(mc.churned_count, 0) AS churned_count, ROUND(100.0 * COALESCE(mc.churned_count, 0) / NULLIF(ma.active_start_count, 0), 2) AS churn_rate_pct, COALESCE(mc.avg_days_to_churn, 0) AS avg_days_to_churn FROM monthly_active ma LEFT JOIN monthly_churn mc ON ma.plan_tier = mc.plan_tier AND ma.month = mc.month ORDER BY ma.month, churn_rate_pct DESC;"
}

Feature Adoption Funnel by Customer Segment

Difficulty: Hard | Domain: HighTech (SaaS) | Tables: fact_feature_usage, dim_customer, dim_feature, dim_date

Business Question

What percentage of customers in each segment have adopted our core features, and where in the adoption funnel are they dropping off?

Business Context

Product team running quarterly feature adoption review. Low adoption of core features correlates strongly with churn. Segment-level funnel reveals whether onboarding, UX, or feature-market fit is the root cause.

KPIs

KPI	Formula
Customers Who Tried Feature	`COUNT(DISTINCT customer_id)` per feature, per segment
Adoption Rate (%)	`ROUND(100.0 * tried / total_segment_customers, 2)`
Repeat Usage Rate (%)	`COUNT(DISTINCT customer_id WHERE usage_count >= 3) / tried`
Funnel Drop-off Rate (%)	`100 - repeat_usage_rate_pct`

Key Technique

3-CTE pattern. CTE1 total customers per segment baseline. CTE2 feature trial counts. CTE3 repeat users (usage_count >= 3 threshold). Final join computes full funnel. RANK() surfaces features with worst drop-off per segment.

Raw JSON Submission

{
  "q_id": 6,
  "difficulty": "Hard",
  "db_type": "Snowflake",
  "domain": "HighTech (SaaS)",
  "instruction": "What percentage of customers in each segment have adopted our core features, and where in the adoption funnel are they dropping off?",
  "context": "Product team running quarterly feature adoption review. Low adoption of core features correlates with churn.",
  "metrics_and_aggregation": [
    {"kpi_metric_name": "Customers Who Tried Feature", "aggregation_formula": "COUNT(DISTINCT customer_id) per feature_id, per customer_segment"},
    {"kpi_metric_name": "Adoption Rate (%)", "aggregation_formula": "ROUND(100.0 * tried_count / total_segment_customers, 2)"},
    {"kpi_metric_name": "Repeat Usage Rate (%)", "aggregation_formula": "ROUND(100.0 * COUNT(DISTINCT customer_id WHERE usage_count >= 3) / tried_count, 2)"},
    {"kpi_metric_name": "Funnel Drop-off Rate (%)", "aggregation_formula": "100.0 - repeat_usage_rate_pct"}
  ],
  "chain_of_thought": [
    "Step 1: Build feature adoption funnel per customer segment.",
    "Step 2: Need fact_feature_usage, dim_customer, dim_feature, dim_date.",
    "Step 3: CTE1 — total active customers per segment (baseline denominator).",
    "Step 4: CTE2 — distinct customers who used each feature at least once.",
    "Step 5: CTE3 — customers with usage_count >= 3 (repeat/sticky usage threshold).",
    "Step 6: RANK() per segment by drop-off_rate DESC — worst funnel leaks first."
  ],
  "schema_tables": {
    "fact_tables": ["fact_feature_usage"],
    "dimension_tables": ["dim_customer", "dim_feature", "dim_date"]
  },
  "data_model_layers": {
    "hierarchies": "Feature > Feature_Group > Product, Customer > Segment > Region",
    "aggregations": "agg_feature_adoption_by_segment, agg_repeat_usage_rate",
    "snapshots": "snap_feature_funnel_quarterly"
  },
  "sql": "WITH segment_baseline AS (SELECT c.customer_segment, COUNT(DISTINCT c.customer_id) AS total_customers FROM dim_customer c WHERE c.customer_status = 'active' GROUP BY c.customer_segment), feature_trials AS (SELECT fu.feature_id, c.customer_segment, COUNT(DISTINCT fu.customer_id) AS tried_count FROM fact_feature_usage fu JOIN dim_customer c ON fu.customer_id = c.customer_id GROUP BY fu.feature_id, c.customer_segment), repeat_users AS (SELECT fu.feature_id, c.customer_segment, COUNT(DISTINCT fu.customer_id) AS repeat_count FROM fact_feature_usage fu JOIN dim_customer c ON fu.customer_id = c.customer_id GROUP BY fu.feature_id, c.customer_segment HAVING COUNT(fu.usage_event_id) >= 3) SELECT df.feature_name, ft.customer_segment, sb.total_customers, ft.tried_count, ROUND(100.0 * ft.tried_count / NULLIF(sb.total_customers, 0), 2) AS adoption_rate_pct, COALESCE(ru.repeat_count, 0) AS repeat_count, ROUND(100.0 * COALESCE(ru.repeat_count, 0) / NULLIF(ft.tried_count, 0), 2) AS repeat_usage_rate_pct, ROUND(100.0 - ROUND(100.0 * COALESCE(ru.repeat_count, 0) / NULLIF(ft.tried_count, 0), 2), 2) AS dropoff_rate_pct FROM feature_trials ft JOIN segment_baseline sb ON ft.customer_segment = sb.customer_segment JOIN dim_feature df ON ft.feature_id = df.feature_id LEFT JOIN repeat_users ru ON ft.feature_id = ru.feature_id AND ft.customer_segment = ru.customer_segment ORDER BY ft.customer_segment, dropoff_rate_pct DESC;"
}

Finance 

Portfolio Performance vs Benchmark

Difficulty: Hard | Domain: Finance | Tables: fact_transactions, dim_asset, dim_portfolio, dim_date

Business Question

How has each investment portfolio performed against its benchmark index over the past 12 months, and which asset classes are driving over- or under-performance?

Business Context

Investment operations team running quarterly portfolio review. Portfolios consistently underperforming their benchmark trigger rebalancing decisions. Asset-class breakdown identifies whether under-performance is concentrated or broad.

KPIs

KPI	Formula
Portfolio Return (%)	`ROUND(100.0 * (end_value - start_value) / NULLIF(start_value, 0), 2)`
Benchmark Return (%)	From `dim_portfolio.benchmark_return_12m`
Alpha (Excess Return)	`portfolio_return_pct - benchmark_return_pct`
Asset Class Contribution	`SUM(asset_return * weight)` per asset_class within portfolio

Key Technique

Start/end value via FIRST_VALUE and LAST_VALUE window functions partitioned by portfolio_id ordered by date. Alpha = portfolio return minus benchmark constant from dim table. Asset class contribution uses weighted return aggregation.

Raw JSON Submission

{
  "q_id": 7,
  "difficulty": "Hard",
  "db_type": "Redshift",
  "domain": "Finance",
  "instruction": "How has each investment portfolio performed against its benchmark index over the past 12 months, and which asset classes are driving over- or under-performance?",
  "context": "Investment operations team running quarterly portfolio review. Portfolios underperforming benchmark trigger rebalancing decisions.",
  "metrics_and_aggregation": [
    {"kpi_metric_name": "Portfolio Return (%)", "aggregation_formula": "ROUND(100.0 * (end_value - start_value) / NULLIF(start_value, 0), 2) per portfolio_id"},
    {"kpi_metric_name": "Benchmark Return (%)", "aggregation_formula": "dim_portfolio.benchmark_return_12m constant per portfolio"},
    {"kpi_metric_name": "Alpha (Excess Return)", "aggregation_formula": "portfolio_return_pct - benchmark_return_pct"},
    {"kpi_metric_name": "Asset Class Contribution", "aggregation_formula": "SUM(asset_return_pct * asset_weight) grouped by asset_class within portfolio"}
  ],
  "chain_of_thought": [
    "Step 1: Compute 12-month return per portfolio vs benchmark, drill into asset class contribution.",
    "Step 2: Need fact_transactions, dim_asset, dim_portfolio, dim_date.",
    "Step 3: CTE1 — start and end portfolio value using FIRST_VALUE/LAST_VALUE over 12-month window.",
    "Step 4: CTE2 — portfolio return % and alpha vs benchmark from dim_portfolio.",
    "Step 5: CTE3 — asset class weighted return contribution within each portfolio.",
    "Step 6: Order portfolios by alpha ASC — worst underperformers first."
  ],
  "schema_tables": {
    "fact_tables": ["fact_transactions"],
    "dimension_tables": ["dim_asset", "dim_portfolio", "dim_date"]
  },
  "data_model_layers": {
    "hierarchies": "Asset > Asset_Class > Portfolio, Date > Month > Quarter > Year",
    "aggregations": "agg_portfolio_return_12m, agg_asset_class_contribution",
    "snapshots": "snap_portfolio_performance_quarterly"
  },
  "sql": "WITH portfolio_values AS (SELECT t.portfolio_id, FIRST_VALUE(t.portfolio_value) OVER (PARTITION BY t.portfolio_id ORDER BY d.full_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS start_value, LAST_VALUE(t.portfolio_value) OVER (PARTITION BY t.portfolio_id ORDER BY d.full_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS end_value FROM fact_transactions t JOIN dim_date d ON t.date_id = d.date_id WHERE d.full_date >= CURRENT_DATE - INTERVAL '12 months' GROUP BY t.portfolio_id, t.portfolio_value, d.full_date), portfolio_returns AS (SELECT pv.portfolio_id, MAX(pv.start_value) AS start_value, MAX(pv.end_value) AS end_value, ROUND(100.0 * (MAX(pv.end_value) - MAX(pv.start_value)) / NULLIF(MAX(pv.start_value), 0), 2) AS portfolio_return_pct FROM portfolio_values pv GROUP BY pv.portfolio_id), asset_contribution AS (SELECT t.portfolio_id, a.asset_class, ROUND(SUM(t.asset_return_pct * t.asset_weight), 4) AS class_contribution FROM fact_transactions t JOIN dim_asset a ON t.asset_id = a.asset_id JOIN dim_date d ON t.date_id = d.date_id WHERE d.full_date >= CURRENT_DATE - INTERVAL '12 months' GROUP BY t.portfolio_id, a.asset_class) SELECT p.portfolio_name, pr.portfolio_return_pct, p.benchmark_return_12m AS benchmark_return_pct, ROUND(pr.portfolio_return_pct - p.benchmark_return_12m, 2) AS alpha, ac.asset_class, ac.class_contribution FROM portfolio_returns pr JOIN dim_portfolio p ON pr.portfolio_id = p.portfolio_id JOIN asset_contribution ac ON pr.portfolio_id = ac.portfolio_id ORDER BY alpha ASC, ac.class_contribution ASC;"
}

Manufacturing 

Overall Equipment Effectiveness (OEE) by Production Line

Difficulty: Hard | Domain: Manufacturing | Tables: fact_production, dim_equipment, dim_shift, dim_date

Business Question

What is the Overall Equipment Effectiveness (OEE) score for each production line over the past quarter, and which component — availability, performance, or quality — is dragging the score down?

Business Context

Plant operations manager running monthly equipment utilization review. OEE below 85% (world-class threshold) triggers targeted maintenance or process audits. Component breakdown directs the right fix.

KPIs

KPI	Formula
Availability (%)	`(planned_time - downtime) / planned_time * 100`
Performance (%)	`(actual_output / theoretical_max_output) * 100`
Quality (%)	`(good_units / total_units_produced) * 100`
OEE Score (%)	`ROUND(availability * performance * quality / 10000, 2)`
Gap to World-Class (85%)	`85.0 - oee_score`

Key Technique

All three OEE components computed in single CTE using conditional aggregation. OEE = product of three ratios (multiplied, not averaged). CASE WHEN identifies the weakest component per line for root cause routing.

Raw JSON Submission

{
  "q_id": 8,
  "difficulty": "Hard",
  "db_type": "Azure Synapse",
  "domain": "Manufacturing",
  "instruction": "What is the OEE score for each production line over the past quarter, and which component is dragging the score down?",
  "context": "Plant operations manager reviewing equipment utilization. OEE below 85% triggers maintenance or process audit.",
  "metrics_and_aggregation": [
    {"kpi_metric_name": "Availability (%)", "aggregation_formula": "ROUND(100.0 * SUM(planned_time - downtime_minutes) / NULLIF(SUM(planned_time), 0), 2) per equipment_line"},
    {"kpi_metric_name": "Performance (%)", "aggregation_formula": "ROUND(100.0 * SUM(actual_output) / NULLIF(SUM(theoretical_max_output), 0), 2)"},
    {"kpi_metric_name": "Quality (%)", "aggregation_formula": "ROUND(100.0 * SUM(good_units) / NULLIF(SUM(total_units), 0), 2)"},
    {"kpi_metric_name": "OEE Score (%)", "aggregation_formula": "ROUND(availability_pct * performance_pct * quality_pct / 10000.0, 2)"},
    {"kpi_metric_name": "Gap to World-Class", "aggregation_formula": "85.0 - oee_score"}
  ],
  "chain_of_thought": [
    "Step 1: Compute OEE per production line for last quarter. Flag lines below 85%.",
    "Step 2: Need fact_production, dim_equipment, dim_shift, dim_date.",
    "Step 3: CTE1 — compute availability, performance, quality per equipment_line in one pass.",
    "Step 4: CTE2 — multiply three ratios for OEE score. Compute gap to 85% world-class.",
    "Step 5: CASE WHEN identifies weakest component for root cause routing.",
    "Step 6: Order by oee_score ASC — worst performing lines first."
  ],
  "schema_tables": {
    "fact_tables": ["fact_production"],
    "dimension_tables": ["dim_equipment", "dim_shift", "dim_date"]
  },
  "data_model_layers": {
    "hierarchies": "Equipment > Production_Line > Plant, Shift > Date > Quarter",
    "aggregations": "agg_oee_quarterly_by_line, agg_oee_components",
    "snapshots": "snap_equipment_oee_monthly"
  },
  "sql": "WITH oee_components AS (SELECT e.equipment_line, e.plant_name, ROUND(100.0 * SUM(p.planned_time_minutes - p.downtime_minutes) / NULLIF(SUM(p.planned_time_minutes), 0), 2) AS availability_pct, ROUND(100.0 * SUM(p.actual_output) / NULLIF(SUM(p.theoretical_max_output), 0), 2) AS performance_pct, ROUND(100.0 * SUM(p.good_units) / NULLIF(SUM(p.total_units_produced), 0), 2) AS quality_pct FROM fact_production p JOIN dim_equipment e ON p.equipment_id = e.equipment_id JOIN dim_date d ON p.date_id = d.date_id WHERE d.quarter = DATEPART(QUARTER, DATEADD(QUARTER, -1, CURRENT_DATE)) GROUP BY e.equipment_line, e.plant_name), oee_scores AS (SELECT equipment_line, plant_name, availability_pct, performance_pct, quality_pct, ROUND(availability_pct * performance_pct * quality_pct / 10000.0, 2) AS oee_score, ROUND(85.0 - ROUND(availability_pct * performance_pct * quality_pct / 10000.0, 2), 2) AS gap_to_worldclass, CASE WHEN LEAST(availability_pct, performance_pct, quality_pct) = availability_pct THEN 'Availability' WHEN LEAST(availability_pct, performance_pct, quality_pct) = performance_pct THEN 'Performance' ELSE 'Quality' END AS weakest_component FROM oee_components) SELECT plant_name, equipment_line, availability_pct, performance_pct, quality_pct, oee_score, gap_to_worldclass, weakest_component FROM oee_scores ORDER BY oee_score ASC;"
}

Supply Chain 

Supplier Lead Time Variance Analysis

Difficulty: Medium | Domain: Supply Chain | Tables: fact_purchase_orders, dim_supplier, dim_product, dim_date

Business Question

Which suppliers consistently deliver outside their agreed lead time windows, and what is the financial impact of late deliveries on production schedules?

Business Context

Procurement team running quarterly supplier performance review. Suppliers with high lead time variance cause stock-outs, production delays, and emergency procurement costs. SLA breach rate drives contract renegotiation decisions.

KPIs

KPI	Formula
Agreed Lead Time (days)	`dim_supplier.agreed_lead_time_days`
Actual Lead Time (days)	`DATEDIFF(actual_delivery_date, order_date)` per PO
Lead Time Variance (days)	`actual_lead_time - agreed_lead_time`
SLA Breach Rate (%)	`ROUND(100.0 * late_orders / total_orders, 2)` per supplier
Avg Late Delivery Cost	`AVG(emergency_procurement_cost) WHERE is_late = TRUE`

Key Technique

DATEDIFF per PO for actual lead time. Compare to agreed_lead_time_days from dim. CASE WHEN actual > agreed THEN 1 END flags late orders. HAVING breach_rate > 20 filters chronic offenders.

Raw JSON Submission

{
  "q_id": 9,
  "difficulty": "Medium",
  "db_type": "PostgreSQL",
  "domain": "Supply Chain",
  "instruction": "Which suppliers consistently deliver outside their agreed lead time windows, and what is the financial impact of late deliveries?",
  "context": "Procurement team quarterly supplier review. High lead time variance causes stock-outs and emergency procurement costs.",
  "metrics_and_aggregation": [
    {"kpi_metric_name": "Actual Lead Time (days)", "aggregation_formula": "DATEDIFF(actual_delivery_date, order_date) per purchase_order_id"},
    {"kpi_metric_name": "Lead Time Variance (days)", "aggregation_formula": "actual_lead_time_days - dim_supplier.agreed_lead_time_days"},
    {"kpi_metric_name": "SLA Breach Rate (%)", "aggregation_formula": "ROUND(100.0 * COUNT(CASE WHEN actual > agreed THEN 1 END) / COUNT(*), 2) per supplier_id"},
    {"kpi_metric_name": "Avg Late Delivery Cost", "aggregation_formula": "AVG(emergency_procurement_cost) WHERE is_late = TRUE, per supplier_id"}
  ],
  "chain_of_thought": [
    "Step 1: Find suppliers with >20% SLA breach rate and quantify financial impact.",
    "Step 2: Need fact_purchase_orders, dim_supplier, dim_product, dim_date.",
    "Step 3: Compute actual_lead_time = DATEDIFF(actual_delivery_date, order_date) per PO.",
    "Step 4: Compare vs dim_supplier.agreed_lead_time_days. Flag late = actual > agreed.",
    "Step 5: Aggregate per supplier — breach_rate, avg_variance, avg_late_cost.",
    "Step 6: HAVING breach_rate > 20. Order by avg_late_cost DESC."
  ],
  "schema_tables": {
    "fact_tables": ["fact_purchase_orders"],
    "dimension_tables": ["dim_supplier", "dim_product", "dim_date"]
  },
  "data_model_layers": {
    "hierarchies": "Supplier > Category > Region, Date > Month > Quarter",
    "aggregations": "agg_supplier_lead_time_quarterly, agg_sla_breach_rate",
    "snapshots": "snap_supplier_performance_monthly"
  },
  "sql": "WITH po_lead_times AS (SELECT po.purchase_order_id, po.supplier_id, s.supplier_name, s.agreed_lead_time_days, DATEDIFF(po.actual_delivery_date, po.order_date) AS actual_lead_time_days, DATEDIFF(po.actual_delivery_date, po.order_date) - s.agreed_lead_time_days AS lead_time_variance, CASE WHEN DATEDIFF(po.actual_delivery_date, po.order_date) > s.agreed_lead_time_days THEN 1 ELSE 0 END AS is_late, po.emergency_procurement_cost FROM fact_purchase_orders po JOIN dim_supplier s ON po.supplier_id = s.supplier_id JOIN dim_date d ON po.date_id = d.date_id WHERE d.full_date >= CURRENT_DATE - INTERVAL '90 days') SELECT supplier_name, agreed_lead_time_days, COUNT(purchase_order_id) AS total_orders, SUM(is_late) AS late_orders, ROUND(100.0 * SUM(is_late) / COUNT(purchase_order_id), 2) AS sla_breach_rate_pct, ROUND(AVG(lead_time_variance), 1) AS avg_lead_time_variance_days, ROUND(AVG(CASE WHEN is_late = 1 THEN emergency_procurement_cost END), 2) AS avg_late_delivery_cost FROM po_lead_times GROUP BY supplier_id, supplier_name, agreed_lead_time_days HAVING ROUND(100.0 * SUM(is_late) / COUNT(purchase_order_id), 2) > 20 ORDER BY avg_late_delivery_cost DESC;"
}

Examples

Retail

Category Revenue YoY Comparison

Slow Moving Inventory Stock Value

Healthcare

Hospital 30-Day Readmission Rate by Department

Insurance Claims Denial Rate by Diagnosis Code

HighTech (SaaS)

Monthly Churn Rate by Subscription Plan

Feature Adoption Funnel by Customer Segment

Finance

Portfolio Performance vs Benchmark

Manufacturing

Overall Equipment Effectiveness (OEE) by Production Line

Supply Chain

Supplier Lead Time Variance Analysis

Ask Your Doubt

Retail 

Healthcare 

Finance 

Manufacturing 

Supply Chain 