Examples ======== Reference examples showing the range of difficulty, domains, and use cases this dataset covers. Each domain includes real-world business scenarios with full KPI definitions, chain of thought, and SQL. Use these as a guide when writing your own submission. .. contents:: :local: :depth: 1 Retail ------ .. _example-category-revenue-yoy: Category Revenue YoY Comparison ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Difficulty:** Medium | **Domain:** Retail | **Tables:** fact_sales, dim_product, dim_date **Business Question** Which product categories generated the most revenue last quarter, and how does that compare to the same quarter last year? **Business Context** Merchandising leadership preparing for annual planning meeting. Need to identify growing vs declining categories YoY to guide inventory investment, promotional spend, and category expansion decisions. **KPIs** .. list-table:: :header-rows: 1 * - KPI - Formula * - Category Revenue (Current Quarter) - ``SUM(net_sales) WHERE year=2024 AND quarter=4, grouped by category`` * - Category Revenue (Prior Year Quarter) - ``SUM(net_sales) WHERE year=2023 AND quarter=4, grouped by category`` * - YoY Revenue Change (Absolute) - ``revenue_q4_2024 - revenue_q4_2023`` * - YoY Revenue Change (%) - ``ROUND(100.0 * delta / NULLIF(prior_year, 0), 2)`` **Key Technique** Conditional aggregation with ``CASE WHEN`` pivots both years into one row per category. ``NULLIF`` guards division-by-zero for new categories with no prior year sales. **Raw JSON Submission** .. code-block:: json { "q_id": 1, "difficulty": "Medium", "db_type": "Relational (SQL)", "domain": "Retail", "instruction": "Which product categories generated the most revenue last quarter, and how does that compare to the same quarter last year?", "context": "Merchandising leadership preparing for annual planning meeting. Need to identify growing vs declining categories YoY to guide inventory investment, promotional spend, and category expansion decisions.", "metrics_and_aggregation": [ {"kpi_metric_name": "Category Revenue (Current Quarter)", "aggregation_formula": "SUM(net_sales) WHERE year=2024 AND quarter=4, grouped by category"}, {"kpi_metric_name": "Category Revenue (Prior Year Quarter)", "aggregation_formula": "SUM(net_sales) WHERE year=2023 AND quarter=4, grouped by category"}, {"kpi_metric_name": "YoY Revenue Change (Absolute)", "aggregation_formula": "revenue_q4_2024 - revenue_q4_2023"}, {"kpi_metric_name": "YoY Revenue Change (%)", "aggregation_formula": "ROUND(100.0 * delta / NULLIF(prior_year, 0), 2)"} ], "chain_of_thought": [ "Step 1: Rank product categories by net revenue for Q4 2024 and compare against Q4 2023.", "Step 2: Need fact_sales, dim_product, dim_date.", "Step 3: Filter Q4 only, years 2023 and 2024. Exclude returns (is_return = FALSE).", "Step 4: Use conditional aggregation CASE WHEN to pivot both years into same row per category.", "Step 5: Order by Q4 2024 revenue descending to surface top performers first.", "Step 6: NULLIF guards division-by-zero for new categories with no prior year sales.", "Step 7: Only completed transactions counted. Returns excluded at filter level." ], "schema_tables": { "fact_tables": ["fact_sales"], "dimension_tables": ["dim_product", "dim_date"] }, "data_model_layers": { "hierarchies": "Date > Quarter > Year, Product > Category", "aggregations": "agg_quarterly_revenue_by_category", "snapshots": "" }, "sql": "SELECT p.category, SUM(CASE WHEN d.year = 2024 AND d.quarter = 4 THEN s.net_sales ELSE 0 END) AS revenue_q4_2024, SUM(CASE WHEN d.year = 2023 AND d.quarter = 4 THEN s.net_sales ELSE 0 END) AS revenue_q4_2023, SUM(CASE WHEN d.year = 2024 AND d.quarter = 4 THEN s.net_sales ELSE 0 END) - SUM(CASE WHEN d.year = 2023 AND d.quarter = 4 THEN s.net_sales ELSE 0 END) AS revenue_change, ROUND(100.0 * (SUM(CASE WHEN d.year = 2024 AND d.quarter = 4 THEN s.net_sales ELSE 0 END) - SUM(CASE WHEN d.year = 2023 AND d.quarter = 4 THEN s.net_sales ELSE 0 END)) / NULLIF(SUM(CASE WHEN d.year = 2023 AND d.quarter = 4 THEN s.net_sales ELSE 0 END), 0), 2) AS yoy_change_pct FROM fact_sales s JOIN dim_product p ON s.product_id = p.product_id JOIN dim_date d ON s.date_id = d.date_id WHERE s.is_return = FALSE AND d.year IN (2023, 2024) AND d.quarter = 4 GROUP BY p.category ORDER BY revenue_q4_2024 DESC;" } ---- .. _example-slow-moving-inventory: Slow Moving Inventory Stock Value ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Difficulty:** Medium | **Domain:** Retail | **Tables:** fact_sales, dim_product, dim_date **Business Question** Which products have been sitting in inventory for more than 60 days without a sale, and what is their estimated stock value? **Business Context** Inventory management team runs monthly review to identify stagnant products. Zero sales in 60+ days ties up capital, consumes warehouse space, risks obsolescence. Report drives decisions on markdowns, clearance promotions, or supplier returns. **KPIs** .. list-table:: :header-rows: 1 * - KPI - Formula * - Days Since Last Sale - ``DATEDIFF(CURRENT_DATE, MAX(last_sale_date))`` per product_id * - Units Sold Last 90 Days (Stock Proxy) - ``SUM(quantity)`` last 90 days, non-returns, per product_id * - Estimated Stagnant Stock Value - ``ROUND(unit_cost * COALESCE(units_sold_last_90d, 0), 2)`` **Key Technique** Two LEFT JOIN subqueries — one for all-time last sale date, one for 90-day volume. ``HAVING`` filters to products with no sale ever OR >60 days stagnant. ``COALESCE`` handles zero-sales products cleanly. **Raw JSON Submission** .. code-block:: json { "q_id": 2, "difficulty": "Medium", "db_type": "Relational (SQL)", "domain": "Retail", "instruction": "Which products have been sitting in inventory for more than 60 days without a sale, and what is their estimated stock value?", "context": "Inventory management team runs monthly review to identify stagnant products. Zero sales in 60+ days ties up capital, consumes warehouse space, risks obsolescence.", "metrics_and_aggregation": [ {"kpi_metric_name": "Days Since Last Sale", "aggregation_formula": "DATEDIFF(CURRENT_DATE, MAX(last_sale_date)) per product_id"}, {"kpi_metric_name": "Units Sold Last 90 Days (Stock Proxy)", "aggregation_formula": "SUM(quantity) WHERE full_date >= CURRENT_DATE - 90 days AND is_return = FALSE, grouped by product_id"}, {"kpi_metric_name": "Estimated Stagnant Stock Value", "aggregation_formula": "ROUND(unit_cost * COALESCE(units_sold_last_90d, 0), 2) per product"} ], "chain_of_thought": [ "Step 1: Surface all active products with no sales in 60+ days and estimate tied-up capital value.", "Step 2: Need dim_product, fact_sales, dim_date. No inventory table — use 90-day sales as stock proxy.", "Step 3: Filter non-returns only. Two date windows needed.", "Step 4: LEFT JOINs include zero-history products. HAVING filters last_sale_date IS NULL OR days > 60.", "Step 5: Order by estimated_stagnant_value DESC — highest capital-at-risk first." ], "schema_tables": { "fact_tables": ["fact_sales"], "dimension_tables": ["dim_product", "dim_date"] }, "data_model_layers": { "hierarchies": "Product > Sub-Category > Category > Brand", "aggregations": "agg_slow_moving_inventory_monthly", "snapshots": "snap_inventory_stagnant_60d" }, "sql": "SELECT p.product_id, p.product_name, p.category, p.unit_cost, COALESCE(recent.units_sold_last_90d, 0) AS units_sold_last_90d, MAX(last_sale.last_sale_date) AS last_sale_date, DATEDIFF(CURRENT_DATE, MAX(last_sale.last_sale_date)) AS days_since_last_sale, ROUND(p.unit_cost * COALESCE(recent.units_sold_last_90d, 0), 2) AS estimated_stagnant_value FROM dim_product p LEFT JOIN (SELECT s.product_id, MAX(d.full_date) AS last_sale_date FROM fact_sales s JOIN dim_date d ON s.date_id = d.date_id WHERE s.is_return = FALSE GROUP BY s.product_id) last_sale ON p.product_id = last_sale.product_id LEFT JOIN (SELECT s.product_id, SUM(s.quantity) AS units_sold_last_90d FROM fact_sales s JOIN dim_date d ON s.date_id = d.date_id WHERE s.is_return = FALSE AND d.full_date >= CURRENT_DATE - INTERVAL '90 days' GROUP BY s.product_id) recent ON p.product_id = recent.product_id GROUP BY p.product_id, p.product_name, p.category, p.unit_cost, recent.units_sold_last_90d HAVING MAX(last_sale.last_sale_date) IS NULL OR DATEDIFF(CURRENT_DATE, MAX(last_sale.last_sale_date)) > 60 ORDER BY estimated_stagnant_value DESC;" } Healthcare ---------- .. _example-hospital-readmission-rate: Hospital 30-Day Readmission Rate by Department ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Difficulty:** Medium | **Domain:** Healthcare | **Tables:** fact_encounters, dim_patient, dim_department, dim_date **Business Question** Which hospital departments have the highest 30-day patient readmission rates, and how does this compare to the national benchmark of 15%? **Business Context** Quality assurance team running quarterly clinical outcomes review. High readmission rates indicate gaps in discharge planning, post-care follow-up, or treatment effectiveness. Departments above benchmark are flagged for intervention. **KPIs** .. list-table:: :header-rows: 1 * - KPI - Formula * - Total Discharges per Department - ``COUNT(DISTINCT encounter_id) WHERE encounter_type = 'discharge'`` * - 30-Day Readmissions - ``COUNT(DISTINCT patient_id)`` readmitted within 30 days of discharge * - Readmission Rate (%) - ``ROUND(100.0 * readmissions / total_discharges, 2)`` per department * - Variance from Benchmark - ``readmission_rate_pct - 15.0`` **Key Technique** Self-join on ``fact_encounters`` to detect readmissions within 30-day window. ``DATEDIFF`` between discharge date and next admission date per patient. National benchmark (15.0) passed as literal constant for variance calc. **Raw JSON Submission** .. code-block:: json { "q_id": 3, "difficulty": "Medium", "db_type": "PostgreSQL", "domain": "Healthcare", "instruction": "Which hospital departments have the highest 30-day patient readmission rates, and how does this compare to the national benchmark of 15%?", "context": "Quality assurance team running quarterly clinical outcomes review. High readmission rates indicate gaps in discharge planning. Departments above benchmark are flagged for intervention.", "metrics_and_aggregation": [ {"kpi_metric_name": "Total Discharges per Department", "aggregation_formula": "COUNT(DISTINCT encounter_id) WHERE encounter_type = 'discharge', grouped by department_id"}, {"kpi_metric_name": "30-Day Readmissions", "aggregation_formula": "COUNT(DISTINCT patient_id) WHERE next_admission_date <= discharge_date + 30 days"}, {"kpi_metric_name": "Readmission Rate (%)", "aggregation_formula": "ROUND(100.0 * readmissions / total_discharges, 2) per department"}, {"kpi_metric_name": "Variance from Benchmark", "aggregation_formula": "readmission_rate_pct - 15.0"} ], "chain_of_thought": [ "Step 1: Find departments with 30-day readmission rate above 15% national benchmark.", "Step 2: Need fact_encounters, dim_department, dim_patient, dim_date.", "Step 3: Self-join fact_encounters on patient_id to find next admission after each discharge.", "Step 4: DATEDIFF between discharge_date and next admission_date — flag if <= 30 days.", "Step 5: Aggregate by department. Compute rate. Subtract 15.0 for benchmark variance.", "Step 6: Order by readmission_rate_pct DESC — worst performers first." ], "schema_tables": { "fact_tables": ["fact_encounters"], "dimension_tables": ["dim_patient", "dim_department", "dim_date"] }, "data_model_layers": { "hierarchies": "Department > Division > Hospital, Date > Month > Quarter", "aggregations": "agg_readmission_rate_quarterly", "snapshots": "snap_department_outcomes_monthly" }, "sql": "WITH discharges AS (SELECT e.encounter_id, e.patient_id, e.department_id, d.full_date AS discharge_date FROM fact_encounters e JOIN dim_date d ON e.date_id = d.date_id WHERE e.encounter_type = 'discharge'), readmits AS (SELECT e.patient_id, MIN(d.full_date) AS next_admission_date FROM fact_encounters e JOIN dim_date d ON e.date_id = d.date_id WHERE e.encounter_type = 'admission' GROUP BY e.patient_id), dept_rates AS (SELECT dis.department_id, COUNT(DISTINCT dis.encounter_id) AS total_discharges, COUNT(DISTINCT CASE WHEN DATEDIFF(ra.next_admission_date, dis.discharge_date) <= 30 THEN dis.patient_id END) AS readmissions FROM discharges dis LEFT JOIN readmits ra ON dis.patient_id = ra.patient_id AND ra.next_admission_date > dis.discharge_date GROUP BY dis.department_id) SELECT dp.department_name, dp.division, dr.total_discharges, dr.readmissions, ROUND(100.0 * dr.readmissions / NULLIF(dr.total_discharges, 0), 2) AS readmission_rate_pct, ROUND(100.0 * dr.readmissions / NULLIF(dr.total_discharges, 0), 2) - 15.0 AS variance_from_benchmark FROM dept_rates dr JOIN dim_department dp ON dr.department_id = dp.department_id ORDER BY readmission_rate_pct DESC;" } ---- .. _example-claims-denial-rate: Insurance Claims Denial Rate by Diagnosis Code ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Difficulty:** Hard | **Domain:** Healthcare | **Tables:** fact_claims, dim_patient, dim_provider, dim_date **Business Question** Which diagnosis codes have the highest insurance claim denial rates, and which providers are submitting the most denied claims? **Business Context** Revenue cycle management team investigating claim denial patterns. High denial rates on specific diagnosis codes suggest documentation gaps, coding errors, or payer-specific policy mismatches. **KPIs** .. list-table:: :header-rows: 1 * - KPI - Formula * - Total Claims per Diagnosis Code - ``COUNT(claim_id)`` grouped by diagnosis_code * - Denied Claims - ``COUNT(claim_id) WHERE claim_status = 'denied'`` * - Denial Rate (%) - ``ROUND(100.0 * denied / total, 2)`` per diagnosis_code * - Avg Denied Claim Value - ``ROUND(AVG(claim_amount) WHERE status = 'denied', 2)`` * - Top Denying Provider Count - ``COUNT(claim_id) WHERE status = 'denied'`` grouped by provider_id **Key Technique** Two-level aggregation — diagnosis code level first, then provider drill-down. ``RANK()`` window function to surface top 5 providers per diagnosis code. ``HAVING denial_rate > 20`` filters to actionable codes only. **Raw JSON Submission** .. code-block:: json { "q_id": 4, "difficulty": "Hard", "db_type": "Snowflake", "domain": "Healthcare", "instruction": "Which diagnosis codes have the highest insurance claim denial rates, and which providers are submitting the most denied claims?", "context": "Revenue cycle management team investigating denial patterns to target documentation improvement and provider training.", "metrics_and_aggregation": [ {"kpi_metric_name": "Total Claims per Diagnosis Code", "aggregation_formula": "COUNT(claim_id) grouped by diagnosis_code"}, {"kpi_metric_name": "Denied Claims", "aggregation_formula": "COUNT(claim_id) WHERE claim_status = 'denied'"}, {"kpi_metric_name": "Denial Rate (%)", "aggregation_formula": "ROUND(100.0 * denied_claims / total_claims, 2)"}, {"kpi_metric_name": "Avg Denied Claim Value", "aggregation_formula": "ROUND(AVG(claim_amount) WHERE claim_status = 'denied', 2)"}, {"kpi_metric_name": "Top Denying Provider Count", "aggregation_formula": "COUNT(claim_id) WHERE claim_status = 'denied', grouped by provider_id, RANK() per diagnosis_code"} ], "chain_of_thought": [ "Step 1: Find diagnosis codes with >20% denial rate and surface top providers driving denials.", "Step 2: Need fact_claims, dim_provider, dim_patient, dim_date.", "Step 3: CTE1 — denial rate per diagnosis_code. Filter HAVING denial_rate > 20%.", "Step 4: CTE2 — provider-level denied claim counts for those diagnosis codes.", "Step 5: RANK() OVER (PARTITION BY diagnosis_code ORDER BY denied_count DESC) — top 5 providers.", "Step 6: Join provider dim for name and specialty. Order by denial_rate DESC." ], "schema_tables": { "fact_tables": ["fact_claims"], "dimension_tables": ["dim_patient", "dim_provider", "dim_date"] }, "data_model_layers": { "hierarchies": "Provider > Specialty > Division, Date > Month > Year", "aggregations": "agg_denial_rate_by_diagnosis, agg_provider_denial_count", "snapshots": "snap_claims_denial_monthly" }, "sql": "WITH diagnosis_denial AS (SELECT diagnosis_code, COUNT(claim_id) AS total_claims, COUNT(CASE WHEN claim_status = 'denied' THEN 1 END) AS denied_claims, ROUND(100.0 * COUNT(CASE WHEN claim_status = 'denied' THEN 1 END) / NULLIF(COUNT(claim_id), 0), 2) AS denial_rate_pct, ROUND(AVG(CASE WHEN claim_status = 'denied' THEN claim_amount END), 2) AS avg_denied_value FROM fact_claims GROUP BY diagnosis_code HAVING ROUND(100.0 * COUNT(CASE WHEN claim_status = 'denied' THEN 1 END) / NULLIF(COUNT(claim_id), 0), 2) > 20), provider_denials AS (SELECT fc.diagnosis_code, fc.provider_id, COUNT(fc.claim_id) AS provider_denied_count, RANK() OVER (PARTITION BY fc.diagnosis_code ORDER BY COUNT(fc.claim_id) DESC) AS denial_rank FROM fact_claims fc WHERE fc.claim_status = 'denied' AND fc.diagnosis_code IN (SELECT diagnosis_code FROM diagnosis_denial) GROUP BY fc.diagnosis_code, fc.provider_id) SELECT dd.diagnosis_code, dd.total_claims, dd.denied_claims, dd.denial_rate_pct, dd.avg_denied_value, p.provider_name, p.specialty, pd.provider_denied_count FROM diagnosis_denial dd JOIN provider_denials pd ON dd.diagnosis_code = pd.diagnosis_code AND pd.denial_rank <= 5 JOIN dim_provider p ON pd.provider_id = p.provider_id ORDER BY dd.denial_rate_pct DESC, pd.denial_rank;" } HighTech (SaaS) --------------- .. _example-saas-churn-rate: Monthly Churn Rate by Subscription Plan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Difficulty:** Medium | **Domain:** HighTech (SaaS) | **Tables:** fact_subscriptions, dim_customer, dim_plan, dim_date **Business Question** What is the monthly churn rate for each subscription plan tier over the past 6 months, and which plan has the worst retention? **Business Context** Product and growth team tracking retention health across Free, Pro, and Enterprise tiers. High churn in a specific tier signals pricing, onboarding, or feature-fit issues. Monthly tracking enables early intervention before churn compounds. **KPIs** .. list-table:: :header-rows: 1 * - KPI - Formula * - Churned Customers per Month - ``COUNT(customer_id) WHERE status = 'churned'`` grouped by plan_tier, month * - Active Customers Start of Month - ``COUNT(customer_id) WHERE status = 'active'`` at month start, per plan_tier * - Monthly Churn Rate (%) - ``ROUND(100.0 * churned / active_start, 2)`` per plan_tier, per month * - Avg Days to Churn - ``AVG(DATEDIFF(churn_date, subscription_start_date))`` per plan_tier **Key Technique** ``LAG()`` window function to get prior-month active count as denominator. Filter ``subscription_end_date`` within each calendar month to identify churns. ``Avg Days to Churn`` reveals whether churn is early (onboarding) or late (value). **Raw JSON Submission** .. code-block:: json { "q_id": 5, "difficulty": "Medium", "db_type": "BigQuery", "domain": "HighTech (SaaS)", "instruction": "What is the monthly churn rate for each subscription plan tier over the past 6 months, and which plan has the worst retention?", "context": "Product and growth team tracking retention health. High churn in a specific tier signals pricing or onboarding issues.", "metrics_and_aggregation": [ {"kpi_metric_name": "Churned Customers per Month", "aggregation_formula": "COUNT(customer_id) WHERE subscription_status = 'churned' AND churn_date within month, grouped by plan_tier, month"}, {"kpi_metric_name": "Active Customers Start of Month", "aggregation_formula": "COUNT(customer_id) WHERE status = 'active' at first day of month, per plan_tier"}, {"kpi_metric_name": "Monthly Churn Rate (%)", "aggregation_formula": "ROUND(100.0 * churned_count / NULLIF(active_start_count, 0), 2)"}, {"kpi_metric_name": "Avg Days to Churn", "aggregation_formula": "AVG(DATEDIFF(churn_date, subscription_start_date)) per plan_tier"} ], "chain_of_thought": [ "Step 1: Compute monthly churn rate per plan tier for last 6 months.", "Step 2: Need fact_subscriptions, dim_customer, dim_plan, dim_date.", "Step 3: CTE1 — active customers at start of each month per plan. CTE2 — churned per month per plan.", "Step 4: Join CTE1 and CTE2 on plan_tier + month. Compute churn rate.", "Step 5: Include avg_days_to_churn to distinguish early vs late churn patterns.", "Step 6: Order by month ASC, churn_rate_pct DESC." ], "schema_tables": { "fact_tables": ["fact_subscriptions"], "dimension_tables": ["dim_customer", "dim_plan", "dim_date"] }, "data_model_layers": { "hierarchies": "Plan > Tier > Product, Date > Month > Quarter", "aggregations": "agg_monthly_churn_by_plan", "snapshots": "snap_subscription_health_monthly" }, "sql": "WITH monthly_active AS (SELECT dp.plan_tier, FORMAT_DATE('%Y-%m', DATE_TRUNC(d.full_date, MONTH)) AS month, COUNT(DISTINCT fs.customer_id) AS active_start_count FROM fact_subscriptions fs JOIN dim_plan dp ON fs.plan_id = dp.plan_id JOIN dim_date d ON fs.date_id = d.date_id WHERE fs.subscription_status = 'active' AND d.full_date = DATE_TRUNC(d.full_date, MONTH) AND d.full_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH) GROUP BY dp.plan_tier, month), monthly_churn AS (SELECT dp.plan_tier, FORMAT_DATE('%Y-%m', fs.churn_date) AS month, COUNT(DISTINCT fs.customer_id) AS churned_count, ROUND(AVG(DATE_DIFF(fs.churn_date, fs.subscription_start_date, DAY)), 1) AS avg_days_to_churn FROM fact_subscriptions fs JOIN dim_plan dp ON fs.plan_id = dp.plan_id WHERE fs.subscription_status = 'churned' AND fs.churn_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH) GROUP BY dp.plan_tier, month) SELECT ma.plan_tier, ma.month, ma.active_start_count, COALESCE(mc.churned_count, 0) AS churned_count, ROUND(100.0 * COALESCE(mc.churned_count, 0) / NULLIF(ma.active_start_count, 0), 2) AS churn_rate_pct, COALESCE(mc.avg_days_to_churn, 0) AS avg_days_to_churn FROM monthly_active ma LEFT JOIN monthly_churn mc ON ma.plan_tier = mc.plan_tier AND ma.month = mc.month ORDER BY ma.month, churn_rate_pct DESC;" } ---- .. _example-saas-feature-adoption: Feature Adoption Funnel by Customer Segment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Difficulty:** Hard | **Domain:** HighTech (SaaS) | **Tables:** fact_feature_usage, dim_customer, dim_feature, dim_date **Business Question** What percentage of customers in each segment have adopted our core features, and where in the adoption funnel are they dropping off? **Business Context** Product team running quarterly feature adoption review. Low adoption of core features correlates strongly with churn. Segment-level funnel reveals whether onboarding, UX, or feature-market fit is the root cause. **KPIs** .. list-table:: :header-rows: 1 * - KPI - Formula * - Customers Who Tried Feature - ``COUNT(DISTINCT customer_id)`` per feature, per segment * - Adoption Rate (%) - ``ROUND(100.0 * tried / total_segment_customers, 2)`` * - Repeat Usage Rate (%) - ``COUNT(DISTINCT customer_id WHERE usage_count >= 3) / tried`` * - Funnel Drop-off Rate (%) - ``100 - repeat_usage_rate_pct`` **Key Technique** 3-CTE pattern. CTE1 total customers per segment baseline. CTE2 feature trial counts. CTE3 repeat users (usage_count >= 3 threshold). Final join computes full funnel. ``RANK()`` surfaces features with worst drop-off per segment. **Raw JSON Submission** .. code-block:: json { "q_id": 6, "difficulty": "Hard", "db_type": "Snowflake", "domain": "HighTech (SaaS)", "instruction": "What percentage of customers in each segment have adopted our core features, and where in the adoption funnel are they dropping off?", "context": "Product team running quarterly feature adoption review. Low adoption of core features correlates with churn.", "metrics_and_aggregation": [ {"kpi_metric_name": "Customers Who Tried Feature", "aggregation_formula": "COUNT(DISTINCT customer_id) per feature_id, per customer_segment"}, {"kpi_metric_name": "Adoption Rate (%)", "aggregation_formula": "ROUND(100.0 * tried_count / total_segment_customers, 2)"}, {"kpi_metric_name": "Repeat Usage Rate (%)", "aggregation_formula": "ROUND(100.0 * COUNT(DISTINCT customer_id WHERE usage_count >= 3) / tried_count, 2)"}, {"kpi_metric_name": "Funnel Drop-off Rate (%)", "aggregation_formula": "100.0 - repeat_usage_rate_pct"} ], "chain_of_thought": [ "Step 1: Build feature adoption funnel per customer segment.", "Step 2: Need fact_feature_usage, dim_customer, dim_feature, dim_date.", "Step 3: CTE1 — total active customers per segment (baseline denominator).", "Step 4: CTE2 — distinct customers who used each feature at least once.", "Step 5: CTE3 — customers with usage_count >= 3 (repeat/sticky usage threshold).", "Step 6: RANK() per segment by drop-off_rate DESC — worst funnel leaks first." ], "schema_tables": { "fact_tables": ["fact_feature_usage"], "dimension_tables": ["dim_customer", "dim_feature", "dim_date"] }, "data_model_layers": { "hierarchies": "Feature > Feature_Group > Product, Customer > Segment > Region", "aggregations": "agg_feature_adoption_by_segment, agg_repeat_usage_rate", "snapshots": "snap_feature_funnel_quarterly" }, "sql": "WITH segment_baseline AS (SELECT c.customer_segment, COUNT(DISTINCT c.customer_id) AS total_customers FROM dim_customer c WHERE c.customer_status = 'active' GROUP BY c.customer_segment), feature_trials AS (SELECT fu.feature_id, c.customer_segment, COUNT(DISTINCT fu.customer_id) AS tried_count FROM fact_feature_usage fu JOIN dim_customer c ON fu.customer_id = c.customer_id GROUP BY fu.feature_id, c.customer_segment), repeat_users AS (SELECT fu.feature_id, c.customer_segment, COUNT(DISTINCT fu.customer_id) AS repeat_count FROM fact_feature_usage fu JOIN dim_customer c ON fu.customer_id = c.customer_id GROUP BY fu.feature_id, c.customer_segment HAVING COUNT(fu.usage_event_id) >= 3) SELECT df.feature_name, ft.customer_segment, sb.total_customers, ft.tried_count, ROUND(100.0 * ft.tried_count / NULLIF(sb.total_customers, 0), 2) AS adoption_rate_pct, COALESCE(ru.repeat_count, 0) AS repeat_count, ROUND(100.0 * COALESCE(ru.repeat_count, 0) / NULLIF(ft.tried_count, 0), 2) AS repeat_usage_rate_pct, ROUND(100.0 - ROUND(100.0 * COALESCE(ru.repeat_count, 0) / NULLIF(ft.tried_count, 0), 2), 2) AS dropoff_rate_pct FROM feature_trials ft JOIN segment_baseline sb ON ft.customer_segment = sb.customer_segment JOIN dim_feature df ON ft.feature_id = df.feature_id LEFT JOIN repeat_users ru ON ft.feature_id = ru.feature_id AND ft.customer_segment = ru.customer_segment ORDER BY ft.customer_segment, dropoff_rate_pct DESC;" } Finance ------- .. _example-finance-portfolio-performance: Portfolio Performance vs Benchmark ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Difficulty:** Hard | **Domain:** Finance | **Tables:** fact_transactions, dim_asset, dim_portfolio, dim_date **Business Question** How has each investment portfolio performed against its benchmark index over the past 12 months, and which asset classes are driving over- or under-performance? **Business Context** Investment operations team running quarterly portfolio review. Portfolios consistently underperforming their benchmark trigger rebalancing decisions. Asset-class breakdown identifies whether under-performance is concentrated or broad. **KPIs** .. list-table:: :header-rows: 1 * - KPI - Formula * - Portfolio Return (%) - ``ROUND(100.0 * (end_value - start_value) / NULLIF(start_value, 0), 2)`` * - Benchmark Return (%) - From ``dim_portfolio.benchmark_return_12m`` * - Alpha (Excess Return) - ``portfolio_return_pct - benchmark_return_pct`` * - Asset Class Contribution - ``SUM(asset_return * weight)`` per asset_class within portfolio **Key Technique** Start/end value via ``FIRST_VALUE`` and ``LAST_VALUE`` window functions partitioned by portfolio_id ordered by date. Alpha = portfolio return minus benchmark constant from dim table. Asset class contribution uses weighted return aggregation. **Raw JSON Submission** .. code-block:: json { "q_id": 7, "difficulty": "Hard", "db_type": "Redshift", "domain": "Finance", "instruction": "How has each investment portfolio performed against its benchmark index over the past 12 months, and which asset classes are driving over- or under-performance?", "context": "Investment operations team running quarterly portfolio review. Portfolios underperforming benchmark trigger rebalancing decisions.", "metrics_and_aggregation": [ {"kpi_metric_name": "Portfolio Return (%)", "aggregation_formula": "ROUND(100.0 * (end_value - start_value) / NULLIF(start_value, 0), 2) per portfolio_id"}, {"kpi_metric_name": "Benchmark Return (%)", "aggregation_formula": "dim_portfolio.benchmark_return_12m constant per portfolio"}, {"kpi_metric_name": "Alpha (Excess Return)", "aggregation_formula": "portfolio_return_pct - benchmark_return_pct"}, {"kpi_metric_name": "Asset Class Contribution", "aggregation_formula": "SUM(asset_return_pct * asset_weight) grouped by asset_class within portfolio"} ], "chain_of_thought": [ "Step 1: Compute 12-month return per portfolio vs benchmark, drill into asset class contribution.", "Step 2: Need fact_transactions, dim_asset, dim_portfolio, dim_date.", "Step 3: CTE1 — start and end portfolio value using FIRST_VALUE/LAST_VALUE over 12-month window.", "Step 4: CTE2 — portfolio return % and alpha vs benchmark from dim_portfolio.", "Step 5: CTE3 — asset class weighted return contribution within each portfolio.", "Step 6: Order portfolios by alpha ASC — worst underperformers first." ], "schema_tables": { "fact_tables": ["fact_transactions"], "dimension_tables": ["dim_asset", "dim_portfolio", "dim_date"] }, "data_model_layers": { "hierarchies": "Asset > Asset_Class > Portfolio, Date > Month > Quarter > Year", "aggregations": "agg_portfolio_return_12m, agg_asset_class_contribution", "snapshots": "snap_portfolio_performance_quarterly" }, "sql": "WITH portfolio_values AS (SELECT t.portfolio_id, FIRST_VALUE(t.portfolio_value) OVER (PARTITION BY t.portfolio_id ORDER BY d.full_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS start_value, LAST_VALUE(t.portfolio_value) OVER (PARTITION BY t.portfolio_id ORDER BY d.full_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS end_value FROM fact_transactions t JOIN dim_date d ON t.date_id = d.date_id WHERE d.full_date >= CURRENT_DATE - INTERVAL '12 months' GROUP BY t.portfolio_id, t.portfolio_value, d.full_date), portfolio_returns AS (SELECT pv.portfolio_id, MAX(pv.start_value) AS start_value, MAX(pv.end_value) AS end_value, ROUND(100.0 * (MAX(pv.end_value) - MAX(pv.start_value)) / NULLIF(MAX(pv.start_value), 0), 2) AS portfolio_return_pct FROM portfolio_values pv GROUP BY pv.portfolio_id), asset_contribution AS (SELECT t.portfolio_id, a.asset_class, ROUND(SUM(t.asset_return_pct * t.asset_weight), 4) AS class_contribution FROM fact_transactions t JOIN dim_asset a ON t.asset_id = a.asset_id JOIN dim_date d ON t.date_id = d.date_id WHERE d.full_date >= CURRENT_DATE - INTERVAL '12 months' GROUP BY t.portfolio_id, a.asset_class) SELECT p.portfolio_name, pr.portfolio_return_pct, p.benchmark_return_12m AS benchmark_return_pct, ROUND(pr.portfolio_return_pct - p.benchmark_return_12m, 2) AS alpha, ac.asset_class, ac.class_contribution FROM portfolio_returns pr JOIN dim_portfolio p ON pr.portfolio_id = p.portfolio_id JOIN asset_contribution ac ON pr.portfolio_id = ac.portfolio_id ORDER BY alpha ASC, ac.class_contribution ASC;" } Manufacturing ------------- .. _example-manufacturing-oee: Overall Equipment Effectiveness (OEE) by Production Line ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Difficulty:** Hard | **Domain:** Manufacturing | **Tables:** fact_production, dim_equipment, dim_shift, dim_date **Business Question** What is the Overall Equipment Effectiveness (OEE) score for each production line over the past quarter, and which component — availability, performance, or quality — is dragging the score down? **Business Context** Plant operations manager running monthly equipment utilization review. OEE below 85% (world-class threshold) triggers targeted maintenance or process audits. Component breakdown directs the right fix. **KPIs** .. list-table:: :header-rows: 1 * - KPI - Formula * - Availability (%) - ``(planned_time - downtime) / planned_time * 100`` * - Performance (%) - ``(actual_output / theoretical_max_output) * 100`` * - Quality (%) - ``(good_units / total_units_produced) * 100`` * - OEE Score (%) - ``ROUND(availability * performance * quality / 10000, 2)`` * - Gap to World-Class (85%) - ``85.0 - oee_score`` **Key Technique** All three OEE components computed in single CTE using conditional aggregation. OEE = product of three ratios (multiplied, not averaged). ``CASE WHEN`` identifies the weakest component per line for root cause routing. **Raw JSON Submission** .. code-block:: json { "q_id": 8, "difficulty": "Hard", "db_type": "Azure Synapse", "domain": "Manufacturing", "instruction": "What is the OEE score for each production line over the past quarter, and which component is dragging the score down?", "context": "Plant operations manager reviewing equipment utilization. OEE below 85% triggers maintenance or process audit.", "metrics_and_aggregation": [ {"kpi_metric_name": "Availability (%)", "aggregation_formula": "ROUND(100.0 * SUM(planned_time - downtime_minutes) / NULLIF(SUM(planned_time), 0), 2) per equipment_line"}, {"kpi_metric_name": "Performance (%)", "aggregation_formula": "ROUND(100.0 * SUM(actual_output) / NULLIF(SUM(theoretical_max_output), 0), 2)"}, {"kpi_metric_name": "Quality (%)", "aggregation_formula": "ROUND(100.0 * SUM(good_units) / NULLIF(SUM(total_units), 0), 2)"}, {"kpi_metric_name": "OEE Score (%)", "aggregation_formula": "ROUND(availability_pct * performance_pct * quality_pct / 10000.0, 2)"}, {"kpi_metric_name": "Gap to World-Class", "aggregation_formula": "85.0 - oee_score"} ], "chain_of_thought": [ "Step 1: Compute OEE per production line for last quarter. Flag lines below 85%.", "Step 2: Need fact_production, dim_equipment, dim_shift, dim_date.", "Step 3: CTE1 — compute availability, performance, quality per equipment_line in one pass.", "Step 4: CTE2 — multiply three ratios for OEE score. Compute gap to 85% world-class.", "Step 5: CASE WHEN identifies weakest component for root cause routing.", "Step 6: Order by oee_score ASC — worst performing lines first." ], "schema_tables": { "fact_tables": ["fact_production"], "dimension_tables": ["dim_equipment", "dim_shift", "dim_date"] }, "data_model_layers": { "hierarchies": "Equipment > Production_Line > Plant, Shift > Date > Quarter", "aggregations": "agg_oee_quarterly_by_line, agg_oee_components", "snapshots": "snap_equipment_oee_monthly" }, "sql": "WITH oee_components AS (SELECT e.equipment_line, e.plant_name, ROUND(100.0 * SUM(p.planned_time_minutes - p.downtime_minutes) / NULLIF(SUM(p.planned_time_minutes), 0), 2) AS availability_pct, ROUND(100.0 * SUM(p.actual_output) / NULLIF(SUM(p.theoretical_max_output), 0), 2) AS performance_pct, ROUND(100.0 * SUM(p.good_units) / NULLIF(SUM(p.total_units_produced), 0), 2) AS quality_pct FROM fact_production p JOIN dim_equipment e ON p.equipment_id = e.equipment_id JOIN dim_date d ON p.date_id = d.date_id WHERE d.quarter = DATEPART(QUARTER, DATEADD(QUARTER, -1, CURRENT_DATE)) GROUP BY e.equipment_line, e.plant_name), oee_scores AS (SELECT equipment_line, plant_name, availability_pct, performance_pct, quality_pct, ROUND(availability_pct * performance_pct * quality_pct / 10000.0, 2) AS oee_score, ROUND(85.0 - ROUND(availability_pct * performance_pct * quality_pct / 10000.0, 2), 2) AS gap_to_worldclass, CASE WHEN LEAST(availability_pct, performance_pct, quality_pct) = availability_pct THEN 'Availability' WHEN LEAST(availability_pct, performance_pct, quality_pct) = performance_pct THEN 'Performance' ELSE 'Quality' END AS weakest_component FROM oee_components) SELECT plant_name, equipment_line, availability_pct, performance_pct, quality_pct, oee_score, gap_to_worldclass, weakest_component FROM oee_scores ORDER BY oee_score ASC;" } Supply Chain ------------ .. _example-supplychain-supplier-lead-time: Supplier Lead Time Variance Analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Difficulty:** Medium | **Domain:** Supply Chain | **Tables:** fact_purchase_orders, dim_supplier, dim_product, dim_date **Business Question** Which suppliers consistently deliver outside their agreed lead time windows, and what is the financial impact of late deliveries on production schedules? **Business Context** Procurement team running quarterly supplier performance review. Suppliers with high lead time variance cause stock-outs, production delays, and emergency procurement costs. SLA breach rate drives contract renegotiation decisions. **KPIs** .. list-table:: :header-rows: 1 * - KPI - Formula * - Agreed Lead Time (days) - ``dim_supplier.agreed_lead_time_days`` * - Actual Lead Time (days) - ``DATEDIFF(actual_delivery_date, order_date)`` per PO * - Lead Time Variance (days) - ``actual_lead_time - agreed_lead_time`` * - SLA Breach Rate (%) - ``ROUND(100.0 * late_orders / total_orders, 2)`` per supplier * - Avg Late Delivery Cost - ``AVG(emergency_procurement_cost) WHERE is_late = TRUE`` **Key Technique** ``DATEDIFF`` per PO for actual lead time. Compare to ``agreed_lead_time_days`` from dim. ``CASE WHEN actual > agreed THEN 1 END`` flags late orders. ``HAVING breach_rate > 20`` filters chronic offenders. **Raw JSON Submission** .. code-block:: json { "q_id": 9, "difficulty": "Medium", "db_type": "PostgreSQL", "domain": "Supply Chain", "instruction": "Which suppliers consistently deliver outside their agreed lead time windows, and what is the financial impact of late deliveries?", "context": "Procurement team quarterly supplier review. High lead time variance causes stock-outs and emergency procurement costs.", "metrics_and_aggregation": [ {"kpi_metric_name": "Actual Lead Time (days)", "aggregation_formula": "DATEDIFF(actual_delivery_date, order_date) per purchase_order_id"}, {"kpi_metric_name": "Lead Time Variance (days)", "aggregation_formula": "actual_lead_time_days - dim_supplier.agreed_lead_time_days"}, {"kpi_metric_name": "SLA Breach Rate (%)", "aggregation_formula": "ROUND(100.0 * COUNT(CASE WHEN actual > agreed THEN 1 END) / COUNT(*), 2) per supplier_id"}, {"kpi_metric_name": "Avg Late Delivery Cost", "aggregation_formula": "AVG(emergency_procurement_cost) WHERE is_late = TRUE, per supplier_id"} ], "chain_of_thought": [ "Step 1: Find suppliers with >20% SLA breach rate and quantify financial impact.", "Step 2: Need fact_purchase_orders, dim_supplier, dim_product, dim_date.", "Step 3: Compute actual_lead_time = DATEDIFF(actual_delivery_date, order_date) per PO.", "Step 4: Compare vs dim_supplier.agreed_lead_time_days. Flag late = actual > agreed.", "Step 5: Aggregate per supplier — breach_rate, avg_variance, avg_late_cost.", "Step 6: HAVING breach_rate > 20. Order by avg_late_cost DESC." ], "schema_tables": { "fact_tables": ["fact_purchase_orders"], "dimension_tables": ["dim_supplier", "dim_product", "dim_date"] }, "data_model_layers": { "hierarchies": "Supplier > Category > Region, Date > Month > Quarter", "aggregations": "agg_supplier_lead_time_quarterly, agg_sla_breach_rate", "snapshots": "snap_supplier_performance_monthly" }, "sql": "WITH po_lead_times AS (SELECT po.purchase_order_id, po.supplier_id, s.supplier_name, s.agreed_lead_time_days, DATEDIFF(po.actual_delivery_date, po.order_date) AS actual_lead_time_days, DATEDIFF(po.actual_delivery_date, po.order_date) - s.agreed_lead_time_days AS lead_time_variance, CASE WHEN DATEDIFF(po.actual_delivery_date, po.order_date) > s.agreed_lead_time_days THEN 1 ELSE 0 END AS is_late, po.emergency_procurement_cost FROM fact_purchase_orders po JOIN dim_supplier s ON po.supplier_id = s.supplier_id JOIN dim_date d ON po.date_id = d.date_id WHERE d.full_date >= CURRENT_DATE - INTERVAL '90 days') SELECT supplier_name, agreed_lead_time_days, COUNT(purchase_order_id) AS total_orders, SUM(is_late) AS late_orders, ROUND(100.0 * SUM(is_late) / COUNT(purchase_order_id), 2) AS sla_breach_rate_pct, ROUND(AVG(lead_time_variance), 1) AS avg_lead_time_variance_days, ROUND(AVG(CASE WHEN is_late = 1 THEN emergency_procurement_cost END), 2) AS avg_late_delivery_cost FROM po_lead_times GROUP BY supplier_id, supplier_name, agreed_lead_time_days HAVING ROUND(100.0 * SUM(is_late) / COUNT(purchase_order_id), 2) > 20 ORDER BY avg_late_delivery_cost DESC;" }