Getting Started

Note

📊 Open Dataset · SQL · Multi-Domain · Community Contributed

Help build the largest open Text-to-SQL dataset for real-world business analytics. Every query you submit trains better AI for real business problems — across any domain.


Who Should Contribute

  • Data Analysts writing SQL against business data warehouses

  • BI Engineers building dashboards and reports

  • Data Scientists working on NLP and text-to-SQL research

  • Domain Experts (Healthcare, Finance, SaaS, Manufacturing, Supply Chain)

  • Anyone who has turned a business question into a SQL query


Supported Domains

Submissions accepted across all domains:

Domain

Example Use Cases

Retail

Sales revenue, inventory turnover, return rates, customer segmentation

Healthcare

Readmission rates, claims denial, patient outcomes, utilization

HighTech (SaaS)

Churn rate, feature adoption, ARR, funnel conversion, DAU/MAU

Finance

Portfolio performance, risk exposure, transaction anomalies, alpha

Manufacturing

OEE, downtime analysis, yield rate, defect tracking

Supply Chain

Supplier lead time, SLA breach rate, stock-out frequency

Other

Any domain with structured SQL data and real business questions


Supported Databases

Submissions accepted for all major SQL engines:

BigQuery · Snowflake · Redshift · PostgreSQL · MySQL · Oracle · Azure Synapse · Other

Specify your db_type accurately — dialect differences (e.g. DATE_DIFF vs DATEDIFF) are expected and valuable for the dataset.


How to Submit (~5 minutes per entry)

All contributions go through the UI Form. The dataset repository is managed privately by maintainers — contributors submit via the form and maintainers handle ingestion.

Step 1 — Open the Form

Open the Query Entry Form in your browser. No login or account needed.

Step 2 — Select Your Domain

Choose your domain from the dropdown: Retail, Healthcare, HighTech (SaaS), Finance, Manufacturing, Supply Chain, or Other.

Step 3 — Fill Required Fields

Complete all required sections. The Live JSON panel on the right updates in real time as you type. See Data Spec for full field reference.

Required sections:

  1. Meta — difficulty, db_type, domain

  2. Business Question — how a real user would ask it

  3. Business Context — who needs this and why

  4. Metrics & Aggregation — KPI names + formulas

  5. Schema Tables — fact and dimension tables used

  6. Data Model Layers — hierarchies, aggregations, snapshots

Optional but strongly recommended:

  1. Chain of Thought — step-by-step reasoning

  2. SQL Answer — actual query (significantly improves quality)

Step 4 — Copy and Submit

Once all fields are complete, click the Submit button in the form. Your entry will be sent directly to the maintainers for review.

Tip

Best submissions have clear business context, realistic KPIs, and SQL that actually runs. See Examples before writing your first entry.


What Makes a Good Submission

Strong submissions have:

  • A business question written the way a non-technical stakeholder would ask it

  • Context that explains who needs this data and what decision it drives

  • KPIs with clear, plain-English aggregation formulas (avoid SQL jargon in KPI names)

  • SQL that runs cleanly against the standard schema or your specified db_type

  • Chain of thought that walks through the reasoning step by step

Weak submissions often have:

  • Vague instructions like “get sales data” with no business context

  • KPI names that are just SQL expressions (SUM(net_sales)) instead of business terms

  • Missing or mismatched metric/formula pairs

  • SQL with syntax errors or non-standard table names without explanation


Difficulty Guide

Use this as a reference when selecting difficulty:

Level

Typical Pattern

Examples

Easy

Single table, simple filter + aggregate

Total sales last month, top 10 products by revenue

Medium

2–3 table joins, date windows, basic window functions

YoY revenue comparison, customer segment breakdown, return rate

Hard

Multi-CTE, advanced window functions, complex conditions

Seasonality index, OEE score, churn funnel, portfolio alpha

Expert

Recursive CTEs, nested window functions, multi-step derivations

Cohort retention curves, graph traversal, multi-period attribution


Examples

Click any example below to jump to the full walkthrough:

Retail

Healthcare

HighTech (SaaS)

Finance

Manufacturing

Supply Chain


FAQ

Do I need to know SQL to contribute?

SQL is optional but strongly recommended. Entries with SQL are higher quality and more useful for model training. See Examples for reference before writing.

Can I submit from non-retail domains?

Yes — all domains are welcome. Use equivalent fact/dimension table naming conventions (e.g. fact_claims, dim_patient) following the star schema pattern in Schema Reference.

How many submissions can I make?

No limit. Each unique business question counts as one entry. Bulk submissions with diverse domains and difficulty levels are especially valued.

What if my SQL has dialect-specific syntax?

Specify your db_type correctly (e.g. BigQuery, Snowflake). Dialect-specific functions like DATE_DIFF, DATEADD, FORMAT_DATE are expected and kept as-is.

Can I submit without the SQL?

Yes — SQL is optional. Entries without SQL are still accepted if all other required fields are complete and high quality.

What if my schema differs from the standard one?

Note the variation in your submission. Non-standard tables are accepted as long as they follow fact/dimension naming conventions.

How long does review take?

Target turnaround is 7 days. Complex or ambiguous entries may take longer. You’ll receive feedback directly in your Discussion thread.