Getting Started
Note
📊 Open Dataset · SQL · Multi-Domain · Community Contributed
Help build the largest open Text-to-SQL dataset for real-world business analytics. Every query you submit trains better AI for real business problems — across any domain.
Who Should Contribute
Data Analysts writing SQL against business data warehouses
BI Engineers building dashboards and reports
Data Scientists working on NLP and text-to-SQL research
Domain Experts (Healthcare, Finance, SaaS, Manufacturing, Supply Chain)
Anyone who has turned a business question into a SQL query
Supported Domains
Submissions accepted across all domains:
Domain |
Example Use Cases |
|---|---|
Retail |
Sales revenue, inventory turnover, return rates, customer segmentation |
Healthcare |
Readmission rates, claims denial, patient outcomes, utilization |
HighTech (SaaS) |
Churn rate, feature adoption, ARR, funnel conversion, DAU/MAU |
Finance |
Portfolio performance, risk exposure, transaction anomalies, alpha |
Manufacturing |
OEE, downtime analysis, yield rate, defect tracking |
Supply Chain |
Supplier lead time, SLA breach rate, stock-out frequency |
Other |
Any domain with structured SQL data and real business questions |
Supported Databases
Submissions accepted for all major SQL engines:
BigQuery · Snowflake · Redshift · PostgreSQL · MySQL · Oracle · Azure Synapse · Other
Specify your db_type accurately — dialect differences (e.g. DATE_DIFF vs DATEDIFF)
are expected and valuable for the dataset.
How to Submit (~5 minutes per entry)
All contributions go through the UI Form. The dataset repository is managed privately by maintainers — contributors submit via the form and maintainers handle ingestion.
Step 1 — Open the Form
Open the Query Entry Form in your browser. No login or account needed.
Step 2 — Select Your Domain
Choose your domain from the dropdown: Retail, Healthcare, HighTech (SaaS), Finance, Manufacturing, Supply Chain, or Other.
Step 3 — Fill Required Fields
Complete all required sections. The Live JSON panel on the right updates in real time as you type. See Data Spec for full field reference.
Required sections:
Meta — difficulty, db_type, domain
Business Question — how a real user would ask it
Business Context — who needs this and why
Metrics & Aggregation — KPI names + formulas
Schema Tables — fact and dimension tables used
Data Model Layers — hierarchies, aggregations, snapshots
Optional but strongly recommended:
Chain of Thought — step-by-step reasoning
SQL Answer — actual query (significantly improves quality)
Step 4 — Copy and Submit
Once all fields are complete, click the Submit button in the form. Your entry will be sent directly to the maintainers for review.
Tip
Best submissions have clear business context, realistic KPIs, and SQL that actually runs. See Examples before writing your first entry.
What Makes a Good Submission
Strong submissions have:
A business question written the way a non-technical stakeholder would ask it
Context that explains who needs this data and what decision it drives
KPIs with clear, plain-English aggregation formulas (avoid SQL jargon in KPI names)
SQL that runs cleanly against the standard schema or your specified db_type
Chain of thought that walks through the reasoning step by step
Weak submissions often have:
Vague instructions like “get sales data” with no business context
KPI names that are just SQL expressions (
SUM(net_sales)) instead of business termsMissing or mismatched metric/formula pairs
SQL with syntax errors or non-standard table names without explanation
Difficulty Guide
Use this as a reference when selecting difficulty:
Level |
Typical Pattern |
Examples |
|---|---|---|
Easy |
Single table, simple filter + aggregate |
Total sales last month, top 10 products by revenue |
Medium |
2–3 table joins, date windows, basic window functions |
YoY revenue comparison, customer segment breakdown, return rate |
Hard |
Multi-CTE, advanced window functions, complex conditions |
Seasonality index, OEE score, churn funnel, portfolio alpha |
Expert |
Recursive CTEs, nested window functions, multi-step derivations |
Cohort retention curves, graph traversal, multi-period attribution |
Examples
Click any example below to jump to the full walkthrough:
Retail
Healthcare
HighTech (SaaS)
Finance
Manufacturing
Supply Chain
FAQ
Do I need to know SQL to contribute?
SQL is optional but strongly recommended. Entries with SQL are higher quality and more useful for model training. See Examples for reference before writing.
Can I submit from non-retail domains?
Yes — all domains are welcome. Use equivalent fact/dimension table naming conventions
(e.g. fact_claims, dim_patient) following the star schema pattern in Schema Reference.
How many submissions can I make?
No limit. Each unique business question counts as one entry. Bulk submissions with diverse domains and difficulty levels are especially valued.
What if my SQL has dialect-specific syntax?
Specify your db_type correctly (e.g. BigQuery, Snowflake). Dialect-specific
functions like DATE_DIFF, DATEADD, FORMAT_DATE are expected and kept as-is.
Can I submit without the SQL?
Yes — SQL is optional. Entries without SQL are still accepted if all other required fields are complete and high quality.
What if my schema differs from the standard one?
Note the variation in your submission. Non-standard tables are accepted as long as they follow fact/dimension naming conventions.
How long does review take?
Target turnaround is 7 days. Complex or ambiguous entries may take longer. You’ll receive feedback directly in your Discussion thread.
See also