Large Language Models (LLMs) are transforming how we interact with structured data, enabling users to query databases using natural language. This shift is particularly evident in the Text-to-SQL task, where models translate human questions into SQL queries. However, translating complex queries accurately remains a challenge for current models.
My master's thesis investigates how we can boost the performance of LLMs on the Text-to-SQL task—not by increasing model size or pre-training data, but by scaling inference time. Specifically, I explore techniques like Best-of-N sampling, Majority Voting, and the use of an LLM as a judge to select the best SQL query from multiple candidates.
The thesis is structured around:
- An overview of Text-to-SQL systems and their limitations;
- A detailed comparison of decoding strategies (greedy, beam search, random sampling);
- Evaluation of inference-time techniques like Best-of-N with heuristic filters, LLM judges, and progressive refinement;
- A custom implementation of an Outcome Reward Model (ORM) to judge SQL output quality;
- Extensive benchmarking on the BIRD dataset using state-of-the-art models like OMNI-SQL and IBM Granite.
The results demonstrate that small models, when enhanced with inference-time scaling, can rival much larger models—highlighting a promising direction for cost-effective NLP applications.
Acknowledgments
I would like to thank my supervisor Prof. Fedelucio Narducci and co-supervisor Dr. Dario Di Palma for their continuous support and guidance throughout this research.