Master thesis

Large Language Models (LLMs) are transforming how we interact with structured data, enabling users to query databases using natural language. This shift is particularly evident in the Text-to-SQL task, where models translate human questions into SQL queries. However, translating complex queries accurately remains a challenge for current models.

My master's thesis investigates how we can boost the performance of LLMs on the Text-to-SQL task—not by increasing model size or pre-training data, but by scaling inference time. Specifically, I explore techniques like Best-of-N sampling, Majority Voting, and the use of an LLM as a judge to select the best SQL query from multiple candidates.

The thesis is structured around:

An overview of Text-to-SQL systems and their limitations;
A detailed comparison of decoding strategies (greedy, beam search, random sampling);
Evaluation of inference-time techniques like Best-of-N with heuristic filters, LLM judges, and progressive refinement;
A custom implementation of an Outcome Reward Model (ORM) to judge SQL output quality;
Extensive benchmarking on the BIRD dataset using state-of-the-art models like OMNI-SQL and IBM Granite.

The results demonstrate that small models, when enhanced with inference-time scaling, can rival much larger models—highlighting a promising direction for cost-effective NLP applications.

Acknowledgments

I would like to thank my supervisor Prof. Fedelucio Narducci and co-supervisor Dr. Dario Di Palma for their continuous support and guidance throughout this research.

Boosting Text-To-SQL tasks by using LLMs, in collaboration with IBM

Acknowledgments