Thompson Sampling Via Fine-Tuning (ToSFiT) of LLMs Achieves Scalable Bayesian Optimization in Complex Discrete Spaces
Thompson Sampling via Fine-Tuning (ToSFiT), a major breakthrough in optimization algorithms, has been revealed by researchers from ETH Zürich and IBM Research, Zurich. This innovative strategy overcomes the difficulty of searching huge and complex search spaces, where conventional gradient-based approaches usually fall short, by utilizing the power of large language models (LLMs). ToSFiT presents a scalable approach to Bayesian optimization (BO) that effectively circumvents the computationally costly acquisition function maximization procedure.
The novel method achieves high theoretical performance guarantees while significantly increasing efficiency in real-world applications by gradually modifying LLMs to reflect the growing understanding of the search area. Together with Abbas Rahimi from IBM Research, Zurich, the team behind this work consists of Nicolas Menet, Aleksandar Terzić, and Andreas Krause from ETH Zürich.
You can also read SPDC Quantum: Spontaneous Parametric Down Conversion news
Overcoming the Optimization Hurdle in Discrete Domains
When reward function evaluations are expensive or time-consuming, Bayesian optimization is a crucial algorithmic framework for automated discovery and large-scale experimental design. BO employs this statistical model to direct the search for promising configurations while maintaining a posterior distribution over unknown rewards. Traditionally, the process of choosing new candidates involves optimizing an acquisition function that strikes a compromise between exploitation (improving on current solutions) and exploration (testing out new options).
Thompson sampling (TS) is unique among acquisition procedures because to its robust empirical performance and cutting-edge convergence guarantees. In order to treat the realization as an acquisition function, TS usually draws a reward function realization from the posterior and chooses the point that maximizes it.
However, because effective search is impossible in huge unstructured discrete domains like the space of amino acid sequences or correct code for quantum circuits due to the lack of gradients, this maximization stage poses a fundamental issue. An exhaustive search is impossible because, for example, a protein search space with 20 amino acids and a maximum sequence length of 100 already surpasses the number of atoms in the observable universe. In these combinatorial spaces, conventional gradient-based techniques are intractable and frequently necessitate iteration over every point.
You can also read Universal Gröbner Bases Enable Next-Gen Post-Quantum World
ToSFiT: LLMs as Generative Optimizers
The researchers created ToSFiT in order to scale BO to these complicated, high-dimensional areas. ToSFiT uses a generative LLM to parameterize the probability of minimality (PoM), or the likelihood that a candidate solution is optimal, directly rather than maximizing an acquisition function. By treating the resulting proposals as Thompson samples, costly acquisition function maximization is avoided.
The Variational Bayesian Optimistic Sampling (VBOS) paradigm serves as the foundation for ToSFiT. Importantly, ToSFiT begins the optimization process with a pre-trained language model that has been prompt-conditioned. This gives it a solid prior knowledge basis, which speeds up learning. Online fine-tuning is the method by which it carefully adjusts the model parameters towards the posterior PoM utilising the VBOS objective.
In order to compute the reward posterior in closed form and enable conditioning on observations, the researchers used linear kernels over learnt features to implement scalable Gaussian process (GP) inference. This indicates that the memory and computational complexity scale in Θ(dim(H) 2) rather than the number of previous observations.
Reinforcement learning techniques, notably the Reinforce Leave-One-Out (RLOO) baseline, were used to stabilize the gradient estimation needed for fine-tuning the LLM. Group Relative Policy Optimization’s (GRPO) advantage function is technically identical to standardized RLOO.
You can also read Meson-Antimeson Mixing Studies CP Violation in Standard Model
Theoretical Guarantees and Policy Initialization
The study offers substantial theoretical support for ToSFiT. In order to show that the cumulative regret scales with the maximal information gain (γT) rather than the size of the search space (∣X∣), the researchers developed a novel regret restriction for a variational formulation of Thompson Sampling. This significantly outperforms earlier constraints for precise VBOS, which scaled as O ~ ( T∣X∣), a constraint in combinatorically huge domains that is vacuous. In d dimensions, this new bound scales nicely as O(dlogT) for a linear kernel.
This theoretical approach emphasizes how important careful adaptation is. The approximation error between the precise VBOS maximiser (πt) and the sampling policy (~t) It runs the risk of overpowering the total remorse. To address this, it is crucial to initialise ToSFiT through pre-training and context, which guarantees that the policy begins in the appropriate area of the probability simplex. A robust initial policy produces significantly superior performance, according to empirical research, and careful adaptation (using low learning rates) is necessary to preserve this prior knowledge and prevent performance stagnation.
Validation Across Diverse Tasks
ToSFiT’s efficacy in sample efficiency with minimal impact on computing cost was confirmed by empirical validation across three very different search issues.
- FAQ Response Refinement: Using a Qwen3-1.7B model, this natural language challenge optimizes content according to semantic alignment to an unknown ground-truth response.
- Thermally Stable Protein Search: Creating amino acid sequences that optimize thermal stability a crucial characteristic for medication development is the challenge in this field. Sequences were sampled using ProtGPT2, and the search space is exponentially big.
- Quantum Circuit Design: This calls for employing a Qwen2.5-Coder-1.5B model to navigate a large, discrete space of legitimate quantum programs in order to create Qiskit circuits that prepare low-energy quantum states in unknown contexts.
Because Unguided Generation does not use feedback, it rapidly reaches an unsatisfactory reward level in all experimental conditions. Although Post-Generation TS, a traditional BO technique over a predetermined subset of candidates, finds effective solutions quickly, it is limited to its starting pool and saturates too soon. ToSFiT, on the other hand, performs BO throughout the whole solution space and constantly finds candidates with larger rewards. Additionally, it demonstrated better exploration efficiency through optimism in the face of uncertainty, outperforming baselines such as Actor Critic and Soft Actor Critic.
Additionally, Thompson sampling is ideal for batched optimization since it naturally produces a variety of candidates. This ability is demonstrated by ToSFiT, which shows that batching greatly increases iteration efficiency and reaches target performance in fewer rounds, even while it somewhat decreases sample efficiency. This is crucial when observations are time-consuming or delayed.
The results validate that complex, discrete search problems can be solved by combining principled Bayesian optimization with strong foundation models. To further lower computing cost, future work will try to incorporate jointly learnt task-adaptive embeddings, investigate more expressive reward models like Bayesian neural networks, or limit updates to just a subset of the generative model.
You can also read Quantum Enhanced Markov Chain Monte Carlo MCMC Method