Current Position
I’m a Research Engineer at Meta (2022-present) working on Large Language Models. Previous work includes LLama 2, 3, 4 and Galactica. I’ve worked in numerous areas primarily in post-training and alignment.
Hallucination Reduction and Factuality
- I’ve worked extensively on the research separation between hallucinations and factuality. In HalluLens: LLM Hallucination Benchmark we argued that work on hallucinations was unhelpfully entangled with factuality. Furthermore, we argued that there were few works separating extrinsic and intrinsic hallucinations. We released a series of novel extrinsic hallucination evaluations which we combined with existic work to create a comprehensive benchmark.
- I was one half of the core factuality and hallucination reduction post-training team in Llama3 (mainly for the 3.1 release). We worked on the hallucination technique described in 4.3.6 of The Llama 3 Herd of Models to ecourage faithfullness of llama answers to training data.
- I worked on a paper which investigated a mechanistic technique to reduce hallucinations: Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations.
- I’ve recently been working on agentic retrieval with web-search for improving factuality
Reasoning
- I have worked on long CoT reasoning:
- 2023: reward models (ORMs vs PRMs) for rejection sampling and RL.
- late 2024/early 2025: Structured (or parsed) rewards for rejection sampling and RL.
- Recently we released the paper What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT which provides an investigation into the overall structure of CoT and what sub-structures increase liklihood of getting the correct answer. We also released the paper Don’t Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting to make use of negative batches in GRPO using a confidence-weighted penalty on incorrect responses.
Previous Experience
- I’ve advised startups on Machine Learning:
- I was the Machine Learning Lead at Genie AI 2018-2022.
- Led a 2 year InnovateUK collaboration with Marta Kwiatkowska’s team at Oxford in explainability and robustness in NLP. We’ve developed a novel way of explaining transformer models based on a game based framework, published a way of assessing robustness of machine learning models to adversarial examples using a monte carlo tree search, and developed techniques to explain machine learning models decisions causally.
- Led a 2 year InnovateUK collaboration with the computational privacy group at Imperial college. We’ve developed techniques to assess vulnerabilities to authorship attribution attacks on text (automatically determining the author of a piece of text from their writing style), and potential ways to mitigate these risks.
- Developed countless proof of concepts of intelligent potential features and productionized the most promising into the Genie product.
- Worked with great advisors such as Jun Wang, Former Supreme Court President Lord Neurberger, and Adam Ziegler who created case.law
- Led the technical due dillegence aspects of several funding rounds.
- Previously founder of legal tech company through EF (LD10). We managed to raise a bit of money and got some traction, but it was ultimately unsucessful and we pulled the plug. I learned a lot, and remain plugged into the startup ecosystem.
- Did my Machine Learning masters at Signal AI where I was researching reported speach detection in news (detecting quoations that are indirect and not within quotation marks).
- Cut my software teeth as a Java developer at IBlocks and as a software engineer at TNG Technology Consulting in Germany.
I’m an all-rounder technologist with extensive experience in taking products from conception through to product development and production. I’m a polyglot software engineer who can write well engineered software, perform devops and deploy to production.
Education
- MSc Machine Learning, University College London (2016-2017) - Distinction
- MEng Engineer Science, Magdalen College, Oxford University (2010-2014)- 2.1
Publications
- Don’t Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting
- What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT
- Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations
- HalluLens: LLM Hallucination Benchmark
- The Llama 3 Herd of Models
- Llama 2: Open Foundation and Fine-Tuned Chat Models
- Improving clinical trial design using interpretable machine learning based prediction of early trial termination
- Galactica: A Large Language Model for Science
- Assessing Robustness of Text Classification through Maximal Safe Radius Computation