Inference Time Tactics
A podcast exploring the emerging field of inference-time compute—the next frontier in AI performance. Hosted by the Neurometric team, we unpack how models reason, make decisions, and perform at runtime. For developers, researchers, and operators building AI infrastructure.
Episodes

4 days ago
4 days ago
In this episode of Inference Time Tactics, Rob, Cooper, and Byron sit down with Shawn Rogers, CEO of BARC US to unpack fresh data from 421 organizations actively deploying AI in production. Shawn shares what separates the 20% of AI leaders from everyone else, why cost surprises are hitting harder than expected, and how the pressure to "just do AI" is causing companies to skip critical foundations—often to their detriment.
We talked about:
Why multi-model strategies and small language models are becoming essential for enterprise AI.
The seven foundational areas that help AI leaders deploy twice as many projects as everyone else.
Why 51% of deployments face unexpected cost overruns—and which expenses hit hardest.
Data quality jumping to the #1 challenge, affecting 44% of production deployments.
The IT satisfaction paradox: top resource at the start, lowest satisfaction scores at scale.
How responsible AI priorities shifted as human-in-the-loop dropped from 36% to 21%.
Resources Mentioned:
Lessons from the Leading Edge: Successful Delivery of AI/GenAI
https://barc.com/research/successful-ai-genai-delivery/
Connect with BARC:Website: https://barc.com/
LinkedIn (Shawn Rogers): https://www.linkedin.com/in/shawnrogers/
Connect with Neurometric:Website: https://www.neurometric.ai/
Substack: https://neurometric.substack.com/
X: https://x.com/neurometric/
Bluesky: https://bsky.app/profile/neurometric.bsky.social
Hosts:
Rob May
https://x.com/robmay
https://www.linkedin.com/in/robmay
Calvin Cooper
https://x.com/cooper_nyc_
https://www.linkedin.com/in/coopernyc
Byron Galbraith
https://x.com/bgalbraith
https://www.linkedin.com/in/byrongalbraith

Tuesday Dec 16, 2025
Tuesday Dec 16, 2025
In this episode of Inference Time Tactics, Cooper and Byron break down NeuroMetric's Thinking Algorithm Leaderboard and what it reveals about building production-ready AI agents. They share why prompt engineering with a single model won't cut it for enterprise use cases, explore the impact of inference-time compute strategies, and discuss what they learned from testing 10 models across real CRM tasks—from surprising token inefficiency to catastrophic failures in SQL generation.
We talked about:
Why NeuroMetric built the first leaderboard combining models with inference-time compute strategies.
How Salesforce's CRMArena-Pro reflects real multi-step business tasks better than pure reasoning benchmarks.
The jagged frontier: no single model or technique dominates across all tasks.
Why GPT 20B was surprisingly token inefficient—twice as slow as GPT 120B for similar accuracy.
How GPT-5 nano's conversational style broke SQL generation tasks completely.
Trading accuracy for speed: two-model ensembles versus five, and saving 20+ seconds per task.
Throughput constraints as a hidden bottleneck when scaling to production volumes.
Future directions: LLM-guided search, task clustering, and compression to specialized small models.
Resources Mentioned:
CRMArena-Pro from Saleforce:
https://www.salesforce.com/blog/crmarena-pro/
Thinking Algorithm Leaderboard:
https://leaderboard.neurometric.ai/
Connect with Neurometric:Website: https://www.neurometric.ai/
Substack: https://neurometric.substack.com/
X: https://x.com/neurometric/
Bluesky: https://bsky.app/profile/neurometric.bsky.social
Hosts:
Calvin Cooper
https://x.com/cooper_nyc_
https://www.linkedin.com/in/coopernyc
Guest/s:
Byron Galbraith
https://x.com/bgalbraith
https://www.linkedin.com/in/byrongalbraith

Wednesday Nov 05, 2025
Wednesday Nov 05, 2025
In this episode of Inference Time Tactics, Rob and Cooper from Neurometric sit down with Yash Sharma, an AI researcher whose work is reshaping how we understand model generalization. Yash recently completed his PhD at the Max Planck Institute for Intelligent Systems and has held research roles at Google Brain, Meta AI, Amazon, Borealis AI, and IBM Research. His studies on compositional generalization, adversarial robustness, and long-tail benchmarks reveal when and why models succeed—or fail—at reasoning beyond their training data.
If you’re designing inference-time systems, building agents that need reliability, or just want to understand what “generalization” actually means in practice, this conversation bridges deep theory with actionable insight—clear, technical, and strategically grounded.
Key Topics
What it really means for AI systems to generalize beyond their training data
Why large language models still fail in novel or unpredictable scenarios
How inference-time compute can both amplify and reveal generalization limits
What these limits mean for building reliable, agentic AI systems
How to benchmark generalization in real-world settings
Yash’s “Let It Wag!” benchmark for testing long-tail and under-represented concepts
Why genuine scientific breakthroughs (like curing cancer) require more than scaling test-time compute
Connect with Yash Sharma:
Yash Sharma
Let It Wag! Benchmark
Paper: Pretraining Frequency Predicts Compositional Generalization of CLIP (NeurIPS 2024 Workshop)
Connect with Neurometric:Website: https://www.neurometric.ai/
Substack: https://neurometric.substack.com/
X: https://x.com/neurometric/
Bluesky: https://bsky.app/profile/neurometric.bsky.social
Rob May
https://x.com/robmay
https://www.linkedin.com/in/robmay
Calvin Cooper
https://x.com/cooper_nyc_
https://www.linkedin.com/in/coopernyc

Friday Oct 03, 2025
Friday Oct 03, 2025
In this episode of Inference Time Tactics, Rob, Cooper, and Byron sit down with Prashanth Velidandi, co-founder of InferX, to explore how serverless inference is tackling the AI “cold start problem.” They dig into why 90% of the model lifecycle happens at inference—not training—and how cold starts and idle GPUs are crippling efficiency. Prashanth explains InferX’s snapshot technology, what it takes to deliver sub-second cold starts, and why inference infrastructure—not just models—will define the next era of AI.
We talked about:
Why inference represents 90% of the model lifecycle, compared to the training focus most of the industry has.
How cold starts and idle GPUs create massive inefficiencies in AI infrastructure.
InferX’s snapshot technology that enables sub-second model loading and higher GPU utilization.
The challenges of explaining and selling deeply technical infrastructure to the market.
Why enterprises care about inference efficiency, cost, and reliability more than model size.
How serverless inference abstracts away infrastructure complexity for developers.
The coming explosion of multi-agent systems and billions of specialized models.
Why sustainable innovation in AI will come from inference infrastructure.
Connect with InferX
Prashanth Velidandi
https://inferx.net
https://x.com/pmv_inferx
https://www.linkedin.com/in/prashanth-velidandi-98629b115
Connect with Neurometric:Website: https://www.neurometric.ai/
Substack: https://neurometric.substack.com/
X: https://x.com/neurometric/
Bluesky: https://bsky.app/profile/neurometric.bsky.social
Rob May
https://x.com/robmay
https://www.linkedin.com/in/robmay
Calvin Cooper
https://x.com/cooper_nyc_
https://www.linkedin.com/in/coopernyc
Byron Galbraith
https://x.com/bgalbraith
https://www.linkedin.com/in/byrongalbraith

Tuesday Sep 30, 2025
Tuesday Sep 30, 2025
Check out the latest episode of Inference Time Tactics. Our guest is Pawan Deshpande, founder, product leader, and angel investor in companies like Anthropic and Toast, with roles at Google, Scale AI and Domino Data Lab.
Hosts Rob May & Calvin Cooper sit down with Pawan to cover:
Early MIT NLP research applied to today’s inference-time tradeoffs
How to evaluate enterprise agents in practice
Training data plus inference filtering in real deployments
Open source adoption realities in the enterprise
Where durable value lives in the stack
Connect with Pawan Deshpande
Website: https://pawandeshpande.com/Academic / Research Works & ThesisDecoding Algorithms for Complex Natural Language Tasks (MIT thesis, 2007)Randomized Decoding for Selection-and-Ordering Problems
Connect with Neurometric:Website: https://www.neurometric.ai/
Substack: https://neurometric.substack.com/
X: https://x.com/neurometric/
Bluesky: https://bsky.app/profile/neurometric.bsky.social
Hosts:
Rob May
https://x.com/robmay
https://www.linkedin.com/in/robmay
Calvin Cooper
https://x.com/cooper_nyc_
https://www.linkedin.com/in/coopernyc

Monday Sep 22, 2025
Monday Sep 22, 2025
In this episode of Inference Time Tactics, Rob, Cooper, Byron, and Dave share product updates for Neurometric’s Inference Time Compute Studio and what they reveal about the shift from single models to full AI systems. They discuss why wiring models together at scale is so challenging, how a drag-and-drop interface can make experimenting with inference strategies easier, and why open source, benchmarking, and community feedback are key to building the next generation of composable AI systems.
We talked about:
Why AI is shifting from single models to full systems and what that means for builders.
The challenges of wiring multiple models together at scale and running them in production.
How Neurometric’s drag-and-drop interface simplifies testing inference strategies without code.
Why open-source models are becoming increasingly competitive with commercial solutions.
The lack of standardization in AI stacks and why the industry still feels like the “early web” era.
How inference-time compute can balance performance, cost, and latency across different tasks.
Why benchmarks alone are insufficient and how domain-specific evaluations can fill the gap.
The role of community feedback in shaping priorities for benchmarks and new primitives.
Connect with Neurometric:Website: https://www.neurometric.ai/
Substack: https://neurometric.substack.com/
X: https://x.com/neurometric/
Bluesky: https://bsky.app/profile/neurometric.bsky.social
Hosts:
Rob May
https://x.com/robmay
https://www.linkedin.com/in/robmay
Calvin Cooper
https://x.com/cooper_nyc_
https://www.linkedin.com/in/coopernyc
Guest/s:
Byron Galbraith
https://x.com/bgalbraith
https://www.linkedin.com/in/byrongalbraith
Dave Rauchwerk
https://x.com/elevenarms
https://www.linkedin.com/in/dave-rauchwerk-0ba82822

Monday Sep 08, 2025
Monday Sep 08, 2025
In this episode of Inference Time Tactics, Rob, Cooper, and Byron explore Salesforce’s CRMArena-Pro benchmark and what it reveals about the limits of enterprise AI agents. They share why benchmark scores often fail in production, how inference-time tactics like best-of-N can improve reliability, and what NeuroMetric is building to make eval easier—from an ITC Test Engine to a drag-and-drop interface for rapid visualization and experimentation.
We talked about:
Why Salesforce’s CRMArena-Pro benchmark highlights the gap between lab benchmarks and real-world agent reliability.
How leading models perform inconsistently across single-turn and multi-turn enterprise tasks.
Why benchmark scores are weak predictors of operational success in production.
The role of inference-time tactics in reducing variance and improving stability.
NeuroMetric’s new platform: ITC Test Engine and drag-and-drop interface for experimentation.
Challenges in building agentic systems, from database integration to managing multi-prompt complexity.
Why large language models’ stochastic nature conflicts with business demands for reliability.
Latency, cost, and rate limits as major bottlenecks in scaling agentic workflows.
The limits of “vibe testing” and why rigorous evaluation frameworks are essential.
How Google’s Stacks tool speeds up evaluation with LLM-as-judge, and why it still falls short for enterprise needs.
Resources Mentioned:
CRMArena-Pro from Saleforce:
https://www.salesforce.com/blog/crmarena-pro/
Connect with Neurometric:Website: https://www.neurometric.ai/
Substack: https://neurometric.substack.com/
X: https://x.com/neurometric/
Bluesky: https://bsky.app/profile/neurometric.bsky.social
Hosts:
Rob May
https://x.com/robmay
https://www.linkedin.com/in/robmay
Calvin Cooper
https://x.com/cooper_nyc_
https://www.linkedin.com/in/coopernyc
Guest/s:
Byron Galbraith
https://x.com/bgalbraith
https://www.linkedin.com/in/byrongalbraith

Friday Aug 29, 2025
Friday Aug 29, 2025
In this episode of Inference Time Tactics, Rob and Cooper unpack the launch of GPT 5.0 and what OpenAI’s new routing layer signals about the shifting AI landscape. They explore the tradeoffs of cost, latency, and accuracy, zoom out to programmable inference in an agent-driven world, and track the ripple effects on chips, data centers, and energy use.
We talked about:
Why GPT 5.0’s launch felt more like refinement than a revolution in AI progress.
How OpenAI’s new routing layer reframes the race around inference control.
The tradeoffs routing enables between cost, latency, and accuracy across models.
Why the “one model to rule them all” view is giving way to multi-model orchestration.
The strategic role of programmable inference in an agent-driven world.
How router companies are becoming a strategic layer in the AI technology stack.
The impact of inference compute on chips, accelerators, and data center design.
Why energy use at scale is driving a push for more efficient AI systems.
Why inference optimization may be the next big competitive edge.
Connect with Neurometric:Website: https://www.neurometric.ai/
Substack: https://neurometric.substack.com/
X: https://x.com/neurometric/
Bluesky: https://bsky.app/profile/neurometric.bsky.social
Hosts:
Rob May
https://x.com/robmay
https://www.linkedin.com/in/robmay
Calvin Cooper
https://x.com/cooper_nyc_
https://www.linkedin.com/in/coopernyc

Monday Aug 18, 2025
Monday Aug 18, 2025
In this episode of Inference Time Tactics, Rob, Cooper, and CTO Byron unpack Apple’s “Illusion of Thinking” paper—why it split the AI community, what it reveals about reasoning model limits, and how hidden thinking traces shape performance. They share insights from building an open-source tool to reproduce the study, explain why models loop, overthink, or stall, and outline what it will take to build more reliable reasoning systems for real-world use.
We talked about:
Why Apple’s Illusion of Thinking paper sparked heated debate in the AI community.
How reasoning models work, including hidden “thinking” phases and token budget limits.
Key findings on when reasoning improves results, when it degrades them, and where it stalls.
Reasons models loop, overthink, or abandon tasks.
Building an open-source tool to replicate the study and test local reasoning models.
What real-time reasoning traces reveal about model behavior and limits.
Challenges in scoring reasoning quality and treating “I don’t know” as a valid output.
Why reasoning models must be matched carefully to specific tasks.
The ongoing debate over scaling vs. new architectures for advancing reasoning.
Developing a benchmarking platform to help enterprises choose models for IP-sensitive applications.
Resources Mentioned:
Illusion of Thinking Paper
https://machinelearning.apple.com/research/illusion-of-thinking
Neurometric Illusion of Thinking Tool
https://github.com/NeurometricAI/illusion-of-thinking
Connect with Neurometric:Website: https://www.neurometric.ai/
Substack: https://neurometric.substack.com/
X: https://x.com/neurometric/
Bluesky: https://bsky.app/profile/neurometric.bsky.social
Hosts:
Rob May
https://x.com/robmay
https://www.linkedin.com/in/robmay
Calvin Cooper
https://x.com/cooper_nyc_
https://www.linkedin.com/in/coopernyc
Guest/s:
Byron Galbraith
https://x.com/bgalbraith
https://www.linkedin.com/in/byrongalbraith

Tuesday Aug 12, 2025
Tuesday Aug 12, 2025
In this episode of Inference Time Tactics, Rob and Cooper dig into the strategic trade-offs driving a major shift in AI: why some enterprises start with closed models like OpenAI or Anthropic, then move to open-source stacks. The team breaks down the challenges of switching and how inference-time compute is becoming a competitive differentiator. They also unpack why pricing is shifting, how governance will evolve for this new layer, and what Rob learned from reviewing 250 research papers on reasoning algorithms.
We talked about:
Insights from reviewing 250 research papers on reasoning algorithms.
Why enterprises start with closed models like OpenAI or Anthropic before moving to open-source stacks.
Challenges of switching stacks, including model fragmentation, capability gaps, and hardware choices.
Cost-performance trade-offs when choosing inference architectures.
How inference-time configuration can become a competitive differentiator.
The role of pricing shifts and vendor lock-in in AI adoption.
Emerging governance considerations for inference workflows.
The growing variety and complexity of inference-time techniques..
Benchmarking challenges for multi-step and reasoning tasks.
Why the lack of best practices makes inference optimization harder to operationalize.
Connect with Neurometric:Website: https://www.neurometric.ai/
Substack: https://neurometric.substack.com/
X: https://x.com/neurometric/
Bluesky: https://bsky.app/profile/neurometric.bsky.social
Hosts:
Rob May
https://x.com/robmay
https://www.linkedin.com/in/robmay
Calvin Cooper
https://x.com/cooper_nyc_
https://www.linkedin.com/in/coopernyc Comment end



