Inference Index
Intelligence

Signal, not noise.

Releases, benchmarks, and analysis from a curated list of labs, newsletters, and community feeds. Updated every four hours.

analysisOpenAIInvalid Date · NaNy ago

New OpenAI Academy courses for the next era of work

OpenAI introduces three Academy courses that help people build practical AI skills, create repeatable workflows, and apply agents in everyday work.

openai.com ↗
analysisHugging FaceNaNy ago

Thousand Token Wood: shipping a multi-agent economy on a 3B model

analysisHugging FaceNaNy ago

CyberSecQwen-4B: Why Defensive Cyber Needs Small, Specialized, Locally-Runnable Models

analysisSimon WillisonNaNy ago

OpenAI WebRTC Audio Session, now with document context

OpenAI WebRTC Audio Session, now with document context I built the first version of this tool in December 2024 to try out the then-new OpenAI WebRTC API for interacting with their realtime audio…

analysisMicrosoft ResearchNaNy ago

Ire identifies another LOTUSLITE specimen

Project Ire examined a timely malware sample and determined its intent through reverse engineering—identifying LOTUSLITE characteristics even as most major EDR tools did not detect it. The post Ire…

benchmarkHugging FaceNaNy ago

olmo-eval: An evaluation workbench for the model development loop

analysisOpenAINaNy ago

BBVA puts AI at the core of banking with OpenAI

Learn how BBVA scaled ChatGPT Enterprise to 100,000 employees and partnered with OpenAI to accelerate AI-powered banking transformation worldwide.

analysisOpenAINaNy ago

OpenAI to acquire Ona

OpenAI plans to acquire Ona to expand Codex with secure, persistent cloud environments, enabling long-running AI agents across enterprise workflows.

analysisOpenAINaNy ago

Access OpenAI models and Codex through your Oracle cloud commitment

Access OpenAI models and Codex through Oracle Cloud, using existing commitments to build and deploy AI with enterprise security and governance.

analysisHugging FaceNaNy ago

Amazing Digital Dentures (a failed project)

analysisHugging FaceNaNy ago

Five labs, five minds: building a multi-model finance drama on small models

analysisHugging FaceNaNy ago

MachinaCheck: Building a Multi-Agent CNC Manufacturability System on AMD MI300X

analysisHugging FaceNaNy ago

"OncoAgent: A Dual-Tier Multi-Agent Framework for Privacy-Preserving Oncology Clinical Decision Support"

analysisHugging FaceNaNy ago

Adaptive Ultrasound Imaging with Physics-Informed NV-Raw2Insights-US AI

analysisHugging FaceNaNy ago

How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas

analysisarXiv cs.CLNaNy ago

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

arXiv:2606.12765v1 Announce Type: new Abstract: Apple's Metal 4.1 exposes a tensor compute path: the Metal Performance Primitives (MPP) matmul2d operation over cooperative_tensor fragments, whose…

analysisarXiv cs.CLNaNy ago

Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review

arXiv:2606.12716v1 Announce Type: new Abstract: The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant…

analysisarXiv cs.CLNaNy ago

Direct Preference Optimization for Chatbot Fine-Tuning: An Empirical Study

arXiv:2606.12881v1 Announce Type: new Abstract: We present an approach to fine-tuning large language models using Direct Preference Optimization (DPO), a reinforcement learning technique. Our…

benchmarkarXiv cs.CLNaNy ago

Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

arXiv:2606.12608v1 Announce Type: new Abstract: Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn…

analysisarXiv cs.CLNaNy ago

Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures

arXiv:2606.12576v1 Announce Type: new Abstract: Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with…

analysisarXiv cs.CLNaNy ago

MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents

arXiv:2606.10304v1 Announce Type: new Abstract: When LLM agents are coerced into covertly encoding sensitive data (Base64, ROT13, acrostic, synonym chains, and beyond), the resulting outputs evade…

analysisarXiv cs.CLNaNy ago

The Order Matters: Sequential Fine-Tuning of LLaMA for Coherent Automated Essay Scoring

arXiv:2606.10327v1 Announce Type: new Abstract: Automated Essay Scoring (AES) systems must judge interdependent discourse elements (e.g., lead, claim, evidence, conclusion), yet most approaches treat…

analysisarXiv cs.CLNaNy ago

Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles

arXiv:2606.06715v1 Announce Type: new Abstract: We ask whether topic sentiment has a causal effect on perceived political ideology, and whether the answer depends on who assigns the ideology label.…

analysisarXiv cs.CLNaNy ago

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

arXiv:2606.06667v1 Announce Type: new Abstract: The mechanisms behind LLMs' broad over-generalization beyond training examples remain unclear. Emergent misalignment (EM) offers a striking case study:…

analysisarXiv cs.CLNaNy ago

Korean Culture into LLM Alignment: Toward Cultural Coherence

arXiv:2606.06797v1 Announce Type: new Abstract: Cultural-aspect work on large language models is dominated by a negative target: which outputs to suppress. We argue that a constructive counterpart is…

analysisarXiv cs.CLNaNy ago

What Do People Actually Want From AI? Mapping Preference Plurality

arXiv:2606.06674v1 Announce Type: new Abstract: Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people's preferences and…

analysisarXiv cs.CLNaNy ago

Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics

arXiv:2606.05168v1 Announce Type: new Abstract: Training on synthetic data causes model collapse, but existing analyses treat this as single-chain degradation. In reality, the AI ecosystem involves…

benchmarkarXiv cs.CLNaNy ago

LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations

arXiv:2606.05182v1 Announce Type: new Abstract: Large language models discard critical details when conversation history is compacted to fit within finite context windows. We present LANTERN (Layered…

analysisarXiv cs.CLNaNy ago

Topics as Proxies for Sociodemographics: How Conversational Context Affects LLM Answers

arXiv:2606.02776v1 Announce Type: new Abstract: When large language models (LLMs) are used in high-stakes scenarios, such as legal, medical and financial advice, even a single conversation history is…

analysisarXiv cs.CLNaNy ago

Cognitive-Linguistic Indicators of Depression in Online Communities: Analysed by DistilBERT and Holographic Reduced Representation

arXiv:2606.00026v1 Announce Type: new Abstract: This paper investigates whether combining cognitively grounded linguistic features with transformer-based embeddings improves automated detection of…

analysisarXiv cs.CLNaNy ago

Knowledge Graph-Enhanced Zero-Shot Topic Classification: A Multi-Strategy Comparative Study

arXiv:2605.30465v1 Announce Type: new Abstract: Multi-label topic classification without labeled training data is a challenging task, specially when documents contain complex relational information.…

analysisarXiv cs.CLNaNy ago

Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

arXiv:2605.30529v1 Announce Type: new Abstract: Sentence-embedding models for semantic search are overwhelmingly developed and evaluated on English corpora. When applied to clinical retrieval in…

analysisarXiv cs.CLNaNy ago

A comparative study of transformer-based embeddings for topic coherence

arXiv:2605.28832v1 Announce Type: new Abstract: Topic modeling is a branch of Natural Language Processing (NLP) that aims to organize large collections of texts into coherent groups according to word…

analysisarXiv cs.CLNaNy ago

Assessing Dutch Syllabification Algorithms and Improving Accuracy by Combining Phonetic and Orthographic Information through Deep Learning

arXiv:2605.28834v1 Announce Type: new Abstract: Syllabification describes the task of dividing words into syllables. Due to many rules and exceptions, training an algorithm to perform syllabification…

benchmarkarXiv cs.CLNaNy ago

SERC: LDPC-Inspired Semantic Error Correction for Retrieval-Augmented Generation

arXiv:2605.28837v1 Announce Type: new Abstract: While Large Language Models (LLMs) have demonstrated remarkable capabilities, their reliability is significantly compromised by hallucinations.…

analysisarXiv cs.CLNaNy ago

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

arXiv:2605.28833v1 Announce Type: new Abstract: Automatic speech recognition (ASR) has the potential to substantially reduce manual annotation effort in child speech research by generating automatic…

analysisarXiv cs.CLNaNy ago

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines

arXiv:2605.28840v1 Announce Type: new Abstract: Large language model (LLM) agents with tool-calling capabilities are increasingly deployed in production systems, yet a fundamental reliability…

analysisarXiv cs.CLNaNy ago

Self-Verified Distillation: Your Language Model Is Secretly Its Own Synthetic Data Pipeline

arXiv:2605.26132v1 Announce Type: new Abstract: Can post-trained large language models (LLMs) further improve themselves using only unlabeled prompts, without external teachers or feedback from…

benchmarkarXiv cs.CLNaNy ago

Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study

arXiv:2605.26394v1 Announce Type: new Abstract: Multi-turn Text-to-SQL is central to enterprise analytics yet remains predominantly evaluated in single-turn settings. We introduce…

analysisarXiv cs.CLNaNy ago

Slide Deck Q&A Quality Assurance App: A Multi-Stage Pipeline for Pedagogical Question Generation

arXiv:2605.26428v1 Announce Type: new Abstract: Generating high-quality, pedagogically useful questions from lecture slide decks is difficult because important instructional content is distributed…