Research papers
structure
Research Updates/
Reinforcement Learning from Human Feedback
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, by Anthropic [Apr 2022] - source
Agent Frameworks:
- Code Generation with AlphaCodium: From Prompt Engineering to Flow
Engineering, Ridnik et al. (Jan 2024)[paper]
- Reflexion: Language Agents with
Verbal Reinforcement Learning, Shinn et al (Oct 2023) [paper]
- MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action, Yang et al. (March 2023) - paper
Chain of Thought / Thinking
- Understanding Before Reasoning: Enhancing Chain-of-Thought with Iterative Summarization Pre-Prompting, by Zhu et al. [Jan 2025] - paper
- CoT can improve the performance of LLMs on reasoning tasks (often overlook the important step of extracting important information early in the reasoning process)
- they propose iterative summarization pre-prompting (ISP^2) to enhance CoT
- refine LLM reasoning when key information is missing
- ISP^2 first extract entities and their descriptions to form potential key information pairs using a rating system.
- can improve performance (compared to existing CoT methods) by 7.1%
- Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought, Xiang et al. [Jan 2025] - paper
- propose Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT)
Human and Agent interaction
- Agents Are Not Enough, Shah, et al. [Dec 2024] - paper
- Gen AI alone is insufficient to make new generations of agents more successful.
- to have more effective and sustainable ecosystem needs to includes:
- Agents: Agents are narrow and purpose-driven modules that are trained to do a specific task. Each agent can be autonomous, but with an ability to interface with other agents.
- Sims: Sims are representations of a user. Each Sim is created using a combination of user profile, preferences, and behaviors, and captures an aspect of who the user is. Different Sims can have different privacy and personalization settings. (user persona)
- Assistant: An Assistant is a program that directly interacts with the user, has a deep understanding of that user, and has an ability to call Sims and Agents as needed to reactively or proactively accomplish tasks and sub-tasks for the user.
The Assistant, with its comprehensive understanding of the user, co-creates and manages Sims with the supervision of the user,
Agent Computer Interfaces (ACI)
- SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering by Yang et al. [Nov 2024] - paper website - github
- investigate how interface design affects the performance of language model agents.
- Inspired by human-computer interaction (HCI) studies on the efficacy of user interfaces for humans, we investigate whether LM agents could similarly benefit from better-designed interfaces for performing software engineering tasks.
Multi-Agent
Debate
- On scalable oversight with weak LLMs judging
strong LLMs , Kenton et al. [Jul 2024] - paper
- debate consistently outperforms consultancy across all tasks, previously only shown on a single extractive QA task in Khan et al. (2024).
- Comparing debate to direct question answering baselines, the results depend on the type of task. In extractive QA tasks with information asymmetry, debate outperforms QA without article as in the single task of Khan et al. (2024), but not QA with article. For other tasks, when the judge is weaker than the debaters (but not too weak), we find either small or no advantage to debate over QA without article.
- Changes to the setup (number of turns, best-of-N sampling, few-shot, chain-of-thought) seem to have little effect on results.
- In open consultancy, the judge is equally convinced by the consultant, whether or not the consultant has chosen to argue for the correct answer. Thus, using weak judges to provide a training signal via consultancy runs the risk of amplifying the consultant’s incorrect behavior.
- In open debate, in contrast, the judge follows the debater’s choice less frequently than in open
consultancy. When the debater chooses correctly, the judge does a bit worse than in open
consultancy. But when the debater chooses incorrectly, the judge does a lot better at discerning
this. Thus, the training signal provided by the weak judge in open debate is less likely to amplify
incorrect answers than in open consultancy.
- They calculate Elo scores and show that stronger debaters lead to higher judge accuracy (including for a weaker judge) across a range of tasks.
- Multi-LLM Debate: Framework, Principals, and
Interventions, Estornell et al. [Jul 2024] - paper
- Introduce 3 interventions in the Multi-LLM Debate:
- Diversity pruning: Delete responses with similar distributions over latent concepts
- Quality pruning: Delete responses with dissimilarity distributions over latent concepts
- Misconception pruning: Modify distribution of latent concepts in a reponse
- Apply interventions: at time t-1 modify reponses before they are used next round
Evaluation
LLM evaluators (LLM-as-a-Judge)
LLM-as-a-Judge weather as self-evaluator or evaluator of other LLM’s generation, it a topic that has been proven to be useful in following scenarios:
- benchmarking LLM’s performance
- reward modeling
- constitutional AI, Bai et al
- self-refinement
Here are some interesting papers on this topic:
-
A Survey on LLM-as-a-Judge, Gu et al. [Dec 2024] - paper
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, Li et al. [Dec 2024] - paper
- From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge, Li et al. [Jan 2025] - paper
- “LLM-as-ajudge” paradigm, LLMs are leveraged to perform scoring, ranking, or selection across various tasks and applications
- explore LLM-as-a-judge from three dimensions: what to judge, how to judge and where to judge
- Attribute: What to judge? helpfulness, harmlessness, reliability, relevance, feasibility and overall quality
- Methodology: How to judge? prompting techniques for LLMas-a-judge systems, including manually-labeled data, synthetic feedback, supervised fine-tuning, preference learning, swapping operation, rule augmentation, multi-agent collaboration, demonstration, multi-turn interaction and comparison acceleration
- Application: Where to judge? applications in which LLM-as-a-judge has been employed, including evaluation, alignment, retrieval and reasoning
- LLM Evaluators Recognize and Favor Their Own Generations, Panickssery et al. [Apr 2024] - paper
- biases are introduced due to the same LLM acting as both the evaluator and the evaluatee
- self-preference, where an LLM evaluator scores its own outputs higher than others’ while human annotators consider them of equal quality
- findings:
- LLMs such as GPT-4 and Llama 2 have non-trivial accuracy at distinguishing themselves from other LLMs and humans
- They discover a linear correlation between self-recognition capability and the strength of self-preference bias; using controlled experiments, they show that the causal explanation resists straightforward confounders
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena by Zheng et al [Dec 2023] - paper
- Motivation / contribution:
-
Challenges in Evaluating LLM-Based Chat Assistants: Traditional benchmarks are inadequate for assessing the broad capabilities of LLM-based chat assistants, especially in measuring human preferences.
-
Scalability and Explainability: Human evaluations are expensive and time-consuming. Utilizing LLMs as judges offers a scalable and explainable alternative to approximate human preferences.
-
Alignment with Human Preferences: There’s a need to ensure that LLMs align with human preferences in open-ended tasks, such as multi-turn dialogues, which traditional benchmarks fail to assess effectively.
-
Mitigating Biases in LLM Judgments: The research identifies potential biases in LLM judgments, such as position, verbosity, and self-enhancement biases, and proposes solutions to mitigate them.
-
Development of New Benchmarks: The introduction of MT-Bench and Chatbot Arena aims to provide platforms for evaluating the alignment between LLM judgments and human preferences.
Evaluation of LLM’s
- A Survey on Evaluation of Large Language Models - by YUPENG CHANG et al [Dec 2023] - paper
AI CUDA Engineer
- The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition [Feb, 2025] - paper
models
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models by Deekseek, Jan 2024
Bite: How Deepseek R1 was trained by Philipp Schmid Jan, 2025
The Illustrated DeepSeek-R1 by Jay Alammar Jan, 2025
Recent Trends in Retrieval Augmented Generation (RAG) (2024)
Based on “The Rise and Evolution of RAG in 2024 A Year in Review” by RAGFlow (Dec 2024) and other observations:
- Core Themes & Debates:
- RAG’s role as an indispensable component for LLMs in enterprise scenarios has solidified.
- Focus on overcoming challenges like ineffective Q&A for unstructured multimodal documents, low recall with pure vector databases, and semantic gaps in search.
- Key Technological Advancements:
- Multimodal Document Parsing: Tools and techniques for understanding complex documents (PDFs, PPTs) beyond text, incorporating layout and visual elements (e.g., RAGFlow’s DeepDoc, MinerU, Docling, generative AI for OCR like Nougat).
- Hybrid Search: Shift from pure vector search to hybrid approaches combining vector search with traditional methods like BM25 for better precision and recall. RAGFlow and its database Infinity, along with OpenAI’s acquisition of Rockset, highlight this trend.
- GraphRAG: Utilizing knowledge graphs to bridge semantic gaps and enable more complex reasoning. Microsoft’s open-source GraphRAG is a key example, with other variations like KAG, Nebula GraphRAG, Fast/Light/LazyGraphRAG, HippoRAG, and Triplex emerging.
- Advanced Ranking Models: Development of more sophisticated reranking models, including LLM-based rerankers (e.g., gte-Qwen2-7B) and tensor-based late interaction models (e.g., ColBERT, RAGatouille, Jina-colbert-v2).
- Improved Chunking Strategies: Moving beyond naive text chunking to methods that preserve context or add semantic meaning (e.g., Late Chunking, Contextual Chunking, Meta-Chunking).
- Agentic RAG: Integrating agentic capabilities into RAG systems for more adaptive and complex problem-solving (e.g., Self RAG, Adaptive RAG, LangGraph, RARE).
- Multimodal RAG: Systems that can retrieve and reason over combined text and visual information, leveraging Vision-Language Models (VLMs) like PaliGemma and techniques like ColPali.
- Noteworthy Tools/Platforms: RAGFlow, GraphRAG (Microsoft), LangGraph, Infinity DB.
Recent Trends in Large Language Models (LLMs) (late 2024 / early 2025)
Based on “Recent advancements in large language models (LLMs) and their applications” (LinkedIn, Apr 2025 - likely referencing late 2024/early 2025 developments):
- Key Capabilities & Enhancements:
- Enhanced Multimodal Capabilities: Seamless processing of diverse inputs (text, images, audio, video) within unified architectures.
- Sophisticated Reasoning & Tool Use: Significant improvements in multi-step reasoning, coding, mathematics, and scientific analysis.
- Expanded Context Windows: Models supporting much larger context windows (e.g., up to 1 million tokens), allowing for processing of entire documents or large codebases.
- Refined RAG Integration: Better integration with external knowledge bases and real-time data for improved accuracy and currency.
- Agentic Frameworks & Autonomous Operation: LLMs evolving into autonomous agents capable of planning, task decomposition, and complex workflow execution.
- Efficiency and Optimization: A push towards smaller, faster, and more cost-effective models without sacrificing performance.
- Real-time & Streaming Interaction: Capabilities to process streaming data in real-time for dynamic applications.
- Prominent Models Showcasing these Trends (availability/versions as of early 2025 in source article):
- OpenAI: GPT-4.1 family (superior coding, 1M token context, agent-like).
- Anthropic: Claude 3.7 Sonnet (hybrid reasoning, “extended thinking mode” for RAG).
- Google: Gemini 2.5 Pro & Flash (multimodal, 1M token context, Live API).
- xAI: Grok-3 family (real-time data via X platform, image generation).
- Mistral AI: Mistral Small 3.1 (open-weight, efficient, multimodal).