coffee-gen-ai

Research papers

structure

Research Updates/

Reinforcement Learning from Human Feedback

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, by Anthropic [Apr 2022] - source

Agent Frameworks:

Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering, Ridnik et al. (Jan 2024)[paper]
Reflexion: Language Agents with Verbal Reinforcement Learning, Shinn et al (Oct 2023) [paper]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action, Yang et al. (March 2023) - paper

Chain of Thought / Thinking

Understanding Before Reasoning: Enhancing Chain-of-Thought with Iterative Summarization Pre-Prompting, by Zhu et al. [Jan 2025] - paper
- CoT can improve the performance of LLMs on reasoning tasks (often overlook the important step of extracting important information early in the reasoning process)
- they propose iterative summarization pre-prompting (ISP^2) to enhance CoT
  - refine LLM reasoning when key information is missing
- ISP^2 first extract entities and their descriptions to form potential key information pairs using a rating system.
- can improve performance (compared to existing CoT methods) by 7.1%
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought, Xiang et al. [Jan 2025] - paper
- propose Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT)

Human and Agent interaction

Agents Are Not Enough, Shah, et al. [Dec 2024] - paper
- Gen AI alone is insufficient to make new generations of agents more successful.
- to have more effective and sustainable ecosystem needs to includes:
  - Agents: Agents are narrow and purpose-driven modules that are trained to do a specific task. Each agent can be autonomous, but with an ability to interface with other agents.
  - Sims: Sims are representations of a user. Each Sim is created using a combination of user profile, preferences, and behaviors, and captures an aspect of who the user is. Different Sims can have different privacy and personalization settings. (user persona)
  - Assistant: An Assistant is a program that directly interacts with the user, has a deep understanding of that user, and has an ability to call Sims and Agents as needed to reactively or proactively accomplish tasks and sub-tasks for the user. The Assistant, with its comprehensive understanding of the user, co-creates and manages Sims with the supervision of the user,

Agent Computer Interfaces (ACI)

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering by Yang et al. [Nov 2024] - paper website - github
- investigate how interface design affects the performance of language model agents.
- Inspired by human-computer interaction (HCI) studies on the efficacy of user interfaces for humans, we investigate whether LM agents could similarly benefit from better-designed interfaces for performing software engineering tasks.

Multi-Agent

OPENHANDS: An Open Platform for AI Software Developers as Generalist Agents by wang et al. [Oct 2024] - Github / Paper
- a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with a command line, and browsing the web
ChatDev: Communicative Agents for Software Development by Qian et al. [Jun 2024] - paper / github
- Company Overview: ChatDev is a virtual software company powered by intelligent agents.
- Agents take on roles such as CEO, CTO, programmer, tester, reviewer, and designer.
- Organizational Structure: Operates as a multi-agent system collaborating through specialized seminars.
- Collaboration Tasks: Agents handle designing, coding, testing, and documenting software.
- Mission: “Revolutionize the digital world through programming.”
- Framework Focus: Offers a user-friendly, customizable, and extendable framework.
- Technology Basis: Built on large language models (LLMs).
- Research Purpose: Serves as a platform to study and understand collective intelligence.
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? Cao et al. [Jul 2024] - paper
Merge, Ensemble, and Cooperate! A Survey on Collaborative Strategies in the Era of Large Language Models, Le et al. [July 2024] paper
- LLMs show different strengths and weaknesses, leading to challenges in maximizing their overall efficiency and versatility
- collaborative strategies for LLMs:
  - Merge: integrating the parameters of multiple LLMs into a single, unified model, requiring that the parameters are compatible within a linear space
  - Ensemble: combines the outputs of various LLMs to generate coherent results
  - Cooperate: leverages different LLMs to allow full play to their diverse capabilities for specific tasks.
Chain of Agents: Large Language Models Collaborating on Long-Context Tasks, Zhang et al. [Jun 2024] - paper
- previous work on long context:
  - input reduction: such as Truncation and RAG
  - context extension: Claude-3, Long llama
- this paper:
  - Stage 1: worker agent: segment comprehension and chain of communication
  - Stage 2: manager agent: information integration and response generation

Debate

On scalable oversight with weak LLMs judging strong LLMs , Kenton et al. [Jul 2024] - paper
- debate consistently outperforms consultancy across all tasks, previously only shown on a single extractive QA task in Khan et al. (2024).
- Comparing debate to direct question answering baselines, the results depend on the type of task. In extractive QA tasks with information asymmetry, debate outperforms QA without article as in the single task of Khan et al. (2024), but not QA with article. For other tasks, when the judge is weaker than the debaters (but not too weak), we find either small or no advantage to debate over QA without article.
- Changes to the setup (number of turns, best-of-N sampling, few-shot, chain-of-thought) seem to have little effect on results.
- In open consultancy, the judge is equally convinced by the consultant, whether or not the consultant has chosen to argue for the correct answer. Thus, using weak judges to provide a training signal via consultancy runs the risk of amplifying the consultant’s incorrect behavior.
- In open debate, in contrast, the judge follows the debater’s choice less frequently than in open consultancy. When the debater chooses correctly, the judge does a bit worse than in open consultancy. But when the debater chooses incorrectly, the judge does a lot better at discerning this. Thus, the training signal provided by the weak judge in open debate is less likely to amplify incorrect answers than in open consultancy.
- They calculate Elo scores and show that stronger debaters lead to higher judge accuracy (including for a weaker judge) across a range of tasks.
Multi-LLM Debate: Framework, Principals, and Interventions, Estornell et al. [Jul 2024] - paper
- Introduce 3 interventions in the Multi-LLM Debate:
  - Diversity pruning: Delete responses with similar distributions over latent concepts
  - Quality pruning: Delete responses with dissimilarity distributions over latent concepts
  - Misconception pruning: Modify distribution of latent concepts in a reponse
- Apply interventions: at time t-1 modify reponses before they are used next round

Evaluation

LLM evaluators (LLM-as-a-Judge)

LLM-as-a-Judge weather as self-evaluator or evaluator of other LLM’s generation, it a topic that has been proven to be useful in following scenarios:

benchmarking LLM’s performance
reward modeling
constitutional AI, Bai et al
self-refinement

Here are some interesting papers on this topic:

A Survey on LLM-as-a-Judge, Gu et al. [Dec 2024] - paper
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, Li et al. [Dec 2024] - paper
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge, Li et al. [Jan 2025] - paper
- “LLM-as-ajudge” paradigm, LLMs are leveraged to perform scoring, ranking, or selection across various tasks and applications
- explore LLM-as-a-judge from three dimensions: what to judge, how to judge and where to judge
  - Attribute: What to judge? helpfulness, harmlessness, reliability, relevance, feasibility and overall quality
  - Methodology: How to judge? prompting techniques for LLMas-a-judge systems, including manually-labeled data, synthetic feedback, supervised fine-tuning, preference learning, swapping operation, rule augmentation, multi-agent collaboration, demonstration, multi-turn interaction and comparison acceleration
  - Application: Where to judge? applications in which LLM-as-a-judge has been employed, including evaluation, alignment, retrieval and reasoning
LLM Evaluators Recognize and Favor Their Own Generations, Panickssery et al. [Apr 2024] - paper
- biases are introduced due to the same LLM acting as both the evaluator and the evaluatee
- self-preference, where an LLM evaluator scores its own outputs higher than others’ while human annotators consider them of equal quality
- findings:
  - LLMs such as GPT-4 and Llama 2 have non-trivial accuracy at distinguishing themselves from other LLMs and humans
  - They discover a linear correlation between self-recognition capability and the strength of self-preference bias; using controlled experiments, they show that the causal explanation resists straightforward confounders
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena by Zheng et al [Dec 2023] - paper
- Motivation / contribution:
  - Challenges in Evaluating LLM-Based Chat Assistants: Traditional benchmarks are inadequate for assessing the broad capabilities of LLM-based chat assistants, especially in measuring human preferences.
  - Scalability and Explainability: Human evaluations are expensive and time-consuming. Utilizing LLMs as judges offers a scalable and explainable alternative to approximate human preferences.
  - Alignment with Human Preferences: There’s a need to ensure that LLMs align with human preferences in open-ended tasks, such as multi-turn dialogues, which traditional benchmarks fail to assess effectively.
  - Mitigating Biases in LLM Judgments: The research identifies potential biases in LLM judgments, such as position, verbosity, and self-enhancement biases, and proposes solutions to mitigate them.
  - Development of New Benchmarks: The introduction of MT-Bench and Chatbot Arena aims to provide platforms for evaluating the alignment between LLM judgments and human preferences.

Evaluation of LLM’s

A Survey on Evaluation of Large Language Models - by YUPENG CHANG et al [Dec 2023] - paper

AI CUDA Engineer

The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition [Feb, 2025] - paper

models

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models by Deekseek, Jan 2024

Bite: How Deepseek R1 was trained by Philipp Schmid Jan, 2025

The Illustrated DeepSeek-R1 by Jay Alammar Jan, 2025

Recent Trends in Retrieval Augmented Generation (RAG) (2024)

Based on “The Rise and Evolution of RAG in 2024 A Year in Review” by RAGFlow (Dec 2024) and other observations:

Core Themes & Debates:
- RAG’s role as an indispensable component for LLMs in enterprise scenarios has solidified.
- Focus on overcoming challenges like ineffective Q&A for unstructured multimodal documents, low recall with pure vector databases, and semantic gaps in search.
Key Technological Advancements:
- Multimodal Document Parsing: Tools and techniques for understanding complex documents (PDFs, PPTs) beyond text, incorporating layout and visual elements (e.g., RAGFlow’s DeepDoc, MinerU, Docling, generative AI for OCR like Nougat).
- Hybrid Search: Shift from pure vector search to hybrid approaches combining vector search with traditional methods like BM25 for better precision and recall. RAGFlow and its database Infinity, along with OpenAI’s acquisition of Rockset, highlight this trend.
- GraphRAG: Utilizing knowledge graphs to bridge semantic gaps and enable more complex reasoning. Microsoft’s open-source GraphRAG is a key example, with other variations like KAG, Nebula GraphRAG, Fast/Light/LazyGraphRAG, HippoRAG, and Triplex emerging.
- Advanced Ranking Models: Development of more sophisticated reranking models, including LLM-based rerankers (e.g., gte-Qwen2-7B) and tensor-based late interaction models (e.g., ColBERT, RAGatouille, Jina-colbert-v2).
- Improved Chunking Strategies: Moving beyond naive text chunking to methods that preserve context or add semantic meaning (e.g., Late Chunking, Contextual Chunking, Meta-Chunking).
- Agentic RAG: Integrating agentic capabilities into RAG systems for more adaptive and complex problem-solving (e.g., Self RAG, Adaptive RAG, LangGraph, RARE).
- Multimodal RAG: Systems that can retrieve and reason over combined text and visual information, leveraging Vision-Language Models (VLMs) like PaliGemma and techniques like ColPali.
Noteworthy Tools/Platforms: RAGFlow, GraphRAG (Microsoft), LangGraph, Infinity DB.

Recent Trends in Large Language Models (LLMs) (late 2024 / early 2025)

Based on “Recent advancements in large language models (LLMs) and their applications” (LinkedIn, Apr 2025 - likely referencing late 2024/early 2025 developments):

Key Capabilities & Enhancements:
- Enhanced Multimodal Capabilities: Seamless processing of diverse inputs (text, images, audio, video) within unified architectures.
- Sophisticated Reasoning & Tool Use: Significant improvements in multi-step reasoning, coding, mathematics, and scientific analysis.
- Expanded Context Windows: Models supporting much larger context windows (e.g., up to 1 million tokens), allowing for processing of entire documents or large codebases.
- Refined RAG Integration: Better integration with external knowledge bases and real-time data for improved accuracy and currency.
- Agentic Frameworks & Autonomous Operation: LLMs evolving into autonomous agents capable of planning, task decomposition, and complex workflow execution.
- Efficiency and Optimization: A push towards smaller, faster, and more cost-effective models without sacrificing performance.
- Real-time & Streaming Interaction: Capabilities to process streaming data in real-time for dynamic applications.
Prominent Models Showcasing these Trends (availability/versions as of early 2025 in source article):
- OpenAI: GPT-4.1 family (superior coding, 1M token context, agent-like).
- Anthropic: Claude 3.7 Sonnet (hybrid reasoning, “extended thinking mode” for RAG).
- Google: Gemini 2.5 Pro & Flash (multimodal, 1M token context, Live API).
- xAI: Grok-3 family (real-time data via X platform, image generation).
- Mistral AI: Mistral Small 3.1 (open-weight, efficient, multimodal).