Skip to Content
Update, 4/14/25: We are no longer accepting applications for this RFP. If you have a proposal that meets the following criteria, you are welcome to submit it to the AI Governance and Policy General RFP.

1. Capability evaluations are not on track for the role they are expected to play

Experts significantly disagree on the likely future capabilities of large language models (LLMs). Some experts believe LLM-based agents will soon outperform human professionals in almost every task,[1]E.g., in Managing AI Risks in an Era of Rapid Progress, a paper co-authored by Bengio, Hinton, and other leading AI researchers, they write “Combined with the ongoing growth and automation in AI R&D, we must take seriously the possibility that generalist AI systems will outperform human … Continue reading while others think the impact will be more modest and limited to specific areas.[2]For example, ML experts surveyed in 2023 had a 72-year gap between median timelines for fully automating all tasks (50% probability by 2048) and fully automating all occupations (50% probability by 2120). Disagreements about likely AI progress underpinning disagreements about AI risk are discussed … Continue reading These disagreements often underpin larger debates about potential risks, whether from misuse or loss of control.[3]See e.g., What Are the Real Questions in AI? and What the AI debate is really about.  

These disagreements have contributed to growing interest in governance approaches that accommodate different predictions about AI progress. One prominent example is “if-then commitments,” where AI developers agree to take specific actions (like pausing training of more powerful models until safety measures are implemented or improving security) if their systems meet certain capability thresholds.[4]For example, at the 2024 AI Seoul Summit, a network of AI Safety Institutes committing to information sharing about capabilities evaluation was established, and 16 AI companies signed frontier safety commitments. See also statements from IDAIS-Beijing – International Dialogues on AI Safety … Continue reading Since these actions are conditional on certain thresholds being met, this approach is compatible with a range of beliefs about AI progress.

Both if-then commitments and other governance approaches based on AI capabilities rely on accurate AI evaluations.[5]In this RFP, we use the terms “evaluation,” “benchmark,” “test,” and “task” as follows. “Evaluation” refers to the complete process of measuring model performance (including, e.g., test conditions, the benchmark or tasks used, grading metrics, and how you report results). … Continue reading Governments, regulators, and society at large need robust, reliable measurements to understand and respond appropriately to advances in AI capabilities. We are worried that the current evaluations paradigm isn’t mature enough to play this role.

Capability evaluation currently faces three major challenges:

  1. Existing benchmarks for risk-relevant capabilities are inadequate. We need more demanding evaluations that can meaningfully measure frontier models’ performance on tasks relevant to catastrophic risks, resist saturation even as capabilities advance, and rule in (not just rule out) serious risks.[6]See, e.g., International AI Safety Report 2025, which states (pg 169): “An ‘evaluation gap’ for safety persists: Despite ongoing progress, current risk assessment and evaluation methods for general-purpose AI systems are immature. Even if a model passes current risk evaluations, it can be … Continue reading
  2. The science of capability evaluation remains underdeveloped. We don’t yet understand how many capabilities scale, the relationships between different capabilities, or how post-training enhancements[7]By “post-training enhancements,” we mean techniques that improve model performance after pretraining. This includes methods that modify model weights, such as fine-tuning, and methods that change how the model is used (such as scaffolding, tool use, agent architectures, and prompt engineering). … Continue reading will affect performance.[8]See, e.g., A Path for Science- and Evidence-based AI Policy, a proposal co-authored by Li, Liang, and Song, among others, which states that “our understanding of how these models function and their possible negative impacts on society remains very limited.” The same point is made in the … Continue reading This makes interpreting current evaluation results and predicting future results challenging.
  3. Third-party evaluators already face significant access constraints, and increasing security requirements will make access harder. Maintaining meaningful independent scrutiny will require advances in technical infrastructure, evaluation and audit protocols, and access frameworks.[9]A similar point is made in the International AI Safety Report 2025, which states (pg 181): “The absence of clear risk assessment standards and rigorous evaluations is creating an urgent policy challenge, as AI models are being deployed faster than their risks can be evaluated. Policymakers face … Continue reading

We are looking to fund projects that can make progress on any of these challenges. 

The window to submit an expression of interest (EOI) has closed. We will review EOIs on a rolling basis and aim to respond within two weeks.

 

Below, we expand on each of the three main areas we’re seeking proposals for. Each section includes examples of valuable work, open questions we’re interested in, and key requirements for proposals. We expect that most strong proposals will focus on just one area, so there is no need to read the whole page.

2. Global Catastrophic Risk (GCR)-relevant capability benchmarks for AI agents

In November 2023, we launched an RFP for LLM agent benchmarks. Though that RFP did not focus solely on GCR-relevant capabilities,[10]By “global catastrophic risk,” we mean a risk that has the potential to cause harm on an enormous scale (e.g., threaten billions of lives). See Potential Risks from Advanced Artificial Intelligence for more details. multiple benchmarks funded through that RFP have already been used in pre-deployment testing of frontier models,[11]For example, LAB-Bench and Cybench were used in UK AISI’s and US AISI’s pre-deployment testing of Claude 3.5 Sonnet (report) and o1 (report). Other tasks have been used privately. and others are forthcoming.

Despite that progress, we still think we urgently need more demanding tests of AI agents’ capabilities[12]By “AI agent” or “agentic AI system,” we mean AI systems capable of pursuing complex goals with limited supervision. Examples include systems which could, e.g., identify and exploit an elite zero-day vulnerability with no instances of human intervention. We borrow this definition (though … Continue reading that are directly relevant to global catastrophic risks. We’re particularly interested in such benchmarks because:

  1. Some of the existing relevant benchmarks are already saturated, or close to saturation.[13] See, e.g., LAB-Bench, on which Sonnet 3.5 (new) achieved human-level or greater performance in 2 out of 5 categories (pg. 7).
  2. Existing benchmarks only cover some potential risks, and not necessarily at a difficulty level relevant to catastrophic risk.[14]For example, while Cybench measures a precursor of GCR-level cyber capabilities, it does not directly test for, e.g., the ability to discover elite zero-day vulnerabilities. Similarly, LAB-Bench measures capabilities like designing scientific protocols, reasoning about tables and figures, and … Continue reading
  3. Ideally, we want evaluations to be able to rule in risks. By this, we mean that sufficiently strong performance on the evaluation would provide compelling evidence of the capability to cause serious harm.[15]For evaluation results to provide enough evidence for us to rule in risks, we’ll likely need: more comprehensive testing, harder and more relevant tasks, adversarial testing and other concerted efforts to upper-bound model performance, post-deployment testing, and close collaboration with domain … Continue reading

We focus on agent capabilities because they’re central to the main risks we’re concerned about. Capable AI agents could pursue their own objectives, be used to automate dangerous R&D, and accelerate overall AI development beyond our ability to handle it safely. Understanding these capabilities helps mitigate these risks.

We recognize the technical challenges in building benchmarks and running evaluations. As a result, we are ready to provide substantial funding for well-designed proposals that demonstrate sufficient ambition.

2.1 Necessary criteria for new benchmarks

2.1.1 Relevance to global catastrophic risk

Tasks should be designed to measure capabilities directly relevant to plausible threat models for global catastrophic risks. We’re particularly interested in testing for the following capabilities:

  • AI research and development (R&D), i.e., automating many/all of the tasks currently done by top AI researchers and engineers
    • AI systems that could competently perform AI R&D could dramatically accelerate the pace of AI capabilities development, potentially outpacing society’s ability to adapt and respond. 
    • AI R&D also serves as a useful proxy for AI capabilities in other kinds of R&D, e.g., in biology, chemistry, or weapons development. We think AI systems will likely be better at AI R&D than other kinds of R&D,[16]ML has relatively tight feedback loops, requires little physical interaction, and has large amounts of relevant training data compared to other science. Also, AI developers have strong economic incentives to improve their models’ AI R&D capabilities, since this would accelerate AI … Continue reading so AI R&D capabilities can act as a leading indicator of general R&D capabilities. 
  • Rapid adaptation to, and mastery of, novel adversarial environments
    • The ability to efficiently develop and execute winning strategies in complex, competitive environments is a key capability for loss of control (or “rogue AI”) scenarios.
    • We’re particularly interested in the ability to autonomously produce such strategic behavior across multiple domains without domain-specific optimization — whether competing in novel games, executing or responding to cyber threats, or real-world resource acquisition.
  • Capabilities that are directly relevant to undermining human oversight, such as scheming, situational awareness, or persuasion[17]For further discussion of capabilities relevant to undermining human oversight, see Section 2.2.3 (and in particular Table 2.4, pg 104) in International AI Safety Report. For examples of useful work in this category, see, e.g., Scheming reasoning evaluations, and Me, Myself, and AI: The Situational … Continue reading
  • Cyberoffense
    • We’re particularly interested in benchmarks that cover the entire kill chain. Scores should be benchmarked to the operational capacities[18]See Chapter 4 of Securing AI Model Weights for a definition of operational capacity categories. of specific threat actors.

We’ll consider proposals for other capability benchmarks if they are supported by a specific threat model showing that the capability is necessary to realize a catastrophic risk.[19]This list draws upon Karnofsky’s prioritization in A Sketch of Potential Tripwire Capabilities for AI. For other views on what capabilities to test for, see, e.g., Early lessons from evaluating frontier AI systems, IDAIS-Beijing – International Dialogues on AI Safety, and Common Elements of … Continue reading

In general, we prefer evaluations that measure specific dangerous capabilities over those that measure more general precursors. For example, cyberoffense evaluations are better than general coding proficiency evaluations, and cyberoffense evaluations focusing on the key bottlenecks are better still.[20] See our discussion of construct validity for more detail on this point.

2.1.2 Evaluating agentic capabilities

The majority of catastrophic risks we’re concerned about stem from agentic AI systems.[21]By “AI agent” or “agentic AI system,” we mean AI systems capable of pursuing complex goals with limited supervision. Examples include systems that could, e.g., identify and exploit an elite zero-day vulnerability with no instances of human intervention. We borrow this definition (though not … Continue reading In part because of this, we are only inviting proposals about benchmarks for AI agent capabilities. This means testing AI systems on decomposing complex tasks into smaller subtasks, autonomously pursuing objectives across multiple steps, and responding to novel situations without human guidance.

2.1.3 Construct validity

Evaluations need to measure what they claim to measure, not just superficially similar tasks. Good construct validity entails that high scores on an evaluation would correspond to high real-world performance in the relevant domain, and that low scores would correspond to poor real-world performance.

Construct validity is often difficult to assess. All else equal, we prefer benchmarks where tasks are critical bottlenecks or necessary difficult steps in a GCR threat model, and which mirror tasks done by human professionals.[22] Giving models access to the same environments and tools as human professionals (where possible) helps to mirror tasks. Alternative task designs are acceptable if they are justified by an argument for their construct validity. While testing performance on discrete subtasks is often useful to distinguish between low performance levels, high scores should require models to identify necessary subtasks, determine their relationships, and chain them together without supervision. Benchmarks should include the most difficult tasks required to realize a given threat. 

2.1.4 Difficulty

We want to fund evaluations that are difficult enough to resist saturation, and provide unambiguous evidence of serious risks if models perform well on them.

This will likely require identifying tasks that are challenging for world-class human experts, and which take days or more to complete. Difficulty is particularly important given the recent rate of AI progress and the effects of post-training enhancements[23]By “post-training enhancements,” we mean techniques that improve model performance after pretraining. This includes methods that modify model weights, such as fine-tuning, and methods that change how the model is used (such as scaffolding, tool use, agent architectures, and prompt engineering). … Continue reading on model performance. For example, when FrontierMath — a benchmark of math problems ranging from PhD-level to open research questions — was first released, frontier models scored less than 2%; now, o3 scores 25%.[24]As reported by OpenAI. Note that o3 used significantly more inference compute, was tested by OpenAI, and that OpenAI had access to the FrontierMath benchmark. Similarly, o3 has exceeded human expert performance on GPQA Diamond, a benchmark of unpublished PhD-level science questions, scoring 87.7%.[25]OpenAI recruited experts with PhDs to answer GPQA Diamond questions, and found that they scored 69.7%. The previous highest score from an AI model was 56%.[26] Mean score achieved by Claude 3.5 Sonnet in Epoch’s testing; see AI Benchmarking Dashboard for details.

2.1.5 Follows best scientific practice

Evaluations should follow scientific best practice in how they are conducted and reported.[27]For relevant work here, see Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations. We expect there’s significant additional work to be done here. For example, to mitigate test set contamination, representative private test sets should be excluded from published benchmarks, and performance differences between public and private test sets should be measured. The experimental setup, including the base model, agent scaffolding, prompting, and any other post-training enhancements used, should be clearly documented.

2.2 Nice-to-have criteria for new benchmarks 

While the previous criteria should be met, the following are bonuses:

  • Where possible, tasks should be compatible with widely used libraries to make running evaluations easier.
    • We encourage using Inspect, an open-source framework for model evaluations developed by the UK AI Safety Institute.
  • Task grading should be fine-grained or ideally continuous, rather than binary (pass/fail). 
  • Tasks should provide opportunities for feedback and iteration, e.g., from critic models, automatic graders, or performance on discrete subtasks.
  • Where relevant, evaluations should include appropriate comparison baselines. 
    • What counts as an “appropriate” baseline will depend on the threat model used. In many cases, we’re interested in both the performance of human professionals and novices, given adequate tool access, incentives, time, and training.

2.3 Useful previous work

  • Examples of challenging benchmarks include: 
    • FrontierMath, a benchmark of unpublished and extremely difficult math problems.
    • Cybench, a benchmark of challenging, multi-step capture-the-flag tasks in cybersecurity. The hardest tasks in Cybench take humans days to solve.
  • Examples of GCR-relevant benchmarks include:
    • Lab-MM, which measures LLM capabilities at assisting novices with wet lab research. (Though note that Lab-MM tests human uplift, rather than AI agent capabilities.)
    • RE-Bench, which measures LLM agents’ capabilities at AI R&D given a fixed time budget, and features a human expert baseline.

Other work we think is useful includes:

3. Advancing the science of evaluations and capabilities development

Capabilities evaluations are closer to snapshots of current model performance than guides to future development: evaluators can run tests and see what models can do today, but struggle to predict what they’ll be capable of tomorrow.

While building hard, GCR-relevant benchmarks will help, we also need more work on several different areas:

  1. Measuring and predicting performance gains from post-training enhancements,[29]By “post-training enhancements,” we mean techniques that improve model performance after pretraining. This includes methods that modify model weights, such as fine-tuning, and methods that change how the model is used (such as scaffolding, tool use, agent architectures, and prompt engineering). … Continue reading like prompt engineering, scaffolding, and fine-tuning 
  2. Research informing how we conduct and interpret evaluations
  3. Understanding and predicting the relationships between different capabilities, and how capabilities emerge

Below, we go into more detail on some open questions we’re particularly interested in. For this area, we will consider proposals on understanding capabilities and evaluation broadly (i.e., not just GCR-relevant capabilities), provided such research could transfer to risk-relevant domains.

3.1 Scaling trends for everything

Capability evaluations should aim to establish reliable upper bounds on AI agent performance within available budgets. To both improve the quality of our estimates and understand when we need to re-evaluate models, we need to better understand how different enhancements affect capabilities. Open questions include:

  • How do different capabilities scale with different variables, e.g., model size, compute resources, or post-training enhancements?[30]See, e.g., AI capabilities can be significantly improved without expensive retraining and AI Benchmarking Dashboard for examples of useful work in this category.
  • How can we best quantify post-training interventions that affect model performance, such as elicitation, scaffolding, or tooling effort? 
  • What do the marginal return curves for investing in different post-training enhancements look like? How do these returns vary across different model architectures and sizes?
  • For post-deployment models, what kinds of advances in post-training enhancements should trigger re-evaluation? How can we reliably tell when these advances have been realized?

3.2 Understanding relationships between capabilities

Many important AI capabilities are expensive and difficult to reliably evaluate. Understanding the relationships between different capabilities could help us to identify when to run costly evaluations based on results from cheaper evaluations, identify surprising results that warrant more comprehensive testing, and, most ambitiously, predict important capability levels from simpler evaluations. Open questions include:

  • How robust is the predictive relationship between simpler, low-dimensional capability measures and complex, emergent capabilities? 
  • How should we expect this relationship to change with model size, model architecture, or the kinds of capabilities studied?
  • Can we predict how smoothly performance on particular tasks will scale? How is this predictability related to the capability being measured?[31]This question is taken from UK AISI’s “Priority research areas for academic collaborations.”
  • Can we decompose dangerous capabilities into meaningful components that can be tracked separately?[32]This question is taken from UK AISI’s “Priority research areas for academic collaborations.” For relevant work, see, e.g., Burnell et al. (2023).

3.3 Improving baselines and measurement

  • What is the best way to build continuous grading metrics for complex, longer-horizon tasks?[33] This question is taken from UK AISI’s “Priority research areas for academic collaborations.”
  • How can we accurately measure the human effort involved in model evaluations, e.g., in prompt engineering and scaffolding design?
  • How can we establish clear, measurable thresholds for:
    • World-class human performance
    • Where relevant, minimum capability levels sufficient to enable serious harm
  • How can we be confident that high scores on dangerous capability evaluations justify taking meaningful actions on safety?[34] Possible research directions here include threat modelling, building consensus on “red lines” for AI deployment, and assessing the validity, reliability and robustness of different capabilities tests. For more details, see section 3.3 of International AI Safety Report.

3.4 Understanding how evaluation results may be misleading 

  • How easy is it to fine-tune a model to significantly change its performance on a particular evaluation in a hard-to-detect way, e.g., to artificially degrade model performance on a dangerous capabilities evaluation?
  • Do performance changes from fine-tuning transfer across different evaluations?
  • Can we reliably detect when a model has been fine-tuned for specific evaluations?
  • Can supervised fine-tuning reliably mitigate under-elicitation and sandbagging?[35]This question is taken from UK AISI’s “Priority research areas for academic collaborations.” For relevant work, see, e.g., Stress-Testing Capability Elicitation With Password-Locked Models.
  • How do model jailbreaks affect performance? 
    • Can we reliably estimate the upper bound of model performance when testing jailbroken models?
  • How can we account for, or prevent, data contamination when using public evaluation datasets?
  • More rigorous measurements of model performance on top benchmarks 
  • Uncontaminated versions of top benchmarks (e.g., GSM1k for more challenging benchmarks)
  • Policy and/or planning work toward:
    • Bounties for beating the evaluation scores that companies report
    • RCTs and human uplift studies for human novices
    • Tracking and discovering “in the wild” examples of frontier capabilities, e.g., Big Sleep, Project Naptime, LLM forecasting 

3.5 Useful previous work 

Work in this area that we think is useful includes:

4. Improving third-party access and evaluations infrastructure

Reliable assessment of frontier AI capabilities requires independent verification, but conducting meaningful external evaluations is already challenging, and will grow more so as security requirements grow more stringent.[37]This point is made in International AI Safety Report 2025, which states (pg 181): “Rigorous risk assessment requires combining multiple evaluation approaches, significant resources, and better access. Key risk indicators include evaluations of systems themselves, how people apply them, as well as … Continue reading

We’re looking for work that can help manage the tension between security and independent oversight. This includes:

  • Improving model access for external evaluations 
  • Approaches to verifiable model auditing
  • Improving evaluation infrastructure 

4.1 Improving model access for external evaluators

As security requirements increase, we need to understand exactly what access auditors need to evaluate model capabilities and safety. This requires understanding the relationship between access levels, evaluation quality, and security risks, understanding practical implementation challenges, and developing protocols for third-party evaluators that work within labs’ security constraints and enable meaningful oversight. Open questions include:

  • What minimum information about model training do auditors need to know to evaluate a safety case (i.e., a structured argument that an AI system is safe to deploy in a given environment)?[38]By “safety case,” we mean “a structured argument, supported by a body of evidence, that provides a compelling, comprehensible, and valid case that a system is safe for a given application in a given environment.”; we borrow this definition from Safety cases at AISI. For more details and … Continue reading
  • Which information about model training would be sufficient for auditors to evaluate safety cases? Which information would be most helpful? 
  • What are the strongest safety cases we could make and evaluate at each level of model access?
  • What are the practical barriers to third-party evaluation at SL-3, SL-4, and above?[39] For a definition of SL-3 and SL-4, see Securing AI Model Weights: Preventing Theft and Misuse of Frontier Models.
  • How should capabilities evaluations for models deployed for internal use only be carried out? 
  • What are the benefits of different levels of access for different kinds of model evaluations (e.g., standard API, helpful-only models, fine-tuning, logprobs, intermediate activations, full weights)? Which are necessary, and which are beneficial?
  • What are the trade-offs between different levels of model access in terms of security risks and audit effectiveness, and how can the corresponding security costs be mitigated?
  • What governance frameworks, structured access commitments, and evaluation protocols would enable robust external evaluation while managing security risks? This includes:
    • Allocation of evaluation rights and responsibilities across organizations
    • Legal frameworks and liability protections
    • Governance structures and oversight mechanisms for evaluation organizations
    • “Safe channels” or other protocols for secure evaluator-developer collaboration

4.2 Improving evaluations infrastructure

To make evaluations easier, quicker, and more useful to run, the field should work toward establishing common standards and best practices for conducting evaluations. Work we are particularly interested in includes:

  • Building on Inspect, e.g.,:
    • Implementing realistic sandboxes for agentic evaluations
    • Porting existing high-quality evaluations to Inspect
    • Building tools for designing model graders
  • Guidance on how to run evaluations, such as:
    • Clear reporting of test conditions (pass-at-k, best-of-k; scaffolding and elicitation effort; post-training enhancements; inference and time budget)
    • Standards and best practices for human uplift studies
    • Incorporation of statistical best practices and insights from metrology[40]See, e.g., A statistical approach to model evaluations.
  • Guidance on responsible disclosure of evaluations results 
    • This also includes guidance on:
      • Avoiding turning dangerous capabilities into optimization targets
      • Handling test sets or fine-tuning datasets that may be classified and/or dangerous
      • Managing and sharing evaluation results (especially on deployed models) that might be classified and/or dangerous

4.3 Verifiable model auditing

Current model auditing approaches are insufficient for high-stakes evaluations under strict security requirements. Relying solely on developers’ internal evaluation claims introduces conflicts of interest, as they have incentives to downplay model risks. The alternative — providing external auditors with direct model access — can create security concerns. External evaluators also face practical constraints of limited time and model access, leading to rushed and uncertain evaluation results.[41] See, e.g., Details about METR’s preliminary evaluation of OpenAI o1-preview, US AISI1 and UK AISI2 Joint Pre-Deployment Test, and International AI Safety Report 2025 (pg 189).
While some AI companies have provided pre-deployment access to third-party evaluators, this cannot be relied on as security requirements grow more stringent.

Although significant breakthroughs would be required, verifiable model auditing techniques could be a better way forward. If developers can prove claims about, or verifiably run externally developed evaluations on, their models without providing direct access, security requirements would present a much smaller barrier to independent verification. However, many questions remain before such methods can be used, including:

  • What are the likely costs of and main barriers to scaling different options for verifiable model auditing (e.g., ZKPs, homomorphic encryption, secure enclaves) to frontier models? 
  • How does this picture change if we vary our trust assumptions or operational constraints, for example assuming that:
    • Auditors must verify claims about model attestation, version verification, and output integrity
    • Auditors must not know the model architecture 
    • Labs must not know the evaluations being run
  • How can verification handle models that use external tools or retrieval?
  • What statistical claims can we make about model properties via sampling? How confident can we be in these claims?
  • What are the minimum hardware requirements for verifiable auditing at the trillion-parameter scale? What specialized hardware architectures might enable faster verification?
  • What are the security trade-offs between different verification approaches for frontier models, including secure enclaves, ZKPs, and homomorphic encryption?
  • How might these approaches fail?
    • How can we verify that a particular model version was evaluated?
    • Which approaches work well for distributed training, pre-training, or post-training?
    • How do verification costs compare across different model architectures?

4.4 Useful previous work

Work in this area we think is useful includes:

4.5 Other kinds of proposals

If you have a strong proposal that doesn’t fit this RFP, consider applying to our AI governance RFP, or our technical AI safety RFP.

5. Application process

5.1 Time suggestion

We suggest that you aim to spend no longer than one hour filling out the Expression of Interest (EOI) form, assuming you already have a plan you are excited about. Our application process deliberately starts with an EOI rather than a longer intake form to save time for both applicants and program staff.

5.2 Feedback

We do not plan to provide feedback for EOIs in most instances. We expect a high volume of submissions and want to focus our limited capacity on evaluating the most promising proposals and ensuring applicants hear back from us as promptly as possible.

5.3 Next steps after submitting an EOI

We aim to respond to all applicants within three weeks of receiving their EOI. In some cases we may need additional time to respond, for example if it demands consultation with external advisors who have limited bandwidth, or if we receive an unexpected surge of EOIs when we are low on capacity.

If your EOI is successful, you will then typically be asked to fill out a full proposal form. Assuming you have already figured out the details of what you would like to propose, we expect this to take 2-6 hours to complete, depending on the complexity and scale of your proposal.

Once we receive your full proposal, we’ll aim to respond within three weeks about whether we’ve decided to proceed with a grant investigation (though most applicants will hear back much sooner). If so, we will introduce you to the grant investigator. At this stage, you’ll have the opportunity to clarify and evolve the proposal in dialogue with the grant investigator, and to develop a finalized budget. See this page for more details on the grantmaking process from this stage.

6. Acknowledgments

This RFP text was largely drafted by Catherine Brewer, in collaboration with Alex Lawsen. 

We’d like to thank Asher Brass, Ben Garfinkel, Charlie Griffin, Marius Hobbhahn, Geoffrey Irving, Ole Jorgensen, and Kamilė Lukošiūtė[42] All names are listed alphabetically by last name. for providing useful feedback on drafts at various stages.[43] People having given feedback does not necessarily mean they endorse the final version of this text. We’re also grateful to our Open Philanthropy colleagues — particularly Ajeya Cotra, Isabel Juniewicz, Jake Mendel, Max Nadeau, Luca Righetti, and Eli Rose[44] Again, all names are listed alphabetically by last name. — for valuable discussions and input.

Footnotes[+]