Why the Foundation Practice Rating does not use AI to gather data about foundations
This blog post is written by Caroline Fiennes, Giving Evidence
The Foundation Practice Rating (FPR) does not use artificial intelligence (AI) to gather data about foundations because it is inadequately accurate. That is the case as of now: AI may improve and we are open to using it in future.
Accuracy is essential to FPR. Generative AI (GenAI, which is what we would be using) can “hallucinate”, to use the industry term, i.e., produce answers which are plausible but not actually correct. FPR does not need information which is plausible but wrong.
For example, the new ChatGPT5 seems unable to count accurately the number of times the letter ‘b’ appears in the word ‘blueberry’. (See below, but do read the whole story: it’s quite funny.) This is despite the CEO of the firm which runs ChatGPT, OpenAI, claiming that ChatGPT5 is like having “access to a PhD-level expert in your pocket”.
There is a growing academic literature about inaccuracies in AI. For instance, this paper by academics at Glasgow University is entitled “ChatGPT is bullshit”: it uses “bullshit in the sense explored by Frankfurt (On Bullshit, Princeton, 2005)… because they [large language models such as used in ChatGPT] are designed to produce text that looks truth-apt without any actual concern for truth, it seems appropriate to call their outputs bullshit”.
Predictions
GenAI (i.e., AI which produces material, normally words) draws on existing published material. These systems use language models (LMs), described by Bender et al (2021) as “a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot.”
Such AI is particularly likely to be inaccurate when the existing published material (i.e., the training material) is different to new material. For example, Professor Trish Greenhalgh asked “AI” (she didn’t specify which service) to summarise a new “outlier” academic paper: because it drew on existing material, “[t]he AI summary assumed the study had found what everyone else had found”. So LLMs are unlikely to detect if a foundation has changed its practices, and the answers to FPR’s questions about it have changed.
GenAI’s main job is to predict a likely answer. When Gmail suggests endings for your sentence, that is what it is doing: “based on the start of the sentence, and given the patterns of text in zillions of previous emails, what is likely to come next?” You’ll have noticed that sometimes Gmail’s prediction of your intended ending is correct and sometimes it isn’t.
AI and the Foundation Practice Rating
In principle, FPR could use GenAI to find whether it is likely – based on other materials – that a particular foundation publishes a particular thing (e.g., gender pay gap data). But FPR is not interested in whether a foundation is likely to publish that thing: rather, FPR is interested in whether it does publish that thing.
A year ago, we asked ChatGPT how many community foundations there are in the UK: FPR’s method needs this number. It returned an answer which was plausible – but wrong. Today, we asked the same question of Claude (a different AI service). It said 47. UKGrantmaking lists 49. Claude’s answer points to the source that it used, which is a page on UKGrantmaking’s website which has the label below, which clearly Claude had not ‘read’:
Archived analysis
Please note that comparisons cannot be made between this analysis and the current edition of UKGrantmaking due to changes in the availability and quality of data, and to the methodology. Reinstated figures for 2022-23 have been included in the 2025 edition.
Some of what FPR’s analysis finds are ‘rare events’: for example, thus far, we have done over 400 assessments of foundations and only one has achieved an ‘A’ on diversity. Mathematician Adam Kucharski wrote recently about the particular dangers around rare events: “even a small hallucination risk can soon produce an overwhelming number of false positives at scale. And it will no doubt fall to a human to check them.”
Worse, there is a good chance that if we asked GenAI to answer FPR’s questions, it would draw on FPR’s own previous reports – which we have now published in each of our four years – and hence produce answers which simply re-state our previous findings, and don’t reflect real changes in foundation practices.
The only way for us to discover those inaccuracies would be to do the research ourselves: in other words, to do the research method that we already use.
Other reasons for not using AI for our research
First, FPR deliberately uses only three data sources: the foundation’s website, its most recent annual report, and other information published by the relevant regulator. We’re not interested in material from elsewhere, such as social media or reports by other entities, which any LLM may well draw on. (One could in principle write a detailed request, specifying the sole sources to use, but the only way to ascertain whether the request was properly obeyed is to do the research again.)
Second, surprisingly often the research involves reading scanned PDF documents: many annual reports are published on regulator’s website in that format. As yet, most AIs cannot read those because they simply look like a picture.
Third, an advantage of doing the research ourselves is that we know where the data came from, and thus can be accountable. Given that one of FPR’s three domains is accountability, this is important. We currently know precisely which two researchers first looked at any given foundation, how they each answered each question/criterion, the source that they used, what any moderator found, and hence the provenance of all of our decisions. With AI, we would cede all of that.
Fourth, a surprising number of FPR’s criteria require judgement, e.g., whether a document is a plan to improve diversity or just statements of commitment to do so. We even found an instance recently where it was unclear whether a foundation should be counted as having a website or not. We endeavour to be consistent in our judgements, which an automated system might not be.
Fifth, in FPR, we encounter odd situations amazingly frequently. For example, a foundation which has paused grant-making for a stated period (so it’s asleep); a foundation which has paused grant-making for a non-stated period (so it’s dead?); a foundation whose website is only accessible if you register; a foundation which has closed completely. We have to make judgements about handling these, and apply those ‘policies’ consistently each year. It’s unclear that an AI would do that reliably.
FPR does use some automation, e.g., it now pulls from UKGrantmaking the income and net asset values for each assessed foundation, which reduces some scope for human transcription error.
For sure, our researchers are human and make mistakes. But they don’t hallucinate or bullshit. For all these reasons, AI is not sufficiently accurate or consistent to be a reliable tool for the FPR without significant human checking.
For now, the Foundation Practice Rating is sticking with the humans.