How the dataset is built
Plain-English summary of the data pipeline, sourcing approach, and the safeguards we apply before anything reaches a public page.
Source: Freedom of Information requests
UK universities are public bodies under the Freedom of Information Act 2000. They are required to disclose, on request, statistical data they hold — including module-level grade distributions — subject to a small set of exemptions, principally personal data (s40), commercial interests (s43), and vexatious requests (s14).
We file requests in standardised form across UK institutions and archive both the requests and the responses on WhatDoTheyKnow, the public FOI archive. Universities respond on their own timelines (typically 20 working days, sometimes longer) and in their own formats — CSV, Excel, PDF, occasionally even scanned image documents.
Normalisation
Each response is processed through our internal data CLI, which applies programmatic schema-mapping first and falls back to LLM-assisted column inference only when the schema is unrecognisable. Repeated submissions from the same university reuse a cached schema mapping, so we don't re-run inference unnecessarily. Every transformation is logged.
Suppression
Universities apply different cohort-size thresholds when suppressing rows under the personal-data exemption — typically 5–7 students, sometimes 10. We apply a uniform suppression threshold of cohort < 10 to all public-facing data, regardless of what the FOI reply contains. This is deliberately conservative.
On public cluster pages we never publish exact percentages or counts; only banded descriptors (low / mid / high First-rate, mean-mark band, cohort-size band) and year-range. Exact distributions are visible only to logged-in users via the paid product.
Validation
Before any module's data goes live, the row passes:
- Cohort size ≥ 10
- Sum of grade-band percentages within ±0.5 of 100
- At least two non-suppressed years for trend signals
- Schema validation against our canonical model
Limitations
FOI data has known limits. Universities sometimes round to nearest 5 even above the disclosure threshold. Some institutions decline to disclose at module level, citing commercial-interest concerns. Coursework rubrics drift over time, which means historical means are less predictive of future means than they appear. We surface these caveats on every page where they matter.
Ethics
This dataset is, by design, about structural outcomes — not individual students or individual academics. We don't draw teaching-quality conclusions from grade data; we don't publish names; we don't republish raw FOI rows. Aggregated signals only.