GradeHackGet early access
Methodology

How the dataset is built

Plain-English summary of the data pipeline, sourcing approach, and the safeguards we apply before anything reaches a public page.

Source: Freedom of Information requests

UK universities are public bodies under the Freedom of Information Act 2000. They are required to disclose, on request, statistical data they hold — including module-level grade distributions — subject to a small set of exemptions, principally personal data (s40), commercial interests (s43), and vexatious requests (s14).

We file requests in standardised form across UK institutions and archive both the requests and the responses on WhatDoTheyKnow, the public FOI archive. Universities respond on their own timelines (typically 20 working days, sometimes longer) and in their own formats — CSV, Excel, PDF, occasionally even scanned image documents.

Normalisation

Each response is processed through our internal data CLI, which applies programmatic schema-mapping first and falls back to LLM-assisted column inference only when the schema is unrecognisable. Repeated submissions from the same university reuse a cached schema mapping, so we don't re-run inference unnecessarily. Every transformation is logged.

Suppression

Universities apply different cohort-size thresholds when suppressing rows under the personal-data exemption — typically 5–7 students, sometimes 10. We apply a uniform suppression threshold of cohort < 10 to all public-facing data, regardless of what the FOI reply contains. This is deliberately conservative.

On public cluster pages we never publish exact percentages or counts; only banded descriptors (low / mid / high First-rate, mean-mark band, cohort-size band) and year-range. Exact distributions are visible only to logged-in users via the paid product.

Validation

Before any module's data goes live, the row passes:

  • Cohort size ≥ 10
  • Sum of grade-band percentages within ±0.5 of 100
  • At least two non-suppressed years for trend signals
  • Schema validation against our canonical model

Limitations

FOI data has known limits. Universities sometimes round to nearest 5 even above the disclosure threshold. Some institutions decline to disclose at module level, citing commercial-interest concerns. Coursework rubrics drift over time, which means historical means are less predictive of future means than they appear. We surface these caveats on every page where they matter.

Ethics

This dataset is, by design, about structural outcomes — not individual students or individual academics. We don't draw teaching-quality conclusions from grade data; we don't publish names; we don't republish raw FOI rows. Aggregated signals only.