The Other 99% of Being Human in the Loop
Lessons on meaningful human oversight from the people who actually do it
Ever since the European AI Act’s initial set of requirements became effective earlier this year, I’ve also been more involved in discussions around human understanding of automated decisions.
And something I think about a lot more now is what “meaningful human oversight” actually means. What does it look like? What factors define good oversight versus checkbox compliance?
You’ve seen me criticise the wording of Article 14 against the problem of automation bias and the AI alignment concept of scalable oversight. This sort of critical analysis may give you the impression that I’m pessimistic about “Human in the Loop” (HITL) frameworks. And I am the right amount of doubtful, yes.
But today I want to tell you about the situations that have made me think most deeply and productively about what it means to be the Human in the Loop.
It wasn’t in the office or while conducting AI governance research…
I love traveling. I’m that annoying person who arrives early at the airport just to watch other planes take off. I love the feeling of looking out the window during takeoff, and I actually enjoy long-hauls!
But even for travel lovers, going through airport security is a stressful part of the trip.
I’ve had the fortune of traveling several times this year, and two particular experiences at airport security stayed with me for somewhat unusual reasons.
Early memories of “dehumanisation”: A 5-year-old’s POV
Have you ever had your bags inspected at TSA? I remember this happening to my mother when I was five years old.
She was an exhausted mother who had to deal with her two children with less than 2 hours of sleep. But she was still a Colombian woman on a long-haul.
Her bags were placed on the floor at TSA (which wasn’t very clean), and our belongings were torn apart just to find what triggered the alarm.
As my mum had already insisted, it was the baby formula.
At the end of this, she had to kneel down and patiently put all our stuff back, while my brother and I watched.
Yes: Sometimes, it’s just powder.
A few months ago, I was travelling via London Gatwick and something in my handbag triggered the alarm.
I didn’t know what it could have been. Instinctively, I got uneasy about potential delays. But I also suddenly remembered this dehumanising feeling of seeing my mum kneeling in a packed airport, putting our things back, and muttering about good people being treated as criminals.
My stomach shrank at that thought, wondering why that had come to my mind.
I went to the secondary screening area. The officer didn’t need to open the bag: it was being scanned by a CT scanner that detected the chemical composition of my bag’s contents.
The officer smiled at me and exclaimed: “Wow, seems some people on this team have never seen makeup in their life! Sorry about this, madam, have a good trip!”
Somehow, I felt a strange weight off my shoulders. Twenty-four years after my mother’s incident, I reminded my adult brain that such technology exists and bags are not actually opened as often— preventing people from feeling treated like criminals for an honest, human mistake.
I thanked my luck and felt truly blessed for the opportunity to reflect on this. But what stuck with me was how the security officer handled the incident: with a smile, with the bearing of someone who carries the task of humanising something that might otherwise feel like an invasion of intimacy.
And an idea for this post started to form.
“This. This is why”— I thought, walking towards my gate.
Early memories of “humanisation”: A 3-year-old’s POV
On a more recent trip, I witnessed something interesting again, this time in the security area at Manchester airport. After placing my hand luggage in the trays, I was queued for the full body scanner.
In front of me were a single mother and her child, who must have been about 3 years old. I realised I’d never seen a child go through this process, and naively thought for a moment they’d allow the child to remain in her mother’s arms.
Apparently not.
The security officer looked extremely apologetic, probably moved by the scene herself. The mother let go of the child and instructed her to stand at the centre of the scanner where the foot drawings were: arms up, feet apart, imitating the silhouette in the illustration — and as required by aviation security regulations.
For a couple of minutes, a three-year-old had to adopt the same pose criminals are instructed to adopt while being searched, because this is what the rule dictated.
I felt a weird lump in my throat, and I suspect the people around me did too.
The security officer clapped and congratulated the child for doing such a good job, and then grabbed her little hand while waiting for the mother to go through the scanner too. Relieved, the mother received her child back in her arms once they were both cleared, and carried her to the next checkpoint. The child looked so happy about the reaction she had received from the officer that she kept imitating this pose and laughing for a while.
I kept picturing the sobriety and ceremoniousness in the officer’s facial expression, and I remembered what had happened to me at Gatwick.
I was moved by the thought of how these professionals (at least the ones who truly care) carry the task of humanising experiences that may otherwise feel dehumanizing.
I felt a sense of hope and gratefulness for these experiences.
Later that day, on my way to meet my family, I decided that I could make peace with the wording in Article 14… but only as long as we prioritise conserving this in our critical infrastructures.
Beautiful, but: What do you mean by “Human in the Loop”?
Okay, this was more anecdotal writing than usual. But I wanted to ground my thoughts on these experiences before getting legal-technical.
So, in AI regulatory compliance, what do we usually mean by Human in the Loop?
At its core, it’s the premise that a human remains actively involved in an AI system’s decision-making process: able to intervene, override, or at least understand what’s happening when machines make decisions that affect our lives. The European AI Act is quite specific about this.
Article 14 mandates that high-risk AI systems must be designed for “effective oversight” by natural persons during use. It requires that humans can fully understand the system’s capabilities and limitations, properly interpret its outputs, and decide not to use the system or disregard, override, or reverse its outputs.
Article 13 mandates transparency obligations, requiring that these systems come with instructions clear enough that people can actually exercise this meaningful oversight.
Scalability and Automation bias: How “oversight” struggles to actually oversee.
What the AI Act envisions as “meaningful oversight” might be fundamentally at odds with how oversight actually scales. After all, how do you oversee something that processes millions of data points in ways your brain cannot replicate? How can humans supervise systems that increasingly operate beyond human-comprehensible complexity?1
Arguably, most AI systems deployed in critical infrastructures are still comprehensible enough that the AI Act’s mandates are not unreasonable or technically unfeasible (yet).
But a big challenge for effective oversight of current systems is our terrible habit of over-relying on automated systems even when we’re supposedly supervising them: what we call “automation bias”.
Have you ever accepted your GPS’ instructions, even though you suspected it was making a mistake? When a system is right most of the time, our brains naturally start acting on its decisions before our internal alarm starts ringing. That is why, sometimes, you only course-correct after you already took the wrong turn.
Under the AI Act, the security scanner at Gatwick must allow the officer to override its algorithm, understand why it flagged my cosmetics, and choose to dismiss the alert. The officer needs to have undergone sufficient training to know when the system might fail, clear information about what it’s detecting, and have the actual authority to make the final call.
In banking, when an AI flags a transaction as suspicious, the compliance officer reviewing it must be able to understand the flag’s basis, assess its validity against their own judgment, and override it if they disagree.
What, perhaps, we don’t appreciate enough is the level to which the compliance officer is battling their own cognitive tendency to just trust the machine.
Cognitive Calibration
One of the best articles I’ve seen lately on this issue is “Cognitive Calibration” by James Kavanagh and Dr. Alberto Chierici.
Coincidentally, James grounds his piece with a real-life anecdote about an airplane crash (very in-line with my airport musings!): Air France Flight 447, which crashed into the Atlantic Ocean in 2009, killing all 228 people aboard.
The tragedy wasn’t caused by common mechanical failures. It was caused because experienced pilots had become so accustomed to autopilot that they literally forgot how to execute basic stall recovery procedures.
When ice crystals blocked the airspeed sensors and the autopilot disconnected at 37000 feet, the crew had four minutes to perform a maneuver any experienced pilot can do: nose down, power up. Instead, they pulled back on the stick for the entire descent, unable to recognise or recover from the stall despite continuous alarms:
Because years of automation had atrophied their manual flying skills.
Reading this actually made me jump off the sofa a few times. As the authors put it:
Air France 447 reveals a disturbing truth: as our systems become more automated and capable, humans become increasingly vulnerable at the moment those systems fail. Our cognitive abilities are not calibrated to reliably understand and act in those moments.
If AI handles 99.9% of decisions correctly, humans gradually lose the ability to intervene effectively in that critical 0.1% when systems fail. And, to make matters worse, it turns out that increasing Explainability leads to even more automation bias: because more plausible-sounding explanations make it more difficult for the human to detect when they’re not true.
And, if automation continues to improve (and we overcome scalability concerns), that 0.1% may turn into more of a 0.0001%… Can this be helped at all?
Maybe, if standards for meaningful human oversight actively train the Human in the Loop for “Cognitive calibration”: the ability to maintain appropriate skepticism toward AI systems, knowing when to trust and when to challenge their outputs.
James and Dr. Alberto frame it as “building the muscle memory that keeps judgment calibrated and intact precisely when the system needs a human”2. They also propose a practical framework called “the CATCH protocol”, which you can read here. I really encourage Stress Testing Reality readers to engage with their research!
So, What Does This Have to Do with the Previous Rant?
Yes, we need humans who can correct AI’s mistakes before they affect people, and who make it possible to provide meaningful explanations to natural persons as to how these decisions are made.
But long hours in the airport had me thinking: what will actually happen when AI gets so good that mistakes become rare? When, most of the time, the Human in the Loop is… just there?
I picture that 3-year-old being scanned with her hands up. Should a false alarm trigger, I cannot imagine the impotence a mother must feel when she doesn’t understand what’s happening and cannot reassure her toddler. But even if nothing goes wrong, letting a small child off her arms to be independently scanned (cleared from the “maybe there’s something illegal here” presumption) is already uncomfortable.
It would feel even more outlandish without the empathetic presence of another person, reassuring and giving dignity back, while this intrusive moment passes.
As we solve the scalability issues of automation and reduce errors, the human in the loop won’t spend most of their time catching mistakes. They may well become the only “humanisers” of experiences that would otherwise feel soulless. And maybe that’s the point.
Customer service AI agents or chatbots fail for the same reasons human customer service fails: poor rapport, no empathy, and a feeling of having a robotic interaction.
The human in the loop won’t just have to combat automation bias; they’ll also have to resist becoming just another robotic gear in the decision chain.
Right now, oversight frameworks focus on one trait that the HITL must have: the necessary technical expertise. But, isn’t the competence of preserving humanity in automated processes also meaningful human oversight?
We know that the critical part of being the Human in the Loop will always be contingent on our capacity to remain alert and in control, cognitively calibrated.
But, while nothing goes wrong, while oversight turns tedious because there are no mistakes to correct, the human in the loop must, simply, humanise.
If I ever have children and I have to let them off my arms to be scanned at airport security, an LLM output saying everything’s fine won’t remove the knot from my throat.
It will be the kind smile of the human in the loop who isn’t just checking for glitches, but knowingly looking into my eyes, thinking: “it’s okay, this feels weird, and it won’t be too long now”.
This problem is commonly known as “scalable oversight” in AI Safety: the challenge of ensuring humans can provide accurate feedback on AI outputs even when those outputs involve tasks beyond human expertise or comprehension. Scalable oversight research focuses on developing methods (such as task decomposition, debate between AI systems, or recursive reward modeling) that allow humans to maintain meaningful supervisory capacity even as AI capabilities exceed human performance in specific domains.
While this article focuses mostly on current state of the arts’ systems deployed in High risk settings rather than on general purpose AI, I believe that this concept is still illustrative of the limitations of human oversight.
I’ve never liked the saying “it’s like riding a bike”. Surely, if the steering and balance were automated and you had to be alert for that 0.1% moment where it fails, I’d expect that most instances would result in an accident due to slow reflexes and reaction time.