You ask why they chose that comparator. Silence.
You ask what the confidence interval means for the clinical claim. A guess.
You ask which covariates they adjusted for and why. A confused look.
The draft is real. The understanding behind it is not.
This is your problem now, not your trainee’s. For decades the artifact was your evidence. If your fellow could write the paper, they had probably thought through the science. If they could run the analysis, they had probably understood the data. AI cut that link. A strong manuscript no longer tells you the person you are training can do the work, which means the way you have always assessed them no longer measures what you think it measures.
This is not only a cheating problem. It is a measurement problem, and it is yours to solve. The fix is not better detection. It is better assessment.
1. The artifact stopped being your evidence
Research training already gives you the right concept for this moment. It is the entrustment decision: your judgment that a trainee can be trusted to do a task without you watching, once they have shown they actually can.
Every milestone you sign is an entrustment decision. Can I trust my research fellow to run the analysis alone. To answer reviewers without me rewriting every line. To design the next protocol. The manuscript, the abstract, the analysis script were never the point. They were the evidence you used to decide.
AI degrades that evidence. When a professor at NYU started cold-calling students about their own polished submissions, many could not explain basic choices after two follow-up questions. The work was excellent. The ownership was not.
Your trainees’ artifacts now carry the same ambiguity. So your question has to change. Not “did they use AI,” which you mostly cannot answer in 2026 (AI detectors do not work – period. And are the wrong focus). Instead, “can I entrust this person with this skill,” which you can still answer, if you change what you put in front of them.
Dr. Philippa Hardman, frames the shift as: AI did not break assessment. It broke the economics of bad assessment. The methods that were too expensive to run at scale are suddenly cheap, and most of them are better. Four of them belong in your lab.

2. The oral defense, now affordable at every milestone
The oral defense is the oldest assessment you have and the most honest. Nobody can outsource a follow-up question they did not see coming. You reserved it for the dissertation and the occasional qualifying exam because your time is the scarcest resource in the program.
AI changes that math. Two instructors ran structured AI oral exams for about forty-five cents per student, using a voice agent to question students on their own work and a panel of models to grade the transcripts. Seventy percent said it measured their understanding better than a written exam. Most also found it more stressful, which is the point. Defending your reasoning out loud is harder than producing a clean document, because it happens in real time and cannot be edited.
How to put this to work as the mentor:
- Pair every major written product with a short defense. 10 minutes. The trainee submits the manuscript or analysis, then answers your targeted questions about 2 or 3 specific decisions. Why this model. Why this comparator. What single result would have flipped the conclusion.
- Write your milestones as observable verbs. Defends, justifies, critiques, troubleshoots. Not understands, knows, is familiar with. If a competency can be shown by recalling something, AI can fake it. If it can only be shown by doing it or defending it live, AI cannot, and neither can your trainee without the underlying skill.
- Treat the defense as the assessment. The document becomes the thing the conversation is about, not the thing you grade. That single move re-orders your integrity question into one you can answer.
You do not need a voice agent to start. You need 10 minutes and 3 good questions per trainee.
3. Simulated study sections and a tireless Reviewer 2
The best research training has always used simulation. The mock study section. The practice job talk. The lab meeting where a senior colleague plays the hostile reviewer. The method works because it tests behavior under pressure, not recall. Your constraint was always supply. You can stage only so many mock panels, and the most useful adversaries are the busiest people in the building, usually you.
AI personas remove that ceiling, and the approach is now validated next door. In a two-site randomized trial, medical trainees who practiced against AI standardized patients scored higher on real faculty-graded exams, and nearly 3/4 said the simulated patient felt like a real one. The persona was specific, available on demand, and more consistent across learners than any human actor can be. The lesson generalizes cleanly. If the skill lives in a conversation, you can now build a behavioral assessment around that conversation, at scale, for the first time.

Port it to research training. Instead of a patient, build the counterpart your trainees actually have to survive:
- A skeptical Reviewer 2 who attacks the weakest assumption in the analysis.
- A program officer who asks why the aims matter and what happens if Aim 1 fails.
- A study-section member who probes the power calculation and the missing-data plan.
- A journal-club adversary who defends the very paper your trainee is trying to take apart.
Your design move is specificity. Not “a tough reviewer.” A reviewer with a known bias toward causal inference who distrusts observational claims. The sharper the persona, the more precisely you are assessing the behavior you care about.
Then grade the transcript, not the impression. Build a rubric of observable moves. Did they concede the valid critique and hold the line on the invalid one. Did they offer a sensitivity analysis when cornered. Did they name the limitation before the reviewer did. The rubric is the hard part, and it is where your real assessment lives. Spend your effort there.
4. Grade the process, not only the product
When the product no longer proves the thinking, the process becomes your evidence. AI can generate any artifact. It cannot yet fake the trail of how a specific person produced one.
Ask your trainees to submit that trail with the paper. Keep it to a single page:
- Which prompts did you use, and which outputs did you reject.
- Where did you push back on the model, and why.
- What did you verify yourself, and how.
- What would you do differently next time.
This is not a fringe idea, and you are not the first to ask for it. Universities are already formalizing it. The University of Sydney now splits assessment into secure in-person tasks and open tasks where AI is allowed, and on the open ones students must acknowledge how they used AI. Where programs require that kind of disclosure, the conversation moves from policing to coaching.
Two notes decide whether this works in your lab. Make the prompts specific, or you get mush. “Describe your AI use” produces a vague paragraph. “Describe one moment you disagreed with the model and what you did about it” produces something you can actually grade. And design the process memo as the more assessable half of the submission, because it often is. The reasoning about what to keep, the moments of pushback, the decision to verify a suspicious citation, all surface judgment the polished manuscript hides.
This is also where you catch the failure mode that ends careers before they start. A trainee who can account for their process can show you the verified DOI behind every citation. A trainee who cannot is the one who lets a fabricated reference reach a submission with your name two authors down. This risk is measured, not theoretical. When researchers audited reference lists from chatbots, even with GPT-4 about one in five of the references were fabricated outright, and many of the real ones carried the wrong volume, pages, or DOI.
5. Assessment that never stops
The first 3 shifts are better versions of the moments you already run. This one changes what counts as a moment at all.

Your assessment has always been event-based for a simple reason. Every data point cost you time. So you settled for endpoints. The committee meeting, the qualifying exam, the submitted paper. A fellow or postdoc quietly drowning in month two does not surface as a problem until the committee meeting in month seven, by which point the window to intervene has closed. You were measuring at the endpoint because watching continuously was unaffordable, not because the endpoint was the right place to look.
AI changes the economics of measurement too. Every time a trainee works through an analysis or a draft with an AI tool, that interaction is a potential signal. What they attempted, what they got right unaided, where the model had to carry them, how their reasoning moved across sessions. The endpoint stops being your only read.
Your move as the mentor is to pick one or two leading indicators you watch across the training, not just at the end. Choose something concrete. Can they spot the flaw in a methods section without prompting. Do they catch the fabricated citation. Does their unaided reasoning hold when you take the AI model away. Decide the threshold that triggers your intervention before you deploy any of this, so you are not staring at a stream of transcripts later wondering what to do with it.
The further edge of this, drawn from how some organizations are already rethinking workplace training, is that assessment stops being a thing you schedule and becomes a property of the work itself. For a PI, that means building the measurement into the AI tools your lab already uses, rather than bolting a separate exam onto the end.
One hard caveat holds the whole thing together. If your trainee leans on AI through every step, your continuous signal measures the AI model, not the person. The risk is documented. A 2025 study of 666 people found frequent AI use correlated with weaker critical thinking, mediated by cognitive offloading. A small MIT EEG study went further, finding that people who wrote with an LLM showed weaker brain connectivity and could barely quote the work they had just produced. So your continuous signal has to include unaided checkpoints, and you have to supervise the AI use itself. The goal is a trainee who directs the tool, not one the tool quietly carries. The capacity that matters has a name in the assessment literature, evaluative judgment, the ability to tell whether what the model produced is any good, and it is exactly what your checkpoints should test. The worst place for any trainee to sit is scattered, half-committed AI use, which produces weaker learning than disciplined use and weaker learning than none at all.
Move one assessment one rung
AI played two roles in your lab at once. It was the destabilizer, exposing how weak artifact-grading always was. It was also the enabler, making the better methods cheap for the first time. The same voice models that let your trainee generate a flawless draft also let you run an oral defense for the price of a coffee. The same technology that broke your take-home assessment handed you the tools to replace it.
So let’s stop pretending that our trainees are not using AI. Assume they are, and design accordingly. The mentors doing the most useful work have stopped treating AI as the adversary in the room and started using it as a collaborator in building the assessment itself. Use it to draft your rubric. Generate the Reviewer 2 persona. Stress-test your own design by simulating how a strong trainee, a struggling trainee, and a coasting trainee would each move through your assessment, then fix the gaps before a real trainee ever touches it.
You do not need to rebuild your program this quarter. Take one thing you assess. A manuscript draft, a journal-club presentation, a mock grant. Ask two questions:
What is the artifact, and what is the entrustment decision it is supposed to support.
If you are grading the artifact alone, you are using a method AI has already outdated. Pick one of the four moves above and apply it to that single assessment. Then do the next one.
Start with the 10-minute defense this week. It costs you almost nothing, and it tells you fast which of your trainees can defend the work that carries their name. Try it on the next manuscript draft that lands in your inbox, then tell me what you found. I am collecting what works across programs, and the patterns are getting clearer.
Top Papers on AI in research this week:
- General-purpose LLMs beat FDA-cleared clinical AI – A Nature Medicine study ran a head-to-head test on physicians’ real-world questions. GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 beat two cleared tools, OpenEvidence and UpToDate Expert AI, across every benchmark. The results carry regulatory implications. The special purpose tools passed a clinical decision-support pathway where as the general purpose AI (that outperformed) never did, which exposes a validation gap the FDA has not closed.
- AIPatient Arena – This preprint evaluates LLMs on real EHR data across full clinical consultations, not isolated quiz items. Models scored well on interview technique, ethics, and clear explanations. They faltered where it counts: diagnostic reasoning, ambiguous patient replies, and complete information coverage. Richer context improved reasoning but yielded limited gains in treatment planning.
- LLM-as-an-Investigator – The authors name a quiet failure mode they call user-driven sycophancy. A model reinforces the user’s hunch instead of testing competing explanations. Their fix is an evidence-first agentic method that gathers data before proposing answers. It is a sharp reminder for anyone leaning on chatbots during research or clinical problem-solving.
Top Papers on AI in education this week:
- Learning to Prompt: Adaptive High-School Tutoring – Static tutoring prompts rarely fit every subject. This team built a router that picks prompts using 14 pedagogical features extracted from raw transcripts. They trained it in simulation, then deployed it for online adaptation with actual high-school students. An A/B test with 359 students showed the model switching from analytical to scaffolding strategies.
- Do LLM Tutors Teach or Just Solve? – A stronger problem-solver is not automatically a better teacher. Using public MathTutorBench results, the authors found solving and pedagogy scores correlate at only 0.421 across eight models, with several shifting rank when evaluation moves from solving to pedagogy. They argue benchmarks should report solving-oriented and pedagogy-oriented scores separately.
- Spotlight on AI in Education, June 2026 – Sussex publishes a monthly read on how universities actually fold AI into teaching. The headline reads as University of Surrey will embed AI in discipline-specific ways in every degree from September 2026. The bulletin also weighs a Gallup finding that Gen Z worries AI use risks replacing human skills rather than enhancing them.
P.S. Research Boost is now live. You can use it as an academic writing partner for writing your next manuscript or grant. Try it FREE: http://researchboost.com/
