
While building an agentic research system on top of my database of Medicaid audits, I started to worry that I was rebuilding, in a narrower and weaker form, capabilities the model and its tooling already provided. Every Python function I added to validate data before a handoff, every branch to handle a new kind of error, felt like it was constraining what the model could do rather than enabling it.
The most constraining part of the system was the agent harness, the layer around the model that makes the agent able to perform real work over time. Every agent has some kind of harness, and it can determine what step comes next in the workflow, retrieve the information the model needs for each step, enforce output requirements, save artifacts throughout the workflow, retry failed steps, maintain state across longer or paused workflows, and apply guardrails around what the model is allowed to do.
The key idea is that the harness is not there to hold intelligence itself. It is the framework and structure that helps model intelligence operate reliably over time and produce results and audit trails that humans can inspect and trust.
So I wanted to test a different approach: instead of writing code to orchestrate each step, what happens if the harness is mostly just instructions, specification files, and a workspace that a coding model like Codex can work inside?
A baseline test for a minimal harness: Data reliability
I picked a workflow familiar from nearly every audit I've worked on.
When an audit has an objective like "determine whether managed care payments went to eligible members," one of the first steps is pulling a dataset of those payments and checking that the key fields are complete, consistent, and aligned with the team's expectations. Only then can a dataset support a real conclusion that will be published in an audit report.
For the test, I wanted an agent that could run this check on any dataset for any project, a Medicaid audit or criminal justice audit, with no changes. Throw a dataset at it, tell it which fields matter, and get a report on each field that I can quickly review and evaluate. Crucially, the agent only makes the judgments about the data that I ask it to make. It profiles and reports. The auditor then compares the data profile against expectations to decide whether the data is fit for the audit's objective and scope.
Why this is a good candidate for a minimal agent
This data reliability process is highly structured, follows predictable steps, and produces a standard deliverable every run. The auditor's judgment is in knowing what the results mean and what the next steps should be, not in the mechanical act of running the checks. That makes it, to me, an ideal candidate for a text-instruction agent. The methodology can be written down. The output format can be described in plain language.
Building the agent with a minimal harness
My goal was to keep the task as close to the model as possible and let the model decide which tools to use and when to revise, rather than programming every agent decision point in code. I also wanted to set up the agent so that it worked well in both the Codex command-line tool and the Codex app. Both allow an AI coding agent to work inside a folder-based workspace.
To be precise about what "minimal harness" means here: I'm not eliminating orchestration, I'm delegating it. Codex already handles tool selection, sequencing, and retry logic inside its environment. If my workflow fits inside what Codex natively does, I don't need to build a second orchestration layer on top of the one that's already there. What I have to provide the model is the judgment layer: what counts as an acceptable result, what analysis to run, and what the final deliverable should look like.
The resulting harness had exactly zero lines of orchestration code. It consisted of a 100-line instruction document with a few schema examples to show the model what the output should look like and what analysis to run for the key fields.
The filesystem + model method boiled down to four core files:
AGENTS.md- the instruction file that defines how the run should work.artifact_contracts.md- the required output for each run.profiling_rules.md- the dataset-level and field-level checks to run.report_spec.md- the format and required information for the final report.
It's just text in those files, no orchestration code at all. The instructions are detailed and specific about the way I want the model to analyze the key fields in the data.
Here's the code: data-reliability-agent
AGENTS.md - The conductor
When Codex finds an AGENTS.md file in a folder, it treats that file as the overarching instructions for any work done there. That makes AGENTS.md the conductor. It tells the model what the folder is for and how work gets done inside it.
Having the AGENTS.md file as the conductor in this project means that the Codex environment itself handles tool selection, sequencing, and retry logic. I don't have to code any of that. What I do have to provide is the judgment layer: what counts as an acceptable result, what analysis to run, what the final deliverable should look like.
When deciding what belongs in AGENTS.md versus a skill or reference file, I apply a simple test to each instruction: "Must the agent remember this for every task?"
Here are the questions I kept asking as I created the AGENTS.md file:
- What is this folder or project for?
- What is the required order of operations?
- What must always be produced?
- What rules must always be followed?
- What must never be done?
- What standard should the output meet?
Here's an excerpt from the file:
AGENTS.md
## Purpose
This repository is a lightweight Codex workspace for dataset-specific reliability analysis.
Each run must produce inspectable generated code and explicit artifacts so results are easy to audit, review, and revise.
## Operating Model
Treat each dataset as a run-specific task.
For each new analysis run, the agent must:
- inspect the dataset and key-field config
- generate run-specific analysis code
- save generated code in the run folder
- save run artifacts for review
[...]
## Definition of Done
A run is complete only when it contains required artifacts defined in `schemas/artifact_contracts.md`, including:
- `work/generated_analysis.py`
- `outputs/field_results.csv`
- `outputs/dataset_summary.json`
- `outputs/reliability_report.md`
Reference Files
Each reference file answers a different question the model will face during the workflow:
artifact_contracts.mdanswers: what must be produced? It defines the required output files for every run. If the model finishes a run and hasn't produced everything listed in this file, the work isn't done.
profiling_rules.mdanswers: how should the data be analyzed? It defines the specific checks to run at the dataset level, such as row counts, duplicate detection, and date range validation, and at the field level, such as type checks, blanks, duplicates, and min/max values. This is where the audit methodology lives. When I want the agent to check for negative values in payment fields or flag dates outside the audit scope, those instructions go here.
report_spec.mdanswers: how should the results be presented? It defines the structure, formatting, and required narrative elements of the final report: section order, table layouts, how to describe findings, and what context to include.
Keeping these separate means the model only needs to load the reference file that's relevant to whatever stage it's currently working on, which keeps the context focused.
Example of results
I tried a mix of custom datasets and open datasets downloaded from Kaggle.
Example #1
I edited a CSV file from a previous project that showed salary expense and allocation amounts from different areas of Louisiana, adding missing values, mismatched data types like text in date fields, and outlier values. When I ran it through the Codex app, the agent returned a report that identified each issue in the table of results and included a discussion of what didn't look right.
It correctly identified the year_end column as a date-formatted column with one record containing a text value. It also correctly flagged the salary column as numeric even though one record contained text data. One surprising note was that the retirement_expense column was called out for a negative value.

Example #2
I found an open dataset on Kaggle showing monetary performance for music tours and ran the workflow again. It produced a solid report that called out missing values in key fields and flagged a note to review whether duplicate values in the Rank and Peak columns are appropriate for this dataset.

Example #3
For a larger dataset, I found an Audible books dataset on Kaggle that included text in the price field for more than 300 of the 87,489 records. The report identified the same records I had spotted manually when I reviewed the data myself.

Example #4
For the fourth run, I used another dataset that I knew contained a few duplicate rows and missing values, both of which were correctly identified by the final report.
Here are the final findings from that run.

And here's the dataset summary table the agent produced.

Verdict
The experiment convinced me that many agent workflows might need less custom orchestration code than I first assumed. At minimum, I think this kind of text-based setup should be the first step before building a more complex agent.
For a structured task like data reliability, a thin harness worked surprisingly well. The important ingredients were defined context, explicit instructions, durable artifacts, and a filesystem the model could read from and write to as it worked. During the research-agent project, every time I added a Python function to validate data before a task handoff or another method for handling a new kind of error, it felt like the best version of the agent was slipping further and further away.
The most useful feedback loop was also the simplest one. If something was missing or implemented incorrectly, all I had to do was adjust a sentence or two. I would tighten an instruction, clarify a rule, or add to the spec file, then run it again. I was surprised by how consistently that worked.
Iterating on the agent output felt more like training a new team member. I would review the work, identify what was missing, make the instructions clearer or more precise, and the next run would improve.
For workflows that are structured, inspectable, and artifact-driven, that may be the thinnest useful harness: a small, explicit, text-based one. For more complex workflows, starting with this approach may help you identify what can be left to the model and what actually needs to be orchestrated in code.