Claude Code for Scientists
Abundant code without the sharp edges
Claude Code hit a cultural moment over the holidays; “what did you vibecode over the break” became a meme. I’ve been using AI code tools since back in 2022, back when Codex was the only game in town. In 2024, I had a brief viral moment building ismy.blue in Cursor, a color perception test that landed me on the pages of The Guardian and WSJ. I’m not very proficient in HTML/JS, so AI-assisted coding extended my reach. But things are different now, as the models are quite a bit better, and display longer autonomy.
The tools write a ton of code very rapidly; but that code is generated faster than it can be verified, can contain subtle bugs that bypass inspection, and accumulates layers of cruft and complexity that make the code unmaintainable. With great power comes great responsibility. In this post, I cover how scientists who already use Claude Code or are considering jumping on the bandwagon can develop better habits to write code that works well, is verifiable, and is pleasant to write.
A word of warning: this guide assumes you have enough experience coding and evaluating research to have good metacognition. The metacognition required to know when you’re on thin ice, which usually comes from having made mistakes in the past and knowing when they are likely to happen. Vibecoding without that metacognition is, IMO, quite dangerous.
Generic advice
Boris Cherny, the creator of Claude Code, has a thread on how he uses Claude Code. His tips are good jumping-off points:
Use plan mode. Claude Code has a plan mode; instead of immediately trying to write code, it reflects on a plan to accomplish the task in question, often asking qualifying questions. Use
Shift+Tabto iterate on a plan before execution. Get into the habit of running a Plan-Execute-Evaluate loop.Use voice to specify instructions. Models work better when they have extensive instructions. Plan Mode pairs well with voice; monologue about what you are trying to accomplish. You can even pair-instruct: have a pilot specify their requirements and an interviewer actively guide them when instructions are unclear. Use a transcription service rather than the bad default Apple one; I use Wispr.
Don’t just run autonomously. Claude can go down unproductive rabbit holes—burning through tokens, wasting your time, writing trash code, implementing bad plans. That becomes negative productivity. Especially at the start of a task, use
Escwith abandon to stop Claude and give some mid-conversation nudging toward the solution. Paired with plan mode, it eliminates a lot of back and forth.Manage your task switching judiciously. Boris runs 5-10 Claude Code windows in parallel, which eliminates downtime while models generate code. I find that excessive for science work. When you’re working at the very edge of human knowledge, narrow your focus. It’s ok to babysit one Claude Code instance for a while. Write your backlog, and when you feel you have things figured out, then run multiple things in parallel if it suits you.
Test-driven development matters more than ever. Russ Poldrack has been writing about this in his Better Code, Better Science series. When the AI writes the code, tests are how you verify it’s doing what you think.
Use git + GitHub. Learn about branches. When Claude goes off the rails, you need to be able to roll back cleanly. When you refactoring and deleting dead code, delete with impunity, knowing you have infinite backups.
Scientific data analysis
How should you manage a classic scientific workflow, one in which you loop between data acquisition → data processing → analysis → visualization? A lot of this advice that I wrote in goodresearch.dev still holds.
I presented a framework for managing code complexity in the Good Research Code Guide: minimize the code’s loading on short-term memory (what was I doing just now?) and long-term memory (what did I do six months ago?). Now the concern is different: how can I onboard rapidly into code generated by the AI, so I can validate it, change it, and reproduce it? Here are a few tips:
Separate data processing from visualization. Use pure Python/R for data processing, separating code for visualization (notebooks or custom scripts). Don’t let AI write a ton of code that mixes data processing and graphing. That code is brittle, works on different timescales (you iterate on plots constantly; you shouldn’t iterate on data processing constantly).
Give Claude rails with folder structure. My true neutral cookiecutter is a decent starting point, and an easy way to separate data processing and visualization:
Put raw data under
data/rawPut processed data under
data/processedPut package code (for data preprocessing) under
srcPut notebooks under
notebooksI’d add one thing to this list: adata/generated/folder for generated data, that is, data that was generated by an AI.
Use an orchestrator. Your codebase will grow much faster than when you write things by hand. Pretty soon it will be hard to tell 1) where data came from 2) which data is stale and 3) how certain scripts are supposed to be invoked. That’s why you need the equivalent of
make, a tool that specifies a computational DAG (directed acyclic graph), tracks inputs and outputs, and allows you to run your whole pipeline. I find vanilla Makefiles hard to read, so I’m using Snakemake, which is popular in bioinformatics. Airflow is another option if you want something more heavyweight. At the very least: a single shell script that runs pipeline + visualization in order.
Use a package manager and specify it in CLAUDE.md. These days, I gravitate towards
mambafor system-wide packages anduvvirtual environments for python. Docker remains relevant for more heavyweight management or for cloud deployment. The key is making dependency management explicit and automatic.Clean up your code. Use ruff. Be aggressive about culling dead code (git means you don’t lose anything). Name things well, and rename them when their function changes. Ask Claude to identify dead code. Tell it to refactor. Have it write tests beforehand so it knows whether it did the refactoring correctly.
These tips are applicable primarily for new code. In addition, Claude can be used to read, understand, and improve existing repositories that don’t follow these best practices. Indeed, one good test of its capabilities is to upgrade an existing project: writing READMEs and putting some order into poorly documented repositories, e.g. ones with dead code, hardcoded paths, unknown dependencies.
Make lots of cheap visualizations
It is very easy to generate lots of code that reads data incorrectly, or processes it incorrectly, and comes up with wrong conclusions. This is especially likely if the data format is poorly documented or if you don’t have experience with this particular type of data.
Do you remember, as a junior grad student, going to your PI with a bunch of printed-out plots, only to be told 30 seconds in “these plots don’t make any sense, you’re misreading the data”? One clear skill (some would say a curse) that you picked up during your scientific training is your ability to call bullshit. Thus, have the AI generate lots of plots. Individually low-utility plots collectively help you convince yourself that data is correct. Turn these hunches into formal tests as the maturity of your analysis increases.
Jupyter notebooks don’t play well with Claude. That’s because 1) plots are embedded in the json file underlying the jupyter notebook in base64 format, and that eats up large chunks of the context window; 2) jupyter notebooks are stateful, and Claude doesn’t know the state of your kernel. There are several potential solutions:
Don’t use notebooks: have the AI write scripts that generate plots that dump pngs in a folder.
Use jupyter notebooks cheekily: every time the notebook is changed, close the notebook, restart the kernel, and re-run from scratch. You can even ask Claude to do that for you.
Use text-based notebooks:
jupytext(python),quarto(Python & R),Rmarkdown(R) ormarimo(python). Because they are text-based, they fully solve the issue with jupyter’s heavyweight json format. They alleviate the statefulness issue by making it easy to re-run everything from the top.
marimo, in particular, fully solves the statefulness issue by imposing rather minimal restrictions that the code is organized as a well-defined DAG (similar to Observable). I’ve been trialing it out for a few weeks and this is now my preferred solution. Because it is quite a recent framework, and the models haven’t fully picked up on its capabilities, you will need to add this text to CLAUDE.md for best results.
Beyond static visualizations, the tools make it far easier to create interactive plots. You don’t have to limit yourself to matplotlib anymore; plotly, streamlit, leaflet and webgl are all at your fingertips.
Odds and ends
Manage your context judiciously. Use /clear, /compact with abandon. Have the AI write handoff documents and write to CLAUDE.md. Have it remember the solutions to difficult bugs so they don’t get reintroduced.
Use it a lot. Maximize your learning by using it a lot; for fun, you can try to max out your plan. Usage begets skill. Check out /usage to see where you are.
Write down what Claude Code can’t do. Keep a Notion doc or a Markdown file to write down the sharp edges of the models. Make benchmarks out of difficult problems. This will allow you keep track of AI development and notice when things move from impossible to possible.
Configure alerts. Set alerts to know when Claude needs your permission or input. On Mac, use iTerm2 for maximum flexibility in notifications, and set them via /config.
Why Claude Code matters
For scientists, Claude Code (and other related agentic coding tools) makes it possible to 1) remain productive and somewhat in the weeds as a senior researcher or PI, and 2) makes it much more feasible to be a solo researcher. It lowers the cost of both exploration and exploitation. Chasing down a promising lead, building a quick analysis pipeline, generating diagnostic plots… it makes the cost of satisfying one’s curiosity far lower.
That’s a big deal. But it also means you can produce wrong results faster than ever before. The fundamentals, including validation, reproducibility, knowing where your data comes from, and knowing when you’re out of your depth, don’t change from writing code manually. If anything, they matter more now.
In my opinion, using the tools proficiently is feasible when you have gone through the hard work of writing your own code by yourself, failing repeatedly, picking yourself back up; I don’t know of another way of getting to that level of metacognition. How do you develop that when the AI is making the mistakes for you, invisibly? This is likely to be a challenge for junior researchers as the tools get commoditized. I don’t have a good solution yet; if you have tips for how, as a senior PI or research lead, you can help novices develop good metacognition about code, please comment.



Thanks for this article, very interesting!
A major takeaway I've kept with me since the good research code handbook, /9despite a complete overhaul of the way we code since then ) has been the philosophy of externalizing as much as possible with good structure, so there is less decision fatigue and you can leverage as much of your working memory as possible on the task at hand.
I've been working with a different set of tools, but brushing up against some fairly similar challenges and issues. I use VScode mostly via Copilot agents (ChatGPT usually), but much of the structure is similar. My codebase was getting out of hand, so I found it very useful to codify the rules for what kind of functions should go into which submodules into a set of agent instructions, and have specific agents for certain kinds of tasks (core, exploratory, hygiene) etc. I've found it useful to keep a document with all of the function docstrings so every time an agent is prompted to do something, it first follows a set of steps to check the existing codebase for reusable code, decide based on our rules where a new function should go, and automatically updates the code index each time a change is made to the core codebase. I finally have my working memory back.
I'm curious to hear your thoughts on 'chain of prompts', logging user prompts to be able to reconstruct the thought process that went into whatever was generated. Ultimately that's what a scientist is being trained to do, ask the right questions. If I were a PI and my students were using LLMs for scientific coding, I would want to be able to check that - as if it were a lab book. It would also incentivize students to prompt as clearly and unambiguously as possible - a net positive for good outcomes, and good thinking.
Using text-based notebooks was a great idea! Context matters. But have you ever been worried about "taste" in some sense? I align with your metacognition concerns as good safety rails, but more and more scientific Claude code inserts design decisions in your code, which looks fine to some extent, but draws with it "code smell" and a very synthetic taste. Considering code quality in some sense results in downstream inference during metacog, do you propose some methods to tackle taste? How do we protect taste with increasing usage? Senior folks develop their research/code taste overtime, but what about the junior population, do you foresee some serious homogeneity (which not always be great) in code infra/critical thinking in scientific workspaces?