How to Safely Unit Test Shell Scripts from LLMs

So, you just got a shiny new shell script from ChatGPT (or Copilot, or your favorite AI buddy). It looks legit. It even feels right. But then that creeping doubt sets in:

"Wait… is this thing safe to run on production?"

Welcome to the world of unit testing shell scripts generated by LLMs — where the stakes are high, sudo is dangerous, and one wrong rm -rf can ruin your whole day.

In this post, we'll walk through a battle-tested way to safely test and validate scripts that manage real services like PM2, Docker, Nginx, or anything that touches system state.

The Problem With Trusting LLM Shell Scripts#

Frustrated engineer realizing the risks of blindly trusting LLM-generated shell scripts

Large Language Models like ChatGPT are awesome at generating quick shell scripts. But even the best LLM:

  • Can make assumptions about your environment
  • Might use the wrong binary name (like pgrep -x PM2 instead of pm2)
  • Can forget that systemctl restart docker isn't always a no-op

Even if the logic is 90% correct, that 10% can:

  • Restart your services at the wrong time
  • Write to incorrect log paths
  • Break idempotency (runs that shouldn't change state do)

According to a recent study on AI-generated code, about 15% of LLM-generated shell scripts contain potentially dangerous commands when run in production environments.

Strategy 1: Add a --dry-run Mode#

Every LLM-generated script should support a --dry-run flag. This lets you preview what the script would do — without actually doing it.

Here's how you add it:

DRY_RUN=false
[[ "$1" == "--dry-run" ]] && DRY_RUN=true
log_action() {
echo "$(date): $1"
$DRY_RUN && echo "[DRY RUN] $1" || eval "$1"
}
# Example usage
log_action "sudo systemctl restart nginx"

This pattern gives you traceable, reversible operations.

For more advanced dry-run implementations, check this guide.

Strategy 2: Mock External Commands#

You don't want docker restart or pm2 resurrect running during testing. You can override them like this:

mkdir mock-bin
echo -e '#!/bin/bash\necho "[MOCK] $0 $@"' > mock-bin/docker
chmod +x mock-bin/docker
export PATH="$(pwd)/mock-bin:$PATH"

Now, any call to docker will echo a harmless line instead of nuking your containers. Symlink other dangerous binaries like systemctl, pm2, and rm as needed.

This technique is borrowed from Bash Automated Testing System (BATS), which uses mocking extensively.

Strategy 3: Use shellcheck#

LLMs sometimes mess up quoting, variables, or command usage. ShellCheck is your best friend.

Just run:

shellcheck myscript.sh

And it'll tell you:

  • If variables are unquoted ("$var" vs $var)
  • If commands are used incorrectly
  • If your if conditions are malformed

It's like a linter, but for your shell’s sanity.

Strategy 4: Use Functions, Not One Big Blob#

Break your script into testable chunks:

check_pm2() {
ps aux | grep '[P]M2' > /dev/null
}
restart_all() {
pm2 resurrect
docker restart my-app
systemctl restart nginx
}

Now you can mock and call these functions directly in a test harness without running the whole script. This modular approach mirrors modern software testing principles.

Strategy 5: Log Everything. Seriously.#

Log every decision point. Why? Because "works on my machine" isn't helpful when the container didn't restart or PM2 silently failed.

log() {
echo "$(date '+%F %T') [LOG] $1" >> /var/log/pm2_watchdog.log
}

Strategy 6: Test in a Sandbox#

If you've got access to Docker or a VM, spin up a replica and try running the script in that environment. Better to break a fake server than your actual one.

Try:

docker run -it ubuntu:20.04
# Then apt install what you need: pm2, docker, nginx, etc.

Check this Docker-based testing guide

Bonus: Tools You Might Love#

Developer presenting useful tools for safely testing shell scripts generated by LLMs
  • BATS: Bash unit testing framework
  • shunit2: xUnit-style testing for POSIX shell
  • assert.sh: dead-simple shell assertion helper
  • shellspec: full-featured, RSpec-like shell test framework

Final Thoughts: Don't Just Run It — Test It#

Two engineers discussing safe testing practices for LLM-generated shell scripts

It's tempting to copy-paste that LLM-generated shell script and run it. But in production environments — especially ones with critical services like PM2 and Nginx — the safer path is to test before trust.

Use dry-run flags. Mock your commands. Run scripts through shellcheck. Add logging. Test in Docker. Break things in safe places.

With these strategies, you can confidently validate AI-generated shell scripts and ensure they behave as expected before hitting your production servers.

Nife, a hybrid cloud platform, offers a seamless solution for deploying and managing applications across edge, cloud, and on-premise infrastructure. If you're validating shell scripts that deploy services via Docker, PM2, or Kubernetes, it's worth exploring how Nife can simplify and secure that pipeline.

Its containerized app deployment capabilities allow you to manage complex infrastructure with minimal configuration. Moreover, through features like OIKOS Deployments, you gain automation, rollback support, and a centralized view of distributed app lifecycles — all crucial for testing and observability.