Ansible Automation Platform Jobs Stuck in Pending: Root Cause and Fix
Today I ran into one of those issues that looks complex, but turns out to be beautifully simple.
Every job I launched in Ansible Automation Platform (Tower / Controller 4.5) just sat there.
No output. No errors. No movement. Just PENDING.
And sometimes, silence is the loudest signal.
What I Saw
Everything looked healthy on the surface:
- Job templates launched successfully
- Jobs stayed in PENDING forever
- No logs, no failures, no hints
- Execution Environments looked perfectly fine
At first, I suspected Ansible itself, playbooks, environments, something deep.
But this didn’t feel like a playbook problem.
This felt like something wasn’t even starting.
The Turning Point
When jobs don’t start at all, it’s usually not Ansible… it’s scheduling.
So I went one level lower services.
That’s when I checked Receptor, the quiet engine behind job execution in AAP 4.x.
And there it was:
systemctl status receptor● receptor.service - ReceptorLoaded: loaded (/usr/lib/systemd/system/receptor.service; enabled; vendor preset: disabled)Drop-In: /etc/systemd/system/receptor.service.d└─override.confActive: failed (Result: exit-code) since Tue 2026-03-31 12:49:06 CDT; 4s agoProcess: 946196 ExecStart=/usr/bin/receptor -c /etc/receptor/receptor.conf (code=exited, status=1/FAILURE)Main PID: 946196 (code=exited, status=1/FAILURE)Mar 31 12:49:06 xxxx systemd[1]: receptor.service: Service RestartSec=100ms expired, scheduling restart.Mar 31 12:49:06 xxxx systemd[1]: receptor.service: Scheduled restart job, restart counter is at 5.Mar 31 12:49:06 xxxx systemd[1]: Stopped Receptor.Mar 31 12:49:06 xxxx systemd[1]: receptor.service: Start request repeated too quickly.Mar 31 12:49:06 xxxx systemd[1]: receptor.service: Failed with result 'exit-code'.Mar 31 12:49:06 xxxx systemd[1]: Failed to start Receptor.Now we were getting somewhere.
The Real Error
Running Receptor manually revealed the truth:
error opening Unix socket: could not acquire lock on socket file: no such file or directory
That one line explained everything.
What Actually Broke
Receptor relies on a Unix socket located here:
/var/run/receptor
Here’s the subtle part:
-
/var/run(linked to/run) is temporary - It gets cleared on reboot or system cleanup
-
The directory
/var/run/receptorwas simply… gone
And Receptor?
It doesn’t recreate it.
So it failed silently.
And when Receptor is down:
- No execution capacity
- No scheduling
- Jobs stay in PENDING forever
No errors in UI, because nothing even reached that layer.
The Fix
Sometimes the fix feels almost too simple.
I recreated the directory:
mkdir -p /var/run/receptor
chown receptor:receptor /var/run/receptor
chmod 755 /var/run/receptor
Then restarted services:
systemctl start receptor
automation-controller-service restart
And just like that—
PENDING → RUNNING
The system came back to life.
Making It Permanent
Because /var/run is temporary, this would happen again after reboot.
So I made it persistent using systemd:
Create:
/etc/tmpfiles.d/receptor.conf
Add:
d /var/run/receptor 0755 receptor receptor -
Then apply:
systemd-tmpfiles --create
Now the directory is recreated automatically every time.
What This Taught Me
When everything looks fine, but nothing moves:
- Don’t start with playbooks
- Don’t chase UI clues
- Go deeper
Sometimes the failure isn’t in automation…
It’s in the foundation that enables it.
A missing directory.
A silent service.
A system waiting for something that no longer exists.
Final Thought
Not every problem announces itself.
Some just sit there quietly…
like a job stuck in PENDING,
waiting for you to look where no one else does.
Jay
Comments
Post a Comment