Ansible Automation Platform Jobs Stuck in Pending: Root Cause and Fix


Today I ran into one of those issues that looks complex, but turns out to be beautifully simple.

Every job I launched in Ansible Automation Platform (Tower / Controller 4.5) just sat there.
No output. No errors. No movement. Just PENDING.

And sometimes, silence is the loudest signal.

What I Saw

Everything looked healthy on the surface:

  • Job templates launched successfully
  • Jobs stayed in PENDING forever
  • No logs, no failures, no hints
  • Execution Environments looked perfectly fine

At first, I suspected Ansible itself,  playbooks, environments, something deep.

But this didn’t feel like a playbook problem.
This felt like something wasn’t even starting.

The Turning Point

When jobs don’t start at all, it’s usually not Ansible… it’s scheduling.

So I went one level lower services.

That’s when I checked Receptor, the quiet engine behind job execution in AAP 4.x.

And there it was:

systemctl status receptor
● receptor.service - Receptor
   Loaded: loaded (/usr/lib/systemd/system/receptor.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/receptor.service.d
           └─override.conf
   Active: failed (Result: exit-code) since Tue 2026-03-31 12:49:06 CDT; 4s ago
  Process: 946196 ExecStart=/usr/bin/receptor -c /etc/receptor/receptor.conf (code=exited, status=1/FAILURE)
 Main PID: 946196 (code=exited, status=1/FAILURE)

Mar 31 12:49:06 xxxx systemd[1]: receptor.service: Service RestartSec=100ms expired, scheduling restart.
Mar 31 12:49:06 xxxx systemd[1]: receptor.service: Scheduled restart job, restart counter is at 5.
Mar 31 12:49:06 xxxx systemd[1]: Stopped Receptor.
Mar 31 12:49:06 xxxx systemd[1]: receptor.service: Start request repeated too quickly.
Mar 31 12:49:06 xxxx systemd[1]: receptor.service: Failed with result 'exit-code'.
Mar 31 12:49:06 xxxx systemd[1]: Failed to start Receptor.

Now we were getting somewhere.

The Real Error

Running Receptor manually revealed the truth:

error opening Unix socket: could not acquire lock on socket file: no such file or directory

That one line explained everything.

What Actually Broke

Receptor relies on a Unix socket located here:

/var/run/receptor

Here’s the subtle part:

  • /var/run (linked to /run) is temporary
  • It gets cleared on reboot or system cleanup
  • The directory /var/run/receptor was simply… gone

And Receptor?
It doesn’t recreate it.

So it failed silently.
And when Receptor is down:

  • No execution capacity
  • No scheduling
  • Jobs stay in PENDING forever

No errors in UI, because nothing even reached that layer.

The Fix

Sometimes the fix feels almost too simple.

I recreated the directory:

mkdir -p /var/run/receptor
chown receptor:receptor /var/run/receptor
chmod 755 /var/run/receptor

Then restarted services:

systemctl start receptor
automation-controller-service restart

And just like that—

PENDING → RUNNING

The system came back to life.

Making It Permanent

Because /var/run is temporary, this would happen again after reboot.

So I made it persistent using systemd:

Create:

/etc/tmpfiles.d/receptor.conf

Add:

d /var/run/receptor 0755 receptor receptor -

Then apply:

systemd-tmpfiles --create

Now the directory is recreated automatically every time.

What This Taught Me

When everything looks fine, but nothing moves:

  • Don’t start with playbooks
  • Don’t chase UI clues
  • Go deeper

Sometimes the failure isn’t in automation…
It’s in the foundation that enables it.

A missing directory.
A silent service.
A system waiting for something that no longer exists.

Final Thought

Not every problem announces itself.

Some just sit there quietly…
like a job stuck in PENDING,
waiting for you to look where no one else does.

Jay

Comments

Popular posts from this blog

ASM Integrity check failed with PRCT-1225 and PRCT-1011 errors while creating database using DBCA on Exadata 3 node RAC

Lock Tables in MariaDB

Life is beautiful