Reports

I've worked on the exact same project with DQN and can offer some insights. I'm typically able to achieve an average reward of 490+ over 100 consecutive episodes, well within a 500-episode training limit. Here's my analysis of your setup.

(A quick note: I can't comment on the hard update part specifically, as I use soft updates, but I believe the following points are the main bottlenecks.)

The Primary Issues Causing Your Slow Training

Your Replay Buffer is far too large.
Your Batch Size is comparatively too small.
Your episode termination condition is extremely lenient.
Your network architecture is overly complex, with too many parameters to train.

1. Your Replay Buffer is Too Large

We generally think a large replay buffer leads to a more uniform sample distribution, which is true to an extent. Even with a FIFO (First-In, First-Out) principle, the distribution remains stable.

However, this comes with significant risks:

It accumulates too many stale experiences. When your model samples from the buffer to learn, it's overwhelmingly likely to draw on old, outdated samples. This severely hinders its ability to learn from recent, more relevant experiences and thus, to improve.
It introduces significant feedback delay. When your target network updates, it immediately collects new experiences from the environment that reflect its current policy. These new, valuable samples are then added to the replay buffer, but they get lost in the vast sea of older experiences. This prevents the model from quickly understanding whether its current policy is effective.

In my experience, a buffer size between 1,000 and 5,000 is more than sufficient to achieve good results in this environment.

2. Your Batch Size is Too Small in Comparison

Generally, a larger batch size provides a more stable and representative sample distribution for each learning step. Imagine if your batch size was 1; the quality and variance of each sample would fluctuate dramatically.

With a massive replay buffer of 100,000, sampling only 32 experiences per step is highly inefficient. Your model has a huge plate of valuable data, but it's only taking tiny bites. This makes it very difficult to absorb the value contained within the buffer.

A good rule of thumb is to scale your batch size with your buffer size. For a buffer of 1,000, a batch size of 32 is reasonable. If you increase the buffer to 2,000, consider a batch size of 64. For a 5,000-sized buffer, 128 could be appropriate. The ratio between your buffer (100,000) and batch size (32) is quite extreme.

3. Your Episode Termination Condition is Too Lenient

The standard for this environment is typically a maximum of 500 steps per episode, after which the episode terminates.

I noticed you set this to 100,000. This is an incredibly high value and makes you overly tolerant of your agent's failures. You're essentially telling it, "Don't worry, you have almost infinite time to try and balance, just get me that 500 score eventually." A stricter termination condition provides a clearer, more urgent learning signal and forces the agent to learn to achieve the goal efficiently.

I stick to the 500-step limit and don't grant any extensions. I expect the agent to stay balanced for the entire duration, or the episode ends. Trust me, the agent is capable of achieving it! Giving it 100,000 steps might be a major contributor to your slow training (unless, of course, your agent has actually learned to survive for 100,000 steps, which would result in-game-breakingly high rewards).

4. Your Network Architecture is Overly Complex

I use only two hidden layers (32 and 64 neurons, respectively), and it works very effectively. You should always start with the simplest possible network and only increase complexity if the simpler model fails to solve the problem. Using 10 hidden layers for a straightforward project like CartPole is excessive.

With so many parameters to learn, your training will be significantly slower and much harder to converge.

Additional Points

Your set of hyperparameters is quite extreme compared to what I've found effective. I'm not sure how you arrived at them, but from an efficiency standpoint, it's often best to start with a set of well-known, proven hyperparameters for the environment you're working on. You can find these in papers, popular GitHub repositories, or tutorials.
You might worry that starting with a good set of hyperparameters will prevent you from learning anything. Don't be. Due to the stochastic nature of RL, even with identical hyperparameters, results can vary based on other small details. There will still be plenty to debug and understand. I would always recommend this approach to save time and avoid unnecessary optimization cycles.
This reinforces a key principle: start simple, then gradually increase complexity. This applies to your network architecture, buffer size, and other parameters.
Finally, I want to say that you've asked a great question. You provided plenty of information, including your own analysis and graphs, which is why I was motivated to give a detailed answer. Even without looking at your code, I believe your hyperparameters are the key issue. Good luck!