Case study on error “Worker connections aren’t enough while connecting to upstream”

Bibhuti Poudyal
3 min readAug 12, 2022

Scenario

I was running a server on EC2 that handled about 1 req/sec on average. Most of these requests had to make an API call to another endpoint and respond based on its response. EC2 was also exposing APIs which were used by AWS lamdas.

12 pm was the peak time. A few days in a week(around the peak hour) we got complaints from users that the service was down.

Server specs

  • Server: AWS EC2
  • Instance type: T2.medium
  • Techstack: NodeJs(Express), AWS RDS

Attempt 1

Our go-to solution was pretty straightforward: increase server resources. It had been a long since the last server upgrade. Users had grown a lot since then. I upgraded the server to T2.large.

It worked !!

A month later we began to see more such errors. Now more frequently.

Attempt 2

This time we began to look into it deeply. I checked the pm2 logs for errors. None found. My first guess was that either we didn’t log that error or the code doesn’t have a mechanism to detect & handle the error. Then I revisited the whole code-base for possible loopholes. Some lines of code were identified as probable loopholes. Debugged those parts revealed they weren’t the culprit.

The only place this error was seen was the API response from the ec2. It was a 500 — Internal Server Error. Then I planned to analyze the overall architecture once again.

These were layers of services communicating with each other.

  • External API
  • Ec2 instance
    - NodeJs Application
    - Pm2 process manager
    - NGINX
  • Lamda requesting the Ec2 instance

Logs from lamda were clear, and so were the logs from pm2. Then I checked the Nginx logs. There were too many. Among all of those texts, I found “768 worker connections aren’t enough while connecting to upstream”.

Solution

I quickly searched the term and found it on Nginx’s official blog: Tuning NGINX for Performance. It described the term worker_connections as:

The maximum number of connections that each worker process can handle simultaneously. The default is 512, but most systems have enough resources to support a larger number. The appropriate setting depends on the size of the server and the nature of the traffic, and can be discovered through testing.

Then I checked the configuration file from /etc/nginx/nginx.conf. By default, it had a value of 768. It didn’t match as said on the docs. I looked at other sources for a similar issue with a working solution. All of them matched by little but none matched completely.

Most people were randomly assigning large numbers to it. And it solved their problem.

I looked deeper into the Nginx blog and found that this number should match the configuration worker_rlimit_nofile.

As said on the docs

It should be kept in mind that this number includes all connections (e.g. connections with proxied servers, among others), not only connections with clients. Another consideration is that the actual number of simultaneous connections cannot exceed the current limit on the maximum number of open files, which can be changed by worker_rlimit_nofile.

These answers were helpful to understand worker_rlimit_nofile better.

Looking at all those answers, I decided to update worker_connections value to 10000. Since then the server is working perfectly fine, even during peak hours. In case of any issues caused by this approach, I will definitely update here.

Thanks for reading.

--

--