Troubleshooting workers

DannyA
@dannya
9 years ago
584 posts

For some reason now my queue server is getting backed up on ALL queue tasks, so I assume the worker is not picking up the jobs.

On worker:
-Queue client is installed and appropriate modules checked
-All relevant worker modules installed.
-cron server is able to ping worker.
- We have conversion server enabled on queue server
-We have conversiion worker enable on worker server.
-queue client is enabled on all servers.

How do i troubleshoot this?
updated by @dannya: 06/11/15 08:15:59AM

DannyA
@dannya
9 years ago
584 posts

Also, queue server shows queues but conversion server shows no jobs.

DannyA
@dannya
9 years ago
584 posts

Actually, we are seeing the following messages in activity log server A LOT

jrCore_load_url: http://tx1.cleartracks.com/cloudqueueserver/get/send_email returned code 100 - error #28 (Operation timed out after 30001 milliseconds with 0 bytes received)

@brian
9 years ago
10,148 posts

If you are on AWS and have changed servers, then you need to also update your security policies to allow the servers to communicate with each other. The error you see means one of 2 things:

- the network path between the servers is blocked
- the web server is not running on the tx1.cleartracks.com

--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

DannyA
@dannya
9 years ago
584 posts

But if the ping is successful, doesn't it mean it's not blocked?

If not, how do you set the security policy? I don't remember doing that when you help me set up the last server.

And what about these acitivity log errors. They are many many entries despite those items not being checked in queue client.

Date IP Text
05/09/15 14:43:45 69.86.6.74 [Label Five]: jrCore_load_url: http://tx1.cleartracks.com/cloudqueueserver/get/send_email returned code 0 - error #28 (Operation timed out after 30000 milliseconds with 0 bytes received)
05/09/15 14:43:45 69.86.6.74 [Label Five]: jrCore_load_url: http://tx1.cleartracks.com/cloudlogserver/create returned code 0 - error #28 (Operation timed out after 30001 milliseconds with 0 bytes received)
05/09/15 14:43:26 69.86.6.74 [Label Five]: jrCore_load_url: http://tx1.cleartracks.com/cloudqueueserver/get/send_email returned code 0 - error #28 (Operation timed out after 30001 milliseconds with 0 bytes received)
05/09/15 14:43:26 69.86.6.74 [Label Five]: jrCore_load_url: http://tx1.cleartracks.com/cloudlogserver/create returned code 0 - error #28 (Operation timed out after 30001 milliseconds with 0 bytes received)
05/09/15 14:42:45 69.86.6.74 [Label Five]: jrCore_load_url: http://tx1.cleartracks.com/cloudqueueserver/get/search_index returned code 100 - error #28 (Operation timed out after 30000 milliseconds with 0 bytes received)

@brian
9 years ago
10,148 posts

Every single one of those errors is a network timeout, so you definitely have connectivity issues. I have never had access to your AWS account - you'll need to double check your security policies to ensure the servers can communicate with each other.

--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

DannyA
@dannya
9 years ago
584 posts

ok. will check. I assume they just need port 80/443
updated by @dannya: 05/09/15 01:11:54PM

DannyA
@dannya
9 years ago
584 posts

sysadmin verified. no connection issues.

@brian
9 years ago
10,148 posts

DannyA:
sysadmin verified. no connection issues.

It could also be performance - I just tried to load the URL for one of your servers here and it timed out after 60 seconds and did not load, so something is up.

--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

DannyA
@dannya
9 years ago
584 posts

Https resopnds immediately. And all the cloud modules are configured with https.

Also, we are still seeing modules sitting in queue server queue that we are not sending to the queue server. e.g. the queue server shows 32 entries and 32 workers in the send_email queue. But none of the queue clients have send email checked off. They should be processed by main server.

@brian
9 years ago
10,148 posts

DannyA:
Https resopnds immediately. And all the cloud modules are configured with https.

Also, we are still seeing modules sitting in queue server queue that we are not sending to the queue server. e.g. the queue server shows 32 entries and 32 workers in the send_email queue. But none of the queue clients have send email checked off. They should be processed by main server.

I followed up via email, but will follow up here as well:

- you have your cloud modules configured to use "http:" NOT "https:"
- "http:" takes 2 minutes to respond, so is misconfigured or not working right.

You need to:

- change your cloud config to use https: URLs instead
- fix http (port 80) so it responds in less than 2 minutes (as the cloud modules will time out if no response for 30 seconds).

--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

DannyA
@dannya
9 years ago
584 posts

We are looking into why http://tx1 is not redirecting.

We have update the clients to point to https though. And the issues remain.

As far as performance, that still does not explain why there are mail items in the queue. And I'm sure the fact that 32 active workers on the mail queues isn't helping. (although, we were having problems before the mail queue backed up.).

@brian
9 years ago
10,148 posts

DannyA:
As far as performance, that still does not explain why there are mail items in the queue. And I'm sure the fact that 32 active workers on the mail queues isn't helping. (although, we were having problems before the mail queue backed up.).

Actually it does - your workers have been unable to communicate with the master queue server since the URLs were configured incorrectly.

--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

@brian
9 years ago
10,148 posts

And just an FYI - even your https URL for tx1 takes almost 30 seconds to load here for me, so would highly recommend checking your DNS and web server config - it should be way faster.

--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

DannyA
@dannya
9 years ago
584 posts

brian:

DannyA:
As far as performance, that still does not explain why there are mail items in the queue. And I'm sure the fact that 32 active workers on the mail queues isn't helping. (although, we were having problems before the mail queue backed up.).

Actually it does - your workers have been unable to communicate with the master queue server since the URLs were configured incorrectly.

Why are they MAIL queues though? Those should not be sent to the queue server at all. There should be no need to communicate to tx1. Mail send queue is UNCHECKED on all queue clients.

DannyA
@dannya
9 years ago
584 posts

Questions:
1. If a queue is stuck, if there are avaialable queue workers, can it still process other queues?

2. Can you explain how to kill the pending queues that are consuming worker resources?

@brian
9 years ago
10,148 posts

DannyA:
Questions:
1. If a queue is stuck, if there are avaialable queue workers, can it still process other queues?

Yes - as long as the worker can get a queue entry from the queue master.

Quote:
2. Can you explain how to kill the pending queues that are consuming worker resources?

Currently you will need to delete them from the jr_jrcore_queue table - I'm working on an update to the Cloud Queue Server that will have a button to delete pending queue entries.

--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

@brian
9 years ago
10,148 posts

DannyA:
Why are they MAIL queues though? Those should not be sent to the queue server at all. There should be no need to communicate to tx1. Mail send queue is UNCHECKED on all queue clients.

I'm not sure - are Queues paused on the server? Is the Cloud Queue Server installed on the same server?

--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

@brian
9 years ago
10,148 posts

Just an FYI - I logged in to your main site and there are no email queue entries.

--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

@brian
9 years ago
10,148 posts

And second FYI - you are going to have issues with the tx1 server until you improve the performance - it is too slow to function properly in the cluster.

--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

DannyA
@dannya
9 years ago
584 posts

no, they are all on tx1. and they shouldn't be. mail should be sent by dev.

DannyA
@dannya
9 years ago
584 posts

Also, sysadmin has confirmed, performance problem caused by large number of same process:

[12:31:51 PM] YSA Techs S: 52.0.10.70 - - [10/May/2015:16:24:51 +0000] "POST /cloudqueueserver/get/send_email HTTP/1.1" 404 5479 "-" "Jamroom v5.2.29"

@brian
9 years ago
10,148 posts

And third FYI

The reason the "send_email" queue is not being processed on tx1 is that you have the queue server installed and active on that server - queue workers do NOT run on the queue server - so you need to set one of your workers to process the send_email queue (check it in that server's queue client).

--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

@brian
9 years ago
10,148 posts

DannyA:
no, they are all on tx1. and they shouldn't be. mail should be sent by dev.

There a lot of processes in JR that will send mail - including processes on tx1. However, there is no send_email worked ON tx1 that can process these, so they will just build up. You need to have a worker (tx2?) that will work your email queue for you.

--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

DannyA
@dannya
9 years ago
584 posts

So how to i get the dev server to pick up those queue tasks?

Additionally, cache1 (the worker for all the other queues) WAS configured to send mail. So that does not explain why the mail queue on tx 1 is giving 404
updated by @dannya: 05/10/15 09:56:56AM

DannyA
@dannya
9 years ago
584 posts

You need to do something about these names. too confusing.

Why is the conversion worker not called the conversion client like everything else?

And now you are calling them queue workers instead of clients.

At some point you had something called a master.

updated by @dannya: 05/10/15 10:03:34AM

DannyA
@dannya
9 years ago
584 posts

Ok. This is dragging out.

1. How do I clear out the current queues.
2. All changes have been made to https servers
3. Which modules need to be enable on:
---main server
---queue server
---cloud client server

@brian
9 years ago
10,148 posts

DannyA:
You need to do something about these names. too confusing.

Why is the conversion worker not called the conversion client like everything else?

Because it is the WORKER process that is doing the actual work on a worker server. The Conversion CLIENT runs on the front end and submits jobs to the SERVER, as well as receives updates from the WORKER. It is a 3 module setup - not just 2 like other cloud modules.

Quote:
And now you are calling them queue workers instead of clients.

Queue Workers - functions that execute jobs based on queues. These can be part of any module.

Queue Client - the Cloud Queue Client module that sends/receives queue entries from the Queue SERVER.

Quote:
At some point you had something called a master.

I tend to use "master" for the server host name that will run the following modules:

Cloud Conversion Server
Cloud Queue Server
Cloud Log Server
Cloud Cron Server

But you can call it whatever you want.

Also - I believe I understand the root cause of processes building up - go into your Cron Server and make sure it is pinging servers every 30 seconds instead of every 10 seconds. With a 30 second network timeout, if there is a network issue, there's going to be process buildups as new ones are going to be added every 10 seconds, but timeout after 30. I will change the default in the cloud modules to use 10 second timeouts instead of 30.

--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
updated by @brian: 05/10/15 10:13:31AM

@brian
9 years ago
10,148 posts

DannyA:
Ok. This is dragging out.

1. How do I clear out the current queues.

I already covered this above - the jr_jrcore_queue table.

Quote:
2. All changes have been made to https servers

OK.

Quote:
3. Which modules need to be enable on:
---main server
---queue server
---cloud client server

I think you are OK you just need to DISABLE the Queue CLIENT on the same server you are running the Queue SERVER on. The server that the Queue SERVER module runs on has it's queues "paused" permanently, since the Queue Server module handles all queues.

--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

DannyA
@dannya
9 years ago
584 posts

Ok. all these are done. Created a new upload an new stuck queues.
Checked cron server and saw ping was not being responded to

But if I go to url, it responds

https://cache1.cleartracks.com/cloudcore/ping

Something is still wrong

DannyA
@dannya
9 years ago
584 posts

As I said, connection from cache1 to tx1 was confirmed.
Url responds.
but tx 1 is not getting response

DannyA
@dannya
9 years ago
584 posts

Also, the activity log error on tx1 is not timing out, but its getting an "unable to load url" error

@brian
9 years ago
10,148 posts

If you are getting "unable to load" errors, then there are connectivity issues between the servers - that error comes from cURL when the URL times out or is not reachable, so there are connectivity issues between your servers.

--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

DannyA
@dannya
9 years ago
584 posts

As I said, sysadmins say they checked connectivity and there is no issues connecting on port 80 or 443.

They asked if there is a specific command to test.

It would be great if you could provide specific instructions for testing it. The JR system check also does not indicate any issues, but I don't know if it checks any cloud requirements.

@brian
9 years ago
10,148 posts

This has been fixed in the latest Cron Server module - you should be set.

Thanks!

--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

solved Troubleshooting workers

Tags