Queue error logging

alt=
DannyA
@dannya
9 years ago
584 posts
We are having a hellish time troubleshooting conversion and queue errors.

Can you provide some guidance on troubleshooting things stuck in a queue for a really long time. Additionally, there is no logging anywhere for failed conversions. The activity log shows successful conversions but no failed ones and there is no messaging to find out why something is failing.
updated by @dannya: 04/20/15 06:08:07PM
michael
@michael
9 years ago
7,717 posts
Got an example of a file that fails so I can try here and see if it fails for me?
alt=
DannyA
@dannya
9 years ago
584 posts
It's not any one file in particular. It may fail once but then it would work another time. The problem is there is no way to troubleshoot pending queues or conversions that never happen. It's really frustrating.
brian
@brian
9 years ago
10,148 posts
If you are getting queue entries "stuck" for a long time, it means your worker is not cleaning up or returning properly. The Queue itself does not do error logging, checking, etc. - all of that is handled by your worker function (since the queue really has no way to "know" what is to be done with a queue entry).

Your worker function should:

- return TRUE if it was successful. The queue entry will be deleted
- return FALSE if there was an error AND it wants the queue to be handed again to the next available worker
- return a positive integer to SLEEP the queue entry for the number of seconds returned - i.e. if your function returns "200" it means "hide this queue entry from all workers for 200 seconds.

Any LOGGING needs to be handled by your worker function - the core will not do logging for you.

Also - make sure when you are registering your worker function, you pass in the values as needed - i.e.

jrCore_register_queue_worker('jrAudio', 'audio_sample', 'jrAudio_create_audio_sample', 0, 4, 3600);

This says:

- When a new queue entry comes into the "audio_sample" queue, run the jrAudio_create_audio_sample() function.

- When the jrAudio_create_audio_sample() function RETURNS, give it another queue entry until there are no queue entries left for that queue (that's the first "0" - it means keep working til no queue entries left - if it was set to "5" the worker would do 5 jobs then exit).

- At most, left 4 jrAudio_create_audio_sample() functions be running at the same time.

- If a worker has not returned after 3600 seconds, abort the job, kill the worker, and give the queue entry to a new worker.

So it is up to you and your developers to make sure your worker functions "play nice" with the Queue system.

Let me know if that helps.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

updated by @brian: 03/18/15 11:11:33AM
alt=
DannyA
@dannya
9 years ago
584 posts
So maybe what I am asking is for the worker function to provide some level of failure logging.

For example, if I get
"deleted queue entry running for 5586 seconds - 3rd time failing"

I have no idea what song this was for, what went wrong, or how to fix it. Was there a problem with the file format? Was the encoder unable to grab the file? were we unable to write a new file? etc,,,,

The function needs some error logging.
brian
@brian
9 years ago
10,148 posts
DannyA:
So maybe what I am asking is for the worker function to provide some level of failure logging.

For example, if I get
"deleted queue entry running for 5586 seconds - 3rd time failing"

I have no idea what song this was for, what went wrong, or how to fix it. Was there a problem with the file format? Was the encoder unable to grab the file? were we unable to write a new file? etc,,,,

The function needs some error logging.

The problem is that the Core has no idea what you are doing in your worker function - that's for you to develop. The Queue system in Jamroom is a framework for running queue entries - that's it. Your worker functions are going to need to have whatever logic they need them to figure out what went wrong and properly log it as well as exit properly.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
alt=
DannyA
@dannya
9 years ago
584 posts
Ok. But the worker function I'm talking about is a song conversion. That's your worker function. And its not logging anything.
updated by @dannya: 03/18/15 12:49:38PM
brian
@brian
9 years ago
10,148 posts
DannyA:
Ok. But the worker function I'm talking about is a song conversion. That's your worker function. And its not logging anything.

I'm not aware of issues with it, but if you're getting an error like this:

deleted queue entry running for 5586 seconds - 3rd time failing

then it likely means the process is "bombing" - something is causing it to crash so it cannot log or cleanup.

Is this on a 100% stock install? Or are you keying custom functions off of a conversion event?


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
alt=
DannyA
@dannya
9 years ago
584 posts
Both. I think you're missing the point.

Assume a 100% stock install. How would you troubleshoot that? What I'm trying to say is that their is no error logging in the conversion functions and it makes it very difficult to troubleshoot when things go wrong. We don't know WHY its bombing.

We also have the same problems with our custom conversion functions.
brian
@brian
9 years ago
10,148 posts
DannyA:
Both. I think you're missing the point.

Assume a 100% stock install. How would you troubleshoot that? What I'm trying to say is that their is no error logging in the conversion functions and it makes it very difficult to troubleshoot when things go wrong. We don't know WHY its bombing.

I hear you. If there is an actual failure, then it is likely logged (the conversion function has quite a bit of logging in it). If it actually "bombs" out (this can happen for example if you make a system() call that bombs out), then unfortunately the process crashes and JR is not able to log anything - the process no longer exists.

So it is hard to tell the exact reason a process is bombing - if you can reproduce it, then add fdebug calls in at every step and see how far in the process you get.

We run a conversion server here for hosting and it's handled many tens of thousands of conversions with only a handful of issues, so the conversion functions are pretty solid. If you're keying off an event during conversion then add some fdebug calls in to your code to see where you are getting to.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
nate
@nate
9 years ago
917 posts
I understand I can log an error inside my queue functions. What his complaint is, is that the queue just hangs and we have to manually delete it. It reports nothing and if we don't check it, we won't even know something was stuck in the queue.

I'll add some error checking.
nate
@nate
9 years ago
917 posts
Now that I look there is error checking already written. So they aren't the problem.
brian
@brian
9 years ago
10,148 posts
Natedogg265:
Now that I look there is error checking already written. So they aren't the problem.

When you say "there is error checking already written" what do you mean? What type of error checking?


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
nate
@nate
9 years ago
917 posts
 if (isset($_conf['jrAudio_conversion_format']) && strpos($_conf['jrAudio_conversion_format'], 'ogg')) {
        // Convert OGG sample file
        $nam = str_replace('.mp3', 'ogg', basename("xtDistribLicense_{$bid}_distriblicense_sample_file.mp3"));
        ob_start();
        system("{$sox} {$input_file} {$dir}/{$nam} trim {$start} {$sample_length} fade t 00:00:0{$fade}.0 00:00:{$sample_length}.0 00:00:0{$fade}.0 >/dev/null 2>{$tmp}", $ret);
        ob_end_clean();
        // See if we had errors
        if (is_file($tmp) && filesize($tmp) > 0) {
            $tmp = file_get_contents($tmp);
            if (stristr($tmp, 'FAIL')) {
                jrCore_logger('MAJ', "errors encountered creating OGG audio sample__2", $tmp);
                @unlink("{$input_file}.sample.ogg");
                @unlink($tmp);
                return false;
            }
        }
        if (is_file("{$dir}/{$nam}") && filesize("{$dir}/{$nam}") > 200) {
            jrCore_write_media_file($profile_id, $nam, "{$dir}/{$nam}");
        }
        @unlink($tmp);
        @unlink("{$dir}/{$nam}");
    }
brian
@brian
9 years ago
10,148 posts
Natedogg265:
I understand I can log an error inside my queue functions. What his complaint is, is that the queue just hangs and we have to manually delete it.

The queue will never just "hang" - that's not how it works or what it does. Again if a queue entry seems to be "stuck" it is a worker issue - it always is. The core queue functions themselves only check to see that a queue has not been held on to by a single worker for longer than you have it configured for, or that it has failed 3 consecutive tries - that's it - it does not do any other checking.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
brian
@brian
9 years ago
10,148 posts
The code you posted is from the Audio module and works no problem. Were you trying to highlight something?


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
alt=
DannyA
@dannya
9 years ago
584 posts
brian:
Natedogg265:
I understand I can log an error inside my queue functions. What his complaint is, is that the queue just hangs and we have to manually delete it.

The queue will never just "hang" - that's not how it works or what it does. Again if a queue entry seems to be "stuck" it is a worker issue - it always is. The core queue functions themselves only check to see that a queue has not been held on to by a single worker for longer than you have it configured for, or that it has failed 3 consecutive tries - that's it - it does not do any other checking.

Forget about the queue. Again, after the conversion function has failed 3 consecutive times, we have not idea why.

We don't know why it failed, what users, what media.

So the a queue log entry telling me something has failed is useless. It does not help us identify and fix the probelem. So a user can potentially be re-submitting a failing job countless times and we can't identify it or fix it.

There are many thing that can cause the function to fail: unable to read/write, corrupt files, invalid formats, etc....
brian
@brian
9 years ago
10,148 posts
DannyA:
Forget about the queue. Again, after the conversion function has failed 3 consecutive times, we have not idea why.

We don't know why it failed, what users, what media.

So the a queue log entry telling me something has failed is useless. It does not help us identify and fix the probelem. So a user can potentially be re-submitting a failing job countless times and we can't identify it or fix it.

There are many thing that can cause the function to fail: unable to read/write, corrupt files, invalid formats, etc....

That's right - the core is not supposed to tell you what went wrong - that is what I keep saying. Only the worker function can tell you want went wrong. By the time you see the "3rd time failing" error message from the core, that can only tell you something went wrong, and to start investigating - it is a "last gasp" "oops something has gone really wrong and you should check it out".

Again - the CORE queue functions have no idea what you are doing, what a queue worker is doing, etc. It does NOT know that the worker could not "download a file", or whatever. In theory queue workers could be doing thousands of different things. It only provides the framework for handling queue entries.

So again - nothing is needed at the core queue level, nor will we be making any changes since no changes are needed. It is functioning 100% as it should.

So with that said, I think what you are getting at is that the conversion process is failing and you don't know WHY. Debugging it is just like debugging anything else:

- go into the conversion function
- add print lines - i.e. fdebug('made it to here'); all over the place
- run the function and see where it gets to.

There's no magical way to figure it out - you just have to debug it.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
alt=
DannyA
@dannya
9 years ago
584 posts
Yes. I keep saying it. Forget the core. Forget the Queue. I don't know how else to say it's the conversion function.

But it's not that it needs to be debugged. There is nothing to debug in a process that works. It's that we do not get an error when there is a failure. Do you understand that the conversion can fail even if the code is written correctly. I gave you some examples. e.g. unable to read/write, invalid format.

I'm being very specific about the conversion function and you keep telling me about the queue . I want to know how I can see a log of a FUNCTION'S error response. You can debug to a certain point but if you can't see what the error is or what is causing it you can't fix it.
brian
@brian
9 years ago
10,148 posts
Sorry Danny - most of this thread has been about how to get the queues to report errors. Now that we're on the same page that the queue is not an issue, we can move on :)

You state "we do not get an error when there is a failure". There's only 2 things that can be done:

1) If the error is NOT a fatal error (i.e. the process does not bomb) then we need to make sure an Activity Log entry is created.

2) If the error IS a fatal error (in other words the Apache process core dumps or whatever), then it is impossible to add an Activity Log entry. If this happens, then once the time out is reached the core will "reset" the queue entry and another worker will take over. If it ALSO bombs, and then it bombs again on the 3rd try then all you'll get is the core telling you after 3 tries it did not work.

I do not believe there is any issue with the Audio or Video conversion functions that would cause #1 - any actual errors (i.e. unable to open file, invalid file format, etc.) are all logged and return appropriately.

That leaves #2. Like I've mentioned above - IF you are triggered some custom functions off of events that are initiated in the Audio conversion function (part of the Audio module) then you will want to debug your functions. I'm not aware of any issues in the stock audio conversion function, and have not seen any problems here.

Let me know if that helps.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
alt=
DannyA
@dannya
9 years ago
584 posts
No, I am more focused on Case #1. But as you say, we need to make sure an activity log entry is created.

YES. Exactly my point. Right now the only thing I see in the activity log is success messages.

"any actual errors (i.e. unable to open file, invalid file format, etc.) are all logged and return appropriately."


This is what I want but I am not seeing. I can see that the queue executed the function 3 times, and then aborted. But then I should see an error in the activity log from the function as to why it didn't complete. There is nothing .
updated by @dannya: 03/18/15 02:33:16PM
brian
@brian
9 years ago
10,148 posts
DannyA:
This is what I want but I am not seeing. I can see that the queue executed the function 3 times, and then aborted. But then I should see an error in the activity log from the function as to why it didn't complete. There is nothing .

Unfortunately that's not always true - if the process is bombing then you'll get no activity log entry (like #2 above).


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
alt=
DannyA
@dannya
9 years ago
584 posts
Well, then I would hope to see something in the apache logs. But also nothing.

But I'm not buying that it's #2 always. I've never seen any kind of error.

Anyway. I don't see this thread helping at this point. The simple point of this is, conversions randomly continue to fail, no way to tell why....
brian
@brian
9 years ago
10,148 posts
DannyA:
Well, then I would hope to see something in the apache logs. But also nothing.

But I'm not buying that it's #2 always. I've never seen any kind of error.

Anyway. I don't see this thread helping at this point. The simple point of this is, conversions randomly continue to fail, no way to tell why....

I have a feeling it is related to your other S3 thread - there is likely a file copy error, which in turn could be causing it to exit.

I'm going to add additional S3 API error logging in to the S3 module - maybe that will be able to help you out. I'm fairly certain the issue is not in the actual Audio conversion function, or we would have seen the issue very quickly. Most likely it is either in the S3 functions or your custom functions - hopefully the extra logging will help you out.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
brian
@brian
9 years ago
10,148 posts
I just pushed up version 1.0.7 of the S3 module - it does 2 things:

- adds some additional error logging (especially on reads)
- moves error logging that was being logged to the debug_log to the Activity Log

See if that helps pinpoint any issues.

Hope this helps!


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
alt=
DannyA
@dannya
9 years ago
584 posts
Ok thanks. will update shortly

Tags