Project

General

Profile

Bug #50040

local.delivery hangs due to another process holding a lock on the mailbox.

Added by Richard Hall almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Richard Hall
Category:
Scalix Server
Target version:
Start date:
10/21/2015
Due date:
11/13/2015
% Done:

100%

Estimated time:
Operation System:
--

Description

Scalix 12.5 has a different way of maintaining users' calendar freebusy information. The server maintains this info for all users rather than the users publishing their own freebusy info. The 12.5 server program that does this is 'sxfbs'.

Under some conditions it seems that the sxfbs process blocks access to one or more mailboxes so that local.delivery cannot deliver to the inbox. This causes the local.delivery process to wait indefinitely.
Restarting the Freebusy service:
omshut -d0 -w fbs; omon fbs
allows local.delivery to resume message delivery.

Another symptom of this problem can be the delivery of multiple copies of an email with this type of error log warning:
WARNING ..
[OM 7624] Duplicate message check in user's In Tray delivery log has exceeded
timeout

This looks like some sort of timing/locking issue between sxfbs and local.delivery processes.

History

#1

Updated by Richard Hall almost 4 years ago

Here are the relevant bits from the email thread:

-----Original Message-----
From: Dirk Ahrnke
Sent: 20 July 2015 17:26

I am trying to put the pieces together...
After creating more instances of local.delivery (omsetsvc -x ld 4) I was
able to submit most of the messages in the queue.
Some are still left and still blocking delivery.

omshowlog now reveals a possible cause:
WARNING ..
[OM 7624] Duplicate message check in user's In Tray delivery log has
exceeded timeout
...
User Name: Andrew G****

It appears that all stuck messages are to be delivered to his mailbox.

Will try now to set LD_INTRAY_DATA_MAX_SIZE=0 can work around.
The drawback: local delivery will not check for duplicates anymore.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
update:

setting LD_INTRAY_DATA_MAX_SIZE=0 has just one effect:
it removes the 2q-file from the list of open files for the stuck local.delivery process.

The next attempt gets stuck while accessing the v8 file of this user.
other process accessing the v8-file:
several ual.remote processes and sxfbs

omtidyu on this mailbox will not finish as well as trying to open it with omcontain.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

I'm suspicious about the sxfbs process. Can you try restarting the
Freebusy service (and if this just hangs then kill the sxfbs process),
and see if that unblocks access to that mailbox.

I had the same suspicion.
No stuck messages after restarting sxfbs

I have no idea how to reproduce. The mailbox in question isnt
exceptional large, just 1.4k messages in the intray.
Looking at the open files for the new sxfbs instance shows some more
files in the user domain to be permanently opened.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

#2

Updated by Richard Hall almost 4 years ago

I've taken a look at the Scalix server that has the problem, but it is evening local time and therefore lightly loaded.
I can't see any obvious problems with the sxfbs process. However in its idle state it does hold open some user mailbox containers. I'm not sure why this is - sending a meeting request to generate some calendar activity doesn't change the list of open files...

I need to familiarise myself with the Freebusy service design/mechanism and take a look at the sxfbs code where it accesses mailbox containers.

I'll also set up a 12.5 test environment with extra file i/o tracing.

ls al /proc/20864/fd
total 0
dr-x-----
2 root root 0 Jul 21 13:03 .
dr-xr-xr-x 7 scalix scalix 0 Jul 21 13:02 ..
lr-x------ 1 root root 64 Jul 21 13:03 0 > /dev/null
l-wx-----
1 root root 64 Jul 21 13:03 1 > /dev/null
lrwx-----
1 root root 64 Jul 21 13:03 10 > /var/opt/scalix/cm/s/dir/1/1/dir_256.dat
lrwx-----
1 root root 64 Jul 21 13:03 11 > /var/opt/scalix/cm/s/dir/1/1/dir_128.dat
lrwx-----
1 root root 64 Jul 21 13:03 12 > /var/opt/scalix/cm/s/dir/1/1/dir_40.key
lrwx-----
1 root root 64 Jul 21 13:03 13 > /var/opt/scalix/cm/s/dir/1/1/dir_16.dat
lrwx-----
1 root root 64 Jul 21 13:03 14 > /var/opt/scalix/cm/s/dir/1/1/dir_24.key
lrwx-----
1 root root 64 Jul 21 13:03 15 > /var/opt/scalix/cm/s/user/g000035/00000v8
lrwx-----
1 root root 64 Jul 21 13:03 16 > /var/opt/scalix/cm/s/dits/0
lrwx-----
1 root root 64 Jul 21 13:03 17 > /var/opt/scalix/cm/s/user/g000035/0000004
lrwx-----
1 root root 64 Jul 21 13:03 18 > /var/opt/scalix/cm/s/data/0000003/000011j
lrwx-----
1 root root 64 Jul 21 13:03 19 > /var/opt/scalix/cm/s/user/g000041/00000v8
l-wx-----
1 root root 64 Jul 21 13:03 2 > /var/opt/scalix/cm/s/logs/daemon.stderr
lr-x-----
1 root root 64 Jul 21 13:03 20 > inotify
lrwx-----
1 root root 64 Jul 21 13:03 21 > socket:[33197052]
lrwx-----
1 root root 64 Jul 21 13:03 22 > /var/opt/scalix/cm/s/user/g000041/0000004
lrwx-----
1 root root 64 Jul 21 13:03 3 > /var/opt/scalix/cm/s/dir/1/1/dir_8.key
lrwx-----
1 root root 64 Jul 21 13:03 4 > /var/opt/scalix/cm/s/dir/1/1/dir_32.dat
l-wx-----
1 root root 64 Jul 21 13:03 5 > /var/opt/scalix/cm/s/logs/ftlvis.log
lrwx-----
1 root root 64 Jul 21 13:03 6 > /var/opt/scalix/cm/s/dir/1/1/vi_dir.lock
lrwx-----
1 root root 64 Jul 21 13:03 7 > /var/opt/scalix/cm/s/dir/1/1/dir_64.dat
lrwx-----
1 root root 64 Jul 21 13:03 8 > /var/opt/scalix/cm/s/dir/1/1/dir_4.key
lrwx-----
1 root root 64 Jul 21 13:03 9 -> /var/opt/scalix/cm/s/dir/1/1/dir_16.key

#3

Updated by Richard Hall almost 4 years ago

I've had a look at the server 'fleagle' but there is nothing there that I can see that helps pin-point the problem (but the machine is lightly loaded at the moment).

I did notice that the logging level for 'smtpd', 'local delivery' and 'service router' is set very high at 15. It should only be this high for brief periods of low-level tracing and should be set back to 7 (and the services restarted). This will not be the cause of the local.delivery hang but it may be a factor in making it more likely if it is a timing issue.
Can you let me know when this level has be reset to normal and if it has any effect on whether local.delivery hangs periodically? Thanks.

#4

Updated by Richard Hall almost 4 years ago

Ok - Judi has reset the logging level back down to 7 and restarted local delivery - interesting to see if the hang problem happens again.

@Judi - is it true that this has only become an issue since Friday (17th Jul)?

#5

Updated by Richard Hall almost 4 years ago

On 07/21/2015 02:33 PM, Judi Dey wrote:

Yes these aborts on the local delivery only started after Friday.

Cheers,
Judi

#6

Updated by Richard Hall almost 4 years ago

can't reproduce this problem by running our mail server with logging level of 15 for sxfbs and local.delivery.

If this occurs again then extra info can be obtained by checking /proc/locks to see what scalix process is holding a permanent lock, and using gdb on the sxfbs and the blocked local.delivery process to see which procedure is being blocked.

A code review of the freebusy code showed up why we occasionally get a WARNING from sxfbs when reading an update file, and also an error in the order in containers are closed if an error occurs accessing a users calendar. The 2nd of these may explain the blocked local.delivery process.

Changes after code review:
1. user.c: close ctners in correct order in RECOVER block
2. updatesInterface.c: Retry read if no data read initially

#svn commit
Sending fbs/updatesInterface.c
Sending fbs/user.c
Transmitting file data ..
Committed revision 14744.

#7

Updated by Volodymyr Tomash almost 4 years ago

  • Subject changed from https://bugzilla.scalix.com/show_bug.cgi?id=50040 to local.delivery hangs. Restarting FreeBusy service (sxfbs) is a temporarily fix for the problem,
#8

Updated by Richard Hall almost 4 years ago

  • Target version changed from Scalix 12.5 to Scalix 12.6
#9

Updated by Richard Hall almost 4 years ago

  • Due date set to 11/13/2015
  • Status changed from Answered to Resolved
  • % Done changed from 0 to 100

Investigation for bug #60096 shows there is an SFL locking issue when a Calendar item that is modified is concurrently accessed (eg. SWA and T'bird).

Closing this bug as there is a 12.6 fix for bug #60096.

#10

Updated by Richard Hall almost 4 years ago

  • Subject changed from local.delivery hangs. Restarting FreeBusy service (sxfbs) is a temporarily fix for the problem, to local.delivery hangs due to another process holding a lock on the mailbox.

Some additional notes on this issue.

Using "cat /proc/locks" can identify the process that is holding a lock and blocking other processes.

The examples of this problem showed that an IMAP or UAL process was not releasing a lock on a container file. This could be seen by examining the system locks file (cat /proc/locks) when problem was happening:

cat /proc/locks |grep 3542921

2: POSIX ADVISORY READ 5334 fd:02:3542921 956 956
2: -> POSIX ADVISORY WRITE 5402 fd:02:3542921 956 957
2: -> POSIX ADVISORY WRITE 5918 fd:02:3542921 956 957
2: -> POSIX ADVISORY WRITE 6361 fd:02:3542921 956 957
2: -> POSIX ADVISORY WRITE 6627 fd:02:3542921 956 957
2: -> POSIX ADVISORY WRITE 6307 fd:02:3542921 956 957
2: -> POSIX ADVISORY WRITE 2544 fd:02:3542921 956 957
2: -> POSIX ADVISORY WRITE 5145 fd:02:3542921 956 957

3: POSIX ADVISORY READ 5286 fd:02:3542903 956 956
3: -> POSIX ADVISORY WRITE 5334 fd:02:3542903 956 957

The "->" indicates that the process is waiting for the lock (i.e. blocked)

Using gdb we can what the process 5334 is currently doing:

gdb -p 5334 (IMAP CORE)

bt
#0 0x00fcb430 in __kernel_vsyscall ()
#1 0x001ef76b in fcntl () from /lib/libpthread.so.0
#2 0x00db0756 in mdc_fcntl_lock () from /opt/scalix/lib/libom_mdc.so
#3 0x00acb9c8 in os_TryLockPart () from /opt/scalix/lib/libom_os.so
#4 0x002a16f8 in sfl_LockSflItem () from /opt/scalix/lib/libom_sfl.so
#5 0x00acba1f in os_TryLockPart () from /opt/scalix/lib/libom_os.so
#6 0x00b60f26 in ApplyLockToFile () from /opt/scalix/lib/libom_ct.so
#7 0x00b69904 in LockCtner () from /opt/scalix/lib/libom_ct.so
#8 0x00b7ad89 in ct_RealOpenCtner () from /opt/scalix/lib/libom_ct.so
#9 0x00b7a643 in ct_OpenCtner () from /opt/scalix/lib/libom_ct.so
#10 0x0152e74f in ual_CreateList () from /opt/scalix/lib/libom_ual.so
#11 0x015b8ba0 in ual_PrepList () from /opt/scalix/lib/libom_ual.so
#12 0x015ba411 in ual_PrepareList () from /opt/scalix/lib/libom_ual.so
#13 0x01528933 in ual_RecvReply () from /opt/scalix/lib/libom_ual.so
#14 0x00dda18b in uall_read () from /opt/scalix/lib/libom_tucl.so
#15 0x00dd67bf in ual_recv () from /opt/scalix/lib/libom_tucl.so
#16 0x08088239 in ?? ()
.
.
#30 0x079ead36 in __libc_start_main () from /lib/libc.so.6
#31 0x0804e0f1 in ?? ()

process 5334 is waiting for a write lock but is already holding a read lock. The investigation for bug #60096 discovered a problem in the SFL code where concurrent access would cause an escape that didn't release the lock. This would lead to a process holding a container lock for potentially a long time.

Also available in: Atom PDF