![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
UPDATE July 6th, 2011: Updated to include a *POSSIBLE* fix from Microsoft about the problem. I have no way of testing or confirming it, but it sounds like it has potential.
UPDATE March 2, 2010: Updated to include, now known, actual cause of the problem -JH
Recently I've had the exciting job of debugging a massive e-mail failure between kernel.org and a large hardware manufacturer (LHM). The LHM has been particularly nice about this, understand the problem and has been kind enough to do my evil bidding while I try to debug this whole problem with only one side of the entire equation to work with.
Here's the problem: The LHM has contracted all of their e-mail out to Microsoft's Exchange Hosted Service (EHS). Generally speaking this is fine, it saves LHM a lot of money in their IT budget, the mail servers are more up to date, generally more secure, etc. Really I don't have a lot to complain about this choice - except - when it fails.
So the long version of this story is, back in December LHM's couple of employees who need to e-mail kernel.org on a regular basis started noticing that their e-mails weren't getting anywhere, and in fact where getting bounced back at them with some strange errors. They didn't think *TOO* much about it at the time, but the problem persisted and around mid-January they came to kernel.org and asked what we were seeing. I figured it was just a problem with our greylisting, we are a little more aggressive about it than others and it catches up some mail servers. So I did #include <STD_EMAIL_GREYLIST.h> response and expected the problem to go away. Sadly it didn't, so I opened up a proper look into what was going on.
LHM's employees were quite forthcoming with what they knew about how their e-mail worked, mainly that it was a giant black box that no one at LHM could see into because it was a Microsoft run service (known as Bigfish). I asked kindly if they could get me the logs from their box, or at least all of the error messages and I started delving into my logs.
That's when I noticed something very weird.
Hmmmmm ok, that's odd. No really that's very odd... This had been happening since Decemberish or so. Great. But I was having the employees from LHM e-mail both kernel.org, and my personal domains and the e-mails to my personal domains were going through - why?
I started checking the certs, I started monitoring more things then I paid slightly closer attention to my personal server's mail logs, my primary mail server was exhibiting the same problem that kernel.org was - ok at least I'm not insane, but why was I getting the e-mail? I looked at my secondary mail servers, and one of them - my longest running box
Was actually receiving the mail, and actually accepting it. It would spool on that machine and then push to my primary without issue. Hmmmm......
So I turned up the debugging levels on kernel.org and had LHM's employees send me more e-mails. This is what I found:
Long story short, this looks like bigfish opens up a connection to my mail server, attempts to start TLS (secure connection) and then bigfish sends a reset and sendmail goes "Ergh, well I guess we didn't want to chat afterall".
I've sent all of that to LHM, and as weird as this is going to sound - I'm now working through Microsoft's support side of things (I think I'm in Layer-2 with this, the Layer-1 guy was keen enough to realize we were way outside the depth of his script and bumped us up) to get this resolved. It's kind of weird to be the Chief Administrator for kernel.org and be debuging Microsoft's Exchange mail server for them - brings a smile to my face in a lot of ways.
Some additional details should others be seeing this or want to know more:
I'm going to post this in the hopes that if someone Googles for some of this they have at least some explanation of what's going on. I would also argue that the only reason this isn't more wide spread is Systems Administrators have a tendency to be stodgy and slow moving with their mail servers, and I doubt there are really that many out there running something as new as Fedora 11, I would guess the vast majority are on something like RHEL/Centos or Debian which aren't affected - yet.
UPDATE:
Ok so I got lucky on this one and I can now point out the exact and specific issue with the whole thing. I've struck out a chunk of data above as it's not relevant any more (and actually wrong now that I know the exact cause).
So after going through all of the above, the LHM helping push Microsoft on the issue and for the record, to date, I still have not seen a *SINGLE* line of log information from Microsoft, I've seen no good analysis from them and to be perfectly honest I ended up in a Copy/Paste war with them while they were trying to analyse the logs I had sent them - which I thoroughly walked through the whole thing with way more detail and insight and Microsoft latched onto something completely unrelated and blamed my whole SSL negotiation issue on Sendmail's inability to write out statistical data. Needless to say I was unamused and very very frustrated with the whole situation.
However, I got lucky. One of the other kernel.org administrators (Kees Cook specifically) happened to pipe up on the problem, and comment that it looked exactly like a problem he had only recently been debugging with OpenSSL and GnuTLS. After some quick checking I can confirm that the problem *IS* a bug on Microsoft's side, but it's one that's caused by a specific setting on the end that Microsoft is talking to.
The problem, as excruciatingly detailed above, comes into play during the SSL/TLS negotiation, specifically in the Certificate Authority section. When you send the certificate you also send the Certificate Authority certificate, both for verification of the CA but for verification of the cert itself. The default on most systems (particularly Fedora) is to send the entire CA bundle that the OS ships with. This is a sledgehammer approach that should work for most everyone, and it's understandable why it ships this way: Most CA's are included in the bundle and it just works. The problem comes in when the size of the bundle gets too large.
On my older Fedora secondary mail server (the one with the obscene uptime) has about a 418K in size, on my F11 boxes it is 654K. Not a huge change, but it's enough to tickle this problem: when the CA cert grows too large the remote side wigs out and can't handle it. This was seen by Kees with GnuTLS, when it got the bundle it would more or less exhaust the buffer and just die. Thankfully he had access to both sides of this problem and could reasonably debug it. I did a couple of small checks and low and behold - that was the problem. So what I ended up doing was changing our CA cert that sendmail is passing back to be our same certificate. Why? It's a self signed cert, and we don't really have a CA that's signed it so I'm not terribly worried about it and *VERY* few (probably no) mail servers try to verify the cert before sending the mail, mainly because mail is handled at a central server, there's no one to verify the cert manually.
So there you have it, the bug is present in EHS, it's trivially fixable by the end that is Microsoft is talking to (in my case kernel.org) but for this problem to really go away it needs to be fixed in Exchange, I.E. Microsoft needs to fix this. I've handed all of this information to Microsoft and the LHM, MS assures me there is a bug filed with the Exchange team on it but there's no ETA on when this will be fixed - my guess is somewhere around 2020, but I might be being pessimistic.
This now serves as a one stop source for this problem, hopefully others will find it should they run into this problem - I can only hope.
Was just contacted by the company that had the problem originally as a potential "fix" from Microsoft may have some impact on this:
support.microsoft.com/kb/2541763
What's interesting about it is that it talks about TLS/SSL fragmentation. Now that said they do talk about the negotiation " TLS/SSL handshake messages become too large to be contained in a single packet " which sounds a little odd, and I garauntee you the stage at which we are getting the CA cert is being sent over many packets (CA cert that worked being 418K and the max MTU is 1.5K). Also TLS/SSL for most uses is used via TCP/IP where you aren't particularly worried about the size of a single packet. But I begin to wander and digress.
If you find this, and you are having this problem give the KnowledgeBase article a read, possibly even give it a shot. If it works let me know, I would actually be fascinated to know.
Thanks to renormalist for giving me the heads up on the KB article!
UPDATE March 2, 2010: Updated to include, now known, actual cause of the problem -JH
Recently I've had the exciting job of debugging a massive e-mail failure between kernel.org and a large hardware manufacturer (LHM). The LHM has been particularly nice about this, understand the problem and has been kind enough to do my evil bidding while I try to debug this whole problem with only one side of the entire equation to work with.
Here's the problem: The LHM has contracted all of their e-mail out to Microsoft's Exchange Hosted Service (EHS). Generally speaking this is fine, it saves LHM a lot of money in their IT budget, the mail servers are more up to date, generally more secure, etc. Really I don't have a lot to complain about this choice - except - when it fails.
So the long version of this story is, back in December LHM's couple of employees who need to e-mail kernel.org on a regular basis started noticing that their e-mails weren't getting anywhere, and in fact where getting bounced back at them with some strange errors. They didn't think *TOO* much about it at the time, but the problem persisted and around mid-January they came to kernel.org and asked what we were seeing. I figured it was just a problem with our greylisting, we are a little more aggressive about it than others and it catches up some mail servers. So I did #include <STD_EMAIL_GREYLIST.h> response and expected the problem to go away. Sadly it didn't, so I opened up a proper look into what was going on.
LHM's employees were quite forthcoming with what they knew about how their e-mail worked, mainly that it was a giant black box that no one at LHM could see into because it was a Microsoft run service (known as Bigfish). I asked kindly if they could get me the logs from their box, or at least all of the error messages and I started delving into my logs.
That's when I noticed something very weird.
Feb XX YY:58:03 hera sendmail[32520]: STARTTLS=server, error: accept failed=-1, SSL_error=5, errno=104, retry=-1
Feb XX YY:58:03 hera sendmail[32520]: o1JKw24C032520: va3ehsobe005.messaging.microsoft.com [216.32.180.15] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Feb XX YY:58:03 hera sendmail[32520]: o1JKw24C032520: va3ehsobe005.messaging.microsoft.com [216.32.180.15] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
Hmmmmm ok, that's odd. No really that's very odd... This had been happening since Decemberish or so. Great. But I was having the employees from LHM e-mail both kernel.org, and my personal domains and the e-mails to my personal domains were going through - why?
I started checking the certs, I started monitoring more things then I paid slightly closer attention to my personal server's mail logs, my primary mail server was exhibiting the same problem that kernel.org was - ok at least I'm not insane, but why was I getting the e-mail? I looked at my secondary mail servers, and one of them - my longest running box
# uptime
13:02:45 up 1003 days, 15:27, 1 user, load average: 0.06, 0.11, 0.09
13:02:45 up 1003 days, 15:27, 1 user, load average: 0.06, 0.11, 0.09
Was actually receiving the mail, and actually accepting it. It would spool on that machine and then push to my primary without issue. Hmmmm......
So I turned up the debugging levels on kernel.org and had LHM's employees send me more e-mails. This is what I found:
Feb 4 07:17:37 hera sendmail[22509]: NOQUEUE: connect from va3ehsobe005.messaging.microsoft.com [216.32.180.15] Feb 4 07:18:13 hera sendmail[22509]: AUTH: available mech=NTLM GSSAPI DIGEST-MD5 CRAM-MD5 ANONYMOUS, allowed mech=EXTERNAL GSSAPI DIGEST-MD5 CRAM-MD5 LOGIN PLAIN Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: Milter (clamav): init success to negotiate Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: Milter (spamassassin): init success to negotiate Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: Milter (greylist): init success to negotiate Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: Milter: connect to filters Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: milter=clamav, action=connect, continue Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: milter=spamassassin, action=connect, continue Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: milter=greylist, action=connect, continue Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: --- 220 hera.kernel.org ESMTP Sendmail 8.14.3/8.14.3; Thu, 4 Feb 2010 07:18:13 GMT Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: <-- EHLO VA3EHSOBE006.bigfish.com Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: milter=spamassassin, action=helo, continue Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: milter=greylist, action=helo, continue Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: --- 250-hera.kernel.org Hello va3ehsobe005.messaging.microsoft.com [216.32.180.15], pleased to meet you Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: --- 250-ENHANCEDSTATUSCODES Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: --- 250-PIPELINING Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: --- 250-8BITMIME Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: --- 250-SIZE Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: --- 250-DSN Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: --- 250-ETRN Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: --- 250-AUTH GSSAPI DIGEST-MD5 CRAM-MD5 Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: --- 250-STARTTLS Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: --- 250-DELIVERBY Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: --- 250 HELP Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: <-- STARTTLS Feb 4 07:18:13 hera sendmail[22509]: o147Hb7C022509: --- 220 2.0.0 Ready to start TLS Feb 4 07:18:14 hera sendmail[22509]: STARTTLS=server, info: fds=6/4, err=5 Feb 4 07:18:14 hera sendmail[22509]: STARTTLS=server, error: accept failed=-1, SSL_error=5, errno=104, retry=-1 Feb 4 07:18:14 hera sendmail[22509]: o147Hb7C022509: va3ehsobe005.messaging.microsoft.com [216.32.180.15] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA
and doing some quick deciphering on the error returned by sendmail's attempted TLS connection I get the following:
SSL_error=5 | SSL_ERROR_SYSCALL http://www.openssl.org/docs/ssl/SSL_get_error.html for more on what SSL_ERROR_SYSCALL entails. Short version is that it's an error reported by the under lying I/O layer, and to check what errono was returned
errno=104 | ECONNRESET ... #define ECONNRESET 104 /* Connection reset by peer */ ... from /usr/include/asm-generic/errno.h which is provided by the Linux kernel itself (as it's handling the TCP/IP stack here)
Long story short, this looks like bigfish opens up a connection to my mail server, attempts to start TLS (secure connection) and then bigfish sends a reset and sendmail goes "Ergh, well I guess we didn't want to chat afterall".
I've sent all of that to LHM, and as weird as this is going to sound - I'm now working through Microsoft's support side of things (I think I'm in Layer-2 with this, the Layer-1 guy was keen enough to realize we were way outside the depth of his script and bumped us up) to get this resolved. It's kind of weird to be the Chief Administrator for kernel.org and be debuging Microsoft's Exchange mail server for them - brings a smile to my face in a lot of ways.
Openssl versions 0.9.8g and before seem to work fine, this would include my ancient Fedora Core 5 secondary mail serverRedhat Enterprise Linux (RHEL) / Centos 5 ship with Openssl 0.9.8e (this seems to work)Debian Lenny is at 0.9.8g (which was confirmed by an employee from LHM to work)Fedora <= 10 ships with 0.9.8g or earlier (This is known to work)Fedora >= 11 ships with 0.9.8k (This is known to NOT work)Ubuntu 9.10 may have 0.9.8k in it - this is known NOT to workOpensuse has 0.9.8k in it - this is known NOT to work
Openssl has not been updated on my Fedora 11 or Fedora 12 boxes, this is not a change on my server side as there have been no updates to those packages that have been pushed down through Fedora's update process.Disabling TLS negotiation in sendmail alleviates the problem, but I'm not keen to disable it universally if I can. I generally think that TLS is a good thing, and that MTA <-> MTA communications should at least have some modicum of security (even if it's not perfect)There are Microsoft domains that can send e-mail to these systems fine, they however do not attempt TLS/SSL negotiation
I'm going to post this in the hopes that if someone Googles for some of this they have at least some explanation of what's going on. I would also argue that the only reason this isn't more wide spread is Systems Administrators have a tendency to be stodgy and slow moving with their mail servers, and I doubt there are really that many out there running something as new as Fedora 11, I would guess the vast majority are on something like RHEL/Centos or Debian which aren't affected - yet.
UPDATE:
Ok so I got lucky on this one and I can now point out the exact and specific issue with the whole thing. I've struck out a chunk of data above as it's not relevant any more (and actually wrong now that I know the exact cause).
So after going through all of the above, the LHM helping push Microsoft on the issue and for the record, to date, I still have not seen a *SINGLE* line of log information from Microsoft, I've seen no good analysis from them and to be perfectly honest I ended up in a Copy/Paste war with them while they were trying to analyse the logs I had sent them - which I thoroughly walked through the whole thing with way more detail and insight and Microsoft latched onto something completely unrelated and blamed my whole SSL negotiation issue on Sendmail's inability to write out statistical data. Needless to say I was unamused and very very frustrated with the whole situation.
However, I got lucky. One of the other kernel.org administrators (Kees Cook specifically) happened to pipe up on the problem, and comment that it looked exactly like a problem he had only recently been debugging with OpenSSL and GnuTLS. After some quick checking I can confirm that the problem *IS* a bug on Microsoft's side, but it's one that's caused by a specific setting on the end that Microsoft is talking to.
The problem, as excruciatingly detailed above, comes into play during the SSL/TLS negotiation, specifically in the Certificate Authority section. When you send the certificate you also send the Certificate Authority certificate, both for verification of the CA but for verification of the cert itself. The default on most systems (particularly Fedora) is to send the entire CA bundle that the OS ships with. This is a sledgehammer approach that should work for most everyone, and it's understandable why it ships this way: Most CA's are included in the bundle and it just works. The problem comes in when the size of the bundle gets too large.
On my older Fedora secondary mail server (the one with the obscene uptime) has about a 418K in size, on my F11 boxes it is 654K. Not a huge change, but it's enough to tickle this problem: when the CA cert grows too large the remote side wigs out and can't handle it. This was seen by Kees with GnuTLS, when it got the bundle it would more or less exhaust the buffer and just die. Thankfully he had access to both sides of this problem and could reasonably debug it. I did a couple of small checks and low and behold - that was the problem. So what I ended up doing was changing our CA cert that sendmail is passing back to be our same certificate. Why? It's a self signed cert, and we don't really have a CA that's signed it so I'm not terribly worried about it and *VERY* few (probably no) mail servers try to verify the cert before sending the mail, mainly because mail is handled at a central server, there's no one to verify the cert manually.
So there you have it, the bug is present in EHS, it's trivially fixable by the end that is Microsoft is talking to (in my case kernel.org) but for this problem to really go away it needs to be fixed in Exchange, I.E. Microsoft needs to fix this. I've handed all of this information to Microsoft and the LHM, MS assures me there is a bug filed with the Exchange team on it but there's no ETA on when this will be fixed - my guess is somewhere around 2020, but I might be being pessimistic.
This now serves as a one stop source for this problem, hopefully others will find it should they run into this problem - I can only hope.
Was just contacted by the company that had the problem originally as a potential "fix" from Microsoft may have some impact on this:
support.microsoft.com/kb/2541763
What's interesting about it is that it talks about TLS/SSL fragmentation. Now that said they do talk about the negotiation " TLS/SSL handshake messages become too large to be contained in a single packet " which sounds a little odd, and I garauntee you the stage at which we are getting the CA cert is being sent over many packets (CA cert that worked being 418K and the max MTU is 1.5K). Also TLS/SSL for most uses is used via TCP/IP where you aren't particularly worried about the size of a single packet. But I begin to wander and digress.
If you find this, and you are having this problem give the KnowledgeBase article a read, possibly even give it a shot. If it works let me know, I would actually be fascinated to know.
Thanks to renormalist for giving me the heads up on the KB article!