Loose ROMs, busy signals, no sleep, and recursive troubleshooting

  Locations of visitors to this page
be notified of website changes? subscribe

Loose ROMs, busy signals, no sleep, and recursive troubleshooting

Path: sun.sirius.com!usenet
From: Don Hurter
Newsgroups: sirius.tech
Subject: Loose ROM's, busy signals, no sleep, and recursive  
troubleshooting [[very long]]
Date: 22 Sep 1995 05:10:43 GMT
Organization: Sirius Connections
Lines: 247

Where to begin? Just when we thought we had a handle on the new modem configurations, people started reporting busy signals. This of course is the last thing we wanted to hear, because the whole point behind installing another 48 modems is to prevent this. Two weeks ago we received a new shipment of Multitech 28.8k modems, with some confidence that they would work after the success of the first 24. We set them up, tested them, put them online, and a day later they all failed. This was described in a previous message.

Fine. We contact Multitech, and through a few iterations of tech support phone tag we learn that the new ROM revision models that we just received need some secret DIP switch fiddling that isn't outlined in the manual. We can handle that, but on that date when we tried to test them people report busy signals, which is no surprise because the modems we had purchased to prevent that are currently offline, being prepped to be put in service. However, in addition to the busy signal reports we start getting connection problems on the 14.4s, which have been working almost flawlessly for a year now. So that gets added to the list of problems to look into. (Why would the 14.4s start acting up now, we wondered.)

The process of adding new modems is fairly involved, and I'll highlight some of the steps. First we order enough units to carry us through a projected growth curve for a month or so. We order anywhere between 24 to 30 units, with 24 being a logical unit of phone lines, or 30 being the number of ports in a standard Livingston Portmaster. (The USR modems come in blocks of 16, so go figure.) Then we order another Portmaster (30 ports), and keep a fresh spare in-house in case any existing units fail. The lead time of these can be anywhere from a week to three, so we have to plan ahead. Then we order phone lines from PacBell, which in the past took 7-10 days for delivery (although recently we maintained a surplus of 50-odd lines because PB ran out of wires in the streets, and had to pull a new 900-pair cable to our building, which incidentally was planned back in May but was just finished last week.) Then we have to make sure that there are copper wires from the phone room downstairs running up to our modem room, but last Memorial day I got tired of pulling wires all the time so I ran 600 pairs up to our offices and terminated them all on punchdown blocks. Finally we have to make sure we have miscellaneous items like patch panels, 25-pair telco patch cables, room on the racks, serial cables, and a six-pack.

While we wait for everything to arrive, Arman prepares new IP addresses for the modems, and updates all the appropriate routing entries. After we mount all the equipment and cable it up (typically a half day of work) we prepare each modem by writing in the appropriate AT command strings. Then we commence a time-consuming test procedure which entails dialing each number and confirm that the modem picks up and at the correct speed, reports a port address, issues a login prompt, authenticates the test user account, initiates a SLIP or PPP session, and then routes all further packets out the rest of the Internet. After each call we make sure that the modem resets properly so that it is in a reliable state to answer the next call (the newer Multitechs didn't do this at first, so after a few hours of use they would get confused and no longer answer.) Each line takes anywhere between two to five minutes to test, depending on any glitches we may encounter along the way. Assuming that all this works, we then have to test line hunting by dialing each number twice in rapid succession before the modem hangs up to see if the second call rolls over to the appropriate line. Finally, if we are testing a new model of modem we run some long-duration transfer sessions and leave a connection up for 24 hours or more to make sure it can handle it. Then, after all that is done, we add the new modems to the tail of the hunt groups, jump the line forwarding to the first new modem, and wait for users to dial in. For a few days following that we monitor their behavior to see if there are any problems.

In theory, that's how it's done (more or less.) However, two weeks ago we received a second batch of Multitech modems, but with a different ROM revision from the first, so that was one variable we needed to keep track of. Coincidental with that a 16 port USR modem failed (the fifth unit out of seven so far,) so we had to put our reserve MP-16 online and go through the whole configuration/testing procedure one more time. By now our current number of dialup modems were close to saturation, so we had to hurry with the new Multitechs.

The first few days they rarely got hit, only during peak periods. No one reported any problems, so we though we were finished for this round. But then customers started reporting routing problems, and occasional trouble while connected to the 14.4s, which we were going to address. The routing problems were fleeting and difficult to track down, and we couldn't duplicate them with our test machines. But then a day later the new modems were getting hit more and more, and people started reporting long ringing cycles before any connection, and even then the connection was unreliable sometimes. We checked the portmaster activity, and dialed the numbers, but observed intermittent success rates. Then one day it seemed that all of the new 28.8ks simply failed, and that turned out to be the inadequate reset problem.

While we were troubleshooting that with Multitech support, the routing problems got worse. We had to turn off the newest portmaster entirely, since it seemed to be the source of the trouble. Now we had 48 new modems that we couldn't use, and 24 more older units attached to a shut down portmaster. This meant another lost day devoted to rewiring everything, re-jumping of hunt groups, and a modem shortage for that evening.

Once we sorted out the reset problem we set out to tackle the portmaster routing weirdness, as we needed that unit to get the modems back online and tested for their new configuration. It was then that we learned about one bug in Livingston's software which emerged only when one added a certain number of Portmasters to an existing Ethernet, and we had just exceed that threshold. Now on top of everything else we had to re-arrange a block of IP addresses to revive the Portmaster, and the modems still weren't thoroughly tested on their own. Since we couldn't put questionable modems online we had to spill the excess 28.8 calls into the 14.4 pool until the following day.

Last Friday we though we had all the 28.8ks sorted out and ready to put online, but then discovered that they dropped connections during large mail transfers or ftp sessions. Another call to Multitech revealed that a flow-control setting wasn't proper for our configuration (apparently they only have experience with BBS customers who don't normally use the same sort of communication servers that ISPs need.) The result of this was buffer overruns which would hang the modem, and the fix was to set some DIP switches that we never had to touch on the earlier ROM units. This of course required another complete test run through all the modems, 24 at a time. We finished this up on Friday evening and prayed that they would work over the weekend, since Multitech's support lines were closed until Monday.

Over the weekend people reported dropped connections before they even had a chance to log in, and through exhaustive troubleshooting we pinned it down to a bad 10-port card in one of the Portmasters, so we forwarded those lines out of service. Then on monday the routing loops re-occurred, and we had to shut down a different Portmaster to look into it. Now we were down 34 out of 48 new modems until we could rewire and test everything. Before we could even get the portmasters offline another unit failed and needed to be power cycled before it would respond. Throughout all this we had to forward lines back and forth across each other to keep the surviving equipment in use, and we _still_ hadn't seen one un-interrupted 24 hour period where we could see how the newest fixes to the Multitechs worked out.

We pulled the spare Portmaster out of its box and attached a serial cable to assign an IP address. No go. It wouldn't boot up at all, and Livingston deduced that it had a bad motherboard or something (the last thing one wants to hear when already down two other units.) We ordered three more replacements for express delivery, but by now it was late in the afternoon so they wouldn't arrive until the following morning. We had no choice but to leave everything else alone and endure another night of busy signals. But now the 14.4k trouble reports were also mounting, and we thought there must be some sort of radioactive virus killing our equipment, since the 14.4 modems have _never_ failed before, except for two units.

For the past three days I've been coming in extra early to re-arrange the hunt groups and run tests on the existing modems, since a lot of work requires multiple reboots/resets/uninterrupted dialing, and tried to get to the bottom of the problems. Our tech support crew has received the brunt of the customer complaints, and continuously reported to us any modem ports that they could identify as causing problems. Arman and I have been racking our brains trying to get a handle on the routing problems, but every time we try to test things we either had some piece of equipment fail or else had to delay everything because of heavy usage where we'd have to take an entire portmaster down for testing during peak periods. And every time we pulled equipment offline it added to the vicious cycle of busy signals while customers had fewer and fewer modems to dial in to. This past week leading up to last night was probably the most discouraging time I've spent in this business, and we still hadn't found out why everything was failing.

This morning the first break in the storm occurred. I came in at 6 AM when very few people were online and pulled the Portmaster with the bad 10-port card out of service. I swapped in a good card from the DOA spare unit and wired everything back up and started testing it. When Arman came in he looked around inside the open case and discovered that a ROM chip on the portmaster motherboard was not entirely seated, and pushed it in. This revived that particular portmaster, and now we had two units available again. I had to hurry to finish up my testing before 28.8k customers spilled over onto the group of modems that I had taken down, and Arman had to configure the other portmaster for service. Then a replacement unit arrived from Livingston, and we now had enough units to get all of the modems back up, but of course that would take most of the day to accomplish. In the meantime a different set of modems started acting up, so we checked the ROM chip on their portmaster, and sure enough it too wasn't seated. Now we had cracked the mystery of why all the units were failing, and when we checked inside the replacement portmaster from Livingston we saw that they had secured the chip inside with a nylon strap, and added a support to the motherboard to boot. That alone indicated that Livingston knew about the loose chip problem, yet they never offered to tell us this on the phone (perhaps out of embarresment.) This revelation occurred around noon today, so we haven't had a full day yet to observe the implications but so far the equipment is finally behaving.

After we got all the 28.8k modems back on their feet we turned our attention to the aberrant 14.4s. Checking their activity reports we noted that a certain group was dropping lines more frequently than the rest, and they all happened to reside on the same 10-port card on yet another portmaster. Our first reaction was to suspect another bad card, but some ports were in use for over an hour, so that meant there wasn't a clean line of failure. We dialed into each modem with a terminal window to see if they'd pick up, and each one connected fine. However, there is a negotiation process that goes on behind the scenes that the portmasters and modems perform with the customer's computer, and it wasn't following through all the time. From the customer's viewpoint they simply see a failed link or some such, without many details of why the connection wouldn't work. However, the currently active users on these same modems indicated that they could work under certain conditions, which was perhaps more damning evidence than a total failure. After countless dialup tests we finally determined that the modems would connect, but our end wouldn't transmit the login prompt without the user first sending a carriage return, which most PPP software shouldn't have to do. If the customer's software happened to send the carriage return, everything else would follow through as intended, but such isn't necessarily the case.

Now we had to figure out which 14.4k ports this problem occurred on, and why. It turned out after another mindless session of dialup testing that there were 9 contiguous Motorola modems at the tail end of the 14.4k pool which all shared the same ROM revision (Hmm, a pattern is emerging). These particular modems have been performing well for over 6 months, yet now, of all weeks, they commit a mass suicide? Perhaps they have been failing earlier, yet we never caught it because they are at the end of the pool that rarely gets hit, except during brief peak periods. In any event, they have been jumped over by the hunt sequence, and we'll leave it to later troubleshooting episodes to get to the bottom of their troubles, while we swap in replacements.

The lessons learned from all this is that you can't trust any piece of equipment without testing it to the point of absurdity. We made certain assumptions that if a given brand and model has worked reliably in the past, we could add more of those same units in the future and the new units should theoretically behave in the same manner. However, even the most subtle details can blossom into full blown disasters when one adds too many variables to the mix, as was the case of ROM revisions and new equipment. Furthermore, we got bitten big time by Livingston's lack of QA on their latest products, which is pretty indicative of the whole Internet industry where everyone is scrambling to keep up with demand and the 'ship now, test later' mentality takes over.

We also got caught without enough reserve equipment during a cumulative set of system failures, but there again no one expects to lose one USR 16-port modem, three Portmasters (including a brand new unit still in the box) plus ten ports from a fourth, and nine Motorola modems, and then encounter trouble with 48 brand new Mutiltech modem cards and fall victim to one strategically timed PacBell mishap, all in the course of a two week period while attempting to cover the rest of normal business operations and deploy backlogged dedicated modem and ISDN lines over a new underground cable that has been anticipated since May but didn't get completely spliced until last Thursday and still have any time left over for planning things two weeks into the future.

This whole episode was a pretty big setback for us, both in terms of lost time and overall morale, but we know that we aren't alone in all this after reading reports of other provider's follies. One can get a certain perverse amusement while reading the ba.internet group where someone will naively ask about what is involved in becoming and ISP, only to get overwhelmed by war stories from all the other frazzled service providers. The past two weeks have been a trying experience, and we aren't even sure if it's over yet, but at least we've managed to narrow down the variables somewhat and finally added the new modems to the pool.

Once again I thank everyone for whatever patience they still have, and will appreciate additional reports of modem behavior so we can tell if these damn things are finally working the way they should.

-- Don

Have you found errors nontrivial or marginal, factual, analytical and illogical, arithmetical, temporal, or even typographical? Please let me know; drop me email. Thanks!
 

What's New?  •  Search this Site  •  Website Map
Travel  •  Burning Man  •  San Francisco
Kilts! Kilts! Kilts!  •  Macintosh  •  Technology  •  CU-SeeMe
This page is copyrighted 1993-2008 by Lila, Isaac, Rose, and Mickey Sattler. All rights reserved.