Intermittent Network Issues

Incident Report for Endless Group

Postmortem

On 1/26/20 at around 11:34 AM EST, we were informed that the site was becoming extremely slow and responses took a while.

Max - Last Sunday at 11:34 AM

Can it be that the server has to handle a lot right now? Because responses are really slow right now (and I mean even slower than my connection normally is)

After this initial message, we noticed that our servers had gone completely offline. Throughout the next few days, the systems would come online for a short period, then go offline again for a longer period, and the cycle would repeat. We were unable to get any onsite staff to have a look at the possible hardware problem, as through the limited periods of connectivity, we were unable to diagnose any issue through our router interfaces or through Proxmox.

Throughout the next few days we attempted many troubleshooting steps that we could remotely including gateway resets and router reboots. None of these appeared to have resolved the problem.

We were eventually able to get onsite staff to look at the problem and no immediate problem presented itself. We did reseat all networking connections and the issue seems to have resolved itself. We are unsure of the exact cause of the issue, but have suspicions it was a power supply problem with our HP ProCurve switch. We will update this postmortem if any new information is discovered.

We have added many new monitoring systems through Datadog to allow for any future incidents to be handled better. Some of the Datadog stats are available on the homepage of this status site, but the rest are available on our public Datadog dashboard, which is linked at the top of this status site. We will be able to better handle any future incidents and be informed quicker when they happen.

The following is some more information on the switch issue:

DJ Electro - Today at 7:53 AM

my thinking was when I connected over Wifi, I couldn't even reach local sites (192.168.1.1 timed out). Soooo it couldn't be the modem since I would still be able to reach local addresses, therefore it must be the switch because if that died then it would make sense I would drop connections to all addresses.
Now our switch does this weird thing sometimes where if theres a big voltage drop which can happen if theirs a power grid problem or if the cable is yanked or pulled to the wrong direction, the switch will power cycle instead of doing what anything else does during a brownout. So my guess was that the cable was pulled wayyyy off and so every single time it would complete the boot cycle it would just power cycle again and we were sort of stuck in an infinite loop but I checked the modem and I reseated the power plug on the switch.

Thanks for sticking with us,
EH Administration

Posted Jan 28, 2020 - 08:29 EST

Resolved

At this time we have determined that the issue is most likely resolved. Expect a postmortem at a later time.

Posted Jan 28, 2020 - 08:17 EST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 27, 2020 - 11:16 EST

Investigating

The issue appears to have reappeared although network access was 30+ minutes.
We suspect a L1 problem at this point, possibly a pulled coax line or possible ethernet problem, or a problem with our cable modem. It is also possible that we are having problems with our HP ProCurve network switch. We will likely look into replacing our cable modem after this as it has usually been the source of similar problems.
Unfortunately, there are no staff available onsite, so we will be unable to do anything except during the short sections where the network comes online to do anything.

Posted Jan 27, 2020 - 09:07 EST

Monitoring

Although we were unable to locate an exact cause of the issue due to only having remote access to the datacenter. At this time, IPv4 access appears to be operational. IPv6 access appears to be not-functional, and we are working on restoring this as we are now able to access our router interface. As such, this issue has been reduced to partial outage.

Posted Jan 27, 2020 - 08:51 EST

Investigating

We have identified an issue with our network stack. At this time, the cause of the issue is unknown. The issue seems to cause problems connecting to any Endless services, and appears to be intermittent. We will update as more information is found.

Posted Jan 27, 2020 - 07:20 EST

This incident affected: Customer Systems (Networking).