Telstra and the ‘whoops’ of AS1221

It served as a nice distraction from the over-hyped and bizarre ‘challenge’ to the leadership of Australian politics – a major issue with Telstra’s data network developed around 1:50pm AEDT, knocking over millions of customers on Telstra, Bigpond and ISPs who use Telstra’s backhaul links. I felt it myself – I had my Westnet (which is really an iiNet) internet connection stop working, before noticing the ever-present “3G” symbol on my iPhone had disappeared. It even made the front page of news.com.au – so you know it was a big one.

It became pretty clear very soon that something unusual was afoot. Once it was confirmed it wasn’t a ‘fail whale’ when the Twitter feed wasn’t updating, I took to the trusty command prompt in Windows 7, and did a Tracert. While I was foolish enough to *not* get a screen shot, it showed that the flow of data when trying to access http://www.google.com was stopping at one of the “10gigabitethernet” routers housed by Telstra in Sydney. This was all that was needed to confirm an issue with routing, and as a result DNS requests were going unanswered, and requests to pages where the DNS was known were very flakey.

It was rumoured, and now confirmed, that it was in fact Dodo Internet (via Optus) that had advertised a stack of new routes to a Telstra BGP router, which effectively said that Dodo was the rest of the internet, and the Telstra router accepted them. So, any traffic destined to go overseas, was in fact effectively being routed back towards Dodo. This meant that Telstra lost the ability to communicate data overseas, the local network became saturated with data and became unstable, and anyone using Telstra for international capacity suddenly stopped working.

The funny thing about the Internet, and the way it works at a hardware level, is this was entirely to be expected. The hardware bits that control the flow of information across networks (which form the Internet) can only work on the information given to them. It’s the old GIGO saying – “Garbage In, Garbage Out”. So naturally, when a Telstra hardware bit gets told some information about where traffic should go, and it accepts it on the assumption that it is right, things go a bit haywire when said assumption is wrong.What is surprising about all this, is that Telstra was accepting this volume of BGP information from, what must be seen as, a lesser ISP in the Australian landscape.

Whoa whoa whoa. Not everyone knows about BGP, nor what a Tracert is. Or maybe you want an English explanation of what happened today? Right.

BGP stands for Border Gateway Protocol. Its name is rather descriptive – its a protocol to allow information to flow between the borders of two networks, via a gateway. Importantly, the information that flows between networks is information about where data should be going if it is destined for an outside network. This is what’s called routing information; it is information about where data should be routed. The importance of BGP, is that this information is crucial when data needs to be sent outside of your network, which is already known. So for example, if you jump on your computer which has a Bigpond ADSL connection, and browse to www.google.com.au, the Bigpond network uses information obtained via BGP to know where to get the information to display the page. This is overly simplified, of course, but shows how important BGP is. So of course, when this information is wrong, the Bigpond network no longer knows where to get the right information from. Which is what happened today. (Naturally, Wikipedia has a nice breakdown of how BGP works in some detail if you are interested)

Doing a Tracert is a great way of seeing the path through the internet an ISP takes to get to a web site. There are other utilities on other operating systems that do similar, but this works under Windows, my current platform of choice. In fact, you can see how iiNet were routing traffic to the Telstra website on the afternoon of 23/2/2012:

Tracert from iiNet to Telstra

You can do this for yourself on your Windows PC – open “Run” and type in ‘cmd’ without quotes (or just type that into the Search bar in the Start menu). From the Command Prompt, you can then type ‘tracert’ (again without the quotes) and the name of the website you want to check the path to. What Tracert will do, is check each hop from your PC to the destination, and tell you where possible where each hop is. Using the tracert above, we can see that the request to http://www.telstra.com starts on the iiNet network (being on an iiNet DSLAM), and gets sent across the Pacific to California, before being sent back to the Telstra network on another Pacific link. Remembering that iiNet manually changed how traffic is routed to completely avoid the Telstra/Bigpond network.

So why did Telstras issue with routing cause problems with so many other ISPs?  Naturally, there was a backlog created by the bad routing settings caused congestion on the Telstra network, so if your data was routed through a Telstra network, you would have congestion issues.. On top of that, if your data was routed to use Telstra’s international link (via the router known as AS1221), then it had no hope of it reaching its destination, leaving you floundered.

There remains a few unanswered questions, most of which probably will never be answered. Firstly, why on earth were Telstra’s BGP routers accepting information from Dodo on such a vast scale? BGP is very much a trust protocol, however you would think that Telstra would have some sort of filtering to prevent this from happening. Secondly, how did a router on Dodo’s network broadcast such a drastic routing message?

Here’s hoping both parties learn a lesson from this. Network engineering is a tough gig, and while these incidents do happen, it’s been a while since something on this scale happened. Given our ‘connected society’, and reliance on services that are generally hosted in the US, these sorts of issues are very quickly found by the general public and often not fast to fix. Here’s hoping the ‘whoops’ that afflicted AS1221 is a singular occurence.

Advertisements
Comments
4 Responses to “Telstra and the ‘whoops’ of AS1221”
  1. PJ Hunt says:

    I work for a multinational Pharma company, and we plan for this kind of thing every day. Now, if Telstra does not plan for it. What hope do we have???

    🙂

    by the way, Hello World.

  2. Possibly helpful says:

    A few technical corrections.

    AS1221 is Telstra. It is not a specific router. In the simplest terms all AS1221 is configured on all Telstra’s internet facing routers. (There may be more complicated AS structures within Telstra, but the outside world should just see them as AS1221)

    And we already know how Dodo sent such a drastic routing update. Dodo is multi-homed to Optus as well as Telstra, so it took the internet routes it learned from Optus and handed them on to Telstra. During the time of the outage ‘internet’ routes were seen inside the network with AS7474 (Optus) in their path. See http://lists.ausnog.net/pipermail/ausnog/2012-February/012191.html

    You can get a better understanding of this stuff by looking at http://bgp.he.net/AS38285
    AS 38285 is Dodo’s AS number, and it shows the peering involved.

    What they did is more akin to bad manners than being technically wrong. They offered themselves up as a transit between Optus and Telstra. BGP is designed to allow for these kinds of relationships. This might have caused their systems to redline or fall over, depending on their capacity, or it might have racked up a massive traffic bill depending on their peering arrangement.

    Telstra on the other hand as a higher Tier carrier should have had security in place to prevent such behaviour. Your internet paths should be deterministic and not left to good faith. BGP is not a ‘trust’ protocol, it is a policy based protocol, and it has many tools to enforce different kinds of policies. The choice to have no secure policy on the peering with Dodo isn’t inherent in BGP, it was either a conscious decision or an oversight.

    Also your data may always have been reaching its destination. Dodo offered a transit path to the internet via Optus. Depending on how Dodo’s routing is configured, traffic from Telstra may have been able to pass through Dodo, Optus and out to the internet at large.

    The issue will have come with return traffic. Telstra does have some forms of security between its internal networks, and it appears that some of those links may have shut down rather than let the routes compromise their performance. This would have stopped any return traffic from transiting those links.

    With the internal routing issues, Telstra’s BGP links to Optus also had issues and went down multiple times (up to 175). This caused Optus to invoke route dampening, which is designed to prevent misbehaving peers from affecting your network. This meant that Optus also blocked any paths back into Telstra.See http://lists.ausnog.net/pipermail/ausnog/2012-February/012206.html

    And through all this Dodo was advertising the world to Telstra, but it wasn’t advertising Telstra to the world.

  3. Rob says:

    Telstra have never filtered inbound BGP announcements, and I’ve worked for various ISPs with peering to Telstra for many years

  4. Tom S says:

    In response to the article:

    “… if your data was routed to use Telstra’s international link (via the router known as AS1221) …”

    There is no “router” called AS1221. AS1221, or more specifically, 1221 is Telstras AS (Autonomous System) number. The AS number is what all routes that originate from Telstras network are “tagged” with. Each time a route traverses another network (each of which has its own unique AS number) as it is propagated throughout the Internet, the tag of those other networks are added on aswell to form what is known as the “AS path”. This allows you to determine how “far” a route has travelled across the Internet (AS path length), and whether or not you are also receiving the same route via a different, potentially shorter path.

    Once you know the path a route has taken, you can filter routes based on this, rejecting or preferring routes that have been through a particular network. Its all up to you and your traffic engineering policies and how you want to send data out of your network – one of BGPs benefits.

    In response to “Possibly helpful”:

    The consensus among professional network engineers is that Telstras connections to the outside world were severed following a breach of a “max-prefix” setting on its upstream BGP sessions – not because of some sort of security mechanism to protect performance (Im not sure where that comes from…)

    Telstra prefers routes from customers, possibly for billing reasons. By preferring routes via their customers they can push more traffic to customers, and thus be able to bill more usage, or charge for higher capacity links.

    By trusting that their customers are sending routes to them legitimately, and accepting them freely, Telstra likely received the full routing table from Dodo and tried to push this onto its upstream carriers, such as Reach.

    Telstras upstreams were probably not configured to expect such a large routing table, and once the max-prefix limit is breached, they shut down their BGP sessions. And just like that, Telstra is cut off from the rest of the world.

    Performance wise, it would only likely have been outbound traffic that suffered, since Telstras routers believed the best path to the outside world was via Dodo who I am sure have far less capacity than Telstra does. Had Telstras BGP sessions with their upstreams not been cut off, performance inbound to Telstras network would likely have been unaffected. The only way inbound traffic to Telstra could be affected is if Dodo passed all of Telstras routes to its upstreams, and they accepted them and passed them on, creating a transitable path through Dodo. This scenario is highly doubtful though…

    [Keating: Thanks for the extra info, both yourself and the other commenters. I’ll admit my BGP and ISP routing is still very sketchy, and there seems to more information and speculation about what happened now. Thanks for the input!]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: