A Routing Engine
lwIP Based Pseudocode
The famed BSD TCP/IP stack has been around long enough that it's merits are known to nearly all. It's respected enough that it is used in modified form by many OSs, including Windows. It's fast, robust, and reliable. It also has demonstrated ability to scale up as both cpu's and connections grow. Unfortunately its design relies on a simple premise that in the end, there is only one path for a given destination. This small restriction does not make sense in a world where bandwidth is available from multiple sources economically right up to the end user's edge. What is needed now is a rethink, a new base routing infrastructure that recognizes the huge leaps in cpu performance, and the rapid decline in the cost of memory, and plans from the start to capitalize on them for better routing for a multi-homed device.
The current routing implementation is based on a single route table with some strict rules on it's structure and what goes in. The table is a list of destinations (made up of a network and mask) and an associated next-hop or gateway. The main rule that controls its behavior is there may not be multiple entries for a given destination. When a routing decision needs to be made, the table is consulted. A search down the table is made to find the most specific destination entry that the target matches, and that next-hop is used.
The rules on the current routing table block the system from ever knowing there is more than one way to get to a remote destination. This means you cannot have two default routes, period. Many OS's have setup simple hacks to sidestep this, most use a pseudo interface to hold the default route destination. Traffic directed to this pseudo interface is then transparently directed to two or more real interfaces, usually using just a simple round robin load balancing implementation. While this can achieve decent results with heavy tuning, it doesn't expose the core routing engine to a fuller routing table, and generally is quite rigid in what it can accomplish. Failure mode performance is usually catastrophic as well. In the end you gain very little at the cost of complexity and unpredictable failure perf. This is not the ideal.
To improve upon the existing solution, a few goals need to be met. The first is reliable utilization of multiple pipes. Simply put if I have access to both DSL and Cable at home, my machine should be able to take advantage of both, maximizing my throughput to external sites, while minimizing congestion on both links by spreading load across them as much as possible. To do this the new engine will need to be aware of interface utilization at the least, and ideally should be able to import routing tables from other external (OSPF, BGP, etc) routing processes as near to directly as possible. While performing its job the routing engine must do so using rules that do not go against normally accepted behavior where possible. More specific routes need to take precedence over less specific routes, etc. The new engine also needs to be consistent and predictable in its routing choices. Given a view of the routing engine's state, one should be able to determine in their head what route(s) will be used for a given path, without resorting to calculators, dice, and cross-reference tables. This new engine must also route packets in such a way that whatever definition for a 'flow' is settled upon, ALL packets in that flow go out the same path.
Our theoretical testbed is a computer acting as a workstation. It is an end node or point on the internet. No traffic is routed through the machine, just to or from. The simple initial test places two identical connections on the machine. Each connection has a 'default' route associated with it. With the standard BSD schema, opening connections to various outside hosts, the system will put all traffic that isn't to a local subnet on one of the connections. The preferred behavior that we are looking for will be a 50% spread of traffic that is not destined for a local subnet. Should we open 4 connections to different FTP servers at the same time and start downloading, 2 of those connections should be using one pipe, the other 2 on the other pipe.
This test should be scalable, such that N connections results in even traffic distribution over N pipes.
The failure mode test is to disconnect a pipe. All traffic that was being routed over it should immediately be distributed over the available links evenly.
The second test scenario has our workstation again hooked up to 2 pipes, but this time one has less bandwidth available to it. As before, the standard BSD schema will use one pipe for all outbound traffic not destined to a locally connected subnet and will continue attempting to use a saturated link for new connections even if it has an unutilized link available to it. This test is passed by a routing implementation that behaves similar to the first test, opening each new connection round robin style, spreading them across the links, until the bandwidth constrained link reaches saturation. At this point, all new connections should only open on the unconstrained link. As old connections terminate and bandwidth utilization drops on the constrained link, new connections can be opened upon it until it hits saturation again.
To demonstrate proper scalability, N links, each with different bandwidth capabilities hooked up to the machine, should at any given time show roughly the same utilization on each link, as connection granularity allows. IE if you have a 300kbps link, a 600kbps link, and a 12000kbps link, and you have 8 connections each pushing roughly the same amount of bandwidth, only 1 should be on the first link, 2 on the second, the rest on the 1200kbps link.
The failure mode test is the same as above, termination of a link should result in even utilization of the rest available.
Both of these tests assume that each connection only injects two routes, an entry for their local subnet, and a default. No external routing protocol should be used, like RIP, OSPF, etc. Should these tests pass, this demonstrates a routing engine that can be utilized by a home user to take advantage of different media, say DSL and cable to achieve multi-homed outbound reliability and bandwidth aggregation, without pushing for specific support for such from their ISPs.
My proposed solution:
To address the problems above, and come up with a solution set that will behave as needed to pass the tests, I have decided to emulate commercial routers and their routing engines. The main change from how BSD's do things now, is the expansion of the primary routing table. Working from the theory that the more information the network stack has about the routes available, the better the decisions it makes will be.
So, the primary routing table will be expanded to including the following chunks of data for each routing entry;
* Destination Network/Mask
* Next Hop Address
* Expiry Counter
* State flags
Overlapping entries are not only allowed but encouraged with this system. For a simple network, this could mean two default routes with a 0.0.0.0/0 destination network, but differing next hop addresses. Most of the data is similar to what exists in the current routing table, but there are a few new additions. The metric field is an arbitrary integer value, where comparatively speaking, a lower number is better than a higher number. Most external routing protocols have a metric associated with all routes that operate in this fashion, unfortunately they usually are not directly comparable, so provisions will be needed for allowing a user to define what functions to perform on metrics before they get injected into the route table. The Expiry Counter allows for routes to be retired from the routing table after a set period of time. For entries that are not to expire, I'm defining -1 as the marker value for this field. The source field simply records the source of that particular routing entry, be it from an interface definition, or external routing process like iBGP.
The rules for choosing a route from the above table are similar to those used on the existing BSD route table. In order, the rules are:
1) Local Interface Check
2) Cache Check (I'll explain below)
3) Build a list of all matching routes based on the destination network field from the master table. ALL routes with a destination network that include the target host are included.
4) If the list of candidates is only one route entry, jump to step 9.
5) Perform a congestion check (explained below), all failing routes are removed from the candidate list. If ALL routes fail, then they are not removed, and we proceed to step 5. If only one route is left, we proceed to step 9.
--- For the next 4 steps, once the candidate list is reduced to one route, we jump to step 9 and ignore the rest of the tests. Any test that has results in ties, all the winning entries stay, the rest are dropped from the candidate pool, and you move to the next test.
6) Compare scopes. The most specific route entry wins.
7) Compare metrics, the lowest wins.
8) Compare expiry counters, the highest wins.
9) Any candidate routes that make it to this point are equal. The default at this point will be choosing a candidate at random. We can also just as easily support multiple methods, user configurable to making the determination at this point.
10) Once you reach this step, there is only one candidate route left. The packet is routed using the next hop specified, and a cache entry (explained below) is generated.
If we used the above as is, we'd end up doing per packet load-balancing on any traffic with matching routes available. Sometimes this is desirable if you know both links are identical and go to the same place. Many times this is very undesirable. Full route tests for each packet can be expensive in cpu time. It can also result in packets arriving at the remote destination out of order. To eliminate this issue, we need to use per flow load balancing. For simplicity and speed, I've chosen to define a flow as any packets going to a specific destination. This information will be tracked in a route cache, or more correctly, an established flows table.
The cache table will contain the following information;
* Destination host
* Next Hop
* Expiry Counter
The table is built dynamically by the decisions made against the master routing table. (There is also precedent for the school of thought that says to pre-compute a cache table before hand, and only tweak it when ever a change to the master table occurs, ala Cisco CEF.) A cache check simply means checking the cache for a matching destination, and if present using the next-hop listed. The expiry counter is present to keep stale, unused entries from littering the table. As a cache entry gets hit, the expiry counter is reset. When the counter hits zero, the cache entry is removed. Cache entries are also purged as appropriate when a change to the master table occurs. If an interface drops out of the master table, we can't continue to route packets over it, etc. Until the route cache is cleaned up, that is just what will happen.
The congestion check is based on user conf and is what allows this system to work for a multi-homed end user without using an external routing process. By putting bandwidth numbers into place, the route engine can determine when an interface is 'congested' and act accordingly.
This system is easily upgraded as well. The flow definition could be expanded to destination and port, for web servers to give them finer granularity in their load balancing. The cache table can be further expanded to include source information, with entries being generated as incoming packets arrive, to prevent asymmetric routing. The master table can be expanded to include source network/mask info, to perform custom source based routing manipulation as well for systems operating as a mid point router.