Azure Hub And Spoke 2.0

I have recently had a couple of recent conversations that have made me reconsider the way we traditionally implement the hub and spoke Virtual Network design in Azure, which has some limitations. The idea is to introduce a relatively simple but powerful modification to the design that achieves these objectives:

Granular control of prefixes advertised via ExpressRoute or BGP site-to-site IPsec connections. Today all the spoke VNet prefixes are advertised to on-premises and to other regions, but instead the goal is advertising only a summary route per region
Have a more flexible architecture where spokes can be peered to multiple hubs for redundancy
Simplify the design configuration, by removing the need as much as possible to overwrite routes in Azure subnets

TL,DR: The main modification introduced to the traditional design is the absence of gateway transit in the VNet peerings (UseRemoteGateways / AllowGatewayTransit settings) between the hub and the spoke VNets, as well as having specific prefixes advertised to ExpressRoute and BGP S2S VPNs via Azure Route Servers in each region.

Mục Lục

When to use this design?

Before getting into the weeds, let’s first clarify to which network design this pattern is applicable. Using the Azure Network architecture framework that I first published in this post, you initially start from the most flexible design which is Virtual WAN with indirect spokes (design number 1 in the diagram below), and to simplify the architecture you can optionally collapse your tiers to either design 2 (Virtual WAN, typically when you don’t need sophisticated functionality in the hub) or design 3 (Hub and Spoke, for example when you need maximum flexibility to bring network services to the hub):

A deep discussion of when to use each design is out of the scope here. In this post, we will be focusing on design number 3, often called “traditional hub and spoke”.

The topology

I have used the Megaport Cloud Router to provision my ExpressRoute connections, and Google Cloud to simulate my on-prem environment. The initial test bed looks like this (you can see the exact Azure CLI commands I used to deploy the topology in https://github.com/erjosito/azcli/blob/master/routeserver_2hubs.azcli):

Traditional behavior

In the traditional hub and spoke behavior, the VNet peerings will be configured with gateway transit enabled (as explained in https://learn.microsoft.com/azure/virtual-network/virtual-network-peering-overview#gateways-and-on-premises-connectivity). Spoke VNets will have route tables associated to their subnets pointing to the hub NVA (no NVA redundancy in the lab for simplicity, otherwise you would probably have them pointing to an internal Load Balancer). Here you can see the effective routes for NIC in a spoke in region 1, not showing the routes learnt by the VPN on ExpressRoute gateways (since the route table should have the option to disable gateway route propagation):

❯ az network nic show-effective-route-table --ids $spoke11_vm_nic_id -o table
Source    State    Address Prefix    Next Hop Type      Next Hop IP
--------  -------  ----------------  -----------------  -------------
Default   Active   10.1.16.0/24      VnetLocal
Default   Active   10.1.0.0/20       VNetPeering
Default   Invalid  0.0.0.0/0         Internet
User      Active   0.0.0.0/0         VirtualAppliance   10.1.1.4

The NVAs in the hub get to know where to reach everything: they need to have gateway propagation enabled to receive the routes from VPN/ExpressRoute gateways and from the Route Server (all these routes will be marked as being sourced by a VirtualNetworkGateway). Below you can see the effective routes of the NVA in region 1:

❯ az network nic show-effective-route-table --ids $hub1_nva1_nic0_id -o table
Source                 State    Address Prefix    Next Hop Type          Next Hop IP
---------------------  -------  ----------------  ---------------------  -------------
Default                Active   10.1.0.0/20       VnetLocal
Default                Active   10.1.16.0/24      VNetPeering
Default                Active   10.1.17.0/24      VNetPeering
VirtualNetworkGateway  Active   10.251.0.0/16     VirtualNetworkGateway  10.1.3.4
VirtualNetworkGateway  Active   10.251.0.0/16     VirtualNetworkGateway  10.1.3.5
VirtualNetworkGateway  Active   10.2.0.0/16       VirtualNetworkGateway  10.1.1.4
VirtualNetworkGateway  Active   10.1.0.0/16       VirtualNetworkGateway  10.1.1.4
VirtualNetworkGateway  Active   10.2.16.0/24      VirtualNetworkGateway  10.2.146.34
VirtualNetworkGateway  Active   10.2.17.0/24      VirtualNetworkGateway  10.2.146.34
VirtualNetworkGateway  Active   10.4.2.0/24       VirtualNetworkGateway  10.2.146.34
Default                Active   0.0.0.0/0         Internet
Default                Active   10.0.0.0/8        None
Default                Active   100.64.0.0/10     None
Default                Active   172.16.0.0/12     None
Default                Active   25.48.0.0/12      None
Default                Active   25.4.0.0/14       None
Default                Active   198.18.0.0/15     None
Default                Active   157.59.0.0/16     None
Default                Active   192.168.0.0/16    None
Default                Active   25.33.0.0/16      None
Default                Active   40.109.0.0/16     None
Default                Active   104.147.0.0/16    None
Default                Active   104.146.0.0/17    None
Default                Active   40.108.0.0/17     None
Default                Active   23.103.0.0/18     None
Default                Active   25.41.0.0/20      None
Default                Active   20.35.252.0/22    None
Default                Active   10.2.0.0/20       VNetGlobalPeering

Remarks:

The routes from the spoke VNets in the remote region (10.2.16.0/24 and 10.2.17.0/24) are reflected by the ExpressRoute edge router (MSEE or Microsoft Enterprise Edge) and learned in hub 1.
You can see that there are some regional summaries advertised by the local NVA: 10.1.0.0/16 and 10.2.0.0/16. The local prefixes (10.1.0.0/16 in this example) are not a problem because they are overridden by the more specific routes from the VNet peering, but the summaries for the remote regions (10.2.0.0/16) will have to be overridden by UDRs (or use some encapsulation between the NVAs in different regions, see this post for more details). We will see this overriding later in this post.

We can check that the ExpressRoute gateway in region 1 sees the spoke routes from region 2 with the AS path 12076 12076 (12076 is the ASN associated to ExpressRoute MSEEs):

❯ az network vnet-gateway list-learned-routes -n $hub1_ergw_name -g $rg --query 'value[].{LocalAddress:localAddress, Peer:sourcePeer, Network:network, NextHop:nextHop, ASPath: asPath, Origin:origin, Weight:weight}' -o table             LocalAddress    Peer       Network            ASPath              Origin    Weight    NextHop
--------------  ---------  -----------------  ------------------  --------  --------  ---------
10.1.3.12       10.1.3.12  10.1.0.0/20                            Network   32768
10.1.3.12       10.1.3.12  10.1.16.0/24                           Network   32768
10.1.3.12       10.1.3.12  10.1.17.0/24                           Network   32768
10.1.3.12       10.1.0.4   10.251.0.0/16      65100               IBgp      32768     10.1.3.4
10.1.3.12       10.1.0.5   10.251.0.0/16      65100               IBgp      32768     10.1.3.4
10.1.3.12       10.1.0.4   10.1.0.0/16        65001               IBgp      32768     10.1.1.4
10.1.3.12       10.1.0.5   10.1.0.0/16        65001               IBgp      32768     10.1.1.4                                                                                                                                              10.1.3.12       10.1.0.4   10.2.0.0/16        65001-65002         IBgp      32768     10.1.1.4
10.1.3.12       10.1.0.5   10.2.0.0/16        65001-65002         IBgp      32768     10.1.1.4
10.1.3.12       10.1.3.6   169.254.168.16/30  12076-133937        EBgp      32779     10.1.3.6
10.1.3.12       10.1.3.7   169.254.168.16/30  12076-133937        EBgp      32779     10.1.3.7
10.1.3.12       10.1.3.6   169.254.168.20/30  12076-133937        EBgp      32779     10.1.3.6
10.1.3.12       10.1.3.7   169.254.168.20/30  12076-133937        EBgp      32779     10.1.3.7
10.1.3.12       10.1.3.6   10.2.0.0/20        12076-12076         EBgp      32779     10.1.3.6
10.1.3.12       10.1.3.7   10.2.0.0/20        12076-12076         EBgp      32779     10.1.3.7
10.1.3.12       10.1.3.6   10.2.16.0/24       12076-12076         EBgp      32779     10.1.3.6
10.1.3.12       10.1.3.7   10.2.16.0/24       12076-12076         EBgp      32779     10.1.3.7
10.1.3.12       10.1.3.6   10.2.17.0/24       12076-12076         EBgp      32779     10.1.3.6
10.1.3.12       10.1.3.7   10.2.17.0/24       12076-12076         EBgp      32779     10.1.3.7
10.1.3.12       10.1.3.6   169.254.68.112/29  12076-133937        EBgp      32779     10.1.3.6
10.1.3.12       10.1.3.7   169.254.68.112/29  12076-133937        EBgp      32779     10.1.3.7
10.1.3.12       10.1.3.6   10.4.2.0/24        12076-133937-16550  EBgp      32779     10.1.3.6
10.1.3.12       10.1.3.7   10.4.2.0/24        12076-133937-16550  EBgp      32779     10.1.3.7

As described earlier, I am deploying the ExpressRoute connections using Megaport’s Cloud Router (MCR). We can verify that the MCR sees both the /16 summary routes and the more specific /24 routes for the spokes. I am using this script to provision, troubleshoot and delete MCRs using Megaport’s REST API:

❯ $megaport_script_path -q -s=jomore-${hub1_er_pop} -a=bgp_routes | jq -r '.[] | {prefix,best,source,asPath} | join("\t")'
10.1.0.0/20     false   169.254.168.22  12076
10.1.0.0/20     true    169.254.168.18  12076
10.1.0.0/16     true    169.254.168.18  12076
10.1.16.0/24    false   169.254.168.22  12076
10.1.16.0/24    true    169.254.168.18  12076
10.1.17.0/24    false   169.254.168.22  12076
10.1.17.0/24    true    169.254.168.18  12076
10.2.0.0/20     false   169.254.168.22  12076
10.2.0.0/20     true    169.254.168.18  12076
10.2.0.0/16     true    169.254.168.18  12076
10.2.16.0/24    false   169.254.168.22  12076
10.2.16.0/24    true    169.254.168.18  12076
10.2.17.0/24    false   169.254.168.22  12076
10.2.17.0/24    true    169.254.168.18  12076
10.251.0.0/16   true    169.254.168.18  12076
169.254.68.112/29       true    0.0.0.0
169.254.168.16/30       true    0.0.0.0
169.254.168.20/30       true    0.0.0.0

Notice that all the /24 prefixes from the spoke VNets are there too, as expected. Finally, these are the routes that arrive to the on-premises network (simulated with a VPC in Google Cloud):

❯ gcloud compute routers get-status $router1_name --region=$region1 --format=json | jq -r '.result.bestRoutesForRouter[]|{destRange,routeType,nextHopIp} | join("\t")'
10.1.0.0/16             BGP     169.254.68.114
10.1.0.0/20             BGP     169.254.68.114
10.1.16.0/24            BGP     169.254.68.114
10.1.17.0/24            BGP     169.254.68.114
10.2.0.0/16             BGP     169.254.68.114
10.2.0.0/20             BGP     169.254.68.114
10.2.16.0/24            BGP     169.254.68.114
10.2.17.0/24            BGP     169.254.68.114
10.251.0.0/16           BGP     169.254.68.114
169.254.68.112/29       BGP     169.254.68.114
169.254.168.16/30       BGP     169.254.68.114
169.254.168.20/30       BGP     169.254.68.114

Again, the on-premises router would see all the individual spoke prefixes, which might not be a good idea if there are many spoke VNets in Azure. For example, in this specific case Google Cloud has a limit of 100 for the number of routes received to (https://cloud.google.com/network-connectivity/docs/router/quotas#limits). Consequently, we would have a problem if we had 100 spoke VNets or more in Azure.

Disabling Gateway Transit

After changing the peerings between the spoke and hub VNets to not use gateway transit (disabling the settings UseRemoteGateways and AllowGatewayTransit), this is what the effective routes in the NVA in hub 1 now look like:

❯ az network nic show-effective-route-table --ids $hub1_nva1_nic0_id -o table
Source                 State    Address Prefix    Next Hop Type          Next Hop IP
---------------------  -------  ----------------  ---------------------  -------------
Default                Active   10.1.0.0/20       VnetLocal
Default                Active   10.1.16.0/24      VNetPeering
Default                Active   10.1.17.0/24      VNetPeering
VirtualNetworkGateway  Active   10.251.0.0/16     VirtualNetworkGateway  10.1.3.4
VirtualNetworkGateway  Active   10.251.0.0/16     VirtualNetworkGateway  10.1.3.5
VirtualNetworkGateway  Active   10.1.0.0/16       VirtualNetworkGateway  10.1.1.4
VirtualNetworkGateway  Active   10.4.2.0/24       VirtualNetworkGateway  10.2.146.34
VirtualNetworkGateway  Invalid  10.2.0.0/16       VirtualNetworkGateway  10.1.1.4
Default                Active   0.0.0.0/0         Internet
Default                Active   10.0.0.0/8        None
Default                Active   100.64.0.0/10     None
Default                Active   172.16.0.0/12     None
Default                Active   25.48.0.0/12      None
Default                Active   25.4.0.0/14       None
Default                Active   198.18.0.0/15     None
Default                Active   157.59.0.0/16     None
Default                Active   192.168.0.0/16    None
Default                Active   25.33.0.0/16      None
Default                Active   40.109.0.0/16     None
Default                Active   104.147.0.0/16    None
Default                Active   104.146.0.0/17    None
Default                Active   40.108.0.0/17     None
Default                Active   23.103.0.0/18     None
Default                Active   25.41.0.0/20      None
Default                Active   20.35.252.0/22    None
User                   Active   10.2.0.0/16       VirtualAppliance       10.2.1.4
Default                Active   10.2.0.0/20       VNetGlobalPeering

Note that a static route has been added as well to override the summary prefix for the spoke VNets in region 2 (10.2.0.0/16), that is why you can see that the previous route for 10.2.0.0/16 shows as Invalid. This summary route now will work now, because the more specific /24 prefixes are not there anymore (before we would have had to override every. single. spoke.).

Now we can have a look at the onprem routes:

❯ gcloud compute routers get-status $router1_name --region=$region1 --format=json | jq -r '.result.bestRoutesForRouter[]|{destRange,routeType,nextHopIp} | join("\t")'
10.1.0.0/16     BGP     169.254.68.114
10.1.0.0/20     BGP     169.254.68.114
10.2.0.0/16     BGP     169.254.68.114
10.2.0.0/20     BGP     169.254.68.114
10.251.0.0/16   BGP     169.254.68.114
169.254.68.112/29       BGP     169.254.68.114
169.254.168.16/30       BGP     169.254.68.114
169.254.168.20/30       BGP     169.254.68.114

Here again the spoke prefixes are gone, so we can be confident that adding new spoke VNets in Azure will not compromise the scale of our onprem routers (Google virtual routers in the example).

Avoiding the need for UDRs in the spokes

As described at the beginning of the article, the basic design requires a route table associated to every spoke VNet to send traffic to the NVA appliance(s) in the same region containing a single 0.0.0.0/0 route and disabling gateway propagation.

An additional route server could be introduced in every region if no UDRs are desired, following the pattern in Different Route Servers to advertise routes to VNGs and VNets (Azure docs). This second route server would inject the 0.0.0.0/0 to every spoke VNet.

Dual-homing spokes to multiple hubs

Since the VNet peerings between hubs and spokes do not require the gateway transit setting, they can be connected to more than one hub VNet. This design would result in higher resiliency, since a given spoke can be accessed via more than one region, but would reduce the scalability of any given hub.

Scalability

The main scalability limit of this design is going to be the route table in the gateway subnet where VPN and ExpressRoute gateways are deployed. In every hub you will need a route table associated to its corresponding GatewaySubnet with one route for each directly peered spoke. You cannot use summary routes because the VNet peerings will introduce more specific routes corresponding to the prefixes defined in every spoke VNet.

Since the limit of route tables is 400 routes at the time of this writing (see Azure Networking Limits), the maximum number of spokes that you can attach to any given hub doesn’t change with this pattern and stays 400.

Downsides

So far, we have spoken about benefits, particularly the more granular control of what gets advertised from Azure to other network locations (on-premises networks and other Azure regions). As with every design, this one comes with its own set of caveats:

The main drawback is the complexity associated to the need to maintain an NVA that feeds the Azure Route Server with BGP routes. If your design already incorporates a Network Virtual Appliance able to speak BGP, and you are familiar with this protocol, this additional complexity might be negligible. If on the contrary you have an NVA that doesn’t support BGP (like Azure Firewall) or you are not willing to swim through the BGP waters, this design might be overly complex for you.
A limitation of this design is the fact that Azure Route Server doesn’t support IPv6 today (https://learn.microsoft.com/azure/route-server/route-server-faq#does-azure-route-server-support-ipv6), so if you need IPv6 in your VNet, this is not the design you are looking for (read with a Jedi hand move).

Adding up

Azure Route Server gives you a finer control on what you want to advertise over BGP with Azure, making the use of the gateway transit settings in VNet peerings unnecessary, or even undesirable in certain situations. Consider this option the next time you are confronted with a hub and spoke topology in Azure.

Please let me know in the comments below if you see other benefits or disadvantages of this design that I haven’t described.

Thanks for reading!