Cloud adoption was supposed to solve disaster recovery. Spin up workloads in distant regions, replicate data across availability zones, let the cloud provider handle infrastructure resilience.
Simple, right?
Except organizations that went all-in on cloud DR are discovering it doesn’t work quite as advertised. Cloud regions fail. Cross-region replication costs add up quickly.
Recovery time objectives that looked good on paper don’t hold up during actual outages. And for workloads that can’t move to public cloud due to compliance, performance, or cost reasons, the promise of cloud-based DR doesn’t help at all.
The reality is that effective disaster recovery still requires geographic diversity of physical infrastructure.
Whether you’re running workloads on-premises, in colocation, in cloud, or some hybrid combination, having infrastructure positioned in geographically diverse locations protects against regional failures that take down everything in a single area.
The difference between 2025 and 2005 isn’t whether you need geographic diversity – you absolutely do – it’s how you implement it and what technologies enable faster, more reliable recovery.
Most organizations haven’t thought critically about disaster recovery positioning in years. They made decisions a decade ago based on conventional wisdom about primary and DR sites, and they’ve maintained those same locations through inertia rather than strategic thinking.
Meanwhile, the infrastructure landscape changed dramatically. New markets emerged with excellent connectivity and lower costs. Hybrid cloud architectures created new DR possibilities. Edge computing pushed workloads to distributed locations. But the DR strategy stayed frozen in 2015.
Disaster recovery fundamentally depends on distance. Place your primary and DR sites too close together, and regional disasters affect both. Hurricanes impact entire coastal regions. Earthquakes damage areas hundreds of miles across. Winter storms and power grid failures affect multiple states. Even localized events like construction accidents cutting fiber, utility failures, or urban flooding can take down facilities within the same metro area.
The industry rule of thumb suggests minimum 100 miles between primary and DR sites. This distance provides protection against most regional events while keeping latency manageable for data replication. Some organizations push this to 200-500 miles for better disaster separation, accepting slightly higher replication latency in exchange for better geographic protection.
But distance alone doesn’t ensure protection. Two facilities 200 miles apart in the same hurricane zone don’t provide real diversity – when the hurricane hits, both sites face similar risks. Two facilities in different cities but on the same power grid face correlated failure risk during grid emergencies. True geographic diversity requires thinking about the different risk profiles of potential DR locations.
Regional Risk Profiles
Different parts of the country face different disaster scenarios. The Gulf Coast deals with hurricanes. California has earthquakes and wildfires. The Midwest sees tornadoes. The Northeast experiences winter storms and occasional hurricanes. Understanding these regional risks helps you select DR locations that don’t share your primary site’s vulnerabilities.
A company with primary infrastructure in Houston faces hurricane risk. Placing DR in Dallas provides some geographic separation but both locations sit in the same general region affected by Gulf Coast hurricanes. Moving DR to Kansas City or Indianapolis creates true diversity – these mid-country locations face completely different weather patterns and disaster scenarios than coastal Texas.
Power grid reliability varies by region too. Some markets have aging infrastructure prone to failures. Others invested heavily in grid modernization. Markets with diverse power generation sources maintain better reliability than those dependent on single fuel types or generation methods. When evaluating DR locations, utility infrastructure reliability matters as much as facility quality.
Network Infrastructure Concentration
Less obvious but equally important: network infrastructure concentration creates correlated failure risks. Markets where multiple carriers depend on common facilities or fiber routes create correlation in network failures. One incident affecting shared infrastructure impacts multiple supposedly diverse carriers.
The best DR markets have network infrastructure that developed over decades through organic growth rather than recent buildouts. These markets tend to have truly diverse fiber routes, multiple carrier facilities, and network infrastructure that evolved with real redundancy rather than just paperwork claiming diverse paths.
Geography creates natural advantages for certain markets as DR locations. Mid-country positioning between the coasts provides balanced connectivity and positions infrastructure away from coastal disaster risks that affect many primary sites.
The Kansas City Example
Kansas City sits roughly equidistant from major coastal metros – about 1,000 miles from Los Angeles, 1,100 miles from New York, 750 miles from Houston. This geographic positioning creates several DR advantages.
For companies with primary sites on either coast, Kansas City provides genuine geographic diversity. A company based in Northern Virginia or New York gets true separation without going so far that replication latency becomes problematic. A company based in California gets similar benefits.
The Netrality Kansas City data center demonstrates how mid-country positioning combines strategic location with carrier-neutral connectivity. Access to 120+ network providers means you can establish diverse network paths for replication and establish direct connectivity to cloud platforms for hybrid DR strategies.
Latency from Kansas City to either coast runs 12-15 milliseconds, low enough to support synchronous replication for applications with aggressive recovery point objectives. Organizations that need zero data loss during failover can achieve it with mid-country DR where the same replication latency to West Coast sites would require asynchronous replication with its associated data loss risk.
Cost Considerations
Mid-country markets typically offer 20-30% lower costs than coastal primary markets. Power costs less, real estate costs less, labor costs less. When you’re maintaining DR infrastructure that sits mostly idle until disasters strike, cost efficiency matters more than for primary production sites.
Lower costs let you provision more capacity than minimum requirements. Instead of running DR infrastructure at the absolute minimum needed to maintain operations during outages, you can afford to provision 1.5x or 2x capacity. This headroom means DR failover doesn’t require immediately optimizing resource usage – you can run less efficiently during the emergency and optimize later once primary sites recover.
Disaster Scenario Differences
Mid-country locations face different disaster scenarios than coasts. No hurricanes. Lower earthquake risk than California. Different weather patterns mean correlated weather events affecting both primary and DR sites become less likely. Power grid infrastructure serves different load patterns and faces different stress scenarios.
This diversity in risk profiles means the scenarios most likely to trigger DR failover at coastal sites – hurricanes, wildfires, earthquakes – won’t affect mid-country DR sites. Your DR infrastructure stays available during the exact scenarios you designed it to handle.
Cloud services changed what’s possible for disaster recovery, but effective implementation requires more thought than just spinning up instances in distant regions.
The Economics of Cloud DR
Cloud providers charge for everything – compute, storage, data transfer. A hot DR site running in cloud to match on-premises primary infrastructure costs as much as running production in cloud. Warm DR that maintains infrastructure but doesn’t actively process workloads still incurs substantial storage and minimal compute costs. Even cold DR requires paying for storage of recovery images and data.
Data transfer charges hit during failover when you’re moving traffic from failed primary sites to cloud DR, and again during failback when restoring operations to primary infrastructure. Large environments might face five or six-figure data transfer charges during failover events.
Hybrid approaches that combine colocation DR with cloud burst capacity provide better economics for many organizations. Maintain core DR infrastructure in colocation where costs stay fixed regardless of activation state. Use cloud for additional capacity during failovers or for workloads that benefit from cloud characteristics during recovery.
Replication Architecture
Hybrid cloud DR requires connectivity between on-premises or colocation environments and cloud platforms. This is where working with a hybrid cloud colocation provider creates advantages – you need facilities with direct connectivity to AWS, Azure, or Google Cloud to enable efficient replication without pushing all traffic over public internet.
Direct connectivity through services like AWS Direct Connect or Azure ExpressRoute provides the bandwidth and reliability needed for continuous replication. Private circuits eliminate concerns about internet congestion impacting replication performance and provide the consistent latency replication systems need for efficient operation.
Organizations implementing hybrid cloud DR should establish direct cloud connectivity at both primary and DR sites. This enables DR orchestration that can fail over between colocation sites, fail over to cloud, or even fail over from one cloud region to another based on what failure scenario actually occurs.
Application-Layer Considerations
Not every application DR model works the same way. Databases with synchronous replication need low latency between sites – typically under 10 milliseconds round-trip. Applications that can tolerate asynchronous replication handle higher latency but accept potential data loss during failover.
Stateless applications that depend on shared storage or databases can fail over quickly because the compute layer can start anywhere that can access the data layer. Stateful applications that maintain local data require replicating that state before they can fail over effectively.
Modern microservices architectures create both opportunities and challenges for DR. Individual services can fail over independently, providing granular DR capabilities. But the web of dependencies between services means coordinated failover that maintains service relationships becomes more complex than monolithic application DR.
How you architect DR determines both your recovery capabilities and your costs.
Active-Passive Approaches
Traditional DR uses active-passive architecture where primary sites run production workloads and DR sites stand ready to take over but don’t normally process production traffic. This approach minimizes DR costs because you’re not running full workloads at both sites continuously.
The tradeoff is recovery time. When primary sites fail, you need to promote DR sites to active status, update DNS or load balancers to redirect traffic, and handle the transition period where some requests might fail. Recovery time objectives of 15-60 minutes are typical for well-designed active-passive DR.
For many applications this tradeoff makes sense. The cost savings from not running full infrastructure at both sites continuously justifies accepting some downtime during failover. You maintain DR capability without the expense of active-active operation.
Active-Active Benefits
Active-active DR runs production workloads at both sites simultaneously. Both locations serve production traffic, and failover happens automatically when one site becomes unavailable because the other site is already handling production load.
This provides the best recovery time objectives – often under 60 seconds because you’re not promoting inactive infrastructure. For applications where even brief outages cause significant business impact, active-active architecture justifies its higher costs.
The challenge is that not all applications support active-active operation. Databases that require single-writer configurations can’t run active-active without complex conflict resolution. Applications that maintain local state need distributed state management. Building applications that work correctly across active-active sites requires design decisions made early in development.
Determining the Right Model
Recovery time objectives and recovery point objectives drive architecture decisions. If you need recovery in under 5 minutes with zero data loss, you probably need active-active architecture with synchronous replication. If you can tolerate 30-60 minutes and minimal data loss, active-passive with asynchronous replication costs less while meeting requirements.
Business impact analysis clarifies which applications need aggressive DR and which can accept relaxed requirements. Not everything needs active-active DR. Tier applications based on business criticality and implement appropriate DR architectures for each tier rather than applying one-size-fits-all approaches.
The best DR plan fails if you haven’t tested it. But testing disrupts operations when it requires taking over DR sites or redirecting production traffic. Organizations that can’t test without business impact typically don’t test adequately, which means their DR plans fail during actual disasters.
Non-Disruptive Testing Approaches
Test failover of individual application components without redirecting production traffic. Promote DR database replicas to read-write mode, verify they accept writes, then demote back to replica status. Start application servers at DR sites and verify they can access data and dependencies, then shut them down. Test network failover by activating backup circuits without moving production traffic to them.
These component-level tests validate most of the DR plan without requiring full production cutover. You verify that infrastructure works, connectivity functions, and procedures are correct. The only thing you don’t test is actual production traffic handling at scale, but you’ve validated everything else.
Scheduled Maintenance as Testing Opportunity
Use planned maintenance windows for more complete DR testing. When primary sites need maintenance that requires taking systems offline, actually fail over to DR sites rather than just taking downtime. This tests real failover while the business impact happens during scheduled maintenance rather than unexpected outages.
Some organizations schedule quarterly or semi-annual DR tests that involve full production failover during off-peak periods. Run production from DR sites for a weekend, verify everything works, then fail back. These tests validate the complete DR process including failback, which many organizations neglect to test despite its importance after real disasters.
Automated Testing
Modern orchestration tools enable automated DR testing that doesn’t require manual procedures. Scripts can fail over test workloads to DR sites, verify functionality, and fail back automatically. This automation enables more frequent testing because you’re not consuming significant staff time for each test cycle.
The frequency of automated testing can increase from quarterly manual tests to weekly or even daily automated validation of DR readiness. More frequent testing catches configuration drift and ensures DR stays current as production systems evolve.
Disaster recovery infrastructure faces the same security and compliance requirements as primary sites, but meeting those requirements across distributed infrastructure creates additional complexity.
Data Protection During Replication
Data traveling between primary and DR sites needs encryption in transit. Even when using private circuits rather than public internet, encryption protects against interception. Many organizations implement encryption at the application layer or use VPN connections over private circuits to ensure end-to-end protection.
Compliance frameworks often specify encryption requirements for data replication. HIPAA requires encryption for protected health information. PCI DSS requires it for cardholder data. Industry-specific regulations may add additional requirements. Implementing consistent encryption across all replication flows ensures compliance regardless of which data types you’re protecting.
Access Controls at DR Sites
DR sites need the same access controls as primary sites. Physical access restrictions, biometric controls, video surveillance, and escort requirements should match primary site security. The fact that a site primarily serves DR workloads doesn’t reduce security requirements – unauthorized access at DR sites creates the same data exposure risks as primary site breaches.
Network access controls and firewall rules need to function identically at both sites. During failover, applications should enforce the same security policies regardless of which site they’re running in. This means maintaining consistent security configurations across sites even when DR infrastructure sits mostly idle.
Compliance Audits
Auditors examine DR plans and test results as part of compliance assessments. Organizations need documented procedures, test records showing regular validation, and evidence that DR actually works. Missing or outdated DR documentation creates audit findings even when technical DR capability exists.
SOC 2 audits specifically evaluate disaster recovery and business continuity controls. PCI DSS requires testing DR at least annually. HIPAA expects documented contingency plans with regular testing. Meeting these requirements means treating DR as a compliance control that needs the same rigor as other security measures.
Applications don’t exist in isolation. They depend on databases, file storage, network services, authentication systems, and other infrastructure that all need to fail over together.
Dependency Mapping
Effective DR requires understanding what each application needs to function. A web application might depend on a database, caching system, API gateway, authentication service, and external payment processor. If any dependency fails to come up during DR failover, the application doesn’t work even if the application servers themselves started successfully.
Document these dependencies explicitly. Many organizations discover critical dependencies only during actual outages when applications fail in unexpected ways because some obscure service they depend on didn’t fail over correctly.
Orchestrated Failover
Dependencies create ordering requirements during failover. Databases need to come up before applications that query them. Authentication services need to be available before applications that require authentication. DNS and network services need to function before anything else starts.
Orchestration tools automate these ordering dependencies and handle the complexity of bringing up interconnected systems in the right sequence. Manual DR procedures that require dozens of steps in specific sequences are error-prone during the stress of actual disasters. Automation handles the complexity reliably.
Partial Failover Scenarios
Sometimes only specific components fail, not entire sites. A storage array failure might require failing over databases while applications continue running. A network outage might need route failover without moving workloads.
DR plans should accommodate partial failover rather than assuming all-or-nothing scenarios. The flexibility to fail over individual components independently provides options during complex failure scenarios where some systems remain available.
Organizations tend to underinvest in disaster recovery because it only proves its value during actual disasters – which hopefully don’t happen often. But the cost of inadequate DR shows up in less obvious ways.
Insurance and Risk Management
Business interruption insurance costs reflect your disaster recovery capabilities. Organizations with tested, proven DR get better rates than those with paper plans that haven’t been validated. The insurance cost difference over years can exceed the cost of implementing proper DR.
Customer contracts increasingly require DR provisions with specific recovery time objectives and recovery point objectives. Government contracts often mandate DR capability. Enterprise customers expect vendors to maintain DR that protects customer data and services. Missing these requirements limits business opportunities.
Market Confidence
Public companies face scrutiny about disaster recovery after any operational incident. Organizations that can demonstrate quick recovery from outages maintain market confidence. Those that suffer extended outages without effective recovery see stock price impacts beyond the direct business costs of downtime.
Private companies face similar dynamics with customers and prospects. Word spreads quickly in industries when vendors suffer extended outages. Competitors use these incidents in sales conversations. The reputational damage from inadequate DR compounds over time as the story becomes part of your company’s narrative.
Strategic Flexibility
Effective DR infrastructure provides flexibility beyond pure disaster scenarios. Planned maintenance at primary sites becomes straightforward when you can fail over to DR sites temporarily. Capacity expansion happens more smoothly when DR sites can absorb load during primary site upgrades.
Organizations with robust DR can be more aggressive about infrastructure improvements and changes because they have fallback options if changes go wrong. Those without effective DR make more conservative infrastructure decisions to avoid risk, which can slow innovation and limit competitive positioning.
Disaster recovery continues evolving as infrastructure becomes more distributed and hybrid architectures become standard. The organizations that will handle future disruptions effectively are those thinking strategically about DR positioning now rather than maintaining legacy DR architectures designed for past infrastructure models.
Geographic diversity remains fundamental regardless of how much infrastructure moves to cloud. The physics of distance and the reality of regional disasters don’t change just because workloads run in different environments. But how you implement geographic diversity can adapt to use modern capabilities while maintaining the protection distance provides.
Mid-country positioning offers advantages that weren’t obvious when most organizations made their original DR location decisions. Balanced connectivity, cost efficiency, and genuine geographic separation from coastal primary sites create DR capabilities that serve modern hybrid architectures while providing protection against the regional disasters that DR fundamentally exists to handle.
The companies that will look smart in five years are those revisiting DR strategies now based on current infrastructure options rather than maintaining decade-old decisions made for different technologies and different threat landscapes.
Q: How far apart should primary and DR sites be located? The industry standard recommends minimum 100 miles between primary and DR sites to protect against regional disasters while maintaining reasonable latency for data replication. Organizations with aggressive recovery point objectives requiring synchronous replication typically stay within 300-500 miles (maintaining under 10ms round-trip latency). Companies prioritizing maximum geographic diversity may place sites 1,000+ miles apart and accept asynchronous replication with its associated minimal data loss potential during failover.
Q: What latency is required for synchronous replication? Synchronous replication typically requires round-trip latency under 10 milliseconds to maintain acceptable performance. This translates to approximately 500-600 miles maximum distance between sites given fiber optic transmission speeds. Beyond this distance, most organizations switch to asynchronous replication which accepts slightly higher latency but introduces potential data loss during failover (typically measured in seconds to minutes depending on replication lag).
Q: How often should DR plans be tested? Best practice recommends testing DR at least quarterly for business-critical systems, with full production failover tests at least annually. Compliance frameworks specify minimum testing frequencies – PCI DSS requires annual testing, SOC 2 audits evaluate testing frequency as part of availability controls. Many organizations implement monthly component-level testing (databases, network failover, application startup) supplemented by quarterly or semi-annual full failover tests. Automated testing can increase frequency to weekly or daily validation of DR readiness.
Q: What’s the difference between hot, warm, and cold DR sites? Hot DR sites run identical configurations to production with data continuously replicated and systems ready to assume load instantly (recovery time under 60 seconds). Warm DR sites maintain infrastructure ready to run with current data replicated, but systems aren’t actively processing production transactions (recovery time 15-60 minutes). Cold DR sites provide space and basic infrastructure but don’t maintain current data or running systems (recovery time hours to days). Hot sites cost most but provide fastest recovery; cold sites cost least but require longest recovery times.
Q: Can cloud completely replace physical DR infrastructure? Cloud provides viable DR for many workloads, particularly cloud-native applications designed for that environment. However, limitations exist: cloud DR costs accumulate through storage, compute, and data transfer charges; some compliance requirements prevent moving certain data to public cloud; applications with extreme performance requirements may need physical infrastructure; and large-scale failover can trigger substantial data transfer costs. Many organizations implement hybrid approaches combining colocation DR for core systems with cloud DR for workloads suited to that model.
Q: What recovery time objectives are realistic for different DR architectures? Active-active architectures with load balancing across sites achieve recovery times under 60 seconds (often seconds) since workload already runs at both locations. Active-passive with hot DR typically delivers 5-15 minute recovery times for automated failover. Warm DR with manual procedures typically requires 30-60 minutes. Cold DR requiring data restoration from backups and manual system configuration often takes 4-24 hours. Required business recovery times should drive architecture selection – don’t over-invest in active-active if 30-minute recovery meets business needs.
Q: How do you handle DNS during DR failover? Most organizations use short DNS TTL (time-to-live) values for critical services (300 seconds or less) to enable faster DNS-based failover. Global load balancers with health checking can automatically update DNS or use anycast routing to redirect traffic during failures. Some organizations use GeoDNS or traffic management services that route users to available sites without manual DNS changes. For critical applications requiring instant failover, consider load balancer-based approaches rather than DNS-dependent strategies, as DNS changes face propagation delays even with low TTL values.
Q: What should be included in DR runbooks and procedures? Comprehensive DR runbooks should include: complete dependency maps showing what must start in what order; specific commands and procedures for failing over each component; contact information for key personnel and vendors; network configuration changes needed during failover; validation steps to confirm systems work correctly after failover; and failback procedures for restoring primary site operations. Runbooks should be detailed enough that staff unfamiliar with specific systems can execute procedures during emergencies. Regular updates as systems change prevent runbooks from becoming outdated and ineffective during actual disasters.