June 18, 2024


Priceline has been a pacesetter in on-line journey for twenty-five years providing inns, flights and extra. Priceline’s proprietary offers know-how pairs negotiation with innovation to investigate billions of knowledge factors to generate deep reductions for purchasers they cannot discover anyplace else. In reality, Priceline has saved prospects over $15B on journey. On this weblog put up, we’re highlighting Priceline for the DevOps achievements that earned them the ‘Leveraging Loosely Coupled Structure’ award within the 2022 DevOps Awards. If you wish to be taught extra in regards to the winners and the way they used DORA metrics and practices to develop their companies, begin right here.

Priceline was created to assist prospects discover the perfect journey offers precisely after they want them, which implies that any delays or disruptions to looking for the perfect journey offers not solely doubtlessly impacts gross sales but additionally leaves vacationers with much less choices. For this reason our firm realized that performing periodic regional failovers — for upkeep, troubleshooting, or simply stress testing a area — was inflicting delays attributable to an lack of ability of auto-scalable compute assets to scale up shortly sufficient in response to their very own altering efficiency metrics. These delays have been ensuing within the following points:

  • Latency and even intermittent search failures

  • Hesitancy to carry out regional failovers until vital

  • Overhead in manually scaling up after which again down Google assets to compensate

  • Elevated prices from over-provisioning assets to keep away from buyer points

Fixing for the longer term

To cut back any delays in regional failovers, Priceline’s know-how management realized that we would have liked a DevOps transformation with participation from each govt management and particular person practitioners. Particularly, we would have liked a strategy to ensure that there can be sufficient capability to facilitate massive quantities of platform site visitors shifting from area to area whereas additionally guaranteeing that we solely paid for the capability getting used.

This purpose required an answer that was:

  • Dynamic

  • Responsive

  • Configurable

  • API-driven

  • Useful resource-aware

Environment friendly cluster administration

To deal with these challenges and enhance platform stability, we partnered with Google Cloud to seek out, implement, and validate an answer based mostly on DevOps Analysis and Evaluation (DORA) analysis. DORA is the most important and longest operating analysis program of its variety, that seeks to grasp the capabilities that drive software program supply and operations efficiency inside our group. Not solely that, however Google Cloud helped our groups apply DORA capabilities — which led to improved organizational efficiency.

To be able to compensate for auto-scalable compute assets that couldn’t scale up shortly sufficient based mostly on metrics alone, we carried out a mechanism that leverages two separate maximizer parts working in tandem by a loosely coupled structure.

The primary element is the Python-based Bigtable maximizer that may optimize clusters earlier than platform site visitors turns into a problem. This maximizer makes use of Google Bigtable APIs for every Bigtable cluster to seek out in a specified venture and area every cluster’s present minimal and most node rely. It could possibly then increase the minimal variety of nodes in order that it briefly matches the utmost node rely, and with a command, it goes again to the unique minimal rely.

The second, Python-based Google Kubernetes Engine (GKE) deployment maximizer element makes use of the Kubernetes API to seek out the utmost variety of replicas for every Horizontal Pod Autoscaler (HPA) object in a GKE cluster. This time, it seems for the utmost variety of replicas wanted for the HPA object and units the minimal to match that most so that every HPA is briefly maximized to deal with the inflow in site visitors from the alternative area. And equally, with a subsequent command, the minimal variety of replicas for every HPA may be restored to its authentic worth. The GKE deployment maximizer additionally lets our groups select what to maximise by specifying particular person HPA objects, whole namespaces inside an atmosphere and area, or a number of clusters, namespaces, and HPA objects designated inside a JSON file.

With this two-pronged technique, we are able to mechanically and seamlessly maximize clusters and deployments earlier than receiving platform site visitors, after which optimize them again down instantly after a quick pause to make sure that they don’t incur bills for capability they don’t want. This lets our groups mitigate points instantly and reliably with logs that they will examine after the very fact, just about eliminating any upkeep burden or danger of misconfiguration that would have an effect on manufacturing.


By optimizing for velocity, with out sacrificing stability, Priceline groups may confidently schedule regional failovers with out issues that they may trigger disruptions to buyer searches. This has led to measurable enhancements — as proven by important enchancment in DORA’s 4 key metrics:

  • Deployment frequency: With unified, reusable CI/CD pipelines constructed for each GKE and Compute Engine-based purposes, we may deploy purposes to manufacturing a number of instances a day if vital, with the added safety of built-in rollback capabilities. This stability has led to at the very least 70% extra frequent deployments.

  • Lead time for adjustments: We’ve got achieved a 30% discount within the time wanted to carry out adjustments with regional failover.

  • Change failure charge: Migrating to the cloud enabled a uniform CI/CD course of with a number of gates to make sure that groups comply with a correct, repeatable course of with vital testing and with out configuration drift. Now, by discovering and mitigating points earlier within the software program improvement lifecycle, at the very least 90% much less time is spent investigating potential manufacturing incidents.

  • Time to revive service: By making the most of templated, custom-made APM alerts, we may find out about and reply to anomalous conduct, utility efficiency drops, and outright failures in real-time. This has lowered the period of regional failovers and failback by at the very least 43%, with most manufacturing failure recoveries completed in minutes.

Keep tuned for the remainder of the sequence highlighting the DevOps Award Winners and browse the 2022 State of DevOps report back to dive deeper into the DORA analysis.


Source link