• Defying Murphy’s Law - Three Keys to Live-Stream Performance

    In my ten years of experience in major event streaming, including Super Bowls and Olympics, I’ve found that every big event is unique – and every event has something unexpected happen. But successful streaming always has three essential ingredients – clear objectives, comprehensive testing, and operational playbooks.

    Know Your KPIs
    The ultimate objective may be to make the live-stream experience flawless. Realistically, that can’t happen for all viewers all the time in all places. Audience expectations, while rising steadily overall, vary locally. And there are cost-performance tradeoffs to navigate.

    Amid all that variability, you need to establish specific KPIs to benchmark performance measurement, comprehensive testing, and continuous improvement. Start with the basics – viewers’ time-to-access the stream and rebuffering percentage. Include audience satisfaction and feedback if measured. And be precise about time-to-recovery objectives. For example, when servers go down, software components fail, or unexpected things happen on the Internet, does the workflow have the resiliency to recover quickly, even imperceptibly to the viewer?

    A True Test of Strength
    The success of an event often hinges on how thoroughly you’ve tested the entire workflow in advance. Complete functional testing accomplishes three things:

    - Finding bottlenecks, possible points of failure, the weakest links in the delivery chain. Once weaknesses are identified, you can then engineer for greater resiliency and minimize the chances of failures.

    - Refining KPIs and identifying leading indicators of potential failure. It takes rigorous testing to establish thresholds for the performance indicators which will then function as an early warning system.

    - Anticipating what to do when failures do occur. Testing provides the foundation for a playbook of how to respond under a variety of scenarios.

    Functional testing starts with the basics: given a particular input, does the output stream come out correctly? So far, so good. Now, what happens when you rip out one of the components? Does the system fail over to a backup? Can you still get the output you want? Is the workflow resilient enough to maintain continuity in the viewer experience? You get answers based not on theory but on realistic testing of the actual system and components.

    By testing the system from many angles and under many conditions, you can validate assumptions about the workflow, understand what happens when something fails, and devise the procedures to address it – both in advance and during the event.

    Testing Done Right
    You can’t test the Internet in a lab – it must happen in production. The challenge is to test realistically without using the audience as guinea pigs and without posing risk to the business. Functional testing therefore needs three key ingredients:

    - Scale safely. Stress testing workflows and their components requires the ability to generate sometimes massive loads from the cloud at and beyond the limits of anticipated operation. It can’t be just “dumb load,” but rather must represent the actual audience, including geographic distribution and device and connectivity combinations.

    - Real-time feedback. The response time for test data should be a few seconds, not several minutes or a quarter hour. Instant feedback makes it easier and faster to isolate problems and eliminates the risk that high-volume testing will overload the infrastructure.

    - Dynamic control. Most streaming and gaming companies have narrow windows in which to test their systems, and one problematic component shouldn’t halt the entire test. With the help of real-time feedback and dynamic control over testing, problems can be removed so the test can continue.

    Playing by the Book
    Once your event is happening, it is time to put the engineering aside and operate the system. If something goes wrong, there’s no time to re-engineer. You’ve got to know how to operate.

    The antidote is to have a well-documented plan – and follow it. An operational playbook anticipates problems that may arise and what to do about them. It has incorporated lessons from functional testing. It helps people expect the unexpected and manage their responses based on the data, the testing, and the groundwork that was laid, rather than on an emotional response. It helps them make good decisions in the clutch.

    With clear objectives, thorough testing, and a comprehensive playbook you can have confidence in your workflow’s ability to perform at scale and deliver a successful event even when things go awry.