Hello! I’m James “WxWatch” Glenn and I’m a software program engineer on the Riot Developer Experience: Operability (RDX:OP) workforce. My workforce focuses on offering instruments for Riot engineers and operations groups that assist them higher perceive the state of their dwell providers throughout the globe.
Some of those instruments embrace Riot’s service metrics, logging, and alerting pipelines. In this text, I’ll be speaking about our one-stop-shop software for Rioters working providers – Console.
Building Console has allowed us to deprecate and take away most of the standalone customized instruments that we mentioned in a earlier weblog publish. But earlier than we get into the small print of Console, let’s set the context of this drawback area utilizing an instance of a troubleshooting expertise an engineer would have had utilizing these instruments previous to the creation of Console.
For this instance, we’ll use a service my workforce owns referred to as the “opmon.collector” – this service is the first interface that providers at Riot use to ship logs and metrics to our monitoring platform. The opmon.collector is deployed in datacenters throughout the globe.
In this case, an alert triggers for opmon.collector and we have to determine why.
A map to information you thru this instance.
To view the alert particulars, we’ll must open up a browser, navigate to our alerting software, and enter the service identify and placement to view the lively alerts. From the lively alerts, we see an alert saying we’re exceeding our allowed variety of API timeouts, so let’s go verify the logs to see if we are able to pinpoint the difficulty additional.
To view the logs, we’ll log in to our monitoring platform log viewer and sort within the service identify and placement, solely to find that there aren’t any logs for this service occasion in our monitoring platform! Since the service’s logs do not make it to our monitoring system, we’ll want to have a look at the container logs instantly.
To do that, we flip to our container visualizer, Toolbox. Inside Toolbox, we as soon as once more drill all the way down to the suitable cluster, discover the service and open it up. Looking at container logs, we’re in a position to see that our difficulty is that the service is unable to connect with a dependency service. To additional diagnose this, we have to look into our service’s community guidelines.
Navigating to our Network Viewer, we once more must seek for the service. Once discovered we are able to open up the community guidelines and, upon inspection, uncover that our service is lacking a rule to permit it to speak with a dependency. From right here, we are able to add that rule and resolve the difficulty.
Great! We have been in a position to make use of these instruments to determine the difficulty. Each new software we used, nevertheless, required us to reestablish the context of our search, which, on this case, was our service’s identify and placement. A extra refined inconvenience is that it required us to know the existence of (and have entry to) all these instruments with a view to uncover the reason for our difficulty. Over time, this provides as much as a big quantity of inconvenience, not solely day-to-day as an engineer, however for one-time occasions like onboarding.
We constructed Console to resolve these inconveniences. We took the core performance of those bespoke instruments (and lots of extra) and bundled them right into a single software with a unified context and UI.
This implies that you discover your service as soon as through the search bar and every part you view is inside that context. In addition to eradicating most of the instruments that have been talked about within the earlier part, we’ve been in a position to embrace options that may be practically unattainable to handle throughout a number of instruments (e.g. Console has Dark Mode).
To illustrate this, let’s undergo the identical instance because the part above, however this time we’ll use Console.
Treasure obtained in a fraction of the time.
First, let’s verify logs:
As earlier than, we see there is a community difficulty. Let’s take a look at the community guidelines, the place we see that our service doesn’t have the mandatory community rule to different.service, as earlier than.
These are the identical triage steps as earlier than, however due to Console, we’re in a position to simply navigate between the options we would have liked and extra rapidly decide the reason for the issue.
Combining all these instruments into a standard interface was not as easy because it initially appeared. To get the very best expertise, there have been two principal targets we would have liked to perform. First, we would have liked to distill all of the helpful options from each software whereas forsaking or rethinking the options that sometimes went unused. And second, we would have liked to supply a means for different groups to get their knowledge and options into Console.
To accomplish this primary activity, we adopted a “player experience first” mindset. My workforce’s viewers – our model of “players” – are Riot engineers throughout the whole firm, from sport builders to infrastructure groups. If we are able to enhance their expertise by reducing the quantity of friction when utilizing tooling, then we’re growing the period of time they must work on options and video games for gamers. To determine everybody’s desires and wishes, we simply, nicely, requested them. We created design paperwork and wireframes and interviewed and surveyed engineers throughout Riot. This gave us a strong image of what was (and wasn’t) necessary to builders.
An early Console wireframe
Providing a simple path for different groups to construct options in Console boiled down to at least one main hurdle: Not all groups have devoted front-end engineers, and groups don’t need to spend so much of time designing and constructing a person interface. The manifestation of this hurdle up to now was the gathering of pre-Console instruments we talked about earlier – they have been sometimes constructed utilizing whichever JS framework (React, Angular, and many others) and UI framework (Material, Bootstrap, and many others) the workforce selected on the time, which means no two instruments seemed or felt the identical.
Technology and Template Time
Now that we all know what we wished to perform with Console, let’s speak about how we did it. Console’s back-end is a Golang service that gathers knowledge from providers throughout Riot, caches it, and communicates it to the front-end through REST APIs. Console additionally supplies a proxy that the front-end can use to speak with different providers instantly, within the case the place no further processing is required on the back-end. This eliminates the necessity for an engineer to put in writing boilerplate APIs merely to fetch knowledge from providers. Console’s front-end is a React software utilizing Typescript for sort checking (we initially used Flow however just lately migrated to Typescript) and Ant Design for its UI parts.
This structure permits us to deal with having a constant UI throughout the whole software. To assist keep consistency, we constructed a sequence of templates that groups can use when integrating their very own options into Console. These templates give groups a framework to work with and permits engineers with much less front-end expertise to nonetheless be capable to rapidly construct out good, constant UIs inside Console. It additionally lowers the barrier to entry, because it eliminates the necessity for engineers to give you content material from scratch.
Consistency alone isn’t ok although. We knew we would have liked to prioritize a very good total person expertise so folks can be motivated to make use of Console. It‘s a tool that engineers use every day, so any inconveniences – no matter how small – add up over time, generating a lot of pain and annoyance. Because of this, we focused on making sure Console not only has the right data, but also is easy to navigate and understand. For navigation, each feature in Console is scoped to specific service types and is only visible when viewing a service of that type, ensuring relevant features are easily accessible. Also, since Console collects data from different sources, we help the user understand the origin of the data they’re viewing by offering unobtrusive tooltips that show the info’s supply.
Now that we’ve got all the info in a single place, we are able to start to correlate knowledge that we weren’t in a position to beforehand. For instance, since Console is aware of if a service is alerting, it could show that alert as a notification on the service abstract web page.
Console has many further options that aren’t lined on this article:
Ability to view service specification and deployment standing/logs
Configuration view, together with when a configuration worth was final modified
Kill/Restart particular person situations of a service
Ability to schedule service deployments
Service well being viewer
Deployment logs for our service
To guarantee our investments have paid off, we take a look at analytics and metrics. When we first launched, Console solely had a pair dozen customers every month. Now, we’re as much as nicely over 300 engineers per 30 days!
In addition to those metrics, we additionally conduct periodic surveys and interviews with customers to assemble direct suggestions on the present state of Console. We interact with related engineers once we’re contemplating future options and enhancements we’ve got deliberate. My workforce desires to verify we’re all the time engaged on the options that groups and engineers want most.
The two themes of Console shifting ahead might be integrating extra groups’ options into Console and correlating the info that Console already has in helpful methods.
Here are just a few ideas we’re inquisitive about additional exploring sooner or later:
Currently in beta, the Efficiency Tool measures how effectively a service is utilizing its assets. It makes use of a mix of CPU, reminiscence, and different utilization metrics to present an total rating (out of 100) to a service. This will assist groups know if their providers are requesting too many assets from the cluster or not. Metrics like these may also assist with auto-scaling, load testing, and capability planning.
Console is aware of who customers are (as a result of they must log in) and which providers they’ve checked out just lately (to allow them to rapidly navigate again to the place they’ve been) however doesn’t do the rest with that knowledge. Personalization, nevertheless, might enable Console to right away present you providers that your workforce owns, any messages, alerts, or different points which can be current, and let a person favourite any providers or different entities.
Every service at Riot has providers that it will depend on and, conversely, providers that rely upon it. With dependency correlation, Console, within the occasion of a service with an outage or different difficulty, might present customers of different providers inside the stricken service’s dependency chain that there’s an lively difficulty. This might help engineers when triaging their very own providers, in addition to enable operations groups to raised perceive the consequences of points on different providers and merchandise at Riot.
As you may see, Console has grow to be a extremely usable one-stop-shop for Riot engineers. Throughout its improvement we’ve prioritized suggestions from engineers and groups throughout Riot, and as we glance to the long run, we proceed to combine enter from the viewers that may use our instruments each day. As extra groups add options, Console will proceed to enhance, and we’re invested in making certain a superb expertise for builders throughout Riot to allow them to deal with what they do finest.
Thanks for studying! If you may have questions or feedback, be at liberty to publish them under.