By Nick Gambino
As users, we often evaluate the quality of a service by its uptime. As engineers, we know that 100% uptime is not as simple as it sounds. Here are some strategies in dealing with unexpected outages in an IoT project.
When building an IoT product, fundamentally what you are doing is building infrastructure that allows decentralized systems to pass data back and forth, in addition to the applications that can track and collectively monitor those systems to determine what they are doing. When considering this architecture, there are two main choices you need to make. Are you going to query these systems periodically for their data (polling), or will you be expecting the devices to routinely send their data to a specified location (pushing)?
As with everything in software development, it depends.
Usually, a hybrid solution works best:
IoT systems are an ideal use case for pushing data, since usually devices are extremely purpose-built, as well as efficient from a battery or bandwidth perspective. This is especially true for sensors or small devices that want to manage things like power consumption or process cycles.
But, how exactly does this help with device downtimes? If a device is down, doesn’t that mean that messages won’t be processed, regardless of how you are getting this data into your system?
The unfortunate truth is that if a device is down, the device is down, and more often than not it will require some sort of on-premise troubleshooting. The best we can do from an application perspective is to accurately determine when a device is down, and then account for that outage accordingly. You can think of tracking when there is “no data” as being just as important as tracking when there “is data” in your application.
One strategy to take is to keep track of the time intervals between sent data packets from the IoT device. If this time interval is exceeded beyond a standard deviation, you can assume that either the device is down, or there is something else wrong (maybe the device isn’t executing its data upload correctly). In this scenario, you can then validate whether a device is actually down by querying, or “pulling” data from it. This pull could give you more information about the state of the device you are tracking.
Once your system validates the state of the device, you can imagine it triggering some sort of business process to then enter into further troubleshooting.
The point of building an IoT application is usually to provide some sort of business value, and this can be achieved with an effective monitoring solution. Once you are able to logically analyze data from your devices via pushing and polling strategies, you will need to quickly and accurately send messages to appropriate end-users to troubleshoot them when needed. These messages could be in the form of an email to a sensor technician to be dispatched, a push notification to an end-user to reboot their device, or a Slack notification to an internal engineering troubleshooting channel.
The business logic of exactly how these notifications should go out will be unique to each application’s needs, but they will all need to be fast, accurate, and well documented, so that stakeholders can have visibility into how and why each notification went out.
There are several great monitoring solutions out there, but here are some to consider:
Messages Queues enable a lightweight solution for providing a publish/subscribe architecture for your IoT system. It is a way to collect “push data” from your IoT sensors, and process them in an efficient way. Message queues are a component of what’s called a Message Broker, in that Brokers “validate, store, route, and deliver messages to the appropriate destinations”. The Queue is the actual component that stores the data packets for retrieval. Think of the Broker as a traffic cop, while the Queues are the stop sign. Here’s an interesting article by IBM that does a great job at explaining different kinds of Message Brokers.
It’s inevitable that eventually, applications and devices will crash. With an IoT project, this is particularly problematic, as you are often relying on a constant stream of data (or lack of data) from your IoT sensors. Message Brokers, like RabbitMQ, AWS SQS, GCP Pub/Sub, or Azure Service Bus provide temporary message storage from these data payloads, for when you must prioritize system processes, or as a reference for whenever your application may be busy or disconnected. When your application gets restored, it can then read the pending data from the message queue and backfill it’s storage where needed.
Many of these solutions we mentioned behave in similar ways, but here’s a quick rundown of some of the things you might consider:
Visually, you can see how a message broker would fit into a IoT system’s architecture. A message producer (say, an IoT device), drops a data payload onto the message queue, where the broker then prioritizes this data in a FIFO (first-in-first-out) method. Each consumer application then reads from the message queue, rather than directly from the IoT device. Other than enforcing that payloads are processed in the correct order, should any consumer application fail, it can simply reference the data from the message broker in order to backfill any data that was missed during the outage.
The unfortunate truth is that when a device is down, it won’t be communicating data as you need it to. The best an engineering team can do is to design a system that can accommodate these outages in the best way possible. These are some design choices that we have made in order to ensure that our systems provide reliable and actionable streams of data, even when there are outages in applications and sensors.
Ultimately, the system should be able to provide business value even when parts of it are down. Outages are an inevitable part of any IoT system, and being able to navigate them effectively will often result in a successful and valuable system for the end user.