Doing random things over at http://musteat.org
205 stories
·
4 followers

Stack

3 Comments and 13 Shares
Gotta feel kind of bad for nation-state hackers who spend years implanting and cultivating some hardware exploit, only to discover the entire target database is already exposed to anyone with a web browser.
Read the whole story
smarkwell
4 days ago
reply
Share this story
Delete
3 public comments
tante
4 days ago
reply
The modern tech stack
Oldenburg/Germany
jepler
4 days ago
reply
#toomuchtruth
Earth, Sol system, Western spiral arm
alt_text_bot
4 days ago
reply
Gotta feel kind of bad for nation-state hackers who spend years implanting and cultivating some hardware exploit, only to discover the entire target database is already exposed to anyone with a web browser.

Comments

3 Comments and 10 Shares
NPR encourages you to add comments to their stories using the page inspector in your browser's developer tools. Note: Your comments are visible only to you, and will be lost when you refresh the page.
Read the whole story
smarkwell
20 days ago
reply
Share this story
Delete
3 public comments
srsly
20 days ago
reply
here you go have fun: https://xkcd.com/1019/
Atlanta, Georgia
mcarson
20 days ago
reply
I really hope Randall let's us know how many hits 1019 gets over the next little while.
alt_text_bot
20 days ago
reply
NPR encourages you to add comments to their stories using the page inspector in your browser's developer tools. Note: Your comments are visible only to you, and will be lost when you refresh the page.

Tarn Adams

1 Share

Tarn Adams

Who are you, and what do you do?

I'm Tarn Adams, and I make computer games with my brother. Bay 12 Games is our website, and the game we are most known for is Dwarf Fortress.

What hardware do you use?

The computer I do everything on is my Toshiba Satellite P55T-C5114 laptop, and I also have a Mac Mini which I use for porting Dwarf Fortress. I have an Insignia mouse I can attach to the laptop when I get tired of using the little finger mousepad.. so it's always plugged in!

When I need to talk to somebody online or record something, I have a Turtle Beach headset. My most recent cellphone is a iPhone 8, which I got after dipping my Samsung in the Pacific Ocean by accident.

I have a PS4 and a Switch.

And what software?

My main operating system is Windows 10, and I use Mac OS X and Ubuntu for porting. All of my game programming is done in C++ using Microsoft Visual Studio Community, though I have some helpful utilities that use Python 2.5. I gather there are very newer Pythons, but 2.5 is the most recent compatible one with the scripts. I use OpenGL and SDL for graphics -- the upside is their portability. I use an older free version of FMOD for sound, though that'll change now that Dwarf Fortress is going to Steam and itch.

I do all of my text editing in WordPad and Notepad, depending on how ugly the line breaks look when I open up the file. I use Paint to crop screenshots and whatever else I need to do.

I mostly do interviews and things over Skype, though they have other sites they use sometimes now. I use Audacity and Open Broadcaster when I need to work with audio or video, though that hardly ever comes up for work these days.

For the website, I use the pad programs to do all the HTML. The website has a Simple Machines Forum as well as a Mantis bug tracker to track all of the issues with the game. I move things to and from the website with FileZilla. I currently host Bay 12 Games with Linode and Lunar Pages. I use PuTTY to log in to and mess with my servers.

I listen to music pretty much non-stop, and it'll either be on Spotify or using Winamp.

I use Gmail and the 7-Zip compression utility.

What would be your dream setup?

I always end up getting fairly inexpensive replacement computers and not really missing anything, so I don't think I have a dream hardware setup. The only thing I find myself thinking about at times is a really good music sequencer. For a brief time some years ago, I had Ableton and was super happy with it, but I never ended up using my tracks for anything and didn't renew my license when that old computer died.

Read the whole story
smarkwell
29 days ago
reply
Share this story
Delete

Avalanche Noise Generator Notes

1 Comment and 2 Shares

Good sources of entropy (noise) are an essential part of modern cryptographic systems. I designed a mobile-friendly avalanche noise generator as part of the background work I’ve been doing for the betrusted project (more on that project later). I had to do a new design because the existing open-source ones I could find were too large and power hungry to integrate into a mobile device. I also found it hard to find solid theory pieces on avalanche noise generators, so in the process of researching this I wrote up all my notes in case someone needs to do a ground-up redesign of the system again in the future.

Here’s an excerpt from the notes:

Avalanche breakdown is essentially a miniature particle accelerator, where electrons that enter a PN junction’s depletion region (through mechanisms that include thermal noise) are accelerated across an electrical field, to the point where new electron-hole pairs are generated when these high-energy electrons collide with atoms in the depletion region, creating an amplification cascade with low reproducibility.

An approximate analogy is an inflatable pool filled with water. The height of the pool is the potential barrier of the reverse-biased PN junction. A hose feeding water into the pool represents a constant current of electrons. The volume of the pool can be thought of as the depletion capacitance, that is, the capacitor created by the region of the junction that is void of carriers due to natural drift and diffusion effects. As water trickles into the pool, the water level rises and eventually forms a meniscus. Random disturbances, such as ripples on the surface due to wind, eventually cause the meniscus to crest over the edge of the pool. The water flowing over the edge pushes down on the inflatable pool’s side, causing more water to flow, until the level has reduced to a point where the inflatable pool’s side can snap back into its original shape, thus restarting the cycle of filling, cresting, and breakdown. The unpredictability of when and where the breakdown might happen, and how much water flows out during the event, is analogous to the entropy generated by the avalanche effect in a PN junction.

The electrical characteristic of avalanche noise biased by a constant current source is a “sawtooth” waveform: a linear ramp up in voltage as the capacitance of the depletion region charges to the point where the electric field becomes large enough to initiate the cascade, and then a sharp drop off in voltage as the cascade rapidly discharges the junction capacitance. The cascade then abruptly halts once the field is no longer strong enough to sustain the cascade effect, leading to a subsequent cycle of charging and breakdown.

The site also includes detailed schematics and measurement results, such as this one.

The final optimized design takes <1cm^2 area and draws 520uA at 3.3V when active and 12uA in standby (mostly 1.8V LDO leakage for the output stage, included in the measurement but normally provided by the system), and it passes preliminary functional tests from 2.8-4.4V and 0-80C. The output levels target a 0-1V swing, meant to be sampled using an on-chip ADC from a companion MCU, but one could add a comparator and turn it into a digital-compatible bitstream I suppose. I opted to use an actual diode instead of a NPN B-E junction, because the noise quality is empirically better and anecdotes on the Internet claim the NPN B-E junctions fail over time when operated as noise sources. I’ll probably go through another iteration of tweaking before final integration, but afaik this is the smallest, lowest power open-source avalanche noise generator to date (slightly smaller than this one).

Read the whole story
smarkwell
65 days ago
reply
Share this story
Delete
1 public comment
jepler
68 days ago
reply
I made an avalanche noise RNG way back in the day, so this is fascinating for me.
Earth, Sol system, Western spiral arm

Supply Chain Security Talk

1 Comment and 5 Shares

I recently gave an invited talk about supply chain security at BlueHat IL 2019. I was a bit surprised at the level of interest it received, so I thought I’d share it here for people who might have missed it.

In the talk, I relay some of my personal trials authenticating my supply chains, then I go into the why of the supply chain attacks to establish some scenarios for evaluating different approaches. The talk attempts to broadly categorize the space of possible attacks, ranging from attacks that cost a penny and a few seconds to pull off to hundreds of thousands of dollars and months. Finally, I try to outline the depth of the supply chain attack surface, highlighting the overall TOCTOU (time of check, time of use) problem that is the supply chain.

The main insight is that transparency or openness of design by itself does little to secure a supply chain, because the entire situation is one huge TOCTOU problem. Checking hardware design files, locking down the assembly line, and Fedexing the product to your office is like hashing and signing your source code, running it through a trusted compiler, and then sending the binary unencrypted over the Internet and trusting it because it was “thoroughly checked”.

The inverse analysis is equally daunting: in software, one may copy each binary into RAM, hash and check its cryptographic signature, and run it only if it checks out. For hardware, there is no equivalent of “hash this instance of hardware and check its cryptographic signature” before use: “hashing” hardware involves taking it apart and inspecting every transistor and wire, which is both impractical and likely to render the hardware non-functional.

Thus while open source hardware does engender some benefits for security (such as disclosing μ-state for Spectre side-channel analysis and ensuring no backdoors due to design oversight), it addresses a separate problem domain from supply chain attacks. While an open source hardware phone is arguably more trustable than a closed source one, open source is necessary but not sufficient for it to be trusted.

I do have some ideas on the practical mitigation of supply chain attacks, but they are still a bit too green to blog about. Stay tuned…

Read the whole story
smarkwell
111 days ago
reply
denubis
118 days ago
reply
Sydney, Australia
Share this story
Delete
1 public comment
StunGod
116 days ago
reply
I recently had a discussion about this with an industrial manufacturer. Bunnie is a smart dude, and his approach is spot-on.
Portland, Oregon, USA, Earth

The log/event processing pipeline you can't have

1 Share

Let me tell you about the still-not-defunct real-time log processing pipeline we built at my now-defunct last job. It handled logs from a large number of embedded devices that our ISP operated on behalf of residential customers. (I wrote and presented previously about some of the cool wifi diagnostics that were possible with this data set.)

Lately, I've had a surprisingly large number of conversations about logs processing pipelines. I can find probably 10+ already-funded, seemingly successful startups processing logs, and the Big Name Cloud providers all have some kind of logs thingy, but still, people are not satisfied. It's expensive and slow. And if you complain, you mostly get told that you shouldn't be using unstructured logs anyway, you should be using event streams.

That advice is not wrong, but it's incomplete.

Instead of doing a survey of the whole unhappy landscape, let's just ignore what other people suffer with and talk about what does work. You can probably find, somewhere, something similar to each of the components I'm going to talk about, but you probably can't find a single solution that combines it all with good performance and super-low latency for a reasonable price. At least, I haven't found it. I was a little surprised by this, because I didn't think we were doing anything all that innovative. Apparently I was incorrect.

The big picture

Let's get started. Here's a handy diagram of all the parts we're going to talk about:

The ISP where I worked has a bunch of embedded Linux devices (routers, firewalls, wifi access points, and so on) that we wanted to monitor. The number increased rapidly over time, but let's talk about a nice round number, like 100,000 of them. Initially there were zero, then maybe 10 in our development lab, and eventually we hit 100,000, and later there were many more than that. Whatever. Let's work with 100,000. But keep in mind that this architecture works pretty much the same with any number of devices.

(It's a "distributed system" in the sense of scalability, but it's also the simplest thing that really works for any number of devices more than a handful, which makes it different from many "distributed systems" where you could have solved the problem much more simply if you didn't care about scaling. Since our logs are coming from multiple sources, we can't make it non-distributed, but we can try to minimize the number of parts that have to deal with the extra complexity.)

Now, these are devices we're monitoring, not apps or services or containers or whatever. That means two things: we had to deal with lots of weird problems (like compiler/kernel bugs and hardware failures), and most of the software was off-the-shelf OS stuff we couldn't easily control (or didn't want to rewrite).

(Here's the good news: because embedded devices have all the problems from top to bottom, any solution that works for my masses of embedded devices will work for any other log-pipeline problem you might have. If you're lucky, you can just leave out some parts.)

That means the debate about "events" vs "logs" was kind of moot. We didn't control all the parts in our system, so telling us to forget logs and use only structured events doesn't help. udhcpd produces messages the way it wants to produce messages, and that's life. Sometimes the kernel panics and prints whatever it wants to print, and that's life. Move on.

Of course, we also had our own apps, which means we could also produce our own structured events when it was relevant to our own apps. Our team had whole never-ending debates about which is better, logs or events, structured or unstructured. In fact, in a move only overfunded megacorporations can afford, we actually implemented both and ran them both for a long time.

Thus, I can now tell you the final true answer, once and for all: you want structured events in your database.

...but you need to be able to produce them from unstructured logs. And once you can do that, exactly how those structured events are produced (either from logs or directly from structured trace output) turns out to be irrelevant.

But we're getting ahead of ourselves a bit. Let's take our flow diagram, one part at a time, from left to right.

Userspace and kernel messages, in a single stream

Some people who have been hacking on Linux for a while may know about /proc/kmsg: that's the file good old (pre-systemd) klogd reads kernel messages from, and pumps them to syslogd, which saves them to a file. Nowadays systemd does roughly the same thing but with more d-bus and more corrupted binary log files. Ahem. Anyway. When you run the dmesg command, it reads the same messages (in a slightly different way).

What you might not know is that you can go the other direction. There's a file called /dev/kmsg (note: /dev and not /proc) which, if you write to it, produces messages into the kernel's buffer. Let's do that! For all our messages!

Wait, what? Am I crazy? Why do that?

Because we want strict sequencing of log messages between programs. And we want that even if your kernel panics.

Imagine you have, say, a TV DVR running on an embedded Linux system, and whenever you go to play a particular recorded video, the kernel panics because your chipset vendor hates you. Hypothetically. (The feeling is, hypothetically, mutual.) Ideally, you would like your logs to contain a note that the user requested the video, the video is about to start playing, we've opened the file, we're about to start streaming the file to the proprietary and very buggy (hypothetical) video decoder... boom. Panic.

What now? Well, if you're writing the log messages to disk, the joke's on you, because I bet you didn't fsync() after each one. (Once upon a time, syslogd actually did fsync() after each one. It was insanely disk-grindy and had very low throughput. Those days are gone.) Moreover, a kernel panic kills the disk driver, so you have no chance to fsync() it after the panic, unless you engage one of the more terrifying hacks like, after a panic, booting into a secondary kernel whose only job is to stream the message buffer into a file, hoping desperately that the disk driver isn't the thing that panicked, that the disk itself hasn't fried, and that even if you do manage to write to some disk blocks, they are the right ones because your filesystem data structure is reasonably intact.

(I suddenly feel a lot of pity for myself after reading that paragraph. I think I am more scars than person at this point.)

ANYWAY

The kernel log buffer is in a fixed-size memory buffer in RAM. It defaults to being kinda small (tends or hundreds of kBytes), but you can make it bigger if you want. I suggest you do so.

By itself, this won't solve your kernel panic problems, because RAM is even more volatile than disk, and you have to reboot after a kernel panic. So the RAM is gone, right?

Well, no. Sort of. Not exactly.

Once upon a time, your PC BIOS would go through all your RAM at boot time and run a memory test. I remember my ancient 386DX PC used to do this with my amazingly robust and life-changing 4MB of RAM. It took quite a while. You could press ESC to skip it if you were a valiant risk-taking rebel like myself.

Now, memory is a lot faster than it used to be, but unfortunately it has gotten bigger more quickly than it has gotten faster, especially if you disable memory caching, which you certainly must do at boot time in order to write the very specific patterns needed to see if there are any bit errors.

So... we don't. That ended years ago. If you reboot your system, the memory mostly will contain the stuff it contained before you rebooted. The OS kernel has to know that and zero out pages as they get used. (Sometimes the kernel gets fancy and pre-zeroes some extra pages when it's not busy, so it can hand out zero pages more quickly on demand. But it always has to zero them.)

So, the pages are still around when the system reboots. What we want to happen is:

  1. The system reboots automatically after a kernel panic. You can do this by giving your kernel a boot parameter like "panic=1", which reboots it after one second. (This is not nearly enough time for an end user to read and contemplate the panic message. That's fine, because a) on a desktop PC, X11 will have crashed in graphics mode so you can't see the panic message anyway, and b) on an embedded system there is usually no display to put the message on. End users don't care about panic messages. Our job is to reboot, ASAP, so they don't try to "help" by power cycling the device, which really does lose your memory.) (Advanced users will make it reboot after zero seconds. I think panic=0 disables the reboot feature rather than doing that, so you might have to patch the kernel. I forget. We did it, whatever it was.)

  2. The kernel always initializes the dmesg buffer in the same spot in RAM.

  3. The kernel notices that a previous dmesg buffer is already in that spot in RAM (because of a valid signature or checksum or whatever) and decides to append to that buffer instead of starting fresh.

  4. In userspace, we pick up log processing where we left off. We can capture the log messages starting before (and therefore including) the panic!

  5. And because we redirected userspace logs to the kernel message buffer, we have also preserved the exact sequence of events that led up to the panic.

If you want all this to happen, I have good news and bad news. The good news is we open sourced all our code; the bad news is it didn't get upstreamed anywhere so there are no batteries included and no documentation and it probably doesn't quite work for your use case. Sorry.

Open source code:

  • logos tool for sending userspace logs to /dev/klogd. (It's logs... for the OS.. and it's logical... and it brings your logs back from the dead after a reboot... get it? No? Oh well.) This includes two per-app token buckets (burst and long-term) so that an out-of-control app won't overfill the limited amount of dmesg space.

  • PRINTK_PERSIST patch to make Linux reuse the dmesg buffer across reboots.

Even if you don't do any of the rest of this, everybody should use PRINTK_PERSIST on every computer, virtual or physical. Seriously. It's so good.

(Note: room for improvement: it would be better if we could just redirect app stdout/stderr directly to /dev/kmsg, but that doesn't work as well as we want. First, it doesn't auto-prefix incoming messages with the app name. Second, libc functions like printf() actually write a few bytes at a time, not one message per write() call, so they would end up producing more than one dmesg entry per line. Third, /dev/kmsg doesn't support the token bucket rate control that logos does, which turns out to be essential, because sometimes apps go crazy. So we'd have to further extend the kernel API to make it work. It would be worthwhile, though, because the extra userspace process causes an unavoidable delay between when a userspace program prints something and when it actually gets into the kernel log. That delay is enough time for a kernel to panic, and the userspace message gets lost. Writing directly to /dev/kmsg would take less CPU, leave userspace latency unchanged, and ensure the message is safely written before continuing. Someday!)

(In related news, all of syslogd is kinda extraneous for this reason. So is whatever systemd does. Why do we make everything so complicated? Just write directly to files or the kernel log buffer. It's cheap and easy.)

Uploading the logs

Next, we need to get the messages out of the kernel log buffer and into our log processing server, wherever that might be.

(Note: if we do the above trick - writing userspace messages to the kernel buffer - then we can't also use klogd to read them back into syslogd. That would create an infinite loop, and would end badly. Ask me how I know.)

So, no klogd -> syslogd -> file. Instead, we have something like syslogd -> kmsg -> uploader or app -> kmsg -> uploader.

What is a log uploader? Well, it's a thing that reads messages from the kernel kmsg buffer as they arrive, and uploads them to a server, perhaps over https. It might be almost as simple as "dmesg | curl", like my original prototype, but we can get a bit fancier:

  • Figure out which messages we've already uploaded (eg. from the persistent buffer before we rebooted) and don't upload those again.

  • Log the current wall-clock time before uploading, giving us sync points between monotonic time (/dev/kmsg logs "microseconds since boot" by default, which is very useful, but we also want to be able to correlate that with "real" time so we can match messages between related machines).

  • Compress the file on the way out.

  • Somehow authenticate with the log server.

  • Bonus: if the log server is unavailable because of a network partition, try to keep around the last few messages from before the partition, as well as the recent messages once the partition is restored. If the network partition was caused by the client - not too rare if you, like us, were in the business of making routers and wifi access points - you really would like to see the messages from right before the connectivity loss.

Luckily for you, we also open sourced our code for this. It's in C so it's very small and low-overhead. We never quite got the code for the "bonus" feature working quite right, though; we kinda got interrupted at the last minute.

Open source code:

  • loguploader C client, including an rsyslog plugin for Debian in case you don't want to use the /dev/kmsg trick.

  • devcert, a tool (and Debian package) which auto-generates a self signed "device certificate" wherever it's installed. The device certificate is used by a device (or VM, container, whatever) to identify itself to the log server, which can then decide how to classify and store (or reject) its logs.

One thing we unfortunately didn't get around to doing was modifying the logupload client to stream logs to the server. This is possible using HTTP POST and Chunked encoding, but our server at the time was unable to accept streaming POST requests due to (I think now fixed) infrastructure limitations.

(Note: if you write load balancing proxy servers or HTTP server frameworks, make sure they can start processing a POST request as soon as all the headers have arrived, rather than waiting for the entire blob to be complete! Then a log upload server can just stream the bytes straight to the next stage even before the whole request has finished.)

Because we lacked streaming in the client, we had to upload chunks of log periodically, which leads to a tradeoff about what makes a good upload period. We eventually settled on about 60 seconds, which ended up accounting for almost all the end-to-end latency from message generation to our monitoring console.

Most people probably think 60 seconds is not too bad. But thanks to the awesome team I was working with, we managed to squeeze all the other pipeline phases down to tens of milliseconds in total. So the remaining 60 seconds (technically: anywhere from 0 to 60 seconds after a message was produced) was kinda embarrassing.

The log receiver

So okay, we're uploading the logs from client to some kind of server. What does the server do?

This part is both the easiest and the most reliability-critical. The job is this: receive an HTTP POST request, write the POST data to a file, and return HTTP 200 OK. Anybody who has any server-side experience at all can write this in their preferred language in about 10 minutes.

We intentionally want to make this phase as absolutely simplistic as possible. This is the phase that accepts logs from the limited-size kmsg buffer on the client and puts them somewhere persistent. It's nice to have real-time alerts, but if I have to choose between somewhat delayed alerts or randomly losing log messages when things get ugly, I'll have to accept the delayed alerts. Don't lose log messages! You'll regret it.

The best way to not lose messages is to minimize the work done by your log receiver. So we did. It receives the uploaded log file chunk and appends it to a file, and that's it. The "file" is actually in a cloud storage system that's more-or-less like S3. When I explained this to someone, they asked why we didn't put it in a Bigtable-like thing or some other database, because isn't a filesystem kinda cheesy? No, it's not cheesy, it's simple. Simple things don't break. Our friends on the "let's use structured events to make metrics" team streamed those events straight into a database, and it broke all the time, because databases have configuration options and you inevitably set those options wrong, and it'll fall over under heavy load, and you won't find out until you're right in the middle of an emergency and you really want to see those logs. Or events.

Of course, the file storage service we used was encrypted-at-rest, heavily audited, and auto-deleted files after N days. When you're a megacorporation, you have whole teams of people dedicated to making sure you don't screw this up. They will find you. Best not to annoy them.

We had to add one extra feature, which was authentication. It's not okay for random people on the Internet to be able to impersonate your devices and spam your logs - at least without putting some work into it. For device authentication, we used the rarely-used HTTP client-side certificates option and the devcert program (linked above) so that the client and server could mutually authenticate each other. The server didn't check the certificates against a certification authority (CA), like web clients usually do; instead, it had a database with a whitelist of exactly which certs we're allowing today. So in case someone stole a device cert and started screwing around, we could remove their cert from the whitelist and not worry about CRL bugs and latencies and whatnot.

Unfortunately, because our log receiver was an internal app relying on internal infrastructure, it wasn't open sourced. But there really wasn't much there, honest. The first one was written in maybe 150 lines of python, and the replacement was rewritten in slightly more lines of Go. No problem.

Retries and floods

Of course, things don't always go smoothly. If you're an ISP, the least easy thing is dealing with cases where a whole neighbourhood gets disconnected, either because of a power loss or because someone cut the fiber Internet feed to the neighbourhood.

Now, disconnections are not such a big deal for logs processing - you don't have any. But reconnection is a really big deal. Now you have tens or hundreds of thousands of your devices coming back online at once, and a) they have accumulated a lot more log messages than they usually do, since they couldn't upload them, and b) they all want to talk to your server at the same time. Uh oh.

Luckily, our system was designed carefully (uh... eventually it was), so it could handle these situations pretty smoothly:

  1. The log uploader uses a backoff timer so that if it's been trying to upload for a while, it uploads less often. (However, the backoff timer was limited to no more than the usual inter-upload interval. I don't know why more people don't do this. It's rather silly for your system to wait longer between uploads in a failure situation than it would in a success situation. This is especially true with logs, where when things come back online, you want a status update now. And clearly your servers have enough capacity to handle uploads at the usual rate, because they usually don't crash. Sorry if I sound defensive here, but I had to have this argument a few times with a few SREs. I understand why limiting the backoff period isn't always the right move. It's the right move here.)

  2. Less obviously, even under normal conditions, the log uploader uses a randomized interval between uploads. This avoids traffic spikes where, after the Internet comes back online, everybody uploads again exactly 60 seconds later, and so on.

  3. The log upload client understands the idea that the server can't accept its request right now. It has to, anyway, because if the Internet goes down, there's no server. So it treats server errors exactly like it treats lack of connectivity. And luckily, log uploading is not really an "interactive" priority task, so it's okay to sacrifice latency when things get bad. Users won't notice. And apparently our network is down, so the admins already noticed.

  4. The /dev/kmsg buffer was configured for the longest reasonable outage we could expect, so that it wouldn't overflow during "typical" downtime. Of course, there's a judgement call here. But the truth is, if you're having system-wide downtime, what the individual devices were doing during that downtime is not usually what you care about. So you only need to handle, say, the 90th percentile of downtime. Ignore the black swans for once.

  5. The log receiver aggressively rejects requests that come faster than its ability to write files to disk. Since the clients know how to retry with a delay, this allows us to smooth out bursty traffic without needing to either over-provision the servers or lose log messages.

    (Pro tip: if you're writing a log receiver in Go, don't do the obvious thing and fire off a goroutine for every incoming request. You'll run out of memory. Define a maximum number of threads you're willing to handle at once, and limit your request handling to that. It's okay to set this value low, just to be safe: remember, the uploader clients will come back later.)

Okay! Now our (unstructured) logs from all our 100,000 devices are sitting safely in a big distributed filesystem. We have a little load-balanced, multi-homed cluster of log receivers accepting the uploads, and they're so simple that they should pretty much never die, and even if they do because we did something dumb (treacherous, treacherous goroutines!), the clients will try again.

What might not be obvious is this: our reliability, persistence, and scaling problems are solved. Or rather, as long as we have enough log receiver instances to handle all our devices, and enough disk quota to store all our logs, we will never again lose a log message.

That means the rest of our pipeline can be best-effort, complicated, and frequently exploding. And that's a good thing, because we're going to start using more off-the-shelf stuff, we're going to let random developers reconfigure the filtering rules, and we're not going to bother to configure it with any redundancy.

Grinding the logs

The next step is to take our unstructured logs and try to understand them. In other words, we want to add some structure. Basically we want to look for lines that are "interesting" and parse out the "interesting" data and produce a stream of events, each with a set of labels describing what categories they apply to.

Note that, other than this phase, there is little difference between how you'd design a structured event reporting pipeline and a log pipeline. You still need to collect the events. You still (if you're like me) need to persist your events across kernel panics. You still need to retry uploading them if your network gets partitioned. You still need the receivers to handle overloading, burstiness, and retries. You still would like to stream them (if your infrastructure can handle it) rather than uploading every 60 seconds. You still want to be able to handle a high volume of them. You're just uploading a structured blob instead of an unstructured blob.

Okay. Fine. If you want to upload structured blobs, go for it. It's just an HTTP POST that appends to a file. Nobody's stopping you. Just please try to follow my advice when designing the parts of the pipeline before and after this phase, because otherwise I guarantee you'll be sad eventually.

Anyway, if you're staying with me, now we have to parse our unstructured logs. What's really cool - what makes this a killer design compared to starting with structured events in the first place - is that we can, at any time, change our minds about how to parse the logs, without redeploying all the software that produces them.

This turns out to be amazingly handy. It's so amazingly handy that nobody believes me. Even I didn't believe me until I experienced it; I was sure, in the beginning, that the unstructured logs were only temporary and we'd initially use them to figure out what structured events we wanted to record, and then modify the software to send those, then phase out the logs over time. This never happened. We never settled down. Every week, or at least every month, there was some new problem which the existing "structured" events weren't configured to catch, but which, upon investigating, we realized we could diagnose and measure from the existing log message stream. And so we did!

Now, I have to put this in perspective. Someone probably told you that log messages are too slow, or too big, or too hard to read, or too hard to use, or you should use them while debugging and then delete them. All those people were living in the past and they didn't have our fancy log pipeline. Computers are really, really fast now. Storage is really, really cheap.

So we let it all out. Our devices produced an average of 50 MB of (uncompressed) logs per day, each. For the baseline 100,000 devices that we discussed above, that's about 5TB of logs per day. Ignoring compression, how much does it cost to store, say, 60 days of logs in S3 at 5TB per day? "Who cares," that's how much. You're amortizing it over 100,000 devices. Heck, a lot of those devices were DVRs, each with 2TB of storage. With 100,000 DVRs, that's 200,000 TB of storage. Another 300 is literally a rounding error (like, smaller than if I can't remember if it's really 2TB or 2TiB or what).

Our systems barfed up logs vigorously and continuously, like a non-drunken non-sailor with seasickness. And it was beautiful.

(By the way, now would be a good time to mention some things we didn't log: personally identifiable information or information about people's Internet usage habits. These were diagnostic logs for running the network and detecting hardware/software failures. We didn't track what you did with the network. That was an intentional decision from day 1.)

(Also, this is why I think all those log processing services are so badly overpriced. I wanna store 50 MB per device, for lots of devices. I need to pay S3 rates for that, not a million dollars a gigabyte. If I have to overpay for storage, I'll have to start writing fewer logs. I love my logs. I need my logs. I know you're just storing it in S3 anyway. Let's be realistic.)

But the grinding, though

Oh right. So the big box labeled "Grinder" in my diagram was, in fact, just one single virtual machine, for a long time. It lasted like that for much longer than we expected.

Whoa, how is that possible, you ask?

Well, at 5TB per day per 100,000 devices, that's an average of 57 MBytes per second. And remember, burstiness has already been absorbed by our carefully written log receivers and clients, so we'll just grind these logs as fast as they arrive or as fast as we can, and if there are fluctuations, they'll average out. Admittedly, some parts of the day are busier than others. Let's say 80 MBytes per second at peak.

80 MBytes per second? My laptop can do that on its spinning disk. I don't even need an SSD! 80 MBytes per second is a toy.

And of course, it's not just one spinning disk. The data itself is stored on some fancy heavily-engineered distributed filesystem that I didn't have to design. Assuming there are no, er, collossal, failures in provisioning (no comment), there's no reason we shouldn't be able to read files at a rate that saturates the network interface available to our machine. Surely that's at least 10 Gbps (~1 GByte/sec) nowadays, which is 12.5 of those. 1.25 million devices, all processed by a single grinder.

Of course you'll probably need to use a few CPU cores. And the more work you do per log entry, the slower it'll get. But these estimates aren't too far off what we could handle.

And yeah, sometimes that VM gets randomly killed by the cluster's Star Trek-esque hive mind for no reason. It doesn't matter, because the input data was already persisted by the log receivers. Just start a new grinder and pick up where you left off. You'll have to be able to handle process restarts no matter what. And that's a lot easier than trying to make a distributed system you didn't need.

As for what the grinder actually does? Anything you want. But it's basically the "map" phase in a mapreduce. It reads the data in one side, does some stuff to it, and writes out postprocessed stuff on the other side. Use your imagination. And if you want to write more kinds of mappers, you can run them, either alongside the original Grinder or downstream from it.

Our Grinder mostly just ran regexes and put out structures (technically protobufs) that were basically sets of key-value pairs.

(For some reason, when I search the Internet for "streaming mapreduce," I don't get programs that do this real-time processing of lots of files as they get written. Instead, I seem to get batch-oriented mapreduce clones that happen to read from stdin, which is a stream. I guess. But... well, now you've wasted some perfectly good words that could have meant something. So okay, too bad, it's a Grinder. Sue me.)

Reducers and Indexers

Once you have a bunch of structured events... well, I'm not going to explain that in a lot of detail, because it's been written about a lot.

You probably want to aggregate them a bit - eg. to count up reboots across multiple devices, rather than storing each event for each device separately - and dump them into a time-series database. Perhaps you want to save and postprocess the results in a monitoring system named after Queen Elizabeth or her pet butterfly. Whatever. Plug in your favourite.

What you probably think you want to do, but it turns out you rarely need, is full-text indexing. People just don't grep the logs across 100,000 devices all that often. I mean, it's kinda nice to have. But it doesn't have to be instantaneous. You can plug in your favourite full text indexer if you like. But most of the time, just an occasional big parallel grep (perhaps using your favourite mapreduce clone or something more modern... or possibly just using grep) of a subset of the logs is sufficient.

(If you don't have too many devices, even a serial grep can be fine. Remember, a decent cloud computer should be able to read through ~1 GByte/sec, no problem. How much are you paying for someone to run some bloaty full-text indexer on all your logs, to save a few milliseconds per grep?)

I mean, run a full text indexer if you want. The files are right there. Don't let me stop you.

On the other hand, being able to retrieve the exact series of logs - let's call it the "narrative" - from a particular time period across a subset of devices turns out to be super useful. A mini-indexer that just remembers which logs from which devices ended up in which files at which offsets is nice to have. Someone else on our team built one of those eventually (once we grew so much that our parallel grep started taking minutes instead of seconds), and it was very nice.

And then you can build your dashboards

Once you've reduced, aggregated, and indexed your events into your favourite output files and databases, you can read those databases to build very fast-running dashboards. They're fast because the data has been preprocessed in mostly-real time.

As I mentioned above, we had our pipeline reading the input files as fast as they could come in, so the receive+grind+reduce+index phase only took a few tens of milliseconds. If your pipeline isn't that fast, ask somebody why. I bet their program is written in java and/or has a lot of sleep() statements.

Again here, I'm not going to recommend a dashboard tool. There are millions of articles and blog posts about that. Pick one, or many.

In conclusion

Please, please, steal these ideas. Make your log and event processing as stable as our small team made our log processing. Don't fight over structured vs unstructured; if you can't agree, just log them both.

Don't put up with weird lags and limits in your infrastructure. We made 50MB/day/device work for a lot of devices, and real-time mapreduced them all on a single VM. If we can do that, then you can make it work for a few hundreds, or a few thousands, of container instances. Don't let anyone tell you you can't. Do the math: of course you can.

Epilogue

Eventually our team's log processing infrastructure evolved to become the primary monitoring and alerting infrastructure for our ISP. Rather than alerting on behaviour of individual core routers, it turned out that the end-to-end behaviour observed by devices in the field were a better way to detect virtually any problem. Alert on symptoms, not causes, as the SREs like to say. Who has the symptoms? End users.

We had our devices ping different internal servers periodically and log the round trip times; in aggregate, we had an amazing view of overloading, packet loss, bufferbloat, and poor backbone routing decisions, across the entire fleet, across every port of every switch. We could tell which was better, IPv4 or IPv6. (It's always IPv4. Almost everyone spends more time optimizing their IPv4 routes and peering. Sorry, but it's true.)

We detected some weird configuration problems with the DNS servers in one city by comparing the 90th percentile latency of DNS lookups across all the devices in every city.

We diagnosed a manufacturing defect in a particular batch of devices, just based on their CPU temperature curves and fan speeds.

We worked with our CPU vendor to find and work around a bug in their cache coherency, because we spotted a kernel panic that would happen randomly every 10,000 CPU-hours, but for every 100,000 devices, that's still 10 times per hour of potential clues.

...and it sure was good for detecting power failures.

Anyway. Log more stuff. Collect those logs. Let it flow. Trust me.

Read the whole story
smarkwell
114 days ago
reply
Share this story
Delete
Next Page of Stories