My name is Philipp C. Heckel and I write about nerdy things.

Blog


  • Oct 19 / 2021
Distributed Systems

Lossless MySQL semi-sync replication and automated failover

MySQL is a really mature technology. It’s been around for a quarter of a century and it’s one of the most popular DBMS in the world. As such, as an engineer, one expects basic features such as replication and failover to be fleshed out, stable and ideally even easy to set up.

And while MySQL comes with replication functionality out of the box, automated failover and topology management is not part of its feature set. On top of that, it turns out that it is rather difficult to not shoot yourself in the foot when configuring replication.

In fact, without careful configuration and the right tools, a failover from a source to a replica server will almost certainly lose transactions that have been acknowledged as committed to the application.

This is a blog post about setting up lossless MySQL replication with automated failover, i.e. ensuring that not a single transaction is lost during a failover, and that failovers happen entirely without human intervention.

Continue Reading

  • Jun 20 / 2021
  • 3
Programming

elastictl: Import, export, re-shard and performance-test Elasticsearch indices

For my work, I work a lot with Elasticsearch. Elasticsearch is pretty famous by now, so I doubt that it needs an introduction. But if you happen to not know what it is: it’s a document store with unique search capabilities, and incredible scalability.

Despite its incredible features though, it has its rough edges. And no, I don’t mean the horrific query language (honestly, who thought that was a good idea?). I mean the fact that without external tools it’s quite impossible to import, export, copy, move or re-shard an Elasticsearch index. Indices are very final, unfortunately.

This is quite often very inconvenient if you have a growing index for which each Elasticsearch shard is outgrowing its recommended size (2 billion documents). Or even if you have the opposite problem: if you have an ES cluster that has too many shards (~800 shards per host is the recommendation I think), because you have too many indices.

This is why I wrote elastictl: elastictl is a simple tool to import/export Elasticsearch indices into a file, and/or reshard an index. In this short post, I’ll show a few examples of how it can be used.

Continue Reading

  • Dec 17 / 2020
Code Snippets, Programming

Snippet 0x0F: Recursive search/replace tool “re”

Two and a half years ago, I wrote my first Go program. I wanted to learn another language, and Go looked like a ton of fun: straight forward, easy to learn, and a static binary with no runtime shenanigans. I picked a project and I started hacking. Looking back, the code I wrote is a little cringy, but not terrible. I’d surely do things differently these days, now that I have more Go experience. But we all start somewhere.

However, the tool that I wrote, a recursive search/replace tool which I intelligently dubbed re, is actually incredibly useful: to my own surprise, I use it every day. I haven’t made a single modification to it in all that time (until today for this post). And since I’m in the sharing mood today, I thought I’d share it with the millions of people (cough) that come here every day. Ha!

Continue Reading

  • Dec 13 / 2020
Scripting, Security

Go: Calculating public key hashes for public key pinning in curl

Something occurred to me the other day. This is my blog, and that means I can write about whatever I want. Now you may think that’s totally obvious, but it’s not. For the longest time I wouldn’t blog about anything that I didn’t deem blog-worthy. Small things, like “this is a cool function I found” or “I learned this thing today”, were not blog-worthy in my mind for some reason.

Well today I am changing that. I like writing, but not necessarily so much that I always want to write a super long post. Sometimes, things should be short. Like this one.

So in this super short post I’m gonna show you a cool thing I figured out: How to calculate the the value that curls --pinnedpubkey option needs in Go.

Continue Reading

  • Oct 08 / 2020
  • 1
Linux

Reliably rebooting Ubuntu using watchdogs

Rebooting Ubuntu is hard. I don’t really know why, but in my twelve years as an Ubuntu user, I’ve encountered countless “stuck at reboot” scenarios. Somehow, typing reboot always comes with that extra special feeling of uncertainty and the thrill of danger — Will it come back? Where will it get stuck this time? If it’s your home computer or your laptop, that’s fine, because you can always manually hard reset. If it’s a remote computer to which you have IPMI access, it’s a little bit annoying, but not tragic. But if you’re attempting to reboot tens of thousands of devices across the globe, that level of uncertainty is nothing short of terrifying.

I know I’m being unfair, because more often than not, rebooting Ubuntu actually completes successfully. However, my incredibly unscientific estimate of how often things get stuck forever on shutdown or reboot is this: 1-3%. That’s how often I believe reboots hang. That’s shockingly high, right? Well, I pulled that out of my hat, but that estimate is based on many hundred thousands of reboots I’ve witnessed in our fleet of backup devices. That number is not too terrible when you deal with a handful of machines that you rarely ever reboot. It is, however, incredibly terrible if you reboot tens of thousands of devices running Ubuntu every two weeks as part of an upgrade process (I wrote about our image based upgrade mechanism in another post).

This post describes the short story of how we managed to make Ubuntu machines reliably reboot.

Continue Reading

  • Nov 19 / 2019
Cloud Computing, Distributed Systems, Programming

Providing remote access to devices via SSH tunnels

At my work, the backup appliances are typically physically located inside the LAN of our end users — much like other appliances such as routers, NAS devices or switches. Under normal circumstances that means that they are behind a NAT and are not reachable from the public Internet without a VPN or other tunneling mechanisms. For my employer’s customers, the Managed Service Provider (MSP), only being able to access their devices with direct physical access would be a major inconvenience.

Fortunately we’ve always provided a remote management feature called “Remote Web” for our customers: Remote Web lets them remotely access the device’s web interface as well as other services (mainly RDP, VNC, SSH), even when the device is behind a NAT.

Internally we call this feature RLY (pronounced: “relay”, like the owl, get it?). In this post, I’d like to talk about how we implemented the feature, what challenges we faced and what lessons we learned.

Continue Reading

  • Sep 18 / 2019
  • 9
Linux

Image based upgrades: Upgrading software and OS of 80k servers every two weeks

Anyone that’s ever managed a few dozen or hundreds of physical servers knows how hard it can become to keep all of them up-to-date with security updates, or in general to keep them in sync with their configuration and state. Sysadmins typically solve this problem with Puppet, or Salt or by putting applications in a container. While those are great options if you control your environment, they are less applicable when you think about other cases (such as appliance/server that doesn’t reside in your infrastructure). On top of that, replacing the kernel, major distribution upgrades or any larger upgrades that require a reboot are not covered by these solutions.

Being faced with this problem for work, we started exploring alternative options and came up with something that has worked reliably for almost two years for a fleet of now over 80,000 devices. In this blog post, I’d like to talk about how we solved this problem using images, loop devices and lots of Grub-magic. If you’d like to know more, keep reading.

Continue Reading

  • Jul 22 / 2019
Programming

Deduplicating NTFS file systems (fsdup)

At my work, we store hundreds of thousands of block-level backups for our customers. Since our customer base is mostly Windows focused, most of these backups are copies of NTFS file systems. As of today, we’re not performing any data deduplication on these backups, which is pretty crazy considering that how well you’d think a Windows OS will probably dedup.

So I started on a journey to attempt to dedup NTFS. This blog post briefly describes my journey and thoughts, but also introduces a tool called fsdup I developed as part of a 3 week proof-of-concept. Please note that while the tool works, it’s highly experimental and should not be used in production!

Continue Reading

  • Aug 05 / 2018
  • 35
Linux, Security

Using Let’s Encrypt for internal servers

Let’s Encrypt is a revolutionary new certificate authority that provides free certificates in a completely automated process. These certificates are issued via the ACME protocol. Over the last 2 years or so, the Internet has widely adopted Let’s Encrypt — over 50% of the web’s SSL/TLS certificates are now issued by Let’s Encrypt.

But while there are many tools to automatically renew certificates for publicly available webservers (certbot, simp_le, I wrote about how to do that 3 years back), it’s hard to find any useful information about how to issue certificates for internal non Internet facing servers and/or devices with Let’s Encrypt.

This blog posts describes how to issue Let’s Encrypt certificates for internal servers. At my work, we issued a certificate for each of our 65,000 90,000+ BCDR appliances using this exact mechanism.

Continue Reading

Pages:12345678