Unnecessarily Hard

Denes and I have been trying to figure out a bug with our Starchart DNS code. After hooking up the Let's Encrypt Certificate flow, we were finally in a position to test working with _acme-challenge TXT records in Route53.

Locally and in CI, we've been using the amazing moto route53 server to test things. It's given us the ability to develop the majority of the code quickly, but production is obviously different than mocking. Route53, and DNS in general, is hard to simulate. There's just so much that can go wrong, so many interconnected pieces, timing issues, etc.

Over the weekend we got into a loop of finding a bug and fixing a bug, finding another, fixing another. Eventually this has to work, right? How many can there be...

The process of finding, debugging, and fixing these bugs with AWS is unnecessarily hard. I really like AWS and I even teach an upper-semester course on it. But I feel like AWS makes things harder than they need to be for no apparent benefit.

Let me show you what I mean with one of the bugs. Let's Encrypt needs us to set these _acme-challenge TXT records in Route53. I need to put the string value 8cV4hs2A8VmH3a2f2QYkvANYtXZWm9I93kUXYZtiGgE into _acme-challenge.whatever.com. as a TXT record. To do this in node.js, you need to use the AWS SDK and the ChangeResourceRecordSetsCommand. It lets you specify an array of changes you want to apply: CREATE, DELETE, UPSERT. For example:

// The following example creates a resource record set that routes Internet traffic to a resource with an IP address of 192.0.2.44.
const input = {
  "ChangeBatch": {
    "Changes": [
      {
        "Action": "CREATE",
        "ResourceRecordSet": {
          "Name": "example.com",
          "ResourceRecords": [
            {
              "Value": "192.0.2.44"
            }
          ],
          "TTL": 60,
          "Type": "A"
        }
      }
    ],
    "Comment": "Web server for example.com"
  },
  "HostedZoneId": "Z3M3LMPEXAMPLE"
};
const command = new ChangeResourceRecordSetsCommand(input);
const response = await client.send(command);

This is creating an A record with a single IP address Value. We need to use a TYPE of TXT, but there is no example or mention of this. Surely that means it's the same, right? The API docs seem to imply this, noting of Value:

ResourceRecords: [
  {
    Value: "STRING_VALUE", // required
  },
],

"So, ResourceRecords is an Array of Objects with a string Value, go it." No other details or examples of working with TXT records.

Off we go, and despite everything we throw at it, back comes a 400 error with InvalidChangeBatch. Let's check the docs on this error:

Throws: InvalidChangeBatch (client fault)

This exception contains a list of messages that might contain one or more error messages. Each error message indicates one error in the change batch.

"Might contain," eh? I'll skip ahead and tell you that it does not contain anything useful:

"Code":"InvalidChangeBatch", "Type":"Sender", "name":"InvalidChangeBatch"

Alright, so we're doing something wrong, but what? We try half-a-dozen things, and fix some other bugs, but nothing will unlock this InvalidChangeBatch problem. Now we're neck deep in AWS browser tabs, with articles about every possible way to deal with Route53 other than what we need to do.

Eventually, deep in a Stack Overflow comment (not the answer!) I see someone toss out a lifeline: "FYI, I needed to wrap my value in \"...\" in case that helps anyone." That not only helps me, but it's what should have been written in the official API docs to begin with! Why is this so hard to find?

Armed with this new crumb of information, I again go searching in the AWS documentation labyrinth, and discover this:

A TXT record contains one or more strings that are enclosed in double quotation marks (").

Furthermore, there's all kinds of other special cases you should deal with, including:

A single string can include up to 255 characters, including the following:
  - a-z
  - A-Z
  - 0-9
  - Space
  - (hyphen)
  - ! " # $ % & ' ( ) * + , - / : ; < = > ? @ [ \ ] ^ _ ` { | } ~ .

If you need to enter a value longer than 255 characters, break the value into strings of 255 characters or fewer, and enclose each string in double quotation marks ("). In the console, list all the strings on the same line:

"String 1" "String 2" "String 3"

For the API, include all the strings in the same Value element:

<Value>"String 1" "String 2" "String 3"</Value>

The maximum length of a value in a TXT record is 4,000 characters.

So, in other words, a bit more than Value: "STRING_VALUE", // required. There's no mention of any of this in the API "docs."

As we're looking through this list of DOs and DON'Ts for a TXT record value, I ask myself: "surely someone has written this code already, I wonder where we can get it?" Denes jokingly says, "you mean like having it in the AWS SDK?" I laugh. of course it isn't there. Why would this be part of the SDK? But seriously, AWS knows that I'm setting a TXT record value Type, and when I give them a Value, why not simply do the right thing? Or at the very least, expose a function that formats a TXT record according to their own specifications! Or give me an error message that says "invalid TXT value, USE QUOTES!"

It doesn't need to be this hard. These docs could link to each other, so you can find the details instead of relying on the kindness of strangers on the internet. The errors could actually tell you what's wrong, even link to URLs with the info you need. The API could expose methods to help you get your job done and show you examples for edge cases (are TXT records really an edge case?).

If AWS was some fledgling startup who just shipped a beta and hadn't had time to get to the documentation, or if Route53 was some new service that hadn't been battle-tested yet, I'd be more sympathetic. But that's not what this is. This is unnecessarily hard.