willfullyobtuse.com/content/posts/the-allure-of-cloud-services.md

+++
title = "The allure of cloud services - AWS Transfer Server"
date = "2023-01-05"
+++

## Build vs buy

I like cloud services as much as the next person. Definitely not _too_ much but
definitely, umm, much?

Anyway, I think cloud specific services can _sometimes_ be worth the sort of
vendor lock-in that they facilitate.

Vendor lock in sucks, but so does re-inventing the wheel. As mentioned in an
[earlier post](/posts/cloudwatch-metric-filters/) it can be pretty nice to
create alerts and application metrics passively from your log files.

Knowing when to use vendor-specific services and APIs versus rolling your own
is, in some ways, more art than science. Reasonable teams won't change their
infrastructure provider very many times and the difference between the two
choices is often less about lock-in and more about whether or not the problem
you're solving is really the special unicorn that you want it to be or not.

Sometimes, the cloud vendor solution is just bad, though.

## Case in point - AWS Transfer Server

On my team, we need to accept SFTP file transfers from about 20-30 vendors.

Sadly we can't avoid SFTP for this. It's an industry standard for the kind of
data we're receiving and we aren't in a position to dictate otherwise.

AWS offers a suite of products which include a hosted SFTP solution, [AWS Transfer](https://aws.amazon.com/aws-transfer-family/).

SFTP is simple enough and I'd ultimately like the uploaded files in an S3
bucket so seems like a great fit, right?

Wrong.

You have to jump through [a hundred](https://aws.amazon.com/secrets-manager/)
[different](https://aws.amazon.com/lambda/)
[hoops](https://aws.amazon.com/iam/) to event attempt to have password-based
authentication working on this service.

The services involved to make password based authentication work here aren't in
and of themselves a problem. The problem is the multiple points of failure they
represent. Each service can fail on its own, the permissions between the
services can fail, or the code in the Lambda itself can fail.

Debugging it was a nightmare. It ultimately worked but I wasn't sure _why_. I
had a strange permissions error about the IAM Role being used to either launch
the Lambda or the role used to access the S3 bucket for uploads. I don't
remember at this point. But it was really clear that if it ever broke again
then myself or whatever poor soul had to work on it was going to going to
regret the choice of technologies to make this work.

You know what is easy and simple and well understood? OpenSSH and Linux user
accounts.

In the end, the AWS Transfer option proved to be just complicated and enciting
enough for me to waste about a day and a half on it before I realize, in a
drunken stupor, that I'd be better off creating a snow-flaked EC2 instance and
calling it a day.

## Resisting temptation

I do think I could have identified that this was a bad idea before I wasted
over a day. One doesn't have to be doomed to repeat this mistake to learn the
lesson here.

The way to avoid this is to interrogate the path before you start.

Empathy is a good tool, here.

Picture the thing working as intended. Then imagine someone else (or future you
a year from now) having to search for how to change something or track down an
error.

- Are they likely to find many other people using these services? (In this case, no)
- Are the services involved purpose build for the task at hand? (In this case, no none of them)
- Does your team have pre-existing expertise around the services involved? (Again, no)
- What benefits does this approach yield over a traditional solution? (Upload to S3 is nice)
- Are those benefits important to your use case? (Not for us)

I didn't ask myself of those questions. I _really_ should have. Because the
answers are easy to get and they clearly indicate AWS Transfer Server is not
the best answer.

# Timeboxes
Another tool that could have helped here is a timebox.

A timebox is when you decided to give yourself a fixed, and often pretty short,
amount of time to accomplish something or to at least get meaningful insight
into a potential solution.

Sometimes I will demand that a working solution can be done inside a given
timebox or then the approach is abandoned. But other times I might just say
that I'm going to spend X amount of time on an approach and then make a point
to re-evaluate how long the complete solution will take.

It's important to remember how valuable timeboxes can be. The more time we
waste on things like this then the more attached we become to it. [Sunk
cost](https://en.wikipedia.org/wiki/Escalation_of_commitment) can get even the
best of us.

For me the simplest tool for avoiding sunk cost fallacy is to timebox risky
endeavors like this one. The time inside a timebox starts out as a write off. I
never feel bad about discarding failed results inside a timebox. I should use
them more aggressively.