How My Bash Color Settings Broke edeliver

A story about debugging a popular Elixir deploy tool and its unlikely solution.

Kate Travers
Flatiron Labs
8 min readJul 10, 2019

--

This post was original published on the author’s personal blog.

Yep, you read that right. My bash color settings broke edeliver, the tool my team uses to deploy our Elixir apps.

Now, anyone who’s tinkered with their .bash_profile knows there’s an infinite number of ways to totally bork your system. But this bug was well camouflaged, hiding inside a common, seemingly-benign bash setting I’d had in place for over two and a half years without issue — a bash setting you, too, might have on your machine RIGHT NOW 😱

But don’t worry, I tracked down the little bugger, so read on to save yourself the same hassle I went through. And for those of you in a rush, skip ahead to the tl;dr.

The Problem

I’d just started working on one of our Elixir projects, doing my best to learn a new codebase and a new language all at once. Things were going well until it came time to deploy changes. I could build a release, no problem, but I couldn’t deploy; the command would fail every time with the following output:

My teammates could deploy using this command; I was the only one getting this error. I tried running the command from my work and personal computers, and it errored out the same way consistently on both machines (I use shared dotfiles, so this wasn’t super surprising).

Debugging Steps

1. Hard Reset

My first recourse was the same I use any time I hit a weird error: start over fresh. I deleted the project directory from my machine, re-cloned it back down, and tried deploying the master branch (which was already running on production). No dice.

Outcome:
✅ Confirmed issue wasn’t with something I’d changed in the codebase
❌ Didn’t work. Deploy still failed.

2. Get Verbose

I was running the deploy command with the --verbose flag, but the edeliver error message still wasn’t telling me much about why the command was failing (more on that later). I needed more info.

On my teammate Steven’s excellent advice, I opened up my_app/deps/edeliver, found the command that was failing by searching for the error message (“Uploading release file failed”), and flipped it into verbose debug mode by adding the -vvvvv option (more v’s == more verbose):

This gave me way more insight into what this script was trying (and failing) to do. Honestly, probably too much insight. Re-running the deploy command in verbose debug mode gave me back ~200 lines of output:

Steven did the same thing on his machine, so we’d have his successful debug log for comparison.

His output also gave us another valuable piece of information: the actual bash command $_remote_job built by the upload_release_archive function, i.e. the one that copies the release from our build server to our production server, i.e. the one throwing the error. We’d follow that lead next.

Outcome:
✅ Better output with verbose debug mode
✅ Confirmed exact point of failure
✅ Learned a new debugging technique (thanks Steven!)
❌ Hard to separate signal from noise

3. Manual Workaround

We’d pinpointed the command that was failing locally, so why not try running it directly from the build server?

I ssh’d into our build server, switched to our deployer user, and ran the $_remote_job command above. What do you know, it worked! The release archive was copied from the build server to our production server, as expected.

I dropped back onto my local machine, commented out the upload_release_archive function from the edeliver script (since the archive was already copied over), and ran the deploy command. Success!

So now things are getting interesting. When I run the command from my local as my kate user, it fails. When I run it from our build server as deployer, it succeeds. Conclusion: something’s messed up with my user.

Outcome:
✅ We found a usable (albeit super manual) workaround
✅ Confirmed problem wasn’t with our project’s edeliver config
❌ Still can’t deploy from my local

4. Sanity Check

We’d narrowed down the issue to my user, so next step was to check my ssh config. I confirmed that my keys were on the required servers by ssh’ing in as my user… no problems there. Then I compared my .ssh/config file to my teammates’ config… nothing out of wack there either.

I also tried deploying after removing everything from the following config files (testing each in isolation, one-by-one), still with zero success:

  • ~/.ssh/known_hosts
  • ~/.bashrc
  • ~/.profile

Outcome:
✅ Confirmed problem wasn’t with my ssh config
❌ Running out of ideas

The Turning Point

At this point, desperation was setting in. My teammates and I were stumped, and we’d run out of things to try.

Enter Andres, our team’s devops contractor at the time. I’d sent him Steven’s and my output logs, and he was the first to notice a pretty important difference: in my output, the archive filename had a bunch of weird characters in it.

What was going on here? Where were those extra characters coming from?

A quick Google search revealed that these are terminal escape sequences, or ANSI escape codes. The terminal normally interprets these sequences as functions, not characters, so you can use them to format output, like adding color to the output of grep or ls commands.

I like me some colors in my terminal output, so way back in the day (like, two and a half years ago), I added settings to my .bash_profile to colorize my ls and grep output:

I’d never given these settings a second thought… until now. Turns out they’ve been a ticking time bomb, just waiting for this specific scenario.

The Explanation

The settings above work their colorizing magic by adding ANSI escape codes to output from ls and grep commands. For example, let’s run grep on example.txt below.

Seems harmless, right? The escape codes are properly “escaped”, interpreted as functions instead of characters… EXCEPT when your terminal doesn’t know how to interpret the sequence. For example, look what happens when you pipe colorized grep output into less:

Here’s the problem. If you pipe colorized output into a function like less that doesn’t know how to interpret it, the escape codes get treated as if they’re regular ol’ characters. Same thing happens if you’re colorizing ls or grep in a terminal that doesn’t support whatever escape codes you’ve set through LSCOLORS or GREP_COLOR variables (see this helpful post for a longer explanation).

And that’s what was happening to me in the edeliver deploy script.

Looking back at the edeliver source code, the archived release file name is built from the output of ls and grep commands. Because I had those colorization variables set in my .bash_profile, my machine added the color escape codes to the output, which were “misinterpretted” as extra characters in the release archive filename, breaking the deploy.


.
.
.
🎤 ::mike drop::

The Solution

Remove any ls or grep color settings from your .bash_profile:

[update] Based on this helpful comment from Charles Miller, even simpler solution is simply to change export GREP_OPTIONS='--color=always' to export GREP_OPTIONS='--color=auto'.

Learnings

Debugging is hard, especially when you’re working in a new codebase and/or new language. One thing that can really save you is good error messaging, something Elixir does well and edeliver can do better.

Next steps from here will be to raise this issue for edeliver and try to contribute some error message improvements (and maybe a fix for this escape sequence nonsense… think we’ll just need to disable the color settings before running the deploy command).

Resources

Thanks for reading! Want to work on a mission-driven team that loves bug hunts and Elixir? We’re hiring!

Footer top

To learn more about Flatiron School, visit the website, follow us on Facebook and Twitter, and visit us at upcoming events near you.

Flatiron School is a proud member of the WeWork family. Check out our sister technology blogs WeWork Technology and Making Meetup.

Footer bottom

--

--