Headaches of UTF-8 BOM!

UTF 8 BOM (or Byte Order Mark) can throw a wrench into any Rest API and cause nothing but headaches. Learn how to avoid that headache here!

By Praveen Kumar

Jan 10, 2019 • 3 Minute Read

Subscribe to the newsletter

Introduction

I love Windows, but not when developing PHP applications. Recently, I started working on several main-stream PHP and NodeJS applications and REST APIs. With REST APIs, the output must be perfect.

It is therefore, due to this fact, when designing API end-points, we must make sure that the output doesn’t contain any encoding marks or unnecessary characters. Also, we should be aware that these invisible buggers are not generated by some user intervention but by how the file gets saved to the file system. More specifically, the problem comes with the Byte Order Mark (BOM), which indicates encoding and makes file-reading more efficient.

Problem

The real problem lies in the way the text editors save the file to the file system. Generally, in the case of Windows, the text editors save either in UTF-8 (with BOM, without BOM), UTF-16 (with BOM, without BOM, little endian, etc.), ANSI, or Windows-1252. The worst that happens is, when every file is saved, it gets a Byte-Order-Mark or BOM appended at the end of file.

This will pose a severe threat while developing API end-points, where a single UTF-8 BOM might make the JSON / HTTP Response totally invalid. My current system (not my permanent or personal) is a Windows machine, with Vagrant (+VirtualBox) installed in it. I am writing the files on the host (Windows) machine, which automatically saves along with the BOM.

I found the issue by checking the response from the server to be unusual. Both the images are from Chrome Developer Tools, where the first image is from Inspecting the element and the second one is the response from the server.

The UTF-8 BOM characters can be identifiable using three hexadecimal character signature specified by EF BB BF.

Solution

The only solution to avoid this is to get rid of the UTF-8 BOM completely, by using two simple commands (Mac or Linux only). The first command is to find the files that have the BOM:

      grep -rl $'\xEF\xBB\xBF' .

The second one removes the BOM or replaces it with an empty string using sed. The command is as follows:

      sed -i '1 s/^\xef\xbb\xbf//' *.php

You may replace the *.php with anything like *.inc or a single file index.php. It is a pain if the files are inside multiple sub-folders, but this approach will still be airtight.

I hope this saved someone’s time. Once the above has been done, I saw that the unusual characters no longer appeared in the output.

Thanks for reading this quick guide on UTF-8 BOM.

Praveen K.

More about this author

Headaches of UTF-8 BOM!

Introduction

Problem

Solution

Advance your tech skills today