Pluralsight Logo
Author avatar

Praveen Kumar

Author badge Author

Headaches of UTF-8 BOM!

Praveen Kumar

Author BadgeAuthor
  • May 17, 2017
  • 3 Min read
  • 1,060 Views
  • May 17, 2017
  • 3 Min read
  • 1,060 Views
PHP

Introduction

I love Windows, but not when developing PHP applications. Recently, I started working on several main-stream PHP and NodeJS applications and REST APIs. With REST APIs, the output must be perfect. Even a single character space will screw the 💩 out of the system. 😩

It is therefore, due to this fact, when designing API end-points, we must make sure that the output doesn’t contain any encoding marks or unnecessary characters. Also, we should be aware that these invisible buggers are not generated by some user intervention but by how the file gets saved to the file system. More specifically, the problem comes with the Byte Order Mark (BOM), which indicates encoding and makes file-reading more efficient.

Problem

The real problem lies in the way the text editors save the file to the file system. Generally, in the case of Windows, the text editors save either in UTF-8 (with BOM, without BOM), UTF-16 (with BOM, without BOM, little endian, etc.), ANSI, or Windows-1252. The worst that happens is, when every file is saved, it gets a Byte-Order-Mark or BOM appended at the end of file.

This will pose a severe threat while developing API end-points, where a single UTF-8 BOM might make the JSON / HTTP Response totally invalid. My current system (not my permanent or personal) is a Windows machine, with Vagrant (+VirtualBox) installed in it. I am writing the files on the host (Windows) machine, which automatically saves along with the BOM.

I found the issue by checking the response from the server to be unusual. Both the images are from Chrome Developer Tools, where the first image is from Inspecting the element and the second one is the response from the server.

Before-InspectElement

Before-NetworkResponse

The UTF-8 BOM characters can be identifiable using three hexadecimal character signature specified by EF BB BF.

Solution

The only solution to avoid this is to get rid of the UTF-8 BOM completely, by using two simple commands (Mac or Linux only). 😁 The first command is to find the files that have the BOM:

1
grep -rl $'\xEF\xBB\xBF' .
bash

The second one removes the BOM or replaces it with an empty string using sed. The command is as follows:

1
sed -i '1 s/^\xef\xbb\xbf//' *.php
bash

You may replace the *.php with anything like *.inc or a single file index.php. It is a pain if the files are inside multiple sub-folders, but this approach will still be airtight.

I hope this saved someone’s time . Once the above has been done, I saw that the unusual characters no longer appeared in the output. 😁

After-InspectElement

After-NetworkResponse


Thanks for reading this quick guide on UTF-8 BOM.

0