Author avatar

Ben Sullins

How to Analyze Employee Data Using SQL

Ben Sullins

  • May 11, 2019
  • 3 Min read
  • 7,078 Views
  • May 11, 2019
  • 3 Min read
  • 7,078 Views
Data
SQL

Introduction

When it comes to how to analyze employee data, there are few more complex data structures to work with. Probably the biggest challenge that comes from working with an employee data set is how unassuming it may seem to start. Let's take a typical employee table from a Human Resources Information System (HRIS) and see how we can run some common analytical queries against it.

Here is a table structure for how our data is stored. This is typical of many HRIS systems.

Our Table

Table Name: emps

EmployeeIDHireDateRehireDateTermDateManagerIDName
12004-06-171900-01-013000-01-01-1Washington
22004-07-011900-01-013000-01-011Adams
32008-05-011900-01-012008-07-012Rosevelt
42007-01-012009-01-012008-01-013Carter

Queries

How Many People Were Active on a Given Day (Ex. 2005-07-01)?

If we were answering these questions independently it the following SQL would suffice:

1select count(1)
2from emps
3where HireDate <= '2005-07-01'
4and TermDate > '2005-07-01'
sql

Rinse and repeat this for every date. But what about employees that have been rehired?

First, based on this data, we must assume that when a person is rehired, their TermDate remains the same and does not reset to 3000-01-01 as if they were active.

1select count(1)
2from emps
3where (HireDate >= '2005-07-01'
4and TermDate < '2005-07-01')
5or (RehireDate >= '2005-07-01'
6and (
7    TermDate < '2005-07-01'
8  or
9    TermDate <= RehireDate
10)
sql

The Plot Thickens

This is all easy enough, but what if we wanted to create a view that has the running total of active employees for every day so we could trend it over time and ask more complex questions like Year over Year (YoY) growth?

We would need to include another table which has nothing but dates. In Data Warehousing we typically create this as a Date dimension. Here is a basic structure for us to work with.

Table Name: Dates

DateDayMonthYear
2005-01-01152005
2005-01-02252005
etc...

We'll use this as our starting point since it won't be constrained by the employee status at the time or dates in our emp table. From there, we'll do a left join to our emp table to get the count of active employees similar to how we did it manually above. Here's the SQL to do this:

1select
2DayDate, count(hires.EmployeeId) as ActiveCount
3from
4    dates d
5-- get hires
6left join
7 (select 'active' as Status, * from emps) hires
8on
9 ( d.Date >= hires.HireDate and d.Date < hires.TermDate) -- normal active
10or
11  (d.Date
12      between (case when hires.RehireDate = '1900-01-01' then null else hires.RehireDate end)
13      and hires.TermDate)
14where
15  d.Date between '2004-06-17' and now() -- use the function to pull the current date here
sql

Alas, We Have Trend Data!

The results of our query above will give us a daily count of Active employees to trend over time. This is just the first step in analyzing employee data of course, but a solid foundation to start and a good example of why having a date dimension in your data warehouse is important.