When it comes to how to analyze employee data, there are few more complex data structures to work with. Probably the biggest challenge that comes from working with an employee data set is how unassuming it may seem to start. Let's take a typical employee table from a Human Resources Information System (HRIS) and see how we can run some common analytical queries against it.
Here is a table structure for how our data is stored. This is typical of many HRIS systems.
EmployeeID | HireDate | RehireDate | TermDate | ManagerID | Name |
---|---|---|---|---|---|
1 | 2004-06-17 | 1900-01-01 | 3000-01-01 | -1 | Washington |
2 | 2004-07-01 | 1900-01-01 | 3000-01-01 | 1 | Adams |
3 | 2008-05-01 | 1900-01-01 | 2008-07-01 | 2 | Rosevelt |
4 | 2007-01-01 | 2009-01-01 | 2008-01-01 | 3 | Carter |
If we were answering these questions independently it the following SQL would suffice:
1select count(1)
2from emps
3where HireDate <= '2005-07-01'
4and TermDate > '2005-07-01'
Rinse and repeat this for every date. But what about employees that have been rehired?
First, based on this data, we must assume that when a person is rehired, their TermDate remains the same and does not reset to 3000-01-01 as if they were active.
1select count(1)
2from emps
3where (HireDate >= '2005-07-01'
4and TermDate < '2005-07-01')
5or (RehireDate >= '2005-07-01'
6and (
7 TermDate < '2005-07-01'
8 or
9 TermDate <= RehireDate
10)
This is all easy enough, but what if we wanted to create a view that has the running total of active employees for every day so we could trend it over time and ask more complex questions like Year over Year (YoY) growth?
We would need to include another table which has nothing but dates. In Data Warehousing we typically create this as a Date dimension. Here is a basic structure for us to work with.
Date | Day | Month | Year |
---|---|---|---|
2005-01-01 | 1 | 5 | 2005 |
2005-01-02 | 2 | 5 | 2005 |
etc... |
We'll use this as our starting point since it won't be constrained by the employee status at the time or dates in our emp table. From there, we'll do a left join to our emp table to get the count of active employees similar to how we did it manually above. Here's the SQL to do this:
1select
2DayDate, count(hires.EmployeeId) as ActiveCount
3from
4 dates d
5-- get hires
6left join
7 (select 'active' as Status, * from emps) hires
8on
9 ( d.Date >= hires.HireDate and d.Date < hires.TermDate) -- normal active
10or
11 (d.Date
12 between (case when hires.RehireDate = '1900-01-01' then null else hires.RehireDate end)
13 and hires.TermDate)
14where
15 d.Date between '2004-06-17' and now() -- use the function to pull the current date here
The results of our query above will give us a daily count of Active employees to trend over time. This is just the first step in analyzing employee data of course, but a solid foundation to start and a good example of why having a date dimension in your data warehouse is important.