Pluralsight Technology Index: Methodology
With insights from the Index, technologists and businesses can determine where they may want to invest their time and resources.
The information listed below gives visibility into our methodology and shows how we applied and calculated the data.
The Pluralsight Technology Index uses eight data sources, which were selected based on their relevance to software developers, IT Ops, security and data professionals.
GitHub provides a strong perspective on open source project and programming language popularity—especially as it pertains to the open source developer. This data is retrieved from public data sets, which are updated quarterly, hosted on Google BigQuery. For those technologies that have a related GitHub repository, the Index captures the number of stars on the repo. In order to capture programming language popularity, the Index measures the number of bytes pushed to repositories in each language.
Google AdWords has broad reach and timely ability to reveal which technologies are the most searched. This data is retrieved from its respective APIs. For all relevant search keywords associated with a technology, the Index finds the total sum of the monthly search volume.
YouTube is a hugely popular video platform. Measuring the quantity of video views related to different technologies give us a robust estimation of interest in those topics. This data is pulled from its respective APIs. For the search query associated with each technology, the PTI calculates the total sum of the top 50 videos’ view counts.
Google Search provides a measure of how much written content has been created for any given technology and is extremely comprehensive. Google Search data is retrieved by submitting searches and scraping the “hit” count from the returned result. For each query associated with each technology, the Index captures the number of returned results.
Indeed and Dice
Indeed and Dice provide a comprehensive measure of how much job-related content has been posted for a given technology. Data from Indeed and Dice is retrieved and processed in the same way as it is for Google Search.
Reddit was chosen as a data source to help the Technology Index better capture signal on which technologies (etc.) have high popularity and engagement on social platforms. Data is retrieved from the PushShift API and from the public Reddit API. For related subreddits for each technology, the Index aggregates the number of results when searching for that technology (using hand-tuned search queries to compensate for generic technology names). The Index counts the number of times the technology is mentioned in submissions and comments and the number of comments and submissions to technology-specific subreddits. The Index also captures moment-in-time snapshots of subscriber count for technology-specific subreddits and active user count.
Our goal is to provide an unbiased perspective on technology demand. To achieve this, we don’t apply any weighting techniques to these technologies. Early in the creation of the Index, we realized that if we were to arbitrarily assign weights to either older/mature technologies or newer/unknown technologies, it might undermine that central principle and lead to justifiable critique that we were twisting the data to our own ends.
Along the same lines, we don’t say why certain technologies are listed higher than others. Why? Because we believe that part of the issue learners and leaders have with identifying trends in the technology landscape is seeing past individual changes in any given source signal.
If we expose the underlying popularity metrics of each individual technology on each individual data source, then learners and organizations would have a tendency to apply their own subjective biases to the interpretation of the overall Index. Some users might value the Stack Overflow signal higher than YouTube, or GitHub higher than Google Search, or Google AdWords higher than Indeed. It’s easy to see how this would undermine our goal of offering an objective, quantitative perspective on the relative popularity of different technologies.
To the greatest extent possible, we always opt to provide a true, unvarnished perspective on the technology landscape. We believe that doing so will lend our work a great deal of credibility.
CALCULATING THE INDEX
For each data source, the PTI finds the proportional popularity of a technology versus all other technologies. To reduce some of the volatility, we use a simple moving average at the data source level, using a rolling three-month window for the calculations. We average the proportional popularity of a technology over three months for each data source to find our overall proportional popularity. Then to reduce the effect of any outliers, the Index takes the median of the proportional popularities for each technology, from the various data sources, and then rescales so the results sum to one.
Check out an example of how this works in practice:
Example Technology A
Consider a technology where the proportional popularity by data source was 0.10 (YouTube), 0.30 (GitHub), 0.50 (Stack Overflow), 0.70 (Google Search), 0.80 (Google AdWords), 0.50 (Indeed), 0.25 (Dice), 0.75 (Reddit). The median proportional popularity would be 0.50.
Example Technology B
Now consider another example of a technology with the following proportional popularity: 0.90 (YouTube), 0.70 (GitHub), 0.60 (Stack Overflow), 0.30 (Google Search), 0.10 (Google AdWords), 0.40 (Indeed), 0.33 (Dice), and 0.66 (Reddit). Here, as well, the median proportional popularity would be 0.50—since it’s the same proportional popularity numbers, but they’re coming from different data sources.
Example Technology C
Finally, to add some dimension to the examples, let’s consider a case where there is the following proportional popularity for a technology: 0.20 (YouTube), 0.25 (GitHub), 0.35 (Stack Overflow), 0.60 (Google Search), 0.55 (Google AdWords), 0.35 (Indeed), 0.35 (Dice), and 0.45 (Reddit). The median proportional popularity would be 0.35.
The goal is to fit a 0-1 scale. The index accomplishes this by finding the overall proportion represented by the median proportional popularity. This means the Index divides each individual median proportional popularity by the sum of all the median proportional popularities.
If the entirety of the Index was only the three examples presented above, the PTI would find the overall Index values by rescaling as follows:
Example Technology A
- Example A median proportional popularity / (Example A prop pop + Example B prop pop + Example C prop pop) = 0.50 / 1.35 = 0.370.
- The interpretation is this specific technology, across all eight data sources, captures 37% of the popularity.
Example Technology B
- Example B median proportional popularity / (Example B prop pop + Example A prop pop + Example C prop pop) = 0.50 / 1.35 = 0.370.
- Again, this example technology captures 37% of the overall popularity.
Example Technology C
- Example C median proportional popularity / (Example C prop pop + Example A prop pop + Example B prop pop) = 0.35 / 1.35 = 0.259.
- The interpretation is this technology captures 25.9% of the overall popularity.
If the numbers are added (0.370 + 0.370 + 0.259), it equals a sum total of 1 (0.999).
CALCULATING GROWTH RATES
The PTI also captures the growth rates of these technologies, using the CMGR (compound monthly growth rate) to smooth potentially volatile growth rates.
We calculate the CMGR for each technology at the data source level, computing a CMGR value for each of our eight data sources. To get an overall Index-level CMGR view, we then take the median of the data-source level CMGR values.
We use a rolling three-month window for our calculations, so the data source-level CMGR is calculated by (ending value/beginning value)^(1/4)-1. Since we calculate the CMGR at the data source level, our beginning and ending values are specific to the metric that we’re tracking for a particular data source.
We then calculate the median CMGR value for our technology.
WHAT THE INDEX MEANS
Each value in the Index represents how relatively popular a technology is measured across the eight data sources. For the PTI, the definition of popularity encompasses multiple dimensions of technology engagement. This means the Index captures::
- interest around the origin and usage of a given technology
- interest around learning a given technology
- the actual utilization of a given technology
WHAT THE INDEX IS NOT
The Index is not 1) all-inclusive and not 2) 100% exact. Here’s why:
The Index isn’t fully all-inclusive because there are numerous technologies in active use today, and, currently, the Index tracks just over 850 of them. However, the PTI makes every effort to track the most popular technologies.
Since we’re releasing the Index in phases, certain technologies may not be present yet. In the future, the Index will scale the number of technologies it tracks and the number of data sources it uses.
NOT 100% EXACT
The Index is an estimate of popularity and engagement around the technologies it tracks, specifically relative to the other tracked technologies. Because of real-world constraints, choices were made about how to efficiently and effectively capture signals from these data sources. Its performance depends upon the ability to:
- Expertly hand-tune relevant queries (submitted to search-based data sources) for Google Search, Indeed, Dice, Reddit and YouTube in a consistent way to yield comprehensive and relevant results
- Find and map related repositories for GitHub signals
- Identify all the relevant tags for Stack Overflow signals
- Identify all related search keywords for Google AdWords signals
- Specify all technology-specific Reddit subreddits
With all the above in mind, it’s possible for the carefully hand-tuned search queries miss some relevant results, and also to pick up irrelevant ones. It's also possible for it to miss related GitHub repositories, overlook some related Stack Overflow tags (and include others erroneously), and miss some of the relevant search keywords from Google AdWords.