Pluralsight Technology Index: Methodology

The Pluralsight Technology Index (PTI) ranks software development technologies in terms of popularity/engagement versus one another and gives a sense of whether their popularity is growing or declining. Want to know how popular JavaScript is relative to Blockchain? Swift to Kotlin? Django to Drupal? The index has the answers for these technologies and hundreds more. 

With insights from the index, technologists and businesses can determine where they may want to invest their time and resources.

This resource was designed in order to understand the methodology and how the data was applied and calculated.

 

DATA SOURCES


The Pluralsight Technology Index uses six data sources, which were selected based on their relevance to software developers.

GitHub

GitHub provides a strong perspective on open source project and programming language popularity—especially as it pertains to the open source developer. This data is retrieved from public data sets hosted on Google BigQuery, and these data sets are updated quarterly. For those technologies that have a related GitHub repository, the index captures the number of stars on the repo. In order to capture programming language popularity, the index measures the number of bytes pushed to repositories in each language.

Stack Overflow

Stack Overflow’s popular QA format provides a in-depth understanding of which technologies generate the most questions. This leans towards programming languages and related topics (like frameworks), but other technologies are present as well. Stack Overflow’s QA data is also not specific to proprietary or open source technologies. The data comes from public data sets hosted on BigQuery, which are updated quarterly. For each technology the PTI captures the relevant Stack Overflow tag count. E.g., the tag associated with JavaScript is “javascript.” As of today, it has a total tag count of 1,497,447.

Google AdWords

Google AdWords has broad reach and timely ability to reveal which technologies are the most searched. This data is retrieved from its respective APIs. For all relevant search keywords associated with a technology, the index finds the total sum of the monthly search volume.

YouTube

YouTube is a hugely popular video platform. Measuring the quantity of video views related to different technologies give us a robust estimation of interest in those topics. This data is pulled from its respective APIs. For the search query associated with each technology, the PTI calculates the total sum of the top 50 videos’ view counts.

Google Search

Google Search provides a measure of how much written content has been created for any given technology and is extremely comprehensive. Google Search data is retrieved by submitting searches and scraping the “hit” count from the returned result. For each query associated with each technology, the index captures the number of returned results.

Indeed

Indeed provides a comprehensive measure of how much job-related content has been posted for a given technology. Data from Indeed is retrieved in the same way as it is for Google Search.

GUIDING PRINCIPLES


Our goal is to provide an unbiased perspective on technology demand. To achieve this, we don’t apply any weighting techniques to these technologies. Early on in the creation of the index, we realized that if we were to arbitrarily assign weights to either older/mature technologies or newer/unknown technologies, it might undermine that central principle and lead to justifiable critique that we were twisting the data to our own ends.

Along the same lines, we don’t say why certain technologies are listed higher than others. Why? Because we believe that part of the issue learners have with identifying trends in the technology landscape is seeing past individual changes in any given source signal.  

If we expose the underlying popularity metrics of each individual technology on each individual data source, then learners and organizations would have a tendency to apply their own subjective biases to the interpretation of the overall Index. Some users might value the Stack Overflow signal higher than YouTube, or GitHub higher than Google Search, or Google AdWords higher than Indeed. It’s easy to see how this would undermine our goal of offering an objective, quantitative perspective on the relative popularity of different technologies.

To the greatest extent possible, we always opt to provide a true, unvarnished perspective on the technology landscape. We believe that doing so will lend our work a great deal of credibility.

CALCULATING THE INDEX


For each data source, the PTI finds the proportional popularity of a technology versus all other technologies. To reduce some of the volatility, we’re using a simple moving average at the data source level, using a rolling three-month window for the calculations. We average the proportional popularity of a technology over three months for each data source to find our overall proportional popularity. Then to reduce the effect of any outliers, the index takes the median of the proportional popularities for each technology, from the various data sources, and then rescales so the results sum to one.

Check out an example of how this works in practice:

Example Technology A

Consider a technology where the proportional popularity by data source was 0.10 (YouTube), 0.30 (GitHub), 0.50 (Stack Overflow), 0.70 (Google Search), 0.80 (Google AdWords), and 0.50 (Indeed). The median proportional popularity would be 0.50. 

Example Technology B

Now consider another example of a technology with the following proportional popularity: 0.90 (YouTube), 0.70 (GitHub), 0.60 (Stack Overflow), 0.30 (Google Search), 0.10 (Google AdWords), and 0.40 (Indeed). Here, as well, the median proportional popularity would be 0.50—since it’s the same proportional popularity numbers, but they’re coming from different data sources.

Example Technology C

Finally, to add some dimension to the examples, let’s consider a case where there is the following proportional popularity for a technology: 0.20 (YouTube), 0.25 (GitHub), 0.35 (Stack Overflow), 0.60 (Google Search), 0.55 (Google AdWords), and 0.35 (Indeed). The median proportional popularity would be 0.35.

 

INDEX GOAL

The goal is to fit a 0-1 scale. The index accomplishes this by finding the overall proportion represented by the median proportional popularity. This means the index divides each individual median proportional popularity by the sum of all the median proportional popularities. 

If the entirety of the index was only the three examples presented above, the PTI would find the overall index values by rescaling as follows:

Example Technology A

  • Example A median proportional popularity / (Example A prop pop + Example B prop pop + Example C prop pop) = 0.50 / 1.35 = 0.370. 
  • The interpretation is this specific technology, across all six data sources, captures 37% of the popularity.

Example Technology B

  • Example B median proportional popularity / (Example B prop pop + Example A prop pop + Example C prop pop) = 0.50 / 1.35 = 0.370.
  • Again, this example technology captures 37% of the overall popularity.

Example Technology C 

  • Example C median proportional popularity / (Example C prop pop + Example A prop pop + Example B prop pop) = 0.35 / 1.35 = 0.259.
  • The interpretation is this technology captures 25.9% of the overall popularity.

 If the numbers are added (0.370 + 0.370 + 0.259), it equals a sum total of 1 (0.999).

CALCULATING GROWTH RATES

The PTI also captures the growth rates of these technologies, using the CMGR (compound monthly growth rate) to smooth potentially volatile growth rates.

We calculate the CMGR for each technology at the data source level, computing a CMGR value for each of our six data sources. To get an overall index-level CMGR view, we then take the median of the data-source level CMGR values.

We use a rolling four-month window for our calculations, so the data source-level CMGR is calculated by (ending value/beginning value)^(1/4)-1. Since we calculate the CMGR at the data source level, our beginning and ending values are specific to the metric that we’re tracking for a particular data source.  

We then calculate the median CMGR value for our technology.

WHAT THE INDEX MEANS


Each value in the index represents how relatively popular a technology is measured across the six data sources. For the PTI, the definition of popularity encompasses multiple dimensions of technology engagement. This means the index captures:

  • interest around the origin and usage of a given technology
  • interest around learning a given technology
  • the actual utilization of a given technology

WHAT THE INDEX IS NOT


The index is not 1) all-inclusive and not 2) 100% exact. Here’s why:

NOT ALL-INCLUSIVE

The index isn’t fully all-inclusive because there are numerous technologies in active use today, and, currently, the index tracks just 300+ of them. However, the PTI makes every effort to track the most popular technologies. 

Since we’re releasing the index in phases, with the first being focused on software development, certain technologies may not be present yet. For example, TensorFlow may be well-used by developers now; however, we’ve currently associated TensorFlow to the data role that will exist in a coming phase of the index. In the future, the index will scale the number of technologies it tracks and the number of data sources it uses. 

NOT 100% EXACT

The index is an estimate of popularity and engagement around the technologies it tracks, specifically relative to the other tracked technologies. Because of real-world constraints, choices were made about how to efficiently and effectively capture signal from these data sources. Its performance depends upon the ability to:

  • Expertly hand-tune relevant queries (submitted to search-based data sources) for Google Search, Indeed, and YouTube in a consistent way to yield comprehensive and relevant results     
  • Find and map related repositories for GitHub signals
  • Identify all the relevant tags for Stack Overflow signals
  • Identify all related search keywords for Google AdWords signals

With all the above in mind, it’s possible the carefully hand-tuned search queries miss some relevant results, and also pick up irrelevant results; or miss related GitHub repositories; or overlook some related Stack Overflow tags, and include others erroneously; or miss some of the relevant search keywords from Google AdWords.