Online Resources for Python Data Analytics

By Kimaru Thagana

Jun 15, 2020 • 10 Minute Read

Introduction

With the advent of the information age, fueled by the increasing availability of the internet and technology, many businesses now generate vast amounts of data. This data usually contains enormous hidden value that, if unearthed, could provide businesses with insights that would better their operational efficiency, boost competitive advantage, or increase revenues. Data analytics is the scientific approach to analyzing and visualizing this data in order to draw conclusions. Any company that generates data and wishes to use this data to inform their decision making process requires data analytics. A professional who performs data analytics is commonly referred to a data analyst.

This guide offers online resources and materials that will solidify your knowledge in Python for data analytics. It is broken down into sections in general chronological order of data analysis steps. In each subsection, you will learn about various functions and packages that perform particular tasks. The guide assumes at least intermediate knowledge of Python.

General Workflow for Python Data Analytics

Being a science, data analytics follows best practices in terms of methodology. There is a general consensus on the chronology of tasks. This order is outlined below with resources for further reading and research.

The most common tools in a python data analyst's toolbox are these libraries:

Pandas: Mainly used for data loading and manipulation. Commonly imported as import pandas as pd
Numpy: Mainly used for scientific and matrix computations. Commonly imported as import numpy as np
Matplotlib: A visualization library. Commonly imported as from matplotlib import pyplot as plt
scikitearn: An extensive library mainly used for machine learning tasks.

Obtain Datasets

The first step in any data analytics work, as the name suggests, is data. Data is the fuel for analytics. Currently, there are many online repositories that provide datasets across various domains. These include:

Kaggle: One of the most well known resources in the data community. To download datasets, you must be signed in.
UCI ML Repository: Maintains 400+ datasets which can generally be downloaded with a login.
Data.Gov: The US government's open data catalog contains datasets dealing with the governance, health, administrative, and census domains. It is common for most countries to have open data portals. You can also find research on other countries, such as Canada.
LionBridge Ai: Good resource for datasets. Well curated and arranged.
Google Search: A specified resource from Google that allows users to search for datasets as if they were doing any other Google search.

The above are good resources, but sometimes your problem may be so unique that you need to generate your own datasets. This is commonly done via webscraping as this tutorial demonstrates. For image datasets, use a similar approach, but instead of downloading text and numeric data, download images.

Exploratory Data Analysis (EDA)

This is the process of examining your data and checking it out at a glance to get the feel of it. This could include printing out some rows or displaying some basic visualizations. To do this, you must first load the dataset. This is commonly done in pandas, where the dataset is loaded into a dataframe.

Dataset loading in pandas: General function syntax is pd.read_csv(filepath)

To inspect the dataframe, you can print the first five rows using print(dataframe.head()) or the last five frames using print(dataframe.tail()). As with most numeric columns, there are basic statistics, such as mean or median that are usually important. To check them out, use the describe function: dataframe.describe() In terms of visualizations in this stage, the main aim is to check on distribution and maybe, identify outliers or general trends in the data. These code tutorials from Kaggle and Medium blog best demonstrate the visualizations.

Data Cleaning

Real-world data is often messy. Sometimes the problem is NaN values, sometimes it's missing values, and sometimes the data contains unwanted values. For accurate and reliable analysis, data cleaning is paramount. Some techniques include:

Dropping columns: In some cases, some columns are irrelevant to the goal of the analysis and hence, have no use. You can drop whole columns or rows or only columns or rows with null values. The general syntax is dataframe.drop().

Imputation: A technique for filling in missing data with reasonable values. This is employed when the missing data is too much to just drop. Common imputation techniques include KNNImputer, Simple Imputer and*Iterative Imputer.

Drop duplicates: A technique used to get rid of similar rows or columns that might impact the quality of your analysis. General syntax is dataframe.drop_duplicates().

Regex Filtering: Sometimes, the data has some unwanted strings and needs to be replaced or eliminated. The best approach is usually regex, a vital tool in a data analyst's kit.

Type Conversion: Sometimes the data-type of your columns may not suit the analysis needs. This then calls for changing the data type. The general syntax is dataframe.astype()

Data Enrichment and Processing

At this stage, all your data is clean. Before performing analytics, you want to enrich your data for better results, depending on your goal.

Combining datasets: Techniques include joining, appending, concatenating and merging.

Elementwise Computation: The apply or map function allows you to process rows of a dataframe or column using the passed function. The function could be anything from aggregation to data transformation.

Pre-Processing: A quality assurance step that ensures the data you feed to your analytics is in the best form possible.

Insight Extraction and Predictions

This is the stage where you derive insights and hidden information from your data. The most common library for this stage is the scikit-learn library.

Statistical Inference: Performing inference techniques to gain insight.

Unsupervised Learning: Using algorithms for data that has no initial labels.

Prediction using Regression: Regression can be used for both prediction and classification. Regression techniques include linear, lasso, and logistic, among others. More on using regression with a dataset can be found here.

Meaningful Visualizations

At this point, you have extracted your insights and predictions. It is time to communicate your findings. You have several tools at your disposal. Whichever you choose, you should know why and be guided by best practices. Visualization options include:

Folium: Best for visualization of map based data
Bokeh: Interactive plotting. Also supports shapefiles and map plotting.
Matplotlib: A visualization library for python.
Seaborn: Based on matplotlib but provides a high level interface.
Plotly: Interctive visualizations for different scenarios.

Serve Project in Production

At this point, all is done. The final step is to deploy your project or share it. Options include:

Jupyter notebooks: Allows sharing of scripts in a notebook format.
Streamlit: Allows you to setup and serve your analytics project over the web as an app. One can deploy this with servers such as Heroku.

Data Analytics Online

With advancements in web technologies, some companies now offer data analytics platforms as a service. These platforms help in the end-to-end processing of your data and provide analytics tools. Most are enterprise and hence, the best features may be paywalled. These include Databricks, Dataiku, and Google Data Studio, among others.

Conclusion

The knowledge available in the online resources highlighted in this guide can greatly improve your data analytics skills, helping you land real-world roles in companies or startups. These roles include Data Analyst, Business Intelligence Developer, Data Analytics Consultant, Operations Analyst, and many more. To build on the knowledge gained, build data analytics projects or sign up for a Python course focusing the data analytics track.

Kimaru T.

Kimaru is a firm believer of education as a tool of self sufficiency. As software development consultant, living in Kenya, he mainly works to bring small and medium sized business to the internet with custom solutions ranging from data processing to business digitization. Away from the field of coding and computer science, he participates as a mentor for young university students. In his free time, he prefers peace and quiet, away from screens but close to nature.

More about this author