Data Science for Beginners

This guide will help you understand what Data Science is, typical processes, and the various applications of Data Science.

By Gaurav Singhal

Jan 27, 2020 • 9 Minute Read

Subscribe to the newsletter

Introduction

Nowadays everybody knows what data is, at least in a layman's sense. Data rules the world, and data science is increasingly picking up traction, accepting the challenges of time, and offering new algorithmic solutions.

Data is the new oil of our world. We generate about 2.5 quintillion bytes of data each day at our current pace, but that pace is only accelerating with the growth of the Internet of Things (IoT). Over the last two years alone, 90 percent of the data in the world was generated. The words "data science" give us a very vague idea of what it means to approach all of this data. After reading this guide, I hope you will have a deeper understanding of the field of data science.

What is Data Science?

Wikipedia defines data science as a field focused on extracting knowledge and insights from data by using scientific methods. It is an interdisciplinary field that allows you to obtain knowledge from structured or unstructured data. Data science is not a single sphere, but rather the combination of more than one stream focused on analyzing data. Initially, these tasks were handled by mathematicians or statistitians. In time, experts began to use machine learning, deep learning, and artificial intelligence, which added optimization and computer science as a method for analyzing data.

Artificial intelligence, machine learning, deep learning, and data science — undoubtedly, these significant terms are among the most frequently used today, and we should understand the difference between these realms.

Artificial Intelligence (AI)

Artificial intelligence focuses on creating intelligent machines that act or solve problems like humans. In 1936, Alan Turing built the first AI-powered machine, but as technology advanced, previous benchmarks that defined AI became outdated. For example, today, machines that calculate essential functions or recognize text through optical character recognition are no longer considered to embody AI.

Machine Learning (ML)

Machine learning is the realm in which we work on statistics and mathematical algorithms. It is the field of study that gives computers the capacity to learn without using explicit instructions. It is seen as a subset of AI. ML algorithms build mathematical models in computer memory, and these mathematical models find patterns in given data, called training data. They then make predictions or decisions on unseen data, called test data, without being explicitly programmed.

Machine learning has different types of algorithms. Nobody can tell which algorithm to use without looking at the data and knowing the problem that is to be solved. I mention the most common algorithms, which have versatile behavior towards data.

Supervised Learning

Supervised learning is the most advanced form of machine learning. It is the type of learning in which we train a model using data that is well labeled meaning that the right answer is tagged in the training dataset, providing, as the name indicates, a supervisor as a teacher. Once the model is trained from the well-labeled data, it is applied to a new set of data, test data, to predict results.

Supervised learning involves two categories of algorithms:

Classification: The output variable is a category, such as a 'Man' or 'Woman', 'Adult' or 'Child'
Regression: he output variable is a real value like weight, height, etc. .

Unsupervised Learning

As the name suggests, there is no supervision in unsupervised learning, which means a model is trained on data that is neither classified nor labeled. Unsupervised learning allows a model to act upon information without any guidance. Here, the task of a model is to find patterns in the input data and combine data that is similar.

Unsupervised learning involves two categories of algorithms:

Clustering: Groups data into various groups based on multiple factors, such as grouping consumers by age.
Association: Uses various rules to describe a large portion of data input; for example, if a consumer buys X thing, then they also tend to buy Y thing.

Reinforcement Learning (RL)

Reinforcement learning is about taking steps to maximize results of a particular condition. It connects with various software and machines to find the best possible behavior or path for a specific situation. RL is very different from supervised learning because in supervised learning, a model is trained by correct answers, but in RL, labeled data is not used. The reinforcer medium or agent decides what to do to perform a given task. In RL, a computer or machine learns from experience.

The best example of RL is a computer learning to play a video game in which a user gets rewards for passing successive stages.

Deep Learning

Deep learning is a the branch of machine learning based entirely on artificial neural networks, and the idea is to mimic the human brain's axons, neurons, dendrites, etc. Multi-layer neural networks are created in areas where more advanced or fast analysis is needed. Deep learning finds intricate, hidden patterns in various types of data, such as images, texts, documents, videos, etc.

The Data Science Process

This section is about the flow of the entire data science process, from obtaining data to making accurate calculations and predictions.

Data Accumulation

This step involves acquiring or extracting data from internal and external sources.

Data can come from various sources:

Data streamed from online sources using APIs
Logs from web servers
Census data, terrain data, or weather data
Data gathered from social media

Required skills include:

Database management: Either SQL or NoSQL, depending on your needs and requirements
Querying datasets
Retrieving unstructured data in the form of videos, audios, texts, documents, etc.

Data Wrangling

Data can have inconsistencies, such as missing values, incorrect data format, blank columns or rows, and you have to get rid of all these things. This must be given utmost priority before modeling because the prediction or result of your models depends on clean data.

Skills required:

Scripting language: Python, R, SAS
Data wrangling tools: Python Pandas, R
Distributed processing: Hadoop

Exploratory Data Analysis

This step is time intensive and largely decides the result of your machine learning model. In this step, you come to understand the data through statistical tests and visualizations. This phase aims to derive the hidden meaning from the data, which will give you an idea about which algorithm to use with what parameters.

Skills Required:

Inferential Statistics
R libraries: GGplot2, Dplyr
Python libraries: Numpy, Matplotlib, Pandas, Scipy
Data Visualization Python libraries: Bokeh, Matplotlib, Seaborn

Modeling

This is a crucial step as in this step, you start building your model by distributing the datasets into test and train datasets to train the model. Different techniques, such as association, classification, and clustering are applied to the training dataset. After the training, the model is tested against the testing dataset.

Skills Required:

Machine Learning Libraries: Python (Sci-kit Learn) / R (CARET)
Machine Learning: Supervised/unsupervised/reinforcement learning algorithms
Linear algebra and calculus

Results

In this step, you deliver the final baselined model with reports, code, and technical documents. The model is deployed into a real-time production environment after thorough testing. The key findings are communicated to all stakeholders.

Applications of Data Science

Internet search: Google processes around 3.5 billion search queries in a day with the help of data science.
Image and Speech Recognition: The face-locking system in mobile phones runs with the help of data science. Speech recognition systems like Siri, Google Assistant, and Alexa also rely on data science.
Recommendation systems: Almost every recommendation system runs with the help of data science. Companies like Amazon andNetflix use these types of system to suggest products from billions of possibilities.
Airline route planning: Data science enables airlines to predict flight delays, decide which classes of airplanes to buy, and determine when to land at a destination. These things make airline travel cost effective.
Gaming: EA Sports, Sony, and Nintendo are using data science technology to enhance the gaming experience. Games are now developed using machine learning techniques.
Banking: Banking is one of the most prominent applications of data science. Big data and data science have enabled banks to keep up with the competition and manage their resources efficiently.
Health care: Medical image analysis, genetics and genomics, drug discovery, predictive modeling for diagnosis, health bots, and virtual assistants are all applications of data science in health care.

Conclusion

Data science is a fascinating and vast topic, and so much is done in this field.Still, we have a long way to go as data science takes us to the world's next singularity. If you are from a tech background and have interest in data, this could be an exciting field for you.

Gaurav S.

Guarav is a Data Scientist with a strong background in computer science and mathematics. He has extensive research experience in data structures, statistical data analysis, and mathematical modeling. With a solid background in Web development he works with Python, JAVA, Django, HTML, Struts, Hibernate, Vaadin, Web Scrapping, Angular, and React. His data science skills include Python, Matplotlib, Tensorflows, Pandas, Numpy, Keras, CNN, ANN, NLP, Recommenders, Predictive analysis. He has built systems that have used both basic machine learning algorithms and complex deep neural network. He has worked in many data science projects, some of them are product recommendation, user sentiments, twitter bots, information retrieval, predictive analysis, data mining, image segmentation, SVMs, RandomForest etc.

More about this author