When you are working on a data science project or trying to find data insights to strategize your plans, there are two key steps that can not be avoided – Data Exploration and Data Visualization.
Data Exploration is an integral part of EDA (Exploratory Data Analysis). Whatever you decide to do in the later phases (creating/selecting a machine learning model or summarizing your findings), will depend on the assumptions you make in the exploration phase. It’s not a single step phase, but we get to determine a lot about our data during data exploration e.g. checking data distribution, finding correlation, finding outliers and missing values, etc.
Data Visualizations aren’t part of any specific phase in a data analytics project. We can use visuals to represent the data at any point in our project. Data visualization is nothing but a mapping between tables or graphs and data (inputs or outputs). Data visualization can be done in two forms – tabular and graphical.
We need visualization as a visual summary of the data, because it’s easier to understand for identifying relations and patterns. Many visuals are used in the data exploration phase to find outliers, correlation between features, etc. We also use charts and graphs to check the performance of models or while categorizing or clustering the data.
Choosing a correct chart to communicate your findings about data is also important. Using a line chart instead of a scatter chart might not make sense. There are some basic and widely used charts which we use or see in our day-to-day work – in data science and otherwise:
While trying to make accurate assumptions, we need the best tools to explore and visualize the data. There are several tools and libraries available in the market. It’s nearly impossible to remember all the libraries, it can be confusing to decide which one to use. The aim of this article is to:
Matplotlib was introduced to imitate all the graphics supported by MATLAB, but in a simpler form. Throughout the years, multiple functionalities have been added to the library. Not just this, but many visualizations libraries and tools are built on top of Matplotlib with new, interactive, and attractive visuals.
To learn more about Matplotlib, let’s work with a dataset to unlock and see how some of the functions work:
#Load the dataset import pandas as pd netflix_df = pd.read_csv('netflix_titles.csv') netflix_df.head(2)
We have type of content, title, date added, and other information. But what do we want to do with this information? We could find how many shows and movies are on Netflix (according to the dataset), or we could see which country has produced more content.
#Install matplotlib import matplotlib.pyplot as plt #Find the count of shows and movies counts = netflix_df["type"].value_counts() plt.bar(counts.index, counts.values) plt.show()
In the above code, you can see we’ve imported matplot’s pyplot as plt. Each pyplot function makes some change to a figure – creating a figure, creating a plotting area, plotting some lines, introducing labels in the plot, etc. Then we used pyplot as plt to call a bar chart, and visualize the data inline.
One thing to remember here is we will have to use plt.show() command every time a new plot is created. If you want to avoid this repetitive task, you can use the below command after importing matplotlib.
%matplotlib inline
There’s a lot you can do beyond just creating a simple bar chart. You could provide x and y labels, or you could give different colors to the bars according to their values. You have the choice to change markers, line styles and widths, add or alter text, legend, and annotations, change the limits and layout of your plots, and much more.
We can use Matplotlib to find anomalies in the data too. Let’s try to create a customized plot.
import pandas as pd from sklearn.datasets import load_boston import matplotlib.pyplot as plt boston = load_boston() x = boston.data y = boston.target columns = boston.feature_names #create the dataframe boston_df = pd.DataFrame(boston.data) boston_df.columns = columns fig = plt.figure(figsize =(10, 7)) # Creating axes instance ax = fig.add_axes([0, 0, 1, 1]) ax.set_xlabel('Distance') # Creating plot bp = ax.boxplot(boston_df['DIS']) plt.title("Customized box plot") # show plot plt.show()
As this package provides flexibility, it can be a bit tricky to choose or even remember things when you start working with it. Luckily, documentation contains real life examples, each plot’s argument related details, and all other information we need. Don’t feel overwhelmed, just remember that there can be more than one solution to a problem.
Now that we have some idea what Matplotlib is, let’s discuss the pros and cons, and which tools integrate with it.
A lot of popular Python visualization libraries are built on Matplotlib. For example, seaborn uses matplotlib to display the plot once the figure is created. Not just this, but many tools have also integrated with Matplotlib – neptune.ai is one of them.
The first image of a blackhole was produced using NumPy and Matplotlib. It’s also used in sports for data analysis.
Scikit learn was developed in a Google Summer code project by David Cournapeau. Later, in 2010, FIRCA took it to another level and released a beta version of the library. Scikit learn has come a long way, now it’s the most useful robust library. It’s built in Python on top of NumPy, SciPy and Matplotlib.
It doesn’t focus on one aspect of any data science project, it provides a vast collection of efficient tools for data cleaning, curation, modelling, etc.
It has tools for:
Where does data exploration and visualization fit? Scikit Learn has a collection of tools to meet exploratory data analysis requirements – discover problems and recover them by transforming the raw data.
If you’re looking for datasets to experiment on, Scikit learn has a dataset module which has some popular dataset collections. You can load a dataset as below, and you won’t have to download it on a local machine.
from sklearn.datasets import load_iris data = load_iris()
Scikit learn plays an important role when it comes to pre-processing, ie. cleaning and curating. Assume you have few missing values in your dataset. There are two ways to handle it:
Dropping rows/columns is not always a good choice, so we impute values – zeroes, average/mean, etc.
Let’s have a look at how to do this using scikit’s impute module.
#Create a dataframe import numpy as np import pandas as pd X = pd.DataFrame( np.array([1,2,3, np.NaN, np.NaN, np.NaN, -7, 0,50,111,1,-1, np.NaN, 0, np.NaN]).reshape((10,3))) X.columns = ['feature1', 'feature2', 'feature3'] #Impute values when null found from sklearn.impute import SimpleImputer imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') imp_mean.fit_transform(df)
Above, we used a SimpleImputer module to create an imputer to replace null values with mean. Scikit learn is the only tool with functions/modules for almost everything. No other tool provides a simple imputer module as Scikit learn.
When it comes to feature scaling, or normalizing distribution, Scikit learn has functions available in the preprocessing module: StandardScalar, MinMaxScalar, etc. It has modules for feature engineering as well. Scikit only deals with numeric data, so you will need to convert the categorical variables to numeric to explore the data.
Where scikit learn leads in data exploration, it has minimal use for data visualization. The visual modules are only for visualizing metrics like confusion metrics, trade off curve, roc curve, or recall precision curve. In the next example, we’ll see how we can use the visualization function.
from sklearn import datasets from sklearn.model_selection import train_test_split iris = datasets.load_iris() X = iris.data y = iris.target class_names = iris.target_names # Create training and test data sets X_train,X_test,y_train,y_test=train_test_split( X,y,test_size=0.25, random_state=0) from sklearn.linear_model import LogisticRegression from sklearn.metrics import plot_confusion_matrix # Deliberately over-regularise model with low C to create more error lr=LogisticRegression(C=1,random_state=0) lr.fit(X_train,y_train) # predict test set plot_confusion_matrix(lr, X_test, y_test,display_labels=class_names, cmap=plt.cm.Blues) plt.show()
Even though Scikit has some visualization modules, it still doesn’t support any visualization for regression problems. But, without a doubt, it’s the most effective, easily adaptable data mining tool.
The previous two tools didn’t have any interactive visualization. Most of these tools are built in Python, and it has limited flexibility in terms of visuals.
Plotly develops online data analytics and visualization tools. It offers graphics and analytics tools for different platforms and frameworks like Python, R, and MATLAB. It has a data visualization library plotly.js, an open-source JS library for creating graphs. To let Python use its utilities, plotly.py has been built on top of it.
It supports 40+ unique chart types to cover statistical, financial, geographic, scientific, and 3D use cases. It uses D3.js, HTML and CSS, which helps in integrating many interactive functionality like zoom-in and out, or mouse hover.
Let’s check out how we can introduce interactivity in the plots using plotly.
#Install plotly pip install plotly==4.14.3 #Load the iris dataset from sklearn import datasets import pandas as pd iris = datasets.load_iris() iris_df = pd.DataFrame(iris.data) iris_df.columns = ['Sepal.Length','Sepal.Width','Petal.Length','Petal.Width'] #Data distribution check - Histogram import plotly.graph_objs as go data = [go.Histogram(x=iris.data[:,0])] layout = go.Layout( title='Iris Dataset - Sepal.Length', xaxis=dict(title='Sepal.Length'), yaxis=dict(title='Count') ) fig = go.Figure(data=data, layout=layout) fig
You can see above that plotly’s plot lets you save the image, zoom-in and out, autoscale and more. You can also see that, after the mouse hover, we can see the x and y axis values.
Let’s draw some more plots using plotly to understand how it can help end users.
To understand the relationship between variables we need a scatter plot, but it can be difficult to read the plot when we have many data points. The mouse hover function can help to read the data without making too much effort.
data = [go.Scatter(x = iris_df["Sepal.Length"],y = iris_df["Sepal.Width"],mode = 'markers')] layout = go.Layout(title='Iris Dataset - Sepal.Length vs Sepal.Width', xaxis=dict(title='Sepal.Length'), yaxis=dict(title='Sepal.Width')) fig = go.Figure(data=data, layout=layout) fig
If you want your charts to be interactive, attractive, and readable, plotly is the answer.
Matplotlib is a base for many tools, and Seaborn is one of them. In Seaborn, you can create attractive charts with minimal effort. It has high-level functions for common statistical plots to make them informative and attractive.
It integrates closely with pandas, and accepts inputs in pandas data structures format. Seaborn has not reimplemented any of the plot but has tweaked the functions of Matplotlib in a way that we can use the plots by providing minimum parameters.
Seaborn has collected some common plots from Matplotlib and categorized them: relational(replot), distributional(displot), and categorical(catplot).
What was the need to categorize plots if we could just use them directly? Here’s the twist! Seaborn lets you use categorized plots directly, which is called axis level plotting. These plots, like histplot(), lineplot(), are self-contained plots, and a direct replacement of Matplotlib, though they allow some alternation like adding axis labels and legends automatically. When you want to use two plots together, or play around more, to make customized plots you’ll need to use plot category: figure level plotting.
Let’s try to some of the plots to see how easy seaborn is.
#Load the data set import pandas as pd breast_cancer_df = pd.read_csv("data.csv") #create heatmap plt.figure(figsize= (10,10), dpi=100) sns.heatmap(breast_cancer_df.corr())
Just two lines to create a heatmap! Now we will try some plots which we’ve already tried above with other tools.
#Count plot plt.figure(figsize=(8,5)) ax = sns.countplot(x="diagnosis", data=breast_cancer_df) plt.show()
We just created a count plot without counting anything, much unlike Matplotlib.
The library is not limited to above mentioned plots only. It also has joinplot, subplot, or regplot functions that can help create customized and statistical plots with minimal coding.
One of the most popular libraries in Python for data analysis and manipulation. It started off as a tool to perform quantitative analysis for financial data. Because of this, it’s very popular in time series use cases.
Most data scientists or analysts work with table format data like .csv, .xlsx etc. Pandas provides SQL-like commands that make it easier to load, process and analyze the data. It supports two types of data structure: series and dataframe. Both data structures can hold different data types. Series is a one-dimensional indexed array, dataframe is a two-dimensional data structure – table format, and is popular when dealing with real life data.
Let’s see how series and dataframe can be defined, and unlock some of the features.
#creating a series from dataframe ser1=pd.Series(breast_cancer_df['area_mean']) ser1.head()
You can perform almost all operations and use all the functions we will be discussing further with pandas series also. You can also provide indexing to your series.
data = pd.Series([5, 2, 3,7], index=['a', 'b', 'c', 'd']) data
Also, you can pass dictionary data (key value object), and it can be converted into series too.
#Describe the dataframe - take peek inside the data breast_cancer_df.describe()
With one line of code, we were able to have a look at the data. That’s the power of pandas.
Say we want to create a subset of main dataframe, that also can be done with few lines of code.
subset_df=breast_cancer_df[["id", "diagnosis"]] subset_df
#select data by column and position print("print data for one column id: ",breast_cancer_df["id"]) print("print all the data for one row: ",breast_cancer_df.iloc[3])
Let’s see how pandas handle missing data, first check which column has missing values.
data = 'Col1': [1,2,3,4,5,np.nan,6,7,np.nan,np.nan,8,9,10,np.nan], 'Col2': ['a','b',np.nan,np.nan,'c','d','e',np.nan,np.nan,'f','g',np.nan,'h','i'] > df = pd.DataFrame(data,columns=['Col1','Col2']) df.info()
Non-null count column will show you how many non-null values are available. You can drop the rows with null values or impute some values.
We can handle string values differently, but we won’t go into that level of detail. We can also do statistical calculation using pandas like calculating mean, average, median, etc. There are many string functions available, like covering lower/upper case, substring, replacing string, and using regular expression for pattern matching.
Pandas provides functions for viewing data (head or tail), creating subsets, searching and sorting, finding correlation between variables, handling missing data, reshaping – joining, merging, and more.
Not just this, but pandas also has visualization tools. However, it only does basic plots, but they’re easy to use. Unlike Matplotlib or other tools, you just provide an extra command plt.show() to print the plot.
breast_cancer_df[['area_mean','radius_mean','perimeter_mean']].plot.box()
The above plot is identifying the outliers with a single line of command. It also allows you to alter the plots like their colors, labels, and more.
corr = breast_cancer_df[['area_mean','radius_mean','perimeter_mean']].corr() corr.style.background_gradient(cmap='coolwarm').set_precision(2)
The two charts above were easy to create, but imagine if we want to create a bar chart for breast cancer data, and want to know the count of each type of diagnosis. We’d first need to find the count, and then would only be able to plot the box graph. Pandas doesn’t provide customized plots. In order to use a plot of your choice, you’ll have to first manipulate the data, and then feed appropriate data into the plot function.
D3.js is a JavaScript library to create dynamic and interactive visualizations in web browsers. It uses HTML, CSS and SVG to create visual representations of data. D3 stands for data-driven documents, it was created by Mike Bostock. It’s one of the best tools for data visualization for online analytics, as it manipulates the DOM by combining visual components and a data-driven approach.
We can use the Django or Flask web frameworks to create a website. This way, we can take advantage of Python’s simplicity and D3’s amazing plot collection. Python will work as a backend system, and D3 can integrate with HTML, CSS and SVG for the frontend. If your requirement is to create a dashboard, you can simply use the data that you want to analyze and use D3.js to display it.
Explaining an example of a website, webpage or dashboard with D3 code here would be a bit difficult, but let’s look at what D3 has to offer.
For one thing, relationship visualization or network flow with an aesthetically pleasing circular layout can be coded as – chord diagram and the result of this code can be pleasing to the reader’s eyes –
Chart to stack negative categories to the left and positive categories to the right.
With the below chart you can visualize the hierarchy and the size will adjust as you change the depth. You can find the source code here.
D3 has a large collection of plots and it will be rare that you will have to code from scratch. You can pick any plot, and make the changes you want. Though there’s no question that you will have to write lots of code, more code means more flexibility to change.
Bokeh is a Python data visualization library that lets users generate interactive charts and plots. Similar to plotly, because both libraries let you create JavaScript-powered charts and plots without writing any JS code. Bokeh gives active interaction support like plotly and D3.js, like zooming, panning, selecting, and saving the plot.
Bokeh comes with two different interfaces/layers, which lets developers combine them based on their need and how much time they want to spend coding. Let’s find out the difference between these interfaces and their usage through some examples.
This provides a low-level interface for developers. Charts can be configured by setting values for various properties. This way developers can manipulate the properties as they require.
from bokeh.models import HoverTool #mouse-hover hover = HoverTool( tooltips=[ ("(x,y)", "($x, $y)"), ] ) #step1 - create a plot using figure p = figure(plot_width=400, plot_height=400, tools=[hover]) #step2 - add triangle render with size,color p.triangle([5, 3, 3, 1, 10], [6, 7, 2, 4, 5], size=[10, 15, 20, 25, 30], color="blue") #show the plot show(p)
In this interface, you’ll have the freedom to create plots by combining visual elements: circle, triangle, line, etc., and adding interaction tools: zooming, spanning, etc. The interaction elements will be added with the help of bokeh.model.
from bokeh.io import output_notebook, show from bokeh.plotting import figure #import figure to create plot object output_notebook() #Output mode #step1 - create a plot using figure p = figure(plot_width=400, plot_height=400) #step2 - add triangle render with size,color p.triangle([5, 3, 3, 1, 10], [6, 7, 2, 4, 5], size=[10, 15, 20, 25, 30], color="blue") #show the plot show(p)
There was one more interface called bokeh.chart. It had pre-built visuals like line chart, bar chart, area plot, heatmap, but it has been deprecated.
In many ways Bokeh can be a good choice for data visualization, as it gives you Matplotlib’s simplicity and an option to make your charts more interactive.
Altair is a declarative data visualization library. It’s built on vega lite, which lets you create visualizations for data analysis by defining properties in JSON format. You won’t be writing any json declaratives, but Python. Altair converts the inputs into dictionary format for vega lite.
It’s basically a Python interface for vega lite. Altair supports data transformation within chart definition.
Altair provides inbuilt charts. Bar chart, line chart, area chart, histogram, scatter plot and more. Let’s draw some plots to see how Altair can help us explore data through visuals.
import altair as alt import pandas as pd #create dataframe or load data from a dataset source = pd.DataFrame(< 'a': ['Col1', 'Col2', 'Col3','Col4', 'Col5', 'Col6'], 'b': [28, 55, 43, 50, 30, 99] >) #define altair chart alt.Chart(source).mark_bar().encode( x='a', y='b' )
You can see Altair gives you options to save the image, view the source (data), and edit the chart in vega. When you open the chart in vega editor, this is what you will see.
Your Python code will be translated into JSON format to let you play around with it in vega. Altair has more to offer than just simple charts, it lets you combine two charts and create dependencies between them.
YellowBrick is a machine learning visualization library with two primary dependencies: Scikit learn and Matplotlib. It’s highly focused on feature engineering, and evaluating ML model performance. It has the following visualization capabilities:
This list can help you identify which plot/utility should be used for what kind of requirement. To understand more about YellowBrick, let’s look at some examples.
from sklearn.tree import DecisionTreeClassifier from yellowbrick.features import FeatureImportances clf = DecisionTreeClassifier() viz = FeatureImportances(clf) viz.fit(X_sample, y_sample) viz.poof()
It looks like YellowBrick is a combination of data exploration – before, during and after data modelling. This is a data exploration tool in the truest sense.
Folium is a Python library for visualizing geospatial data, and a wrapper of the JS library Leaflet.js. Leaflet.js is an open-source JS library for interactive maps. Folium has adopted Python’s data wrangling and mapping feature of Leaflet.js.
The library uses tilesets from OpenStreetMap, MapBox, Cloudmade API. You can customize the map by adding Tile Layers, Plotting Markers, showing directions. With the help of plugins, Folium can really help developers create customized maps easily.
Visualizing geospatial data on maps can help understand the data better. You can get a visual representation of location data points, and they’ll be easy to relate with the world. Like a number of sickness cases, showing that information on a map by countries, states and cities can help in containing the information more easily.
Let’s draw our first map with Folium and see how easy can it be.
import folium from folium.plugins import MarkerCluster m = folium.Map(location=[28.7041, 77.1025], zoom_start=10) popup = "Delhi" marker = folium.Marker([28.7041, 77.1025], popup=popup) m.add_child(marker) m
By just inputting latitude and longitude, we were able to draw a map and mark it. Let’s check out how we can add the functionality when you can view the map in different formats. Let’s add tile layers.
import folium from branca.element import Figure from folium.plugins import MarkerCluster popup = "Delhi" fig=Figure(width=500,height=300) m = folium.Map(location=[28.7041, 77.1025]) fig.add_child(m) folium.TileLayer('Stamen Terrain').add_to(m) folium.TileLayer('Stamen Toner').add_to(m) folium.TileLayer('Stamen Water Color').add_to(m) folium.LayerControl().add_to(m) m
Folium makes it easier for developers to avoid the hustle of using Google Maps, putting markers and showing direction on them. In Folium, you can just import a few libraries, draw a map and focus on inputting and understanding the data.
Tableau is one of the best data visualization tools. Organizing, managing, visualizing, and understanding data is extremely easy. It has easy drag-and-drop functionality, but also tools that can help discover patterns and find insights in data.
With Tableau, you can create a dashboard, which is nothing but a collection of different visuals in one place. A dashboard is like a storyboard, where you can include multiple plots, use a variety of layouts and formats, and easily enable filters to select specific data. For example, you can create a dashboard to check the performance of a brand’s marketing campaign.
Integrating with different types of data sources in Python can take lots of coding and effort, but with a business intelligence tool like Tableau, that will be a one-click job. It has many data connectors like Amazon Athena, Redshift, Google Analytics, Salesforce, and more.
It’s a business intelligence tool with limited support to curate data, but it lets the analyst use Python or R. By using scripting programming, the analyst can feed clean data to Tableau and create better visuals. To connect Python with Tableau, you can check out this blog on Tableau’s website.
Here’s a featured example of a Tableau dashboard, doesn’t it look like a newspaper clip?
There are many tools and libraries in the market, and we choose them based on our requirements, capabilities, and budgets. Throughout this article, I discussed some of the best tools for data exploration and visualization. Each of these tools are best in their own way, and they have their own systems and structures to dig deeper into the data and make sense of it.
Data exploration is important for business, management, and data analysts. Without exploration, you will often find yourself in blind spots. So, before you make any big decision, it’s a good idea to analyze what can happen, or what has been happening in the past. In other words, visualize your data to make better decisions.