We all are aware that Python is the favorite of all developers as it comes with an enormously rich and strong ecosystem of several data science tools. But there is also an unfortunate fact that to us outsiders the Python ecosystem seems like a complicated jungle. I provide to all my readers an organized, systematic reference guide to getting started with Python.
There are so many Python tools for someone interested in Data Science. What is wrong with so many lists that are available for PyData packages. You may very well ask that. It can be overwhelming for a beginner to choose from too many options. If you are just getting started, you must know what you want. To make everything easy for my readers I will narrow down the scope and selectively focus on the 10% of Python tools. And the surprising fact is that these tools will accomplish 90% of the work. Once you master these Python fundamentals, you will see that browsing and choosing your favorite PyData package from the huge list is as easy as a walk in the park.
The tools mentioned in the content will introduce you to data science and you can think yourself as a data scientist already. From data analysis to data munging, you can perform everything in your day-to-day life.
Installation
You must have come across people who feel Python is an amazing way to learn data science. I found out that it may take up to two days to install Python and other necessary modules. You may feel that the installation takes a long time. The fact is that all that wait was for a good reason. You must install Python first, only then you can learn about it. For a beginner who might not know all about the full PyData stack manually may be an undertaking task. Many of us would strongly recommend not to do so.
Fortunately, we can install most of PyData stack from Anaconda Python distribution. Created by Continuum, Anaconda Python distribution helps you install most of the PyData stack and the other relevant modules. You just saved two days of long wait!
IPython Notebook
After successfully installing Python most of us start launching it. It is indeed a reasonable task but surprisingly few of us that it is a wrong move. IPython, and the IPython Notebook are two of the most powerful Python shells that you can use universally in PyData. It is strongly recommended that you start using the IPython Notebook once the installation is complete. I can guarantee you will not regret. Therefore, do not bother yourself with anything else. In other words, the IPyNB is an exclusive Python shell that helps you access with the help of your web browser. IPythn Notebook allows the user to blend code, graphics, and text even if they are interactive in nature. For instance:
In [1]:
print(‘Hello Python World’)
Hello Python World
IPython Notebook is superb. Thanks to the creators of this toolkit, now dealing with data science has become much easier and meaningful. Check out the advanced capabilities that this tool offers by the following example.
If you simply want to display an image you can achieve this using the IPython.display.Image class.
>>> from IPython.display import Image
>>> image = Image('tiny_tjelvar.png')
>>> image
Pandas
I have come across several developers who have recommended me to start data science with NumPy. NumPy is a library that offers its users with multi-dimensional arrays. Certainly, if these people would have said this a couple of years back it would have been considered true. The fact is that today hardly anyone uses NumPy at all. The reason behind this trend is that NumPy is a more of a core library as other libraries widely use it.
Thus, Pandas is the main library that you can use for working with data in Python. With Pandas, the user can input as well as output data from formats and databases of all kinds. With Pandas, you can perform joins and SQL-like functions. All these functions can help shape the data. Further such functions can help handle missing values. Also, you can support time series. Pandas has basic plotting capabilities as well as statistical functionality. Here are few tricks that you can apply especially in data munging.
In [18]:
import pandas as pd
df = pd.DataFrame({ ‘A’ : 1.,
‘B’ : pd.Timestamp(‘20130102’),
‘C’ : pd.Series(1, index=list(range(4)), dtype=’float32′),
‘D’ : pd.Series([1, 2, 1, 2], dtype=’int32′),
‘E’ : pd.Categorical([“test”, “train”, “test”, “train”]),
‘F’ : ‘foo’ })
In [19]:
df
Out[19]:
A B C D E F
0 1 2013-01-02 1 1 test foo
1 1 2013-01-02 1 2 train foo
2 1 2013-01-02 1 1 test foo
3 1 2013-01-02 1 2 train foo
Columns can be accessed by name:
In [17]:
df.B
Out[17]:
0 2013-01-02
1 2013-01-02
2 2013-01-02
3 2013-01-02
Name: B, dtype: datetime64[ns]
Compute the sum of D for each category in E:
In [21]:
df.groupby(‘E’).sum().D
Out[21]:
E
test 2
train 4
Name: D, dtype: int32
Doing this is in NumPy (or *gasp* Matlab!) would be much more clunky.
Seaborn
Matplotlib is the main plotting Python library. Matplotlib is extremely powerful. I also recommend you to use Seaborn. Seaborn fundamentally considers Matplotlib as a core library. Seaborn helps create aesthetically appealing plots by default, creates statistically correct plots, and understands the Pandas DataFrame effectively.
Statistically meaningful plots
In [5]:
%matplotlib inline # IPython magic to create plots within cells
In [7]:
import seaborn as sns
# Load one of the data sets that come with seaborn
tips = sns.load_dataset(“tips”)
sns.jointplot(“total_bill”, “tip”, tips, kind=’reg’);
You must have very well noticed that with few lines of code, complex as well as complicated plots can be made. In other words, you can create the best fitting linear regression line, confidence intervals, correlation coefficients, and marginals. With matplotlib this same plot can be a tedious job.
Works well with Pandas DataFrame
Seaborn works great with Pandas DataFrame. Data has structure.
Prelimilaries
import pandas as pd %matplotlib inline import random import matplotlib.pyplot as plt import seaborn as sns
df = pd.DataFrame() df['x'] = random.sample(range(1, 100), 25) df['y'] = random.sample(range(1, 100), 25)
df.head()
x | y | |
---|---|---|
0 | 14 | 52 |
1 | 88 | 92 |
2 | 39 | 69 |
3 | 19 | 98 |
4 | 60 | 76 |
Histogram plt.hist(df.x, alpha=.3) sns.rugplot(df.x);
Seaborn can be considered as the best, even better than Matplotlib for data analysis.
I hope this blog gives you a clear idea about how Python can help immensely in data science.