Jump to ratings and reviews
Rate this book

Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython

Rate this book

Get complete instructions for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3.6, the second edition of this hands-on guide is packed with practical case studies that show you how to solve a broad set of data analysis problems effectively. You’ll learn the latest versions of pandas, NumPy, IPython, and Jupyter in the process.

Written by Wes McKinney, the creator of the Python pandas project, this book is a practical, modern introduction to data science tools in Python. It’s ideal for analysts new to Python and for Python programmers new to data science and scientific computing. Data files and related material are available on GitHub.

Use the IPython shell and Jupyter notebook for exploratory computing Learn basic and advanced features in NumPy (Numerical Python) Get started with data analysis tools in the pandas library Use flexible tools to load, clean, transform, merge, and reshape data Create informative visualizations with matplotlib Apply the pandas groupby facility to slice, dice, and summarize datasets Analyze and manipulate regular and irregular time series data Learn how to solve real-world data analysis problems with thorough, detailed examples

724 pages, Kindle Edition

First published December 30, 2011

Loading interface...
Loading interface...

About the author

Wes McKinney

9 books35 followers

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
904 (40%)
4 stars
923 (40%)
3 stars
349 (15%)
2 stars
58 (2%)
1 star
18 (<1%)
Displaying 1 - 30 of 172 reviews
Profile Image for Ben.
40 reviews9 followers
October 1, 2012
A better title for this book might be Pandas and NumPy in Action

As the creator of the pandas project, a Python data analysis framework, Wes McKinney is well placed to write this book. His experience and vision for the pandas framework is clear, and he is able to explain the main function and inner workings of both pandas and another package, NumPy, very well.

Although the title of the book suggests a broad look at the Python language for data analysis, McKinney almost exclusively focuses on an in-depth exploration of pandas. The book started with a great deal of promise, but as McKinney delved into the detail of NumPy and pandas, the ideas and examples of data analysis are replaced with random number datasets.

The book became a tiresome parade of pandas feature after pandas feature. Each example was stripped of meaning without any real world basis. It would have been great to see more real world cases drawn from McKinney's experience as a day to day user of pandas and Python for data analysis.

This book would be ideal if you're using, or thinking about using NumPy or pandas. If you're looking for a broader introduction to Data Analysis with Python, this might not be the book for you.
Profile Image for Sebastian Gebski.
1,040 reviews1,013 followers
April 1, 2020
It's hell-of-a-book & it took me a lot of time to get through, but it was worth it.

Two key points:
1. it's not time-consuming because it's hard to comprehend or something - quite the opposite, but it's very practical: examples, examples & examples, so it barely makes any sense to read it while not being in front of the keyboard (the check the stuff out)
2. people very differently understand terms like "data analysis", "artificial intelligence", "machine learning" & "data science" - this book is about (rather) straightforward operations on data - reading, sanitizing, filtering, grouping, pivoting, etc. - no advanced statistics, just the "mundane" (but totally necessary) stuff - I'd call it "super flexible equivalent of SQL, but in Python and on any data sets"

The book is based mainly on NumPy & pandas. There are several other libraries mentions (with some examples), but the only ones you can learn for real are the 2 I've listed above.

If you want to learn more about working with data using NumPy & pandas, look no further - this book is for you. 5/5 stars.
Profile Image for Louis.
225 reviews28 followers
November 10, 2012
For some time now I have been using R and Python for data analysis. And I have long ago discovered the Python technical stack of ipython, NumPy, Scipy, and Matplotlib and I thought I knew what I was doing. I even dipped my toe into pandas as my data structure for analysis. But Python for Data Analysis showed me entire worlds of improvement in my workflow and my ability to work with data in the messy form that is found in the real world.

Python, like most interpreted languages, is slow compared to compiled languages. But there is a technical stack that started with the NumPy libraries and has grown to include Scipy, Matplotlib (graphing), ipython (shell) and pandas you get high quality and fast algorithm and data structure Fortran and C libraries underneath Python. But while these libraries are designed to be used together, documentation tends to be only about one at a time, and very little puts it all together as an integrated whole. McKinney's Python for Data Analysis fills that gap.

Even though I have been using iPython, NumPy, Scipy and Matplotlib for years, and pandas for about half a year, going through this book makes me feel like I was a rank novice. I learned how to efficiently use the shell as a development tool, to the point I have stopped automatically using the ipython notebook or pydev (eclipse) when starting new projects and I use the shell instead, because its introspection and debugging capabilities made it much easier to work. I had started using pandas for a data structure because I liked the similarities with R data frames, this book showed me where pandas goes well beyond that. With matplotlib I could make specific plots, this book showed me how to use the pandas interface to make them a natural part of the workflow (even if it is not yet at the level of a grammer such as ggplots)

Python for Data Analysis does not just teach how to use the Python scientific stack, it also teaches a workflow for technical computing. And this is beyond what you can get from reading off the web, it probably really requires the opportunity to work alongside someone who knows what they are doing to see the practices that makes them productive. As such, I would recommend it for anyone who does scientific and technical computing, whether in the sciences, engineering, finance, or other areas where quantitative computing using Python is done.

Disclaimer: I received a free electronic copy of this book from the O'Reilly Blogger Program.
May 13, 2017
Just a more verbose documentation. After a promising introduction showing several real-world usages of data manipulation, the book is nothing more than a documentation of pandas and libraries like numpy and matplotlib. Moreover, many of functions described there are already deprecated, so just be aware of that. Perhaps the best way of "reading" this book is just scanning it quickly for a general overview of pandas functionalities, so it can be used as a point of reference when needed.
Profile Image for Moeen Sahraei.
29 reviews46 followers
December 5, 2020
It couldn’t have been better, A comprehensive book with a lot of details in data wrangling, it has been taught step by step so there is no confusion in figuring out the codes, the author explained the complex python subjects very intuitively so any one can read this book and learn data wrangling with some practice with data sets which book presented
Profile Image for Thiago D.
5 reviews5 followers
December 1, 2021
i still can’t believe that i:
- bought a physical programming book
- actually read it cover to cover
- liked it so much i rated 5 stars

2021 is indeed an odd year
897 reviews19 followers
March 3, 2013
This book is a reasonably comprehensive tutorial to pandas - the Python library for data wrangling. As a tutorial, it works well.

But it wasn't quite what I was expecting. I was expecting less tutorial and more case studies - taking meaningful datasets (instead of makey-upy ones) and using pandas and other tools to pose and answer questions. For me, this would have made the book a much more practical resource.
Profile Image for Terran M.
78 reviews100 followers
October 20, 2018
This book is a well-written, verbose introduction to Pandas by the main author of that library. Don't expect to learn much besides Pandas - matplotlib gets a brief mention, and there is a short Numpy section, but broadcasting is relegated to an appendix.

This book is a peer of Python Data Science Handbook by Jake VanderPlas, and they are more alike than different. They both start with long sections on manipulating data in Numpy and Pandas, on mostly made up examples of random numbers. This book is the more verbose of the two; it does have more complete coverage of Pandas functionality (albeit less coverage of Numpy), and it also takes longer to read. It's only 4 stars because it's not very engaging: I prefer a book like this to introduce some real data early and to motivate the learning of techniques by showing how it helps answer questions in the data, like R for Data Science does.

I find that matplotlib is unusably low-level for modern data science, and you should skip that section in any of the books and learn either Altair or plotnine (a clone of ggplot) for your plotting work in Python.
Profile Image for runzhi xiao.
3 reviews
March 7, 2019
I started my career in data science with this book. The book is very easy to understand, and practical.Having a little bit of python knowledge would help you reading this book too. The book covered a lot of the jobs that a data analyst would do in daily job. It is mostly about pandas,numpy and matplotlib. So if you are already familiar with these tools, you can skip this one
Profile Image for Raimundas.
10 reviews
July 28, 2021
Enormously useful, interesting. It took me quite some time to read this book, but it was worth it!
Profile Image for Martijn.
82 reviews8 followers
Read
August 14, 2020
A good, thorough introduction into using Python (and in particular the numpy and pandas libraries) for data analysis. As with pretty much al books of this kind, after a while the mixture of text and examples makes it hard to follow but then, maybe it's not supposed to be read while sitting down in an armchair.
Profile Image for Jake Losh.
206 reviews27 followers
September 17, 2020
A very good book. I'd been using many of these tools for a while, sometimes using snippets cobbled together from dozens of disparate Stack Overflow threads, so it was really nice to have the material in a clear, thoughtful and organized way.
Profile Image for 박은정 Park.
Author 3 books42 followers
June 26, 2015
여러 official docs를 merging한 느낌인데, 12장 정도 빼고는 그들보다 나은 부분을 발견하지 못해서 아쉽다. 재밌는 application 몇 개만 보여줬어도 좋았을텐데!
6 reviews
July 7, 2022
As what I would call a junior Python developer, I really felt like Python for Data Analysis helped expand my knowledge of the Python language and its uses in analyzing large data sets.

With that being said, I would encourage anyone thinking about a career as a data analyst/data scientist to use this book as a complimentary tool in conjunction with an actual course that allows you to put these skills in to practice. I personally found a tremendous amount of overlap with the beginning of Dataquest’s online Data Science in Python and Wes McKinney’s book. It was a really helpful experience to read about a topic in the book and then learn how to implement it with the course.

While I think Wes does a great job in teaching the basics of Pandas, NumPy, and various other libraries, computer programming is a field where you have to get hands on practice. It would be very difficult to begin a career as a data scientist by simply reading this book and expecting to absorb every bit of information that Wes has to offer.
Profile Image for Britt O'Duffy.
335 reviews36 followers
March 3, 2023
If I were taking the mastery of animal languages (Python, pandas) more seriously and not simply hoping to pass a computational methods + data wrangling class, I'd probably have read this book cover-to-cover. Instead, it gets a dnf for a casual, albeit helpful, reference text in my journey to learn parsletongue.

print("hisssss")
Profile Image for Illia.
196 reviews2 followers
July 2, 2017
Я знав, що Pandas - це якись пекельний статистичний комбайн, але з'ясувалося, що я навіть не уявляв його реальних розмірів і можливостей.
Profile Image for Rob.
Author 2 books410 followers
August 24, 2012
I did copy editing on this book, so my review is of an unfinished (but close to finished) version. That being said: McKinney is the principal author on pandas, a Python package for doing data transformation and statistical analysis. The book is largely about pandas (and NumPy), but also delves into general methodologies for munging data and performing analytical operations on them (e.g., normalizing messy data and turning it into graphs and tables); he also delves into some (semi) esoteric information about how Python works at very low levels, and discusses ways to optimize data structures so that you can get maximum performance from your programs. This book won't be useful for someone looking for a book that discusses data analysis in a broad sense, nor would it be useful for someone looking for a generalist's book on Python -- however if you've already selected Python as your analytical tool (and it sounds like it's more/less the de facto analytical tool in many circles) then this could be just the book for you.
Profile Image for Joshua Hruzik.
17 reviews5 followers
February 9, 2017
Great for Transition
As an R user I always hear people say that one should also learn Python as a secondary language. So I gave this book a shot and it did not disappoint.
Wes McKinney is the creator of Pandas, a framework for working with structured data in Python. That being said, a great deal of the book deals with Pandas and solving classical data science tasks (cleaning and munging your data). His style of writing is very clear and one can easily grasp the concepts by applying them in one's own python console. The book's content is clear cut and all important aspects are included.
One thing that left me a bit unsatisfied was the wide use of randomly generated data sets. Using data that has a logical structure and real world meaning would help understanding the concepts.

Overall, it's a fantastic book and very well suited if you are familiar with typical data science methods.
Profile Image for Stephen.
1,111 reviews14 followers
November 26, 2020
The O'Reilly (animal) book that is the essential reference to pandas and numpy, as used in iPython and Jupyter notebooks. This book is a complete overview of the APIs and packages, hints and tips and some data sources for use with these first class data analysis tools.

Do you need this book? Maybe not. There is so much reference information on the web, I tend to just google it. Also it is not amazingly readable. The Open University course "Learn to code for data analysis" is a better introduction than this book. However, if you have some understanding of iPython or Jupyter and the pandas library, and if you have time to sit down and read it, this book is an excellent and comprehensive source.
Profile Image for Steve.
20 reviews37 followers
Read
April 13, 2013
Good introduction to Python Pandas and other libraries for data analysis. However, the book goes directly from the introduction into pretty complicated examples. As a reader new to R, Pandas, and statistical languages, it was hard work to learn the data structures and semantics. After working through several web-based tutorials, I had a better intuitive sense for how to solve problems with the framework presented by the author.

As documentation for Pandas alone, this book is useful.
Profile Image for Nancy.
72 reviews20 followers
February 13, 2014
This book was the perfect set of training wheels for me, especially since my main goal was to operate on economic and financial data. By chapter 4 (practically the beginning of this book), I was able to sample random stocks, run correlations between stocks and commodities. I think that the TimeSeries chapter should be read just before or after chapter 4, to avoid some time groping in the dark with this datatype. Chapter 11 is also very useful with a focus on data munging for financial data.
Profile Image for Adil Khashtamov.
23 reviews2 followers
February 23, 2017
Frankly speaking, this book is not about data analysis, it is more about pandas as an instrument to do data analysis. Book also covers Ipython, numpy, matplotlib superficially.
The reason why I gave 2 stars is because it is little bit out-of-date and almost no practical examples during uncovering of pandas' functionality.
I would strongly recommend to dive into official documentation instead (10 minutes intro and tutorials) if you want to master pandas.
Profile Image for Jascha.
151 reviews
September 10, 2016
Not bad, but it doesn't provide anything more than the official documentation. Being the only book about pandas, you can't compare it. And you don't have alternatives. I honestly expected more. The only way to learn pandas is spending time with ipython and searching on stackoverflow, creating your own code snippets.
Profile Image for Fang.
19 reviews7 followers
November 30, 2017
Covers a good amount of knowledge, interesting examples too.
But GOSH... the fact that python is a higher level language doesn't grant a free use of generic words and loose conceptual demarcations. This is a textbook-like read after all. Can we put in some more effort to be scientific here?

Also: get the up-to-date version.
Profile Image for Ferhat Culfaz.
243 reviews12 followers
September 27, 2018
By the author and creator of Pandas. A must read book for anyone working in science, engineering, statistics, data science and machine learning. Covers all the feature engineering whet people spend most of their time. Also an excellent book for future reference to look stuff up. Clearly written with lots of practical examples and included Jupyter Notebooks.
Profile Image for ktsn.
71 reviews1 follower
November 21, 2020
Compared with data analysis books written by people like Hadley Wickham, the author of this one obviously is more in the clan of "software development", i.e. caring more about "how" to use something without caring "why" use it. So this book is more like an expanded version of man page.
21 reviews
August 6, 2023
"Datenanalyse mit Python: Auswertung von Daten mit pandas, NumPy und Jupyter" von Wes McKinney ist ein bahnbrechendes Buch für jeden, der sich mit Datenanalyse und -verarbeitung in Python beschäftigt. Der Autor, Wes McKinney, ist der Schöpfer der populären pandas-Bibliothek, und sein umfangreiches Wissen und seine Erfahrung spiegeln sich in diesem Werk wider. Generell ist dieses Buch auch für Neulinge geeignet, wobei ich bereits Grundkentnisse im Umgang mit Python besaß.
Ein bemerkenswerter Aspekt des Buches sind die praxisorientierten Projekte, die leicht mit Jupyter abgerufen werden können. Jupyter ist eine interaktive Entwicklungsumgebung, die es dem Leser ermöglicht, Daten und Code in einem einzigen Dokument zu kombinieren. Diese Kombination von Theorie und Praxis schafft eine unglaublich leistungsfähige Lernumgebung, wobei auch Notizen in Markdown dem Code hinzugefügt werden können.
Das Buch beginnt mit einer soliden Einführung in die Grundlagen von Python, NumPy und pandas. McKinney sorgt dafür, dass der Leser über das erforderliche Wissen verfügt, um die vorgestellten Konzepte zu verstehen und anzuwenden. Die Erklärungen sind klar und gut strukturiert, was auch für Anfänger gut verständlich ist.
Der Hauptteil des Buches konzentriert sich auf pandas, eine mächtige Bibliothek für Datenmanipulation und -analyse in Python. Die verschiedenen Funktionen und Methoden von pandas werden ausführlich behandelt, und McKinney geht auf die effiziente Handhabung großer Datenmengen ein. Er zeigt, wie man Daten einliest, filtert, gruppieren, transformieren und visualisieren kann. Die praxisnahen Projekte, die mit Jupyter interaktiv abgerufen werden können, bieten dem Leser die Möglichkeit, das Gelernte sofort in die Tat umzusetzen und ein tieferes Verständnis für die Anwendung von pandas zu entwickeln.
Ein weiterer Pluspunkt dieses Buches ist die umfassende Abdeckung von Data Wrangling und Data Cleaning. Diese wichtigen Schritte in der Datenanalyse werden oft übersehen, sind aber von entscheidender Bedeutung, um zuverlässige und aussagekräftige Ergebnisse zu erzielen. McKinney erklärt, wie man Daten bereinigt, fehlende Werte behandelt und Daten für die weitere Analyse vorbereitet.
Die Praxisnähe und Anwendungsbeispiele in diesem Buch machen es zu einem ausgezeichneten Lernwerkzeug für Datenwissenschaftler, Analysten und alle, die Daten in Python verarbeiten möchten. Die mit Jupyter bereitgestellten Projekte ermöglichen es dem Leser, auf interaktive Weise mit den Daten zu experimentieren und die Auswirkungen der durchgeführten Analysen sofort zu sehen.
Zusammenfassend ist "Datenanalyse mit Python: Auswertung von Daten mit pandas, NumPy und Jupyter" von Wes McKinney ein unschätzbares Werkzeug für alle, die mit Datenanalyse in Python arbeiten möchten. Die praxisorientierten Projekte und die Verwendung von Jupyter machen das Lernen interaktiv und spannend. McKinney's klare Erklärungen und seine Fähigkeit, komplexe Konzepte verständlich zu vermitteln, machen dieses Buch zu einem Muss für jeden, der seine Datenanalysefähigkeiten auf ein höheres Niveau bringen möchte.

PS: Diese Rezension bezieht sich auf die neuere Ausgabe dieses Buches aus dem Jahre 2023.
Profile Image for Xanan.
59 reviews6 followers
August 14, 2019
The book describes pandas: a Python library that supports data analysis and is also used in some Pyhton machine learning libraries.
The book also briefly mentions other libraries, including numpy and matplotlib.
All you read in this book is certainly available on the online documentation of the libraries discussed.
However the author does an excellent job at providing an accessible introduction to these libraries in a single place using a uniform terminology and paying attention to explaining concepts incrementally (something that is often lacking on the online documentation).
Simple and complete examples illustrate everything discussed.
If you're looking for a handy reference to pandas this is a nice book to have at hand.


Two initial chapters recap the Python language and the most important data structures.

A chapter is then devoted to numpy: a library to represent multidimensional arrays.
Array operations and indexing are adequately covered but linear algebra is given very little space.

One of the chapters describes matplotlib, another Python library that can be used to plot graphs.
This chapter covers the very basics of the library, omitting many details.

The rest (main part) of the book is dedicated to the pandas library, which the author created.
The Series and DataFrame object are described in detail as well as indexes, data selection, sorting, filtering, filling missing data, and various data manipulation function.
Separate chapters are devoted to merging data sets and performing data aggregation.
Time series are also given their own chapter with details on resampling operations.
Displaying 1 - 30 of 172 reviews

Can't find what you're looking for?

Get help and learn more about the design.