Table Of ContentDistributed Computing
with Python
Harness the power of multiple computers using Python
through this fast-paced informative guide
Francesco Pierfederici
BIRMINGHAM - MUMBAI
Distributed Computing with Python
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the author, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
First published: April 2016
Production reference: 1060416
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78588-969-1
www.packtpub.com
Credits
Author Project Coordinator
Francesco Pierfederici Nikhil Nair
Reviewer Proofreader
James King Safis Editing
Commissioning Editor Indexer
Veena Pagare Rekha Nair
Acquisition Editor Graphics
Aaron Lazar Disha Haria
Content Development Editor Production Coordinator
Parshva Sheth Melwyn Dsa
Technical Editor Cover Work
Abhishek R. Kotian Melwyn Dsa
Copy Editor
Neha Vyas
About the Author
Francesco Pierfederici is a software engineer who loves Python. He has been
working in the fields of astronomy, biology, and numerical weather forecasting for
the last 20 years.
He has built large distributed systems that make use of tens of thousands of cores
at a time and run on some of the fastest supercomputers in the world. He has also
written a lot of applications of dubious usefulness but that are great fun. Mostly,
he just likes to build things.
I would like to thank my wife, Alicia, for her unreasonable patience
during the gestation of this book. I would also like to thank Parshva
Sheth and Aaron Lazar at Packt Publishing and the technical reviewer,
James King, who were all instrumental in making this a better book.
About the Reviewer
James King is a software developer with a broad range of experience in distributed
systems. He is a contributor to many open source projects including OpenStack and
Mozilla Firefox. He enjoys mathematics, horsing around with his kids, games, and art.
www.PacktPub.com
eBooks, discount offers, and more
Did you know that Packt offers eBook versions of every book published, with PDF
and ePub files available? You can upgrade to the eBook version at www.PacktPub.
com and as a print book customer, you are entitled to a discount on the eBook copy.
Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign
up for a range of free newsletters and receive exclusive discounts and offers on Packt
books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital
book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
• Fully searchable across every book published by Packt
• Copy and paste, print, and bookmark content
• On demand and accessible via a web browser
Table of Contents
Preface iii
Chapter 1: An Introduction to Parallel
and Distributed Computing 1
Parallel computing 2
Distributed computing 4
Shared memory versus distributed memory 6
Amdahl's law 9
The mixed paradigm 12
Summary 12
Chapter 2: Asynchronous Programming 13
Coroutines 16
An asynchronous example 22
Summary 28
Chapter 3: Parallelism in Python 29
Multiple threads 30
Multiple processes 37
Multiprocess queues 42
Closing thoughts 44
Summary 45
Chapter 4: Distributed Applications – with Celery 47
Establishing a multimachine environment 47
Installing Celery 49
Testing the installation 52
A tour of Celery 55
More complex Celery applications 57
Celery in production 65
[ i ]
Table of Contents
Celery alternatives – Python-RQ 67
Celery alternatives – Pyro 70
Summary 77
Chapter 5: Python in the Cloud 79
Cloud computing and AWS 79
Creating an AWS account 80
Creating an EC2 instance 90
Storing data in Amazon S3 99
Amazon elastic beanstalk 103
Creating a private cloud 104
Summary 105
Chapter 6: Python on an HPC Cluster 107
Your typical HPC cluster 107
Job schedulers 109
Running a Python job using HTCondor 111
Running a Python job using PBS 123
Debugging 128
Summary 129
Chapter 7: Testing and Debugging Distributed Applications 131
The big picture 132
Common problems – clocks and time 132
Common problems – software environments 134
Common problems – permissions and environments 135
Common problems – the availability of hardware resources 136
Challenges – the development environment 140
A useful strategy – logging everything 141
A useful strategy – simulating components 143
Summary 144
Chapter 8: The Road Ahead 145
The first two chapters 146
The tools 147
The cloud and the HPC world 148
Debugging and monitoring 150
Where to go next 151
Index 153
[ ii ]
Preface
Parallel and distributed computing is a fascinating subject that only a few years
ago developers in only a very few large companies and national labs were privy to.
Things have changed dramatically in the last decade or so, and now everybody can
build small- and medium-scale distributed applications in a variety of programming
languages including, of course, our favorite one: Python.
This book is a very practical guide for Python programmers who are starting to build
their own distributed systems. It starts off by illustrating the bare minimum theoretical
concepts needed to understand parallel and distributed computing in order to lay the
basic foundations required for the rest of the (more practical) chapters.
It then looks at some first examples of parallelism using nothing more than modules
from the Python standard library. The next step is to move beyond the confines of a
single computer and start using more and more nodes. This is accomplished using a
number of third-party libraries, including Celery and Pyro.
The remaining chapters investigate a few deployment options for our distributed
applications. The cloud and classic High Performance Computing (HPC) clusters,
together with their strengths and challenges, take center stage.
Finally, the thorny issues of monitoring, logging, profiling, and debugging are
touched upon.
All in all, this is very much a hands-on book, teaching you how to use some of the
most common frameworks and methodologies to build parallel and distributed
systems in Python.
[ iii ]