Other Sellers on Amazon
Follow the Author
Web Scraping with Python: Collecting More Data from the Modern Web : 2nd Edition Paperback – 4 April 2018
Enhance your purchase
If programming is magic then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. The expanded edition of this practical book not only introduces you web scraping, but also serves as a comprehensive guide to scraping almost every type of data from the modern web.
Part I focuses on web scraping mechanics: using Python to request information from a web server, performing basic handling of the server’s response, and interacting with sites in an automated fashion. Part II explores a variety of more specific tools and applications to fit any web scraping scenario you’re likely to encounter.
Book features :
- Parse complicated HTML pages
- Develop crawlers with the Scrapy framework
- Learn methods to store data you scrape
- Read and extract data from documents
- Clean and normalize badly formatted data
- Read and write natural languages
- Crawl through forms and logins
- Use and write image-to-text software
- Avoid scraping traps and bot blockers
- Use scrapers to test your website
About the Author
Ryan Mitchell is a Software Engineer at LinkeDrive in Boston, where she develops their API and data analysis tools. She is a graduate of Olin College of Engineering, and is a Masters degree student at Harvard University School of Extension Studies. Prior to joining LinkeDrive, she was a Software Engineer working on web scraping and data analysis at Abine.
From the Publisher
From the Preface
What Is Web Scraping?
The automated gathering of data from the internet is nearly as old as the internet itself. Although web scraping is not a new term, in years past the practice has been more commonly known as screen scraping, data mining, web harvesting, or similar variations. General consensus today seems to favor web scraping, so that is the term I use throughout the book, although I also refer to programs that specifically traverse multiple pages as web crawlers or refer to the web scraping programs themselves as bots.
In theory, web scraping is the practice of gathering data through any means other than a program interacting with an API (or, obviously, through a human using a web browser). This is most commonly accomplished by writing an automated program that queries a web server, requests data (usually in the form of HTML and other files that compose web pages), and then parses that data to extract needed information.
In practice, web scraping encompasses a wide variety of programming techniques and technologies, such as data analysis, natural language parsing, and information security. Because the scope of the field is so broad, this book covers the fundamental basics of web scraping and crawling in Part I and delves into advanced topics in Part II. I suggest that all readers carefully study the first part and delve into the more specific in the second part as needed.
About This Book
This book is designed to serve not only as an introduction to web scraping, but as a comprehensive guide to collecting, transforming, and using data from uncooperative sources. Although it uses the Python programming language and covers many Python basics, it should not be used as an introduction to the language.
If you don’t know any Python at all, this book might be a bit of a challenge. Please do not use it as an introductory Python text. With that said, I’ve tried to keep all concepts and code samples at a beginning-to-intermediate Python programming level in order to make the content accessible to a wide range of readers. To this end, there are occasional explanations of more advanced Python programming and general computer science topics where appropriate. If you are a more advanced reader, feel free to skim these parts!
If you’re looking for a more comprehensive Python resource, 'Introducing Python' by Bill Lubanovic (O’Reilly) is a good, if lengthy, guide. For those with shorter attention spans, the video series 'Introduction to Python' by Jessica McKellar (O’Reilly) is an excellent resource. I’ve also enjoyed 'Think Python' by a former professor of mine, Allen Downey (O’Reilly). This last book in particular is ideal for those new to programming, and teaches computer science and software engineering concepts along with the Python language.
Technical books are often able to focus on a single language or technology, but web scraping is a relatively disparate subject, with practices that require the use of databases, web servers, HTTP, HTML, internet security, image processing, data science, and other tools. This book attempts to cover all of these, and other topics, from the perspective of 'data gathering.' It should not be used as a complete treatment of any of these subjects, but I believe they are covered in enough detail to get you started writing web scrapers!
About the Author
- Publisher : O'Reilly Media, Inc, USA; 2 edition (4 April 2018)
- Language : English
- Paperback : 300 pages
- ISBN-10 : 1491985577
- ISBN-13 : 978-1491985571
- Dimensions : 17.78 x 1.65 x 23.34 cm
- Best Sellers Rank: 163,426 in Books (See Top 100 in Books)
- Customer Reviews:
About the author
Review this product
Top reviews from other countries
I was also disappointed that the introduction to the new edition didn't reference the first book at all. This rather suggests the publisher wanted to get something out of the door quickly. Compare with other O'Reilly books where the author discusses the additions to the previous work.
Overall, it is not bad as an introduction if you are prepared to play around with the code and look e.g. online when you get stuck. If you already have the first edition, you might look elsewhere for a follow-up.
I am just over halfway through, and learning alot. There are a fair share mistakes, which begs me to ask if every example was tested. None are severe or broken beyond repair. Debugging is a valuable skill, so it checks you are paying attention.
The writing is not as accessible as the 'no starch press' publications i have read, but in that style you would be upping the page count.
Reccomend, despite the flaws. Looking forward to the second half.
Der Code wird Schritt für Schritt aufgebaut und erweitert. Man sollte etwas vorwissen in Python oder einer anderen Sprache haben. Ansonsten eignet sich das Buch auch für Einsteiger, wenn sie etwas mehr nach recherchieren. Toll ist, dass der Code Online verfügbar ist, so dass er getestet werden kann und auch umgeschrieben werden kann um ggf. eigene Ideen um zu setzen.
Das Buch ist an sich in sich geschlossen, aber mir fehlt etwas die Tiefe im Bereich Datenbanken. Datenbanken sind aber auch nicht Thema des Buches, aber wären ganz hilfreich.