The “Basic Example” section below goes through all of these steps to illustrate the process. To download a page the “get” http method is required, I have used the requests package to implement the get. Parsing the page and using xpath expressions to navigate the tree is accomplished with classes within the lxml package. Cleaning and data analysis has been performed with string manipulation and the pandas package. The appendix section contains untility functions that are used to help with the analysis presented but are not central to it. THE CELLS IN THE APPENDIX SHOULD BE RUN FIRST!!
The “xpath Tutorial” section below is based on the w3 schools xpath tutorial, with the xpath commands performed via the lxml package.
The lxml Package is a Pythonic binding for the C libraries libxml2 & libxslt, it is compatible but similar to the well known ElementTree API. The latest release works with all CPython versions from 2.6 to 3.6.
For more information about this package including installation notes visit the links below:
Credit: Source link