README.md (2742B)
1 Deadlink crawler 2 ================ 3 4 This is a small crawler searching a website for deadlinks. 5 6 Dependencies 7 ------------ 8 9 All dependencies are listed in the `requirements.txt` file. You can create an 10 environment with: 11 12 ```bash 13 virtualenv env 14 source env/bin/activate 15 pip install -r requirements.txt 16 ``` 17 18 Via command line 19 ---------------- 20 21 There is a CLI interface to use the crawler. You **must** pass an URL as the starting point for crawling. This might be the home page of your website. 22 23 Additional options available are: 24 25 - `--restrict`: Restrict crawl to pages with URLs matching the given regular expression 26 - If not specified, defaults to all pages within the domain of the start URL 27 - `--wait`: Time (s) to wait between each URL opening. Default=0 28 - `--politeness`: Time to wait (s) between calling two URLs in the same domain. Default=1 29 - `--exclude`: Exclude URLs matching the given regex from the crawl and from deadlink-checking 30 - `--silent`: Turn off verbose output. Only print summary at the end. 31 - `--debug`: Be super-verbose, printing all links found on each page 32 - `--report40x`: Report more 40x status codes as error, default mode reports only 404 33 34 Examples: 35 ```bash 36 # Crawl all subsites of http://stefan-koch.name/ for deadlinks (including external deadlinks) 37 # Wait one second between opening each URL 38 python crawler.py --wait 1 http://stefan-koch.name/ 39 40 # Crawl all article pages of example.com for deadlinks. 41 # We assume that there are linked articles on the main page 42 python crawler.py --restrict http://example.com/article/.+ http://example.com/ 43 44 # Crawl all subdomains of example.com, with silent mode and reporting HTTP 40x as dead 45 python crawler.py --silent --report40x --restrict http://.*\.example\.com/.* http://www.example.com/ 46 47 # Crawl example.com, excluding print pages and calendars 48 python crawler.py --exclude print|calendar http://www.example.com/ 49 ``` 50 51 52 Using an instance of the class 53 ------------------------------ 54 55 You can use it by creating a new instance of the class and running the crawler. The crawler class supports different options. 56 57 ```python 58 # Begin crawling at example.com 59 c = Crawler("http://example.com/") 60 61 # Restrict crawling only to your own domain 62 c.set_url_restrict("http://example.com/.*") 63 64 # Set a second wait time between each URL to avoid putting 65 # too much load on your website. But usually on personal PCs 66 # this should not matter, because our crawler is not distributed 67 # and your bandwidth is small. 68 c.set_wait_time(1) 69 70 # start the crawling process 71 c.crawl() 72 ``` 73 74 License 75 ------- 76 The crawler is licensed under the Apache Software License v2.0, see [LICENSE.txt](LICENSE.txt) for details 77 78 Version history 79 --------------- 80 See [CHANGES.md](CHANGES.md) for complete version history