deadlink-crawler

[unmaintained] crawls a site to detect dead links
Log | Files | Refs | README

README.md (2742B)


      1 Deadlink crawler
      2 ================
      3 
      4 This is a small crawler searching a website for deadlinks.
      5 
      6 Dependencies
      7 ------------
      8 
      9 All dependencies are listed in the `requirements.txt` file. You can create an
     10 environment with:
     11 
     12 ```bash
     13 virtualenv env
     14 source env/bin/activate
     15 pip install -r requirements.txt
     16 ```
     17 
     18 Via command line
     19 ----------------
     20 
     21 There is a CLI interface to use the crawler. You **must** pass an URL as the starting point for crawling. This might be the home page of your website.
     22 
     23 Additional options available are:
     24 
     25 - `--restrict`: Restrict crawl to pages with URLs matching the given regular expression
     26   - If not specified, defaults to all pages within the domain of the start URL
     27 - `--wait`: Time (s) to wait between each URL opening. Default=0
     28 - `--politeness`: Time to wait (s) between calling two URLs in the same domain. Default=1
     29 - `--exclude`: Exclude URLs matching the given regex from the crawl and from deadlink-checking
     30 - `--silent`: Turn off verbose output. Only print summary at the end.
     31 - `--debug`: Be super-verbose, printing all links found on each page
     32 - `--report40x`: Report more 40x status codes as error, default mode reports only 404
     33 
     34 Examples:
     35 ```bash
     36 # Crawl all subsites of http://stefan-koch.name/ for deadlinks (including external deadlinks)
     37 # Wait one second between opening each URL
     38 python crawler.py --wait 1 http://stefan-koch.name/
     39 
     40 # Crawl all article pages of example.com for deadlinks.
     41 # We assume that there are linked articles on the main page
     42 python crawler.py --restrict http://example.com/article/.+ http://example.com/
     43 
     44 # Crawl all subdomains of example.com, with silent mode and reporting HTTP 40x as dead
     45 python crawler.py --silent --report40x --restrict http://.*\.example\.com/.* http://www.example.com/
     46 
     47 # Crawl example.com, excluding print pages and calendars
     48 python crawler.py --exclude print|calendar http://www.example.com/
     49 ```
     50 
     51 
     52 Using an instance of the class
     53 ------------------------------
     54 
     55 You can use it by creating a new instance of the class and running the crawler. The crawler class supports different options.
     56 
     57 ```python
     58 # Begin crawling at example.com
     59 c = Crawler("http://example.com/")
     60 
     61 # Restrict crawling only to your own domain
     62 c.set_url_restrict("http://example.com/.*")
     63 
     64 # Set a second wait time between each URL to avoid putting
     65 # too much load on your website. But usually on personal PCs
     66 # this should not matter, because our crawler is not distributed
     67 # and your bandwidth is small.
     68 c.set_wait_time(1)
     69 
     70 # start the crawling process
     71 c.crawl()
     72 ```
     73 
     74 License
     75 -------
     76 The crawler is licensed under the Apache Software License v2.0, see [LICENSE.txt](LICENSE.txt) for details
     77 
     78 Version history
     79 ---------------
     80 See [CHANGES.md](CHANGES.md) for complete version history