Back to Silas S. Brown's home page

WebCheck: monitor text strings on websites

WebCheck is a program to check Web pages for changes to specific phrases of text.  Some web monitoring programs and "watchlist" facilities etc will tell you when any change is made to a page, but that's of limited use when you are interested in only a few specific phrases, especially when these are surrounded by many other items which change far more frequently than the one that actually interests you.  So WebCheck lets you check for changes to a particular item on a page.

Note that this is not a "foolproof" method.  If a page lists "old news", or otherwise incorporates an old version of the item you're monitoring, WebCheck might fail to spot the new situation.  You have to use your judgement about when this program can reasonably be used.

WebCheck runs from the command line, usually from a cron job or similar, and writes any changes it found to standard output, which can then be emailed or whatever (if using ImapFix, try its --maybenote option).

webcheck.list

The list of sites to check is in a text file called webcheck.list.  Each line (apart from blank lines and comments) specifies a URL to fetch and some text to check, optionally followed by a comment.  For example:

http://nice-program.example.com The latest version is 1.0

or

http://nice-program.example.com The latest version is 1.0 # otherwise we'd better upgrade

If the text starts with a * then the rest of it is treated as a regular expression, otherwise it is treated as a simple case-sensitive search.

You can check for the absence of certain text by prepending a ! to it:

http://wiki-page.example.org !spam

By default, the searches are made against the text on the page, not against its source code.  If you want to check the source code, prepend a > to the text or !text.

If you need to make more than one test on the same page, simply add multiple lines with the same URL. A shortcut for this is to specify also: on the second and subsequent lines, in place of the repeated URL. Webcheck does of course perform multiple tests in the same fetch operation (the fetch itself will not be duplicated for each test).

It is possible to add arbitrary HTTP headers (such as Accept-Language: en) on lines of their own; these apply to all subsequently-listed URLs.

RSS feeds and other items

You can follow new items on RSS/Atom feeds: give the feed URL and no search text.

If the site lists new items but does not support RSS, you can also extract items, by setting the search text to {START...END} where START and END are starting and ending strings that surround each item. (By default this is done on the parsed version of the page; to do it on the HTML source, add a > before the { at the start of the search text.)

You can also check for DNS changes (useful if you're maintaining a hosts file somewhere due to unreliable DNS or an awkward proxy situation): URLs starting dns:// will return a list of all current IPs, each enclosed in parentheses. So for example to be alerted if 93.184.216.34 ceases to be one of the IP addresses of example.com, use dns://example.com (93.184.216.34)

Also if a server has been unreachable for a long time and you want to be alerted if it ever becomes reachable again, you can place up:// before the URL (e.g. up://http://www.example.com) which will return yes or no and not report an error if the server is not reachable.

Using webdriver

If the page you wish to check requires interaction with complex Javascript before it is reached (for example if you need to "log in" to the site and perform other actions before it becomes available), then you can use the 'webdriver' interface via Selenium and PhantomJS (if you need to set it up in your home directory, use pip install selenium --root $HOME/whatever, set PYTHONPATH appropriately, and put the phantomjs binary in your PATH before running webcheck.) This is less efficient than simple URL fetching, but some sites make it necessary.

An instruction to fetch data via webdriver looks like this:

{ http://site.example.org/ [Click here to show the login form] #txtUsername=me@example.com [#okButton] [Show results] "Results" }

where the first word is the starting URL, and items in square brackets will click either a link with that exact text or an element with the ID specified after a # (check for id= in a browser's Document Inspector or similar). #id=text sends keystrokes text to an input field with ID id, and text in quotes causes the browser to wait until the page source contains it. Also available is #id->text to select from a drop-down (by visible text; blank means deselect all; add quotes if multiple words) and #id*n to set a checkbox to state n (0 or 1).

Some sites make you click each item on a results page to reveal an individual result. To automate this, use /start/5 where 'start' is the start of each item ID and 5 is the number of seconds to wait after clicking. A snapshot of the page after each click will be added to that of the final page, and the checks (or item extractions) that you specify will occur on the combined result. It's assumed that no `back' button needs to be pressed between clicks.

Efficiency

To be as efficient as is reasonable for this kind of program, WebCheck has the following features:However, connection re-use and last-modified handling is not performed when using webdriver (except within each session of course).

You can also change the frequency of specific checks with the days command, which must appear on a line of its own, for example:

days 5

which specifies that the addresses below that line will be checked only if the day they were previously checked was at least 5 days ago (unless they are also listed in sections that require more frequent checks).  For convenience, daily, weekly and monthly are short for days 1, days 7 and days 30 respectively.

Download

webcheck.py (License: GPL v3; contact me if you need a different license)
All material © Silas S. Brown unless otherwise stated.