*
Which version of Python do I need?
Python 2.4, 2.5, 2.6, or 2.7. Python 3 is not yet supported.
*
Does mechanize depend on BeautifulSoup?
No. mechanize offers a few classes that make use of BeautifulSoup, but
these classes are not required to use mechanize. mechanize bundles
BeautifulSoup version 2, so that module is no longer required. A future
version of mechanize will support BeautifulSoup version 3, at which point
mechanize will likely no longer bundle the module.
*
Does mechanize depend on ClientForm?
No, ClientForm is now part of mechanize.
*
Which license?
mechanize is dual-licensed: you may pick either the [BSD
license](http://www.opensource.org/licenses/bsd-license.php), or the [ZPL
2.1](http://www.zope.org/Resources/ZPL) (both are included in the
distribution).
Usage
-----
*
I'm not getting the HTML page I expected to see.
[Debugging tips](hints.html)
*
`Browser` doesn't have all of the forms/links I see in the
HTML. Why not?
Perhaps the default parser can't cope with invalid HTML. Try using the
included BeautifulSoup 2 parser instead:
~~~~{.python}
import mechanize
browser = mechanize.Browser(factory=mechanize.RobustFactory())
browser.open("http://example.com/")
print browser.forms
~~~~
Alternatively, you can process the HTML (and headers) arbitrarily:
~~~~{.python}
browser = mechanize.Browser()
browser.open("http://example.com/")
html = browser.response().get_data().replace("
", "
")
response = mechanize.make_response(
html, [("Content-Type", "text/html")],
"http://example.com/", 200, "OK")
browser.set_response(response)
~~~~
*
Is JavaScript supported?
No, sorry. See [FAQs](#change-value) [below](#script).
*
My HTTP response data is truncated.
`mechanize.Browser's` response objects support the `.seek()` method, and
can still be used after `.close()` has been called. Response data is not
fetched until it is needed, so navigation away from a URL before fetching all
of the response will truncate it. Call `response.get_data()` before navigation
if you don't want that to happen.
*
I'm *sure* this page is HTML, why does `mechanize.Browser`
think otherwise?
~~~~{.python}
b = mechanize.Browser(
# mechanize's XHTML support needs work, so is currently switched off. If
# we want to get our work done, we have to turn it on by supplying a
# mechanize.Factory (with XHTML support turned on):
factory=mechanize.DefaultFactory(i_want_broken_xhtml_support=True)
)
~~~~
*
Why don't timeouts work for me?
Timeouts are ignored with with versions of Python earlier than 2.6.
Timeouts do not apply to DNS lookups.
*
Is there any example code?
Look in the `examples/` directory. Note that the examples on the [forms
page](./forms.html) are executable as-is. Contributions of example code
would be very welcome!
Cookies
-------
*
Doesn't the standard Python library module, `Cookie`, do
this?
No: module `Cookie` does the server end of the job. It doesn't know when
to accept cookies from a server or when to send them back. Part of
mechanize has been contributed back to the standard library as module
`cookielib` (there are a few differences, notably that `cookielib` contains
thread synchronization code; mechanize does not use `cookielib`).
*
Which HTTP cookie protocols does mechanize support?
Netscape and [RFC 2965](http://www.ietf.org/rfc/rfc2965.txt). RFC 2965
handling is switched off by default.
*
What about RFC 2109?
RFC 2109 cookies are currently parsed as Netscape cookies, and treated
by default as RFC 2965 cookies thereafter if RFC 2965 handling is enabled,
or as Netscape cookies otherwise.
*
Why don't I have any cookies?
See [here](hints.html#cookies).
*
My response claims to be empty, but I know it's not!
Did you call `response.read()` (e.g., in a debug statement), then forget
that all the data has already been read? In that case, you may want to use
`mechanize.response_seek_wrapper`. `mechanize.Browser` always returns
[seekable responses](doc.html#seekable-responses), so it's not necessary to
use this explicitly in that case.
*
What's the difference between the `.load()` and `.revert()`
methods of `CookieJar`?
`.load()` *appends* cookies from a file. `.revert()` discards all
existing cookies held by the `CookieJar` first (but it won't lose any
existing cookies if the loading fails).
*
Is it threadsafe?
No. As far as I know, you can use mechanize in threaded code, but it
provides no synchronisation: you have to provide that yourself.
*
How do I do
Refer to the API documentation in docstrings.
Forms
-----
*
Doesn't the standard Python library module, `cgi`, do this?
No: the `cgi` module does the server end of the job. It doesn't know
how to parse or fill in a form or how to send it back to the server.
*
How do I figure out what control names and values to use?
`print form` is usually all you need. In your code, things like the
`HTMLForm.items` attribute of `HTMLForm` instances can be useful to inspect
forms at runtime. Note that it's possible to use item labels instead of
item names, which can be useful — use the `by_label` arguments to the
various methods, and the `.get_value_by_label()` / `.set_value_by_label()`
methods on `ListControl`.
*
What do those `'*'` characters mean in the string
representations of list controls?
A `*` next to an item means that item is selected.
*
What do those parentheses (round brackets) mean in the string
representations of list controls?
Parentheses `(foo)` around an item mean that item is disabled.
*
Why doesn't turn up in the data returned by
`.click*()` when that control has non-`None` value?
Either the control is disabled, or it is not successful for some other
reason. 'Successful' (see [HTML 4
specification](http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13.2))
means that the control will cause data to get sent to the server.
*
Why does mechanize not follow the HTML 4.0 / RFC 1866
standards for `RADIO` and multiple-selection `SELECT` controls?
Because by default, it follows browser behaviour when setting the
initially-selected items in list controls that have no items explicitly
selected in the HTML. Use the `select_default` argument to `ParseResponse`
if you want to follow the RFC 1866 rules instead. Note that browser
behaviour violates the HTML 4.01 specification in the case of `RADIO`
controls.
*
Why does `.click()`ing on a button not work for me?
* Clicking on a `RESET` button doesn't do anything, by design - this is a
library for web automation, not an interactive browser. Even in an
interactive browser, clicking on `RESET` sends nothing to the server,
so there is little point in having `.click()` do anything special here.
* Clicking on a `BUTTON TYPE=BUTTON` doesn't do anything either, also by
design. This time, the reason is that that `BUTTON` is only in the
HTML standard so that one can attach JavaScript callbacks to its
events. Their execution may result in information getting sent back to
the server. mechanize, however, knows nothing about these callbacks,
so it can't do anything useful with a click on a `BUTTON` whose type is
`BUTTON`.
* Generally, JavaScript may be messing things up in all kinds of ways.
See the answer to the next question.
*
How do I change `INPUT
TYPE=HIDDEN` field values (for example, to emulate the effect of JavaScript
code)?
As with any control, set the control's `readonly` attribute false.
~~~~{.python}
form.find_control("foo").readonly = False # allow changing .value of control foo
form.set_all_readonly(False) # allow changing the .value of all controls
~~~~
*
I'm having trouble debugging my code.
See [here](hints.html) for few relevant tips.
*
I have a control containing a list of integers. How do I
select the one whose value is nearest to the one I want?
~~~~{.python}
import bisect
def closest_int_value(form, ctrl_name, value):
values = map(int, [item.name for item in form.find_control(ctrl_name).items])
return str(values[bisect.bisect(values, value) - 1])
form["distance"] = [closest_int_value(form, "distance", 23)]
~~~~
General
-------
*
I want to see what my web browser is
doing, but standard network sniffers like
[wireshark](http://www.wireshark.org/) or netcat (nc) don't work for HTTPS.
How do I sniff HTTPS traffic?
Three good options:
* Mozilla plugin: [LiveHTTPHeaders](http://livehttpheaders.mozdev.org/).
* [ieHTTPHeaders](http://www.blunck.info/iehttpheaders.html) does
the same for MSIE.
* Use [`lynx`](http://lynx.browser.org/) `-trace`, and filter out
the junk with a script.
*
JavaScript is messing up my
web-scraping. What do I do?
JavaScript is used in web pages for many purposes -- for example: creating
content that was not present in the page at load time, submitting or
filling in parts of forms in response to user actions, setting cookies,
etc. mechanize does not provide any support for JavaScript.
If you come across this in a page you want to automate, you have four
options. Here they are, roughly in order of simplicity.
* Figure out what the JavaScript is doing and emulate it in your Python
code: for example, by manually adding cookies to your `CookieJar`
instance, calling methods on `HTMLForm`s, calling `urlopen`, etc. See
[above](#change-value) re forms.
* Use Java's [HtmlUnit](http://htmlunit.sourceforge.net/) or
[HttpUnit](http://httpunit.sourceforge.net) from Jython, since they
know some JavaScript.
* Instead of using mechanize, automate a browser instead. For example
use MS Internet Explorer via its COM automation interfaces, using the
[Python for Windows
extensions](http://starship.python.net/crew/mhammond/), aka pywin32,
aka win32all (e.g. [simple
function](http://vsbabu.org/mt/archives/2003/06/13/ie_automation.html),
[pamie](http://pamie.sourceforge.net/); [pywin32 chapter from the
O'Reilly
book](http://www.oreilly.com/catalog/pythonwin32/chapter/ch12.html)) or
[ctypes](http://python.net/crew/theller/ctypes/)
([example](http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/305273)).
[This](http://www.brunningonline.net/simon/blog/archives/winGuiAuto.py.html)
kind of thing may also come in useful on Windows for cases where the
automation API is lacking. For Firefox, there is
[PyXPCOM](https://developer.mozilla.org/en/PyXPCOM).
* Get ambitious and automatically delegate the work to an appropriate
interpreter (Mozilla's JavaScript interpreter, for instance). This is
what HtmlUnit and httpunit do. I did a spike along these lines some
years ago, but I think it would (still) be quite a lot of work to do
well.
*
Misc links
*
The following libraries can be useful for dealing
with bad HTML: [lxml.html](http://codespeak.net/lxml/lxmlhtml.html),
[html5lib](http://code.google.com/p/html5lib/), [BeautifulSoup
3](http://www.crummy.com/software/BeautifulSoup/CHANGELOG.html),
[mxTidy](http://www.egenix.com/files/python/mxTidy.html) and
[mu-Tidylib](http://utidylib.berlios.de/).
* [Selenium](http://www.openqa.org/selenium/): In-browser web functional
testing. If you need to test websites against real browsers, this is a
standard way to do it.
* O'Reilly book: [Spidering
Hacks](http://oreilly.com/catalog/9780596005771). Very Perl-oriented.
* Standard extensions for web development with Firefox, which are also
handy if you're scraping the web: [Web
Developer](http://chrispederick.com/work/webdeveloper/) (amongst other
things, this can display HTML form information),
[Firebug](http://getfirebug.com/).
* Similar functionality for IE6 and IE7: [Internet Explorer Developer
Toolbar](http://www.google.co.uk/search?q=internet+explorer+developer+toolbar&btnI=I'm+Feeling+Lucky)
(IE8 comes with something equivalent built-in, as does Google Chrome).
* [Open source functional testing
tools](http://www.opensourcetesting.org/functional.php).
* [A HOWTO on web
scraping](http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html) from
Dave Kuhlman.
*
Will any of this code make its way into the Python standard
library?
The request / response processing extensions to `urllib2` from mechanize
have been merged into `urllib2` for Python 2.4. The cookie processing has
been added, as module `cookielib`. There are other features that would be
appropriate additions to `urllib2`, but since Python 2 is heading into
bugfix-only mode, and I'm not using Python 3, they're unlikely to be added.
*
Where can I find out about the relevant standards?
* [HTML 4.01 Specification](http://www.w3.org/TR/html401/)
* [Draft HTML 5 Specification](http://dev.w3.org/html5/spec/)
* [RFC 1866](http://www.ietf.org/rfc/rfc1866.txt) - the HTML 2.0
standard (you don't want to read this)
* [RFC 1867](http://www.ietf.org/rfc/rfc1867.txt) - Form-based file
upload
* [RFC 2616](http://www.ietf.org/rfc/rfc2616.txt) - HTTP 1.1
Specification
* [RFC 3986](http://www.ietf.org/rfc/rfc3986.txt) - URIs
* [RFC 3987](http://www.ietf.org/rfc/rfc3987.txt) - IRIs