Update 2009-10-03:
For further development and improvements, contact me or have a look at this public github repository created by Adam Nelson.


From time to time you may want to create a screenshot of a web page from command line, for example if you wish to create thumbnails for your web-application. So you might search for such a program and find tools like webkit2png, which is for Mac OS X only, or khtml2png, which requires a lot of KDE stuff to be installed on your server.

But since Qt Software, formerly known als Trolltech, integrated Safari’s famous rendering engine WebKit (which is based on Konqueror’s khtml engine) into it’s framework, we are now able to make use of it with the help of some Python and PyQt4.

blog_small1.png

If you are in a hurry, click here to get a full-featured version of webkit2png.py.

I assume that you have some basic knowledge of python. If you run into problems with the Qt part of this tutorial, I suggest to have a look at the class documentation, first. Please note that Qt is a C++ framework, and most of the example code in this documentation has not been ported. So it might be helpful if you have some basic knowledge of C++, too.

Requirements: Webkit and PyQt4 (packages libqt4-webkit and python-qt4 when you’re using Intrepid Ibex).

So, run your favourite editor (vim, of course) and start to enter some python code. First, we will have to organize some imports:

#!/usr/bin/env python
import sys # required to exit this program
import signal # required to catch CTRL-C (I'll explain this later)

# Some of the PyQt libs
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *

Qt is highly event based (called “slots” and “signals”), so we have to prepare a “slot” which gets called when the page has been loaded completely:

def onLoadFinished(result):
    print "loadFinished(%s)" % str(result)
    sys.exit(0) # this is the moment when we have to quit normally

Even if we intend to write a CLI based application, QtWebkit requires a GUI in the background. This is why we have to use QApplication instead of QCoreApplication. And because we will not have any visible controls, we should ensure that we can still quit this application using CTRL-C (this is why we have to import signal):

app = QApplication(sys.argv)
signal.signal(signal.SIGINT, signal.SIG_DFL)

Now we can create a QWebPage-Object without any exception or segmentation fault. Connect it with our “onLoadFinished”-slot and load the url you want to make a screenshot of (here I’m using Google):

webpage = QWebPage()
webpage.connect(webpage, SIGNAL("loadFinished(bool)"), onLoadFinished)
webpage.mainFrame().load(QUrl("http://www.google.com"))

If you run this application now, you’ll see… nothing. onLoadFinished might be called, but the result will be “False”. This is because Qt is so extremly event-based, and there is still no main loop to handle these events. So finally you have to start your QApplication:

sys.exit(app.exec_())

If you execute this now, the output should be:

onLoadFinished(True)

Good, the page is loaded! The next step is to render this into a file by expanding “onLoadFinished” (this means: all the code from now on have to be INSIDE of “onLoadFinished”). At first, we should ensure that we do not proceed if we got an error:

def onLoadFinished(result):
    print "loadFinished(%s)" % str(result)
    if not result:
        print "Request failed"
        sys.exit(1)

Otherwise, we should enlarge the viewport (that is our virtual browser window) to the desired size. If you want to create a picture of the whole page, you should use the “preferred” size of the contents:

        print "Request failed"
        sys.exit(1)

    # Set the size of the (virtual) browser window
    webpage.setViewportSize(webpage.mainFrame().contentsSize()

And finally, render this into an QImage-object and store this into a file:

    # Set the size of the (virtual) browser window
    webpage.setViewportSize(webpage.mainFrame().contentsSize()

    # Paint this frame into an image
    image = QImage(webpage.viewportSize(), QImage.Format_ARGB32)
    painter = QPainter(image)
    webpage.mainFrame().render(painter)
    painter.end()
    image.save("output.png")
    sys.exit(0) # quit this application

Done. Pretty easy, isn’t it? Oh, wait! QWebPage depends an QtGui, and QtGui depends on a running X server (at least on Unix systems). So how
can we make use of this on a headless server machine? The answer is Xvfb, a framebuffer based X server, originally designed for testing purposes. Of course, it requires some X-libs and fonts, too (how should a page be rendered without any fonts?), but it does not have so much overhead like the real XOrg-server and don’t need to be running all the time. Just call the script this way:

$ xvfb-run --server-args="-screen 0, 640x480x24" python webkit2png-simple.py

The screen size doesn’t matter, but the color depth of 24 bit is important. Otherwise, the resulting screenshot would be limited to 256 colors. For more options, have a look at the man-Pages of ‘Xvfb’ and ‘xvfb-run’.

Last, but not least, I’ll provide you two versions of this script. webkit2png-simple.py is exactly the result of this tutorial, while webkit2png.py is a much more improved version with command line arguments and coded in OOP style.

Update 2009-04-01
Here’s another guy who had the same idea earlier than me.

39 Responses to “Create screenshots of a web page using Python and QtWebKit”

  1. Uniblogs · Uniblogs im Rückspiegel: Was wichtig war in KW 49 Says:

    [...] geschrieben hat. Die Themen wie immer sehr bunt: Mutmaßungen über die Uniblogs, Wahlhelfersuche, Screenshotautomatisierung mit Python und Qts Webkit (sehr praktisch!), die Uni-Wahlen 2009, nVidia und der Intrepid Ibex und de [...]

  2. Screenshot a URL with Python and Qt and WebKit « the renaissance man Says:

    [...] when I found the work of Roland Tapken. His script and explanation were the solution I needed. It made nice screenshots, had the [...]

  3. Alex Ezell Says:

    Roland, I am having trouble using this script. All of the screenshots turn out fine, but it seems like the Xvfb servers are not killed or exited properly. So, as I create screenshots, Xvfb processes are left behind every time the script runs. Do you have any thoughts why this might be happening?

  4. Roland Says:

    This sounds strange because the application exits itself immediatly after the image is written to disk. Might be a problem with xvfb-run.

    I’ve read in your blog that you modified the script. This should not be neccessary when you make the file executable and run it like this:

    ./webkit2png.py –xvfb [...]

    Please try this and tell me if it helps. If not, we can try to change the code so that it starts Xvfb by itself and kills the process before exit.

  5. Alex Ezell Says:

    Thanks for taking time to look at it Roland. The part I changed is the part that handles starting Xvfb. This is what I have:

    if options.xvfb:
            # Start 'xvfb' instance by replacing the current process
            newArgs = ["xvfb-run", "-a", "--server-args=-screen 0 1024x768x24", "python"]
            for i in range(0, len(sys.argv)):
                if sys.argv[i] not in ["-x", "--xvfb"]:
                    newArgs.append(sys.argv[i])
            logging.debug("Executing %s" % " ".join(newArgs))
            os.execvp(newArgs[0], newArgs)
            raise RuntimeError("Failed to execute '%s'" % newArgs[0])

    I’ve added the “-a” and the “python” arguments to xvfb-run. If I don’t have “-a,” it will fail to start xvfb because it’s already running with that server id. If I don’t have “python” the command passed through xvfb-run is incorrect.

    I have tried it with your original script with no changes and Xvfb still doesn’t die. Perhaps, it’s some environment problem? I noticed that I have to use kill -9 to get the Xvfb process to die. A simple kill won’t work.

  6. Roland Says:

    Alex, I’ll reply to you by mail. If we find a solution, I’ll update the article.

  7. Roland Says:

    The bug described by Alex seems to be reported here:

    https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/294454

    This has to be fixed by the Ubuntu people. However, we’re testing a workaround that tries to determine the PID of the active xvfb-run-instance at the end of the script and then kill itself with signal 9:

    m = re.match(”.*xvfb-run\.(\d+).*”, os.environ['XAUTHORITY'])
    if m:
    os.kill(int(m.group(1)), 9)

    This code has to be injected near line 203, just before “sys.exit(0)”. It requires you to import the module “re” (regular expressions) at the beginning of the script.

    I will not add this to webkit2png.py as I’m really convinced that this bug has to be fixed in Ubuntu’s xvfb pacakge.

  8. Hubert Says:

    Hi,

    webkit2png.py always fails for me with “failed to load”:

    # ./webkit2png.py -x -o test.png –debug http://news.bbc.co.uk
    DEBUG:root:Executing xvfb-run –server-args=-screen 0, 640×480x24 ./webkit2png.py -o test.png –debug http://news.bbc.co.uk
    DEBUG:root:Initializing class WebkitRenderer
    DEBUG:root:render(http://news.bbc.co.uk, timeout=0)
    DEBUG:root:Processing result
    ERROR:root:Failed to load http://news.bbc.co.uk

    The simple version works fine, I have written a .sh wrapper for it.

    Although it seems to fail on some sites, e.g.:

    ./webkit2png.sh http://www.rbsdigital.com
    QPainter::begin: Paint device returned engine == 0, type: 3
    QPainter::renderHints: Painter must be active to set rendering hints
    [...]

    I’m using libqt4-webkit 4.4.3-2, python-qt4 4.4.2-4 on Debian 5.0.

  9. Roland Says:

    I can reproduce issue #2, although I ‘m very busy at the moment and will not be able to analyse this at the moment.

    Problem #1 works for me. Please try to modify the script near line 63 and report the results:

    self._page.mainFrame().load(QUrl(url))
    self.__loading = True
    while self.__loading:

    Keep in mind that this is Python, so don’t mix up the indentation.

  10. Roland Says:

    Update: Problem #2 is because the page does not report a “contentSize”, and the reason is that the site uses a frameset. You can override the contentSize with “–geometry WIDTH HEIGH”, but this results in an empty image. As I said I’ll have a look at this as soon as I’m not so busy anymore.

    If you want to hack this yourself: I assume that you have to define the geometry of self._page or self._page.mainFrame() at some point before the rendering.

  11. thomas Says:

    big, big thanks for such solution. I run on exactly the same problem i was wondering how to solve it quickly. Thanks for a good start with that.

  12. zz Says:

    shameless:p
    http://www.insecure.ws/2008/09/16/xserver-less-webpage-screenshot

  13. Roland Says:

    @zz: Kang’s post is from September 16th, mine is from December. So if there is somebody to blame for stealing code it’s me, but I swear that I never saw Kang’s script earlier :-)

  14. Paul Says:

    This works:

    __self = True
    
    class WebkitRenderer(QObject):
    
        # Initializes the QWebPage object and registers some slots
        def __init__(self):
            def __on_load_finished(result):
                __self.__on_load_finished(result)
            def __on_load_started():
                __self.__on_load_started()
    
            __self = self
            logging.debug("Initializing class %s", self.__class__.__name__)
            self._page = QWebPage()
            self.connect(self._page, SIGNAL("loadFinished(bool)"), __on_load_finished)
            self.connect(self._page, SIGNAL("loadStarted()"), __on_load_started)
    
  15. VidJa Says:

    Hi Roland,

    Thanks for this excellent piece of work. I integrated it in my Django based website. Somewhere in 2007 I had a version of khtml2png2 working, but after a switch to mod_wsgi and various server upgrades I couldn’t get it working anymore.

    I ran into some xvfb issues however. When running a test script on the command line of my server your script runs without error messages using –xvfb, but when I run it from the mod_wsgi environment it generates an error message: Xvfb failed to start.
    when running using –display :0.0 it works from the wsgi script, but with an error message:style cannot be used together with the GTK_Qt engine. Anyway the last one works for me.

    (Ubuntu 9.04)

    # testscript
    import os, sys, subprocess

    options=['webkit2png.py',
    '--display', ':0.0',
    '-g', '1024', '768',
    u'http://www.dpreview.com',
    '--scale','128','92',
    '-o','dpreview.png']

    p=subprocess.Popen(options,0)
    output,errors=p.communicate()

  16. Roland Says:

    Hi VidJa,

    Thanks for this report. I’ll have a look at it later.

    Update: I think this is an issue of mod_wsgi. Sadly, xvfb-run does not provide some sort of –verbose flag. Can you run it with “strace” (by modifying webkit2html.py)?

    Maybe xvfb-run does not have the permission to write the authority-file? The man page says that this file is written to the directory defined by TMPDIR or /tmp.

    Another reason might be that the memory is limited by mod_wsgi.

  17. Ariya Says:

    Check also similar Qt/C++ code I wrote some time ago:
    http://labs.trolltech.com/blogs/2008/11/03/thumbnail-preview-of-web-page/
    http://labs.trolltech.com/blogs/2009/01/15/capturing-web-pages/

  18. Jorge Pereira Says:

    Hi everyone,

    Regarding the issue with Xvfb staying up, it’s enough to pass “-terminate” to the server args. So, line 154 would look like:
    newArgs = ["xvfb-run", "--server-args=-terminate -screen 0, 640x480x24", sys.argv[0]]

    However, xvfb-run is already trying to kill Xvfb, so using this will trigger a warning message from xvfb-run.

    An option to skip this message would be to skip xvfb-run (it’s just a simple shell script anyway) and call Xvfb directly. As for xvfb, one of the following could be done:
    - change xvfb-run to use -terminate instead of issuing a kill (recommended?)
    - change xvfb-run to use kill -9

    Regards,

  19. Anonymous Says:

    For those of you who might be getting the error:
    “QPainter::begin: Paint device returned engine == 0, type: 3″

    There are a couple possible reasons:
    - The page is greater than 32,768 pixels (2^15 px) in any dimension (http://doc.trolltech.com/4.5/qpainter.html#limitations)
    - The page is framed and messing with the image dimensions.

    Hope this saves someone a massive headache.

  20. Rob Sanderson Says:

    Is there an easy way to fire this multiple times from a single script? For example, a crawler that takes snapshots of all of the pages that it visits? Other than the obvious commands.getoutput() of course :)

    Many thanks!

  21. Adam Nelson Says:

    Roland,

    Would you consider getting this script onto PyPI as well as GitHub, BitBucket, or Google Code?

    It’s the best script I’ve come across for this job and it would be great to see it built out by the community. If you don’t want to do you mind if I do? I’d like to use this in a few places and if it were available from PyPI it would be great.

    Cheers,
    Adam

  22. Roland Says:

    At the moment I’m still to busy to package this for PyPI by myself, but I don’t mind if you do so!

  23. Marc Says:

    This script ROCKS!

    I got this working finally and it renders great. Wish I could make it faster. I had this working on a Mac before and it was quite fast. Now running on Linux (yea!)…

    anyway, I can’t get Flash to render. Any ideas? I am pretty certain flash is installed on the server, but maybe need to put it somewhere.

  24. Charlie Clark Says:

    I’m caught between this and simply calling websnap or CutyCapt as a subprocess. Anyone struggling with xvfb-run might try adding -f to the command list as this stops xvfb complaing it can’t start the server.

  25. Cole Says:

    I made some modifications to your script and thought I would share: http://pastie.org/609626
    And the diff: http://pastie.org/609631

    Added a simple networkAccessManager to handle bad ssl certificates (we use self-signed certs on some pages I wanted to thumbnail). It could easily be extended to do something more intelligent, but it works for us.

    Added another option for aspect ratio: crop. This renders the full page the same as expand, then crops to the desired size. This gives better results for short pages like google than setting the browser size and using ignore aspect ratio.

    If anyone knows how to do a higher quality resize in QT I would be interested to hear. It seems to be doing simple linear interpolation which gives very poor results especially for text.

  26. Bob Says:

    Does anyone know how to get this to display Flash plugins?

    I’ve tried enabling plugins in the script and also using Adobe Flash 32bit and 64bit or swfdec and gnash. None of them seem to work.

  27. Adam Nelson Says:

    As per Roland’s comment, I moved this to a public repository so people can collaborate on this.

    http://github.com/AdamN/python-webkit2png

    This includes Coles modifications.

    Feel free to make updates, fork, etc…

  28. Luay Says:

    Good Day everyone,

    thank you for your effort, the idea looks really nice.

    i will start a website soon, in which i need a snapshot functionality, so i landed in this page.

    my website as i got from the host, will be hosted on linux and supports python,

    MY PROBLEM :) is that i come from windows background with eperience in ASP, and little bit PHP (which i will use for the website).

    questions are:
    what are the pre requirments to use your project on linux host, python, and php support, (i read things about Qt but i dont know what is it).

    and the second question is: are there some steps how to setup this on the host and use it from within PHP.

    thank you very much and accept my best regards,
    Luay

  29. Roland Says:

    Hi Luay,

    beside of Python you should have installed the “webkit” library of the qt package and the PyQt4 package for python. Beyond that you’ll need an X11-Server – “Xvfb” should be sufficient for a headless maschine.

    I suggest to use your distributions package management to install these dependencies. If you tell me what distribution you are using I might be able to tell you the package names.

    Qt is a library for GUI programming which comes with it’s own HTML rendering engine, webkit. Please have a look at Wikipedia for further information.

    Good luck!
    Roland

  30. Luay Says:

    Hello Roland,

    thank you very much for the response, do you know a host name which supports such packages,

    i asked the host i suppose to host with, and they have absolutely no idee :)

    Thank you,
    Luay

  31. Roland Says:

    Oh ok, I assumed you were running your own server. Sorry, I don’t think I can help you in that question.

  32. Luay Says:

    Nevertheless, thank you very much

  33. Ruby On Rails Entwicklung Says:

    I just released a ruby-package to generate thumbshots using your script:
    http://github.com/digineo/thumbshooter

  34. Adam Nelson Says:

    @Luay http://webfaction.com has great support for Python stuff – you could try them.

  35. Ben Standefer Says:

    Roland,

    I am having the same exact issue as Hubert. Looks like something with the Debian install of Qt4 makes the simple script work, but webkit2png.py reports “Failed to load” messages on all pages. I debugged for about 2 hours, but I am not Qt expert, and I only got “Failed to load” messages, indefinite hanging, or blank renders.

    I documented on the github repo:
    http://github.com/AdamN/python-webkit2png/issues/#issue/2

    Nice work though, looks excellent!

    -Ben Standefer

  36. Loic Says:

    I got the same problem that Hubert reported in March :
    # ./webkit2png.py -x -o test.png –debug http://news.bbc.co.uk
    DEBUG:root:Executing xvfb-run –server-args=-screen 0, 640×480×24 ./webkit2png.py -o test.png –debug http://news.bbc.co.uk
    DEBUG:root:Initializing class WebkitRenderer
    DEBUG:root:render(http://news.bbc.co.uk, timeout=0)
    DEBUG:root:Processing result
    ERROR:root:Failed to load http://news.bbc.co.uk

    script version is from github.
    my python version is : Python 2.5.2

    le webkit2png-simple works fine.
    And I think i nailed the problem down to the callbacks not being called back….

    __on_load_started is never called, it seems…

    if i change
    - self.connect(self._page, SIGNAL(”loadStarted()”), self.__on_load_started)
    + self.connect(self._page, SIGNAL(”loadStarted()”), onLoadStarted)
    with :
    +def onLoadStarted():
    + print “load started”

    I get a nice log :
    DEBUG:root:Initializing class WebkitRenderer
    DEBUG:root:render(http://www.google.com, timeout=20)
    load started
    ERROR:root:Request timed out

    So, is it because my python is too old ?
    does object method-callbacks works ?

  37. Roland Says:

    Strange, two people reporting the same issue. Ben, what Python and Qt versions are you using?

  38. asdfa Says:

    hi,

    thank you. i’m using the same approach. but i want the captured picture be exactly the size of the web page. if i use your approach, the screen shot will be the size of the web frame, and i often see the scroll bar, because the frame is smaller than the web page.

    so how can i make a screen shot of the entire web page?

    thanks.

  39. Roland Says:

    I think this might be a problem with the size of the “virtual desktop”. Maybe I have a chance to spend more time with this script in the near future.

Leave a Reply