Create screenshots of a web page using Python and QtWebKit
3. Dezember 2008
Update 2009-10-03:
For further development and improvements, contact me or have a look at this public github repository created by Adam Nelson.
From time to time you may want to create a screenshot of a web page from command line, for example if you wish to create thumbnails for your web-application. So you might search for such a program and find tools like webkit2png, which is for Mac OS X only, or khtml2png, which requires a lot of KDE stuff to be installed on your server.
But since Qt Software, formerly known als Trolltech, integrated Safari’s famous rendering engine WebKit (which is based on Konqueror’s khtml engine) into it’s framework, we are now able to make use of it with the help of some Python and PyQt4.
If you are in a hurry, click here to get a full-featured version of webkit2png.py.
I assume that you have some basic knowledge of python. If you run into problems with the Qt part of this tutorial, I suggest to have a look at the class documentation, first. Please note that Qt is a C++ framework, and most of the example code in this documentation has not been ported. So it might be helpful if you have some basic knowledge of C++, too.
Requirements: Webkit and PyQt4 (packages libqt4-webkit and python-qt4 when you’re using Intrepid Ibex).
So, run your favourite editor (vim, of course) and start to enter some python code. First, we will have to organize some imports:
#!/usr/bin/env python import sys # required to exit this program import signal # required to catch CTRL-C (I'll explain this later) # Some of the PyQt libs from PyQt4.QtCore import * from PyQt4.QtGui import * from PyQt4.QtWebKit import *
Qt is highly event based (called “slots” and “signals”), so we have to prepare a “slot” which gets called when the page has been loaded completely:
def onLoadFinished(result):
print "loadFinished(%s)" % str(result)
sys.exit(0) # this is the moment when we have to quit normally
Even if we intend to write a CLI based application, QtWebkit requires a GUI in the background. This is why we have to use QApplication instead of QCoreApplication. And because we will not have any visible controls, we should ensure that we can still quit this application using CTRL-C (this is why we have to import signal):
app = QApplication(sys.argv) signal.signal(signal.SIGINT, signal.SIG_DFL)
Now we can create a QWebPage-Object without any exception or segmentation fault. Connect it with our “onLoadFinished”-slot and load the url you want to make a screenshot of (here I’m using Google):
webpage = QWebPage()
webpage.connect(webpage, SIGNAL("loadFinished(bool)"), onLoadFinished)
webpage.mainFrame().load(QUrl("http://www.google.com"))
If you run this application now, you’ll see… nothing. onLoadFinished might be called, but the result will be “False”. This is because Qt is so extremly event-based, and there is still no main loop to handle these events. So finally you have to start your QApplication:
sys.exit(app.exec_())
If you execute this now, the output should be:
onLoadFinished(True)
Good, the page is loaded! The next step is to render this into a file by expanding “onLoadFinished” (this means: all the code from now on have to be INSIDE of “onLoadFinished”). At first, we should ensure that we do not proceed if we got an error:
def onLoadFinished(result):
print "loadFinished(%s)" % str(result)
if not result:
print "Request failed"
sys.exit(1)
Otherwise, we should enlarge the viewport (that is our virtual browser window) to the desired size. If you want to create a picture of the whole page, you should use the “preferred” size of the contents:
print "Request failed"
sys.exit(1)
# Set the size of the (virtual) browser window
webpage.setViewportSize(webpage.mainFrame().contentsSize()
And finally, render this into an QImage-object and store this into a file:
# Set the size of the (virtual) browser window
webpage.setViewportSize(webpage.mainFrame().contentsSize()
# Paint this frame into an image
image = QImage(webpage.viewportSize(), QImage.Format_ARGB32)
painter = QPainter(image)
webpage.mainFrame().render(painter)
painter.end()
image.save("output.png")
sys.exit(0) # quit this application
Done. Pretty easy, isn’t it? Oh, wait! QWebPage depends an QtGui, and QtGui depends on a running X server (at least on Unix systems). So how
can we make use of this on a headless server machine? The answer is Xvfb, a framebuffer based X server, originally designed for testing purposes. Of course, it requires some X-libs and fonts, too (how should a page be rendered without any fonts?), but it does not have so much overhead like the real XOrg-server and don’t need to be running all the time. Just call the script this way:
$ xvfb-run --server-args="-screen 0, 640x480x24" python webkit2png-simple.py
The screen size doesn’t matter, but the color depth of 24 bit is important. Otherwise, the resulting screenshot would be limited to 256 colors. For more options, have a look at the man-Pages of ‘Xvfb’ and ‘xvfb-run’.
Last, but not least, I’ll provide you two versions of this script. webkit2png-simple.py is exactly the result of this tutorial, while webkit2png.py is a much more improved version with command line arguments and coded in OOP style.
Update 2009-04-01
Here’s another guy who had the same idea earlier than me.
Dezember 9th, 2008 at 21:20
[...] geschrieben hat. Die Themen wie immer sehr bunt: Mutmaßungen über die Uniblogs, Wahlhelfersuche, Screenshotautomatisierung mit Python und Qts Webkit (sehr praktisch!), die Uni-Wahlen 2009, nVidia und der Intrepid Ibex und de [...]
Februar 13th, 2009 at 08:22
[...] when I found the work of Roland Tapken. His script and explanation were the solution I needed. It made nice screenshots, had the [...]
Februar 18th, 2009 at 04:24
Roland, I am having trouble using this script. All of the screenshots turn out fine, but it seems like the Xvfb servers are not killed or exited properly. So, as I create screenshots, Xvfb processes are left behind every time the script runs. Do you have any thoughts why this might be happening?
Februar 18th, 2009 at 11:01
This sounds strange because the application exits itself immediatly after the image is written to disk. Might be a problem with xvfb-run.
I’ve read in your blog that you modified the script. This should not be neccessary when you make the file executable and run it like this:
./webkit2png.py –xvfb [...]
Please try this and tell me if it helps. If not, we can try to change the code so that it starts Xvfb by itself and kills the process before exit.
Februar 18th, 2009 at 17:13
Thanks for taking time to look at it Roland. The part I changed is the part that handles starting Xvfb. This is what I have:
if options.xvfb: # Start 'xvfb' instance by replacing the current process newArgs = ["xvfb-run", "-a", "--server-args=-screen 0 1024x768x24", "python"] for i in range(0, len(sys.argv)): if sys.argv[i] not in ["-x", "--xvfb"]: newArgs.append(sys.argv[i]) logging.debug("Executing %s" % " ".join(newArgs)) os.execvp(newArgs[0], newArgs) raise RuntimeError("Failed to execute '%s'" % newArgs[0])I’ve added the “-a” and the “python” arguments to xvfb-run. If I don’t have “-a,” it will fail to start xvfb because it’s already running with that server id. If I don’t have “python” the command passed through xvfb-run is incorrect.
I have tried it with your original script with no changes and Xvfb still doesn’t die. Perhaps, it’s some environment problem? I noticed that I have to use kill -9 to get the Xvfb process to die. A simple kill won’t work.
Februar 18th, 2009 at 19:53
Alex, I’ll reply to you by mail. If we find a solution, I’ll update the article.
Februar 20th, 2009 at 13:54
The bug described by Alex seems to be reported here:
https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/294454
This has to be fixed by the Ubuntu people. However, we’re testing a workaround that tries to determine the PID of the active xvfb-run-instance at the end of the script and then kill itself with signal 9:
m = re.match(”.*xvfb-run\.(\d+).*”, os.environ['XAUTHORITY'])
if m:
os.kill(int(m.group(1)), 9)
This code has to be injected near line 203, just before “sys.exit(0)”. It requires you to import the module “re” (regular expressions) at the beginning of the script.
I will not add this to webkit2png.py as I’m really convinced that this bug has to be fixed in Ubuntu’s xvfb pacakge.
März 11th, 2009 at 02:06
Hi,
webkit2png.py always fails for me with “failed to load”:
# ./webkit2png.py -x -o test.png –debug http://news.bbc.co.uk
DEBUG:root:Executing xvfb-run –server-args=-screen 0, 640×480x24 ./webkit2png.py -o test.png –debug http://news.bbc.co.uk
DEBUG:root:Initializing class WebkitRenderer
DEBUG:root:render(http://news.bbc.co.uk, timeout=0)
DEBUG:root:Processing result
ERROR:root:Failed to load http://news.bbc.co.uk
The simple version works fine, I have written a .sh wrapper for it.
Although it seems to fail on some sites, e.g.:
./webkit2png.sh http://www.rbsdigital.com
QPainter::begin: Paint device returned engine == 0, type: 3
QPainter::renderHints: Painter must be active to set rendering hints
[...]
I’m using libqt4-webkit 4.4.3-2, python-qt4 4.4.2-4 on Debian 5.0.
März 12th, 2009 at 11:59
I can reproduce issue #2, although I ‘m very busy at the moment and will not be able to analyse this at the moment.
Problem #1 works for me. Please try to modify the script near line 63 and report the results:
self._page.mainFrame().load(QUrl(url))
self.__loading = True
while self.__loading:
Keep in mind that this is Python, so don’t mix up the indentation.
März 12th, 2009 at 12:12
Update: Problem #2 is because the page does not report a “contentSize”, and the reason is that the site uses a frameset. You can override the contentSize with “–geometry WIDTH HEIGH”, but this results in an empty image. As I said I’ll have a look at this as soon as I’m not so busy anymore.
If you want to hack this yourself: I assume that you have to define the geometry of self._page or self._page.mainFrame() at some point before the rendering.
März 22nd, 2009 at 10:35
big, big thanks for such solution. I run on exactly the same problem i was wondering how to solve it quickly. Thanks for a good start with that.
April 1st, 2009 at 12:36
shameless:p
http://www.insecure.ws/2008/09/16/xserver-less-webpage-screenshot
April 1st, 2009 at 13:22
@zz: Kang’s post is from September 16th, mine is from December. So if there is somebody to blame for stealing code it’s me, but I swear that I never saw Kang’s script earlier
April 19th, 2009 at 12:44
This works:
__self = True class WebkitRenderer(QObject): # Initializes the QWebPage object and registers some slots def __init__(self): def __on_load_finished(result): __self.__on_load_finished(result) def __on_load_started(): __self.__on_load_started() __self = self logging.debug("Initializing class %s", self.__class__.__name__) self._page = QWebPage() self.connect(self._page, SIGNAL("loadFinished(bool)"), __on_load_finished) self.connect(self._page, SIGNAL("loadStarted()"), __on_load_started)Mai 4th, 2009 at 10:24
Hi Roland,
Thanks for this excellent piece of work. I integrated it in my Django based website. Somewhere in 2007 I had a version of khtml2png2 working, but after a switch to mod_wsgi and various server upgrades I couldn’t get it working anymore.
I ran into some xvfb issues however. When running a test script on the command line of my server your script runs without error messages using –xvfb, but when I run it from the mod_wsgi environment it generates an error message: Xvfb failed to start.
when running using –display :0.0 it works from the wsgi script, but with an error message:style cannot be used together with the GTK_Qt engine. Anyway the last one works for me.
(Ubuntu 9.04)
# testscript
import os, sys, subprocess
options=['webkit2png.py',
'--display', ':0.0',
'-g', '1024', '768',
u'http://www.dpreview.com',
'--scale','128','92',
'-o','dpreview.png']
p=subprocess.Popen(options,0)
output,errors=p.communicate()
Mai 4th, 2009 at 11:39
Hi VidJa,
Thanks for this report. I’ll have a look at it later.
Update: I think this is an issue of mod_wsgi. Sadly, xvfb-run does not provide some sort of –verbose flag. Can you run it with “strace” (by modifying webkit2html.py)?
Maybe xvfb-run does not have the permission to write the authority-file? The man page says that this file is written to the directory defined by TMPDIR or /tmp.
Another reason might be that the memory is limited by mod_wsgi.
Juni 8th, 2009 at 08:36
Check also similar Qt/C++ code I wrote some time ago:
http://labs.trolltech.com/blogs/2008/11/03/thumbnail-preview-of-web-page/
http://labs.trolltech.com/blogs/2009/01/15/capturing-web-pages/
Juni 17th, 2009 at 13:01
Hi everyone,
Regarding the issue with Xvfb staying up, it’s enough to pass “-terminate” to the server args. So, line 154 would look like:
newArgs = ["xvfb-run", "--server-args=-terminate -screen 0, 640x480x24", sys.argv[0]]
However, xvfb-run is already trying to kill Xvfb, so using this will trigger a warning message from xvfb-run.
An option to skip this message would be to skip xvfb-run (it’s just a simple shell script anyway) and call Xvfb directly. As for xvfb, one of the following could be done:
- change xvfb-run to use -terminate instead of issuing a kill (recommended?)
- change xvfb-run to use kill -9
Regards,
Juli 20th, 2009 at 06:55
For those of you who might be getting the error:
“QPainter::begin: Paint device returned engine == 0, type: 3″
There are a couple possible reasons:
- The page is greater than 32,768 pixels (2^15 px) in any dimension (http://doc.trolltech.com/4.5/qpainter.html#limitations)
- The page is framed and messing with the image dimensions.
Hope this saves someone a massive headache.
Juli 20th, 2009 at 17:27
Is there an easy way to fire this multiple times from a single script? For example, a crawler that takes snapshots of all of the pages that it visits? Other than the obvious commands.getoutput() of course
Many thanks!
August 6th, 2009 at 20:41
Roland,
Would you consider getting this script onto PyPI as well as GitHub, BitBucket, or Google Code?
It’s the best script I’ve come across for this job and it would be great to see it built out by the community. If you don’t want to do you mind if I do? I’d like to use this in a few places and if it were available from PyPI it would be great.
Cheers,
Adam
August 10th, 2009 at 15:15
At the moment I’m still to busy to package this for PyPI by myself, but I don’t mind if you do so!
August 18th, 2009 at 03:50
This script ROCKS!
I got this working finally and it renders great. Wish I could make it faster. I had this working on a Mac before and it was quite fast. Now running on Linux (yea!)…
anyway, I can’t get Flash to render. Any ideas? I am pretty certain flash is installed on the server, but maybe need to put it somewhere.
September 3rd, 2009 at 16:50
I’m caught between this and simply calling websnap or CutyCapt as a subprocess. Anyone struggling with xvfb-run might try adding -f to the command list as this stops xvfb complaing it can’t start the server.
September 8th, 2009 at 16:33
I made some modifications to your script and thought I would share: http://pastie.org/609626
And the diff: http://pastie.org/609631
Added a simple networkAccessManager to handle bad ssl certificates (we use self-signed certs on some pages I wanted to thumbnail). It could easily be extended to do something more intelligent, but it works for us.
Added another option for aspect ratio: crop. This renders the full page the same as expand, then crops to the desired size. This gives better results for short pages like google than setting the browser size and using ignore aspect ratio.
If anyone knows how to do a higher quality resize in QT I would be interested to hear. It seems to be doing simple linear interpolation which gives very poor results especially for text.
September 12th, 2009 at 03:20
Does anyone know how to get this to display Flash plugins?
I’ve tried enabling plugins in the script and also using Adobe Flash 32bit and 64bit or swfdec and gnash. None of them seem to work.
September 18th, 2009 at 01:14
As per Roland’s comment, I moved this to a public repository so people can collaborate on this.
http://github.com/AdamN/python-webkit2png
This includes Coles modifications.
Feel free to make updates, fork, etc…
September 22nd, 2009 at 21:30
Good Day everyone,
thank you for your effort, the idea looks really nice.
i will start a website soon, in which i need a snapshot functionality, so i landed in this page.
my website as i got from the host, will be hosted on linux and supports python,
MY PROBLEM
is that i come from windows background with eperience in ASP, and little bit PHP (which i will use for the website).
questions are:
what are the pre requirments to use your project on linux host, python, and php support, (i read things about Qt but i dont know what is it).
and the second question is: are there some steps how to setup this on the host and use it from within PHP.
thank you very much and accept my best regards,
Luay
September 24th, 2009 at 12:18
Hi Luay,
beside of Python you should have installed the “webkit” library of the qt package and the PyQt4 package for python. Beyond that you’ll need an X11-Server – “Xvfb” should be sufficient for a headless maschine.
I suggest to use your distributions package management to install these dependencies. If you tell me what distribution you are using I might be able to tell you the package names.
Qt is a library for GUI programming which comes with it’s own HTML rendering engine, webkit. Please have a look at Wikipedia for further information.
Good luck!
Roland
September 25th, 2009 at 19:38
Hello Roland,
thank you very much for the response, do you know a host name which supports such packages,
i asked the host i suppose to host with, and they have absolutely no idee
Thank you,
Luay
September 26th, 2009 at 09:09
Oh ok, I assumed you were running your own server. Sorry, I don’t think I can help you in that question.
September 26th, 2009 at 14:31
Nevertheless, thank you very much
September 26th, 2009 at 20:23
I just released a ruby-package to generate thumbshots using your script:
http://github.com/digineo/thumbshooter
September 30th, 2009 at 00:40
@Luay http://webfaction.com has great support for Python stuff – you could try them.
Oktober 6th, 2009 at 02:23
Roland,
I am having the same exact issue as Hubert. Looks like something with the Debian install of Qt4 makes the simple script work, but webkit2png.py reports “Failed to load” messages on all pages. I debugged for about 2 hours, but I am not Qt expert, and I only got “Failed to load” messages, indefinite hanging, or blank renders.
I documented on the github repo:
http://github.com/AdamN/python-webkit2png/issues/#issue/2
Nice work though, looks excellent!
-Ben Standefer
Oktober 6th, 2009 at 12:08
I got the same problem that Hubert reported in March :
# ./webkit2png.py -x -o test.png –debug http://news.bbc.co.uk
DEBUG:root:Executing xvfb-run –server-args=-screen 0, 640×480×24 ./webkit2png.py -o test.png –debug http://news.bbc.co.uk
DEBUG:root:Initializing class WebkitRenderer
DEBUG:root:render(http://news.bbc.co.uk, timeout=0)
DEBUG:root:Processing result
ERROR:root:Failed to load http://news.bbc.co.uk
script version is from github.
my python version is : Python 2.5.2
le webkit2png-simple works fine.
And I think i nailed the problem down to the callbacks not being called back….
__on_load_started is never called, it seems…
if i change
- self.connect(self._page, SIGNAL(”loadStarted()”), self.__on_load_started)
+ self.connect(self._page, SIGNAL(”loadStarted()”), onLoadStarted)
with :
+def onLoadStarted():
+ print “load started”
I get a nice log :
DEBUG:root:Initializing class WebkitRenderer
DEBUG:root:render(http://www.google.com, timeout=20)
load started
ERROR:root:Request timed out
So, is it because my python is too old ?
does object method-callbacks works ?
Oktober 7th, 2009 at 17:47
Strange, two people reporting the same issue. Ben, what Python and Qt versions are you using?
November 10th, 2009 at 23:09
hi,
thank you. i’m using the same approach. but i want the captured picture be exactly the size of the web page. if i use your approach, the screen shot will be the size of the web frame, and i often see the scroll bar, because the frame is smaller than the web page.
so how can i make a screen shot of the entire web page?
thanks.
November 11th, 2009 at 20:38
I think this might be a problem with the size of the “virtual desktop”. Maybe I have a chance to spend more time with this script in the near future.