Yann "Bug" Dubois

Développeur WordPress freelance à Paris
Flux RSS

Use Xvfb, Selenium and Chrome to drive a web browser in PHP

23 August 2012 Par : Yann Dubois Catégorie : English, tech

The PHP language can be used to remote-control an actual web-browser on a server, to allow server-side scripting of complex web interactions involving for example Javascript of Flash-enabled content. This is a very powerful context, but it is tricky to setup. In addition, a proxy such as Squid can be used to log all http interactions, and can be used as an in-depth traffic analysis or debugging tool. This environment can be used for automated web debugging, continuous integration, unit-testing based web development, non-regression testing, load tests, automated screenshots, etc. Here is a quick setup memo for a running environment under Debian Linux (headless server).

Setup a virtual graphic display with Xvfb

The X virtual framebuffer (Xvfb) is an X11 server that performs all graphical operations in memory, without the need for any actual screen output (headless). It can be simply installed with the Debian package:

sudo apt-get install xvfb

Launch the virtual display with those commands:

Xvfb :1 -screen 5 1024x768x8 &
export DISPLAY=:1.5

This site gives a good example of a script to automate Xvfb startup and shutdowns.

Install the Google Chrome Browser for Linux

Go to this page and download the .deb package distribution for your flavor of Linux.

Install it with this command:

sudo dpkg -i google-chrome-stable_current_amd64.deb

If the automatic installation fails because of dependency problems, one simple solution is to automatically install the Chromium browser first (see hereunder), this will install most if not all necessary libraries. (On a bare system, I had to manually install libgconf2-4 n addition, which posed no problem with apt-get install libgconf2-4).

The browser can then be run from the command line with this command:

google-chrome

Never run it as root!

You can stop it with Ctrl+C.

Alternatively, the Chromium web browser can be installed through an official Debian Linux distribution package like this:

sudo aptitude update
sudo aptitude install chromium-browser

However, the stable Chromium version of the browser tends to be older than the most recent Google update, and it might not be compatible with the latest Chromedriver which is needed for setting up browser automation. You can check you version of Chromium on the command-line like this:

chromium-browser --product-version

The Xvfb server will have to be running with an exported DISPLAY in order for google-chrome or chromium-browser to run.

The Firefox browser can also be used without the need for an additional driver. It is usually a bit slower than the Chrome driver, but some features are better supported in the latest versions of Firefox (eg file uploading).

Install Java

If it is not already running on your system, you will need a Java virtual machine to run the Selenium server. Install it like this:

sudo apt-get install default-jdk

Install the Google Chromedriver

Go to to this page and download the appropriate Chromedriver binary.

You will have to save the binary file in a directory which is part of your $PATH environment variable, or to reconfigure your $PATH environment variable so that the file can be found. Alternatively you can create a symbolic link from your /usr/local/bin directory to the file so that it can be easily found. (using ln -s /path/to/chromedriver )

Check that you can run it from the command line wherever you are in your file hierarchy.

chromedriver

Install Selenium Server

Go to this page and download the latest version of the Selenium Server (formerly known as the Selenium RC Server). It is a Java .jar package.

You can run it with this command:

java -jar /path/to/your/selenium-server-standalone-v.v.v.jar

To enable logging in a specific file, add the -log command line option like this:

java -jar /path/to/your/selenium-server-standalone-v.v.v.jar -log /path/to/logfile.log

Once the Selenium server is running, you can then monitor its logs in real time with this command:

tail -f /path/to/logfile.log

In addition to detailed info that is output to the main terminal, this will give you plenty of debugging information.

Installing a Squid proxy for detailed traffic analysis

A proxy can be used to provide detailed logging of all web traffic to and from your remote-controlled web browser. This can be useful for capturing complex interactions that take place between Javascript, Ajax, or Flash / AS3 in-browser applications or plugins and remote web services.

The default installation of the Squid Debian package is perfect for performing this task; we will then programmatically configure a proxy setting for the web browser so that all http and https traffic goes through the Squid proxy to be logged.

sudo apt-get install squid

Depending on your version of Squid, the traffic log will be either here:

sudo tail -f /var/log/squid3/access.log

or here:

sudo tail -f /var/log/squid/access.log

(once the log file has been created by Squid you can chmod it to allow reading it without sudo privileges)

In-browser proxy configuration

To force the remote-controlled browser to use the Squid proxy we just configured, specify a Proxy configuration while setting up the connection to the browser in PHP:

< ?php require_once "phpwebdriver/WebDriver.php"; $webdriver = new WebDriver("localhost", "4444"); $webdriver->connect("chrome", '',
	array(
		'proxy' => array(
			'proxyType' => 'manual',
			'httpProxy' => 'localhost:3128',
			'sslProxy' => 'localhost:3128'
		)
	)
);
$webdriver->get("http://www.google.com/");
$webdriver->close();
?>

:3128 is the default TCP port of the Squid reverse-proxy. For this example, I am using the 3e Software House PHP implementation of the webdriver bindings (see link at the end of the article). This gives you an actual example of how simple it is to script a web browser action in PHP. Enjoy!

Securing your Selenium server with iptables

By default, your Selenium server, once launched, will be listening on port 4444 and accept connections from any host on the Internet. This allows anyone with remote IP access to your machine to use it to remote-control a web browser and perform any task. This is a major security issue/threat. You should ensure that only local applications and/or known hosts get access to your Selenium RC server. To make it private, you can use firewall rules to block outside incoming traffic to Selenium’s TCP port.

Here’s how to do it with a few iptables rules (iptables being the firewall included with Linux Debian):

sudo iptables -A INPUT -p tcp --destination-port 4444 -i lo -j ACCEPT
sudo iptables -A INPUT -p tcp --destination-port 4444 -j DROP
sudo iptables -L -v --line-numbers

(the first rule allows localhost traffic, the second rule blocks all other incoming traffic to TCP port 4444)

If you need to allow incoming browser control traffic from another host, you can insert another rule between the two aforementioned iptables commands:

sudo iptables -A INPUT -p tcp --destination-port 4444 -s ww.xx.yy.zz -j ACCEPT

(where ww.xx.yy.zz is the IP address to authorize). Use  sudo iptables -D INPUT <rule line number> to delete a rule. The order of the rules is significant: ACCEPT rules need to come before the generic DROP rule if you want them to be taken into account as exceptions to the general block on the port.

Securing the Squid proxy

The same measures should be taken to prevent your Squid web proxy to be used by anyone with IP access to your server, and keep it “private”:

sudo iptables -A INPUT -p tcp --destination-port 3128 -i lo -j ACCEPT
sudo iptables -A INPUT -p tcp --destination-port 3128 -j DROP

Issues with Flash

Flash content is fully-supported with the described setup. One possible issue is that running flash-based dynamic content can be quite slow depending on processor speed and available memory. Server hardware configurations are usually not optimized to run a typical client / graphic-based application like a Flash applet. I have tried both letting Chrome use its internal version of Flash, or configuring an external Flash extension lib provided by Adobe (installed with the flashplugin-nonfree package + curl). Both worked. On a recent distribution of Ubuntu 64, flash contents have run flawlessly and flash-based serialized web-service traffic has been exchanged without problems (.amf files). On an older Debian 64 system, I have achieved the same result, with Flash content running at a much slower speed: I thus had to introduce very long delays in the PHP script in order to monitor the expected traffic between browser and web services. Sometimes, heavy flash applications (such as interactive games) cannot complete loading due to system load constraints: they can hang or stop functioning after a length of time. Upgrading to a more powerful system is the only option to run a full-fledged heavy Flash app from inside a remote-controlled browser. The ability to take screenshots at any stage of the process greatly helps debugging.

Also, introducing some “sleep” commands in the PHP scripts can help, by giving time for dynamic contents of the page to fall in place and download additional data, etc.

Issues with file upload

File upload in HTML forms can be achieved by writing the complete physical local path to the file in the value of the <input type=”file”> element. This feature is presently supported by recent versions of Firefox, but not by Chrome under GNU/Linux (this situation might evolve again in the future, because in the past Google was the first to implement this feature in its browser.)

File upload cannot be achieved directly with a remote web-driven browser server, because the browser needs to have local file access to the file that has to be uploaded. The file has to reside on the same server that is running the web browser when the upload is performed (when the form is submitted).

Facebook

Yes, this setup works with Facebook: it can be used to automate post status or image updates with PHP. Facebook usage rules might however forbid this use. Please check against current Facebook usage policy.

General disclaimer

Use at you own risk, and obey to the usage rules of sites you wish to access with a remote-controlled browser.

Please be aware that the use of an automated web browser on third-party sites, if detected, can result in your server’s IP address being blocked or included in security black lists.

The setup described here is very powerful, I totally discourage its use for the generation of spam, DOS attacks, blackhat SEO, and other unfair or otherwise unlawful behavior. I will not respond to enquiries concerning such uses.

Useful links

A lire également...

WordPress › Error

There has been a critical error on your website.

Learn more about debugging in WordPress.