Useragent browser and OS detection PHP script
PHP Browser detection made easy
Useragent strings are notoriously tricky, but essential for statistic gathering, and PHP scripts to process them are few and far between, often out of date and difficult to update, or plain old inefficient. I set about writing a simple analysis script that was easy to update and had a small footprint. I can live with 99% accuracy for my stats, I dont need to know if a single visitor in a million uses a version of Netscape Navigator from 1994, as long as the big four are covered Im happy.
XML UserAgent Search strings
First off, we need an XML configuration file. Im going to call this ua.xml. The first item I want to check is whether the visitor is a bot, so I'll add a few of these first. The script works by searching for a specific string, so I'll call this attribute "search". For my own stats, I can give a name to each bot:
<uasearch type="os"> <bot search="googlebot" /> <bot search="bingbot" /> <bot search="baiduspider" name="Baidu" /> <bot search="slurp" name="Yahoo" /> <bot search="msnbot" name="MSN" /> <bot search="jeeves" name="Ask" /> <bot search="teoma" name="Ask" /> ...
Using the same methods, I need to find the Operating System and version number of the visitor. To find the version number, Im going to call the optional attribute "vsearch". If ommitted, it will use the search attribute instead. I can nest more complex searching for OSs like Windows where several versions exist, but I want a more meaningful name for them. So NT 6.1 for example becomes "WIN7".
In addition, I need to specify how far to look ahead for the version number after the OS string is found. The script defaults to 5 characters, but some version numbers can stretch longer than that, so I'll use a "lookahead" attribute. In the script, searching for version numbers begins immediately after the name of the OS, which works in most cases, but not with Windows, so I need to cancel this out with a "lookback" attribute. The XML below works for most browsers, but can be refined for more accurate detection of say, Linux distros:
<os search="ipad" name="IPAD" vsearch=" OS " lookahead="7" /> <os search="iphone" name="IPHN" vsearch=" OS " lookahead="7" /> <os search="windows" name="WIN" lookback="0"> <version search="6.1" name="7" /> <version search="6.0" name="Vista" /> <version search="5.2" name="XP64" /> <version search="5.1" name="XP2" /> <version search="XP" /> <version search="5.0" name="2K" /> <version search="4.0" name="NT" /> <version search="98" /> <version search="95" /> <version search="CE" /> </os> <os search="macintosh" name="MAC" /> <os search="android" name="ANDR" /> <os search="google desktop" name="GD" /> <os search="fedora" name="LXFED" lookahead="7" /> <os search="linux" name="LX" lookahead="7" /> <os search="java" lookagead="6" /> <os search="symbian" lookagead="6" />
Add in more obscure OS's as required. The last thing is to detect the users browser. The big four are covered here, with a fallback for unknown webkit based browsers, but if I did want to know more about that time travelling Netscape user from 1994, I could add an extra trap in the XML, avoiding the PHP altogether.
<browser search="MSIE" name="IE" /> <browser search="Firefox" name="FF" /> <browser search="Chrome" name="CH" /> <browser search="Safari" name="SF" /> <browser search="Webkit" name="WK" />
Again, lookahead can be tweaked here for more accurate version numbers. For my purposes, just the major revisions are needed. For example, "FF36" is enough to tell me the user is using Firefox version 3.6. I dont need toknow if its Version 3.6.1 or 3.6.8. If you do, add in the attribute lookahead="7". Finish up the XML with:
</uasearch>
PHP to parse useragent string
Im going to build a new class for handling this called UserAgent. Later, I may add statistical analysis routines to it and other whizzy functions for IP locations and such like, but for now we will just parse the UserAgent string using the XML settings above.
<?php class UserAgent { public $UA; public function __construct() { $this->UA = $_SERVER['HTTP_USER_AGENT']; }
Because Im using an external XML file for the search configs, I need a handler to convert it to an array. I'll use one I have lying around, I dont take any credit for this function, I found it lying around on the web somewhere, but its damn useful:
public static function xml2array($xml) { $xmlary = array(); $reels = '/<(\w+)\s*([^\/>]*)\s*(?:\/>|>(.*)<\/\s*\\1\s*>)/s'; $reattrs = '/(\w+)=(?:"|\')([^"\']*)(:?"|\')/'; preg_match_all($reels, $xml, $elements); foreach($elements[1] as $ie => $xx) { $xmlary[$ie]["name"] = $elements[1][$ie]; if($attributes = trim($elements[2][$ie])) { preg_match_all($reattrs, $attributes, $att); foreach($att[1] as $ia => $xx) $xmlary[$ie]["attributes"][$att[1][$ia]] = $att[2][$ia]; $cdend = strpos($elements[3][$ie], "<"); if($cdend > 0) $xmlary[$ie]["text"] = substr($elements[3][$ie], 0, $cdend - 1); if(preg_match($reels, $elements[3][$ie])) $xmlary[$ie]["elements"] = self::xml2array($elements[3][$ie]); elseif($elements[3][$ie]) $xmlary[$ie]["text"] = $elements[3][$ie]; } } return $xmlary; }
Since were parsing the useragent, we may as well search for the visitors language settings. I dont need this to be spot on so the code is crude, and IE lower than version 9.0 doesnt send any language data at all, but its better than nothing:
public function lang($ua) { //Detect Language $ua = str_replace(")", ";", $ua); $parts = explode(";", $ua); foreach($parts as $p) { $p = trim($p); if(strlen($p)===5&&strpos($p, '-')) return strtoupper(substr($p,0,2)); } return false; }
The only thing left is to parse the useragent string using the XML settings from earlier to detect the browser and OS. I figured the easiest way was to simplify the data, remove all of the irrelevant fluff that browsers sometimes add on, and then present the important stuff in a simple to use array. Heres the result:
public function detect($xmlfile = "ua.xml", $vsep = '') { if(!$xml = $this->xml2array(file_get_contents($xmlfile))) return array(); $ua = strtolower($this->UA); $data = array(); foreach ($xml as $atts) { foreach($atts['elements'] as $xml) { if(isset($xml['attributes']['lookahead']) && intval($xml['attributes']['lookahead'])) $lookahead = intval($xml['attributes']['lookahead']); else $lookahead = 5; //Number of chars to search if(isset($xml['attributes']['lookback'])) $lookback = intval($xml['attributes']['lookback']); else $lookback = 1; //Disable or enable lookback $xml['name'] = strtolower($xml['name']); if(!isset($data[$xml['name']]) && !isset($data['bot'])) { if($st = stripos($ua, $xml['attributes']['search'])) { $data[$xml['name']] = isset($xml['attributes']['name'])? $xml['attributes']['name'] : $xml['attributes']['search']; if(!isset($xml['elements'])) { $vsearch = isset($xml['attributes']['vsearch'])? $xml['attributes']['vsearch'] : $xml['attributes']['search']; if($stv = stripos($ua, $vsearch)) { $data[$xml['name']] .= $vsep . preg_replace("/[^\d]/", "", substr($ua, ($stv+(strlen($vsearch)*$lookback)), $lookahead)); } } else { foreach($xml['elements'] as $el) { if(!isset($version) && $el['name']=="version") { if($stv = stripos($ua, $el['attributes']['search'])) { $version = preg_replace("/[^\d]/", "", substr($ua, ($stv+(strlen($el['attributes']['search'])*$lookback)), $lookahead)); $data[$xml['name']] .= $vsep . isset($el['attributes']['name'])? $el['attributes']['name'] : $version; } } } } } } } } if($lang = $this->lang($ua)) $data['lang'] = $lang; return $data; } } ?>
Testing
The script above returns an array containing anything between 0 and 3 elements. A zero element array means that the script was unable to parse anything useful from the useragent string, so can be discarded or filed under "Others". This happens with unknown bots, browsers that dont pass any UA detail, or referal mechanisms designed to hide the useragent. For example, hits from links contained within a facebook page may have a useragent string containing the information "Facebookexternal" but no other relevant details. This will be discarded unless a trap is set in the XML to keep this data.
If the useragent carries a bot signature, the returned data will be a single element array with the key 'bot' and the search value that matches one of our XML bot traps, plus the version number if one exists. For example:
Array( 'bot' => 'Googlebot21' )
Internet Explorer pre version 9.0 user agent strings will usually return a 2 element array containing the OS and the browser detail:
Array( 'os' => 'WINXP', 'browser' => 'IE80' )
This data is self evident, indicating an OS of Windows XP and Microsoft Internet Explorer 8.0. Useragent data from modern browsers will usually return 3 elements:
Array( 'os' => 'WINXP', 'browser' => 'IE90', 'lang' => 'EN' )
We now have the additional language information for the user, in this case English. If you have a typical Apache log lying around, you can run the entire script like so:
$ua = new UserAgent(); echo '<pre>'; foreach(file("access.log") as $line) { if(strlen($line) > 20) { $ua->UA = trim($line); if(count($data = $ua->detect())) { $str = implode("/", $data) . " (".implode("/", array_keys($data)) . ")\n"; echo $str; } else { $str = "Nothing found.\n"; echo $str; } } } echo '</pre>';
The result will look something like this:
BING20 (bot) MJ1213 (bot) WINXP2/EN (os/lang) WIN7/EN (os/lang) YAHOO (bot) WINVista/IE60 (os/browser) WINXP2/IE80 (os/browser) ANDR21/SF530/EN (os/browser/lang) ANDR21/SF530/EN (os/browser/lang) ANDR21/SF530/EN (os/browser/lang) Nothing found. Nothing found. WIN7/IE80 (os/browser) WIN7/IE80 (os/browser) WIN7/IE80 (os/browser) IPHN401/SF6531/EN (os/browser/lang) SYMBIAN8 (os) WIN7/FF36/EN (os/browser/lang) WIN7/FF36/EN (os/browser/lang) WIN7/FF36/EN (os/browser/lang) WINVista/IE80 (os/browser) BAIDU (bot) JAVA14 (os) JAVA14 (os)
All of the log data is now in a nice concise and readable format ready for further processing or storage. If you use a database to record your access logs, or archive your existing logs, you will want to keep them as efficiently as possible. The above data can now be compressed further. In fact, depending on how accurate a record you need, the above can be contained in a single byte of data if for example you can live with just the major browser version and major OS version and ignore bots and language data. If you do need this, 2 bytes is plenty.
I use this information (and some extra Geo-IP data) to generate statistical data on website usage within our customers CMS. Heres an example:



March 29th, 2012 - 14:53
Hi, this looks awesome, exactly what I am looking for. Are you planning to release the files (including the bot XML) somewhere – on GitHub or as a ZIP file?
April 10th, 2012 - 20:16
Nice article thanks man, using a bit of the code in my CMS.