The focus of our final project is improving both the ease-of-use and the functionality
of AdBlock, a popular plugin for the web browser FireFox.
The first thing that we attempt is to remove the requirement that users have a technical background. AdBlock performs its blocking by using wildcard strings, which are something only a technically inclined user is be able to manipulate directly.
Our solution is to use an algorithm to combine similar URLs into a single wildcard URL. Rather than approaching the problem from an artificial intelligence (Bayesian) direction, we chose to use the Longest Common Substring (LCS) algorithm and some interchangeable heuristics. We decided to take this approach since a URL itself does not necessarily reveal any information about its contents, so artificial intelligence might be off track too often. There is an existing project that uses Bayesian networks called AdBlock Learner, which according to the authors is not very successful and exhibits many false positives. We are considering Bayesian filtering for other parts of the project, but not for the generation of wildcard URLs.
The common substrings of an arbitrary minimum length are concatenated and surrounded with asterisks, so two URLs such as http://www.bob.com/url?id=12345 and http://www.bob.com/url?id=678910 produce a wildcard string of http://www.bob.com/url?id=*. Thus, the wildcard pattern matches all similar URLs in the future.
Existing wildcard URLs generate themselves when compared to URLs which match them, e.g., http://www.bob.com/url?id=* with http://www.bob.com/url?id=27 generates http://www.bob.com/url?id=* again. In this fashion, many URLs can be combined by this algorithm into a single generic matching wildcard URL.
One problem with this algorithm is that one needs a heuristic to determine when URLs are “sufficiently similar”. Two URLs such as http://ads.doubleclick.net/?ABZCE and http://www.collegehumor.com/ads/banner0102.jpg certainly have substrings, such as “ads” and “http://”, but the resulting wildcard URL http://*ads* is not useful.
To prevent combination of two different URLs, we are developing a heuristic
algorithm. One we have determined works well assists in combining two already
wildcard URLs, for example, http://www.doubleclick.net/banners/* and http://ads.collegehumor.com/banners/*
would combine to */banners/* using the criterion that the resulting wildcard
string does not have fewer asterisk than both of the source URLs. Our current
approach works well for combining wildcard URLs.
To improve the quality of wildcard URLs generated from fully specified URLs,
we are currently experimenting with dropping the http:// prefix and not allowing
combination if the remaining substring is just a combination of individual letters.
We are also considering only combining URLs for the same domain.
We also want to make it possible to statically block elements of an HTML page. The element would be specified by a URL, which might contain wildcards, and a document object model (DOM) path identifying the element in the page that should be excluded. From a language perspective, the HTML document creates an abstract syntax tree, so a DOM path can uniquely label one of its nodes. For ease of use, we are considering encoding this path within the same string as the URL; therefore, the DOM path may contain wildcards as well.
Identifying an element for exclusion results in the exclusion of all contained elements. Excluding a table row therefore makes all of the contained cells disappear. Using this feature, users can remove entire advertisement sections from their favorite websites.
It also enables us to experiment with additional heuristics and perhaps Bayesian
filters to determine if an HTML element contains a high enough fraction of blocked
URLs and only a small fraction of potentially important information so that
the entire element should be blocked. In that case, our program could place
the element in a “recommended for exclusion” list.
Much of our experimentation to date has been using Java, but we believe it will be easy to port our programs to JavaScript used by AdBlock, since our algorithms are extremely light-weight and do not use any special language features.
We have begun to modify the AdBlock plugin for Mozilla Firefox. The plugin uses JavaScript and XUL, but since there is no dedicated development environment for this combination, the edit-run-debug cycle is somewhat painful. The tasks at hands are not actually that difficult, but it is hard to become familiar with the systems within Firefox, which are actually non-trivial. All the code is somewhat spread out, with different files determining different runtime features of a single dialog box, for example, and many different aspects of Firefox figure even into seemingly easy tasks.
A feature we have added is the ability to get lists of pattern-matching URLs from the web. This is a simple dialog in the preferences window that allows the user to enter a URL. It connects to the specified web server, downloads and parses the file, and adds it to the user’s block list. This allows the very technically uninclined users to get their Adblock lists from somewhere else, rather than worrying about doing it themselves. We think that this is a considerable forward step in usability since most users simply want to set up adblocking software and be done with it, not continuously interact with it like hackers might be inclined to do. We suspected that certain persons will establish reputations as being providers of good, correct block lists and most users will simply accept those lists. This could be extended into a ranking or karma scheme as we proposed earlier, but this will be beyond the scope of this class project.
We are currently working on converting the Java code developed earlier into JavaScript, and we are continuing our attempts to determine which non-wildcard URLs should be combined into a wildcard URL.
We are also going to integrate our enhancements into the AdBlock install script.
We have successfully ported our Java sources to JavaScript and integrated them in a testbed within Firefox/Adblock. We are now modifying the Adblock GUI to better integrate these features.
Since the development of the web update feature progressed faster than expected, we have decided to add an auto-update feature that downloads new filters every few days.
We now have a version of automatic blocking that works. All the user need to do is click on an image, and its URL gets added to the list of URLs that serves as source for the generation. We still need to improve certain aspects, though: Often, the auto-generated patterns get too general and block images the user may want to keep. Also, we believe we have to build in better treatment for file extensions, protocols, and top-level domains. Right now, we treat these parts of the URL just as series of characters instead of as a special feature, which sometimes causes URLs to be merged that should be kept separate.
We have also noticed that a whitelist would definitely be useful. The makers of AdBlock have promised to add one, but we are considering adding it ourselves. We also seem to have identified the right place to start for blocking DOM paths.
For the last few days, we have worked mainly on our presentation. The PowerPoint slides are complete now, and we have practiced our delivery several times already. Tomorrow we will actually use a projector in one of the class rooms. We had to favor working on our presentation over finishing the coding part, so the DOM path blocking has not yet been implemented, we only discuss it.
Our presentation went well, we are satisfied. We got some interesting questions as well that gave us ideas to improve our project. We will look into merging a URL with the best match, for example, instead of with the first match possible. We will also add a shortcut to temporarily disable adblocking to show the ads blocked on a page. A list of the last n items blocked by a particular filter would also be useful.
The slides for our presentation are available for download now.
Happy Thanksgiving, everyone.
Paper: Download
Software: adblock.jar
Here is our final submission, submitted on Monday.
To experiment with our software, please install Firefox, install the original AdBlock, and then replace the adblock.jar in your Firefox profile directory. In Windows, this is C:\Documents and Settings\username\Application Data/Mozilla/Firefox/Profiles/random stuff.default/extensions/{random stuff}/chrome/.
We still have to generalize a few things. Autoblocking, for example, only works for images right now. We still need to add it for Flash as well. Right-click support for DOM path blocking is also done only very rudimentarily.
This project has definitely captured our interest, though, so we will continue to improve it in our spare time. For future directions, please read our paper.
Thank you and Merry Christmas!
Justin and Mathias
I noticed that figure 1 in the PDF above had been rendered with too little resolution. Here is a link to a regenerated PDF of the paper with better quality. The old file has been maintained as well to preserve the time stamp.