Archive for June, 2008

30
Jun

De-Obfuscation Revisited

Consider if you will this Russian proxy list.

This fellow went to a lot of trouble to prevent people like me from harvesting his data.

All the address:port combinations are GIFs, which is bad enough, but that’s not all!

The GIF file names are based on your session cookie and unique for evey visit. So first you get your cookie, then request the page (it helps if you strip the Accept-Encoding: gzip,deflate header and use HTTP/1.0 instead of 1.1) and get the unique GIF names. Then you download the GIFs and throw them at your OCR program.

GNU OCR (or simply “gocr”) couldn’t handle the 7′s in these GIFs, but I piped them through a utility called “gifsicle” and scaled them up by a factor of 10. After that, it only had a problem with colons, but that was taken care of with a quick sed script.

Most of the proxies were already in my database, but I got about 10 out of the 100 or so he had listed. A 10% hit rate is pretty damned good (almost unheard of, in my experience), so this site is going into the permanent rotation.

30
Jun

Check, check, re-check

At this moment, the second run of the Automatic Dead Proxy Eliminator (ADPE) is about half-way done.

ADPE is currently scheduled to run on Mondays, Wednesdays, and Fridays and its purpose is to glean dead proxies out of The List.

There were 13 pages – 630+ proxies – before it started. Now it’s down to 9 pages and 420-odd proxies. Proof positive, if you ever wondered about it in the first place, that they go fast!

Earlier, I speculated that the Bahrainian Proxy Problem might be taken care of on this run, but so far those boxes are still checking out to be up and running. I’m still curious as to what, exactly, they are, since port scans seem to imply they’re devices rather than computers.

29
Jun

Extreme-DM Note

A nice side effect from the Extreme-DM “Web bug” planted in the Proxy List has been the list of search engine referrers.

Not surprisingly, they are all Google hits so far (at this point the list has only been up for a week).

What is surprising is how exceedingly bad some of the search requests are, such as:

11033 11055 11022 port proxies for my ip

On crappy searches like this the List really shines, and it can be the only pertinent result on the first page. Weird.

Other search engine referrers have taken me to lists I haven’t seen before (and with 208,000+ addresses in the database, I have seen quite a few). Some of these have borne fruit and have gone into the daily harvesting runs.