Hinky’s Ultimate Proxy Regex

 

[1-2]?[0-9]{1,3}.[1-2]?[0-9]
{1,3}.[1-2]?[0-9]{1,3}.[1-2]
?[0-9]{1,3}[: ][1-9]?[0-9]{1,5}

It’s wrapped, but you’ll figure it out.

Doesn’t look like much but it does the trick at identifying address:port in any page no matter what, unless it is otherwise obfuscated.

Some dickhead proxy list maintainers like to keep each octet three characters wide and to pad the port with zeros, like this:

010.001.010.066:0080

That is stupid and most correctly written shell tools will treat those octets as octal numbers since they start with a zero (that’s just the way it is – the address above is equal to 8.1.8.54 in decimal and the port would be an illegal octal number with no decimal value at all).  The regex should not recognize this format, but if you want these funky addresses, change all the occurences of [1-2] with [0-2].  However, additional post-processing will be needed to get rid of the leading zeros or you will end up with junk IP addresses.

(Revised 7/26/2008 to remove leading zeros from :port spec)

How can you tell if it’s supposed  to be octal?  Easy.  Nobody uses octal.  Padding with zeroes is just a way of making a list look “pretty” to anal retentive idiots and spreadsheet jockeys.  Unfortunately your shell tools don’t know that.

 

General Use

Here’s how it’s done:

  • Grab the page and save it to a file (using wget, curl, html2text, links2 or equivalent)
  • Strip out all HTML (unless you used html2text or links2 in the first place).
  • Turn any character that’s greater than ASCII 127 into a space.  This can be a pain in the ass.
  • Strip out all newlines/CRLFs, making the file One Big Line.
  • Strip out all multiple spaces down to one space (sed).
  • Break out the address:port using the regex (sed) by adding a newline before and after the regex match.
  • Replace all remaining spaces with a colon (sed).
  • Remove all lines that don’t begin with a number (grep)

Essentially you simply turn a page into a single line and break the data you need down into separate lines, removing junk in the process. 

After this you have nothing but address:port combinations in your file.  Remove the octal crap as needed.

This will allow a few FQDNs (Fully Qualified Domain Names) through, which I don’t use.  They tend to be expired names anyway.  In my setup they generate an error and are ignored.

One problem with html2text is that it can add a lot of garbage.  For some reason you get this crap:

[char][0x08][char]

This prints a character, rubs it out (ASCII 8, a.k.a “backspace”), then prints the same character.  Looks fine on a terminal but junk gets in your file.

 

Notes on Proxy Obfuscation

Obfuscation takes various forms.  So far the most popular I’ve run into is Javascript code that scrambles the order of the octets.

At least two sites that I am aware of use an image of the address and port in GIF format.  This is not an unsurmountable problem and there are various Open Source utilities available for OCR (Optical Character Recognition).

The idiots who use these methods are trying to keep page-scrapers like the Dinkster from “stealing” their already stolen data.  It’s an entirely useless practice because all the data is usually available somewhere else and the proxies are dead anyway.