Tag Archive for 'database'

15
Mar

BEWARE THE IDES OF MARCH!

The database is two years old today!

Two years ago today it all came together and the first runs were made.  That day, a grand total of 596 proxies were collected.  351 timed out, 107 were there but not answering, and 138 were open, which is an incredibly high percentage (23%) of open proxies.

So high it makes me think things weren’t quite ready yet.  And they probably weren’t.  Even back then you’d be lucky to get 1-2% active proxies from any given proxy list.  It didn’t matter much since The List Itself wouldn’t exist for a few months.

The oldest proxy, the very first entry in the database is 200.43.223.227, a host somewhere in Buenos Aires that wasn’t alive then and isn’t alive now.

A week later there were over 4700 entries in the database.  In two weeks, over ten thousand.  Two years later, over two and a half million.

10
Jul

1.999 Million Proxies!

Should roll over this weekend, if not today!

I seem to be getting a steady stream using the usual feeds.  Knock on wood.  I am discontinuing the resurrection for a while.  I was doing it about twice a week, which is way too often.  I’m still wondering why two thirds of the active proxies are on again off again like that.  I had always suspected the proxy judges but whenever I checked them they were always fine.

But, I haven’t done that in quite some time.  And considering it’s always the same ratio no matter what, that’s probably not it.

Anyway, that’ll be my weekend project.

13
Jun

Brazil

It looks like our Russian friend has set his sites on Brazil.

After the 4AM/5AM run there were no less than three pages of Brazilian proxies.  From my past experience, Brazilian proxies are about as worthwhile as Chinese proxies.  They never seem to be working by the time I get around to needing one.

But that was then, this is now.  In those days every other Brazilian proxy was on TCP port 6588, which I believe was WinGate or WinProxy or something similar.  This time around the ports are mostly 8080 and 3128, with a smattering of oddballs.

Still, old prejudices die hard.  I’m going to run them through the same tests I did with the Chinese proxies.

24
May

order by rand()

Duh.  Who knew? 

I never claimed to be a MySQL expert.  

In fact I stumbled across “order by rand()” completely by accident while doing research on Postgres.  I had no clue similar functionality was available in MySQL.  This changes – and simplifies – a lot of code.

And speaking of stumbling upon things, I did finally find the kitchen sink.

23
May

Pushing 1.8M

Since I started tracking URLs, the Google Hack has been much more productive, mostly because it’s faster now.

In all I have collected over 5,000 URLs since I started.  Of these, about 200 are top-level links.  I have just begun to pull these out at regular intervals to re-scan.  So far that has been very productive.

You may or may not have noticed Google Page Creator (this site) is going down soon.  Google has graciously decided to force me into a Google Sites redirect.  Hopefully Google will notice I already have a Google Site (http://www.mrhinkydink.net) and they will put it all there.  Are you listening, Google?

However I have found over the years that when you expect someone to do the “smart thing” you are always sadly disappointed.  With that in mind, I’m keeping a backup.  But without the WYSIWYG editor it’s doubtfull I’ll be able to keep the project journal updated.

I’m keeping my options open.  I’m not entirely impressed with Google Sites.  In fact, it sucks.  As an option I have reserved a spot at WordPress but I’ve done nothing with it so far.

But be advised the next time you come here it may be someplace else.

10
May

1.75M Proxies

Another milestone. Well, whoop dee fucking doo!

It’s going to be a long march to the second million, but we’re well under way, boys and girls!

Since most of the proxy harvest happens over Google these days, I added a URL tracking table to the proxy database. A lot of URLs get hit over and over again. I don’t really care about proxy lists from 2003 (those were the days), but there are hundreds of those out there. 99.9% of the address/ports are already in the database, so scouring those old lists is a simple waste of resources.

This table puts an end to that. All it has is three columns, the url, the sha-1 hash of the url, and a count of how many times the url is seen.

This helps a lot for harvesting proxy forums with long histories (clean that old shit up, guys!). It doesn’t help so much when the current list is at the top level of the site (such as, for example, “http://www.niceproxy.com” – which is a parked domain so don’t bother with it).

As those pile up in the table I can chop them out and do some dedicated runs.

The first time through on this I finally discovered why the “.net” TLD (top level domain) was always the most fruitful. I found a cgi proxy page that is updated hourly! No graphics, no crap, just an ASCII list of addresses and ports!

Nice. That site is now in the daily 4AM schedule.

29
Jan

1.5 Million

That was a bit of a long haul.  We first hit a million proxies last August, five months and a couple of weeks after the project started.  Now, five months later we added another half a million.

Obviously the rate of discovery has dropped by half.

This is not a big surprise considering that ~800,000 proxies were found in a single file back in July.

In other news, some clown with a residential DSL account in Sweden recently whined to my ISP that I was using his “Web server” as a proxy.

The extent of this “use” was checking the address with one of the public proxy judges I use.  This box had been an open proxy since last June.

If anyone needed to get in hot water with their ISP, it was that guy, not me.  The guy’s running an open proxy, for crying out loud!  I ran a Google check on the IP and found it listed at antichat.ru (it’s been off my list since the 25th – the asshole probably finally figured out how to use a firewall).

I would think he’d be more surprised to find out someone was using his open proxy as a Web server. 

I blackholed his IP so it will never get hit by a resurrection run again, but if you’re interested, here it is:

85.195.15.126

The proxy was on port 80.  The “Web site” is a joke.  Check this link and you’ll probably find your own IP address on the “offenders” list.

30
Aug

Million Mark Hit!

We hit it, but the page doesn’t show it yet.

I did a very ugly thing to compensate for GeoCity Lite’s tendency to do nothing with an address it can’t find.  I ran an SQL statement on the entire database to fix the blank data.  These days it’s taking a long, long time.

It was a cheap hack, what can I say?  In the end it took less time to alter the GeoCity code.

So I rewrote the test-geo-city.c program that comes with the binary version of their database to spit out the values I want.  One more “clean” of the database and I can stop doing it.

Great, but right now it’s almost 4PM and the 2PM run hasn’t finished yet.

Also, I got a call from GoDaddy and they’re moving me to another server.  There may be some disruption in service.

The IS-1 suck is awesome.  A few bugs to work out but it’s running fine.  It appears they reset the file every now and then so I have to hack around that.

I plan to rewite the main page on the Web site to reflect the fact that most of the data no longer comes from proxy lists.  The majority of all proxies in the database came from the Google Hack and the “Interesting Sites” found using it. 

I am convinced now more than ever that all online proxy lists, with the exception of the Dinkster’s, are PURE CRAP.  They have nothing on me.  I am the Proxy King.

04
Aug

400K Mark Surpassed

Last Friday I didn’t think this would happen this month, but after the links2 bug was fixed Google Hack data started going up again.

“Interesting Site II” continues to suppy working proxies as well.

Bahrain hasn’t shown up this week yet.

I’ve been looking all over hell for the speed bug but I can’t find it.  It’s starting to get annoying.

SOCKS is still on my mind.  After the SOCKS proxies are identified I can work on publishing my results.  After that I will probably put a fork in this project.  If the data keeps coming in – and I do believe it will finally dry up because this is the twilight of the open proxy – I may change the domain to proxyobession.com and put the whole thing into maintenance mode.

28
Jul

It must be Monday

Why?

Because the purge ran and one third of the proxies are gone.

Also, this is the first run with the fixed speed calculations.  Those are finally back to normal.  You may recall I upped the TIMEOUT to 45 seconds (from 30) last week.  Besides screwing up all the speed values it helped to add to the total proxy count.  There are a few agonizingly slow proxies listed in there but they’re not the majority and they might be of use to somebody, somewhere.

I have been using the DualCore AMD64 (my Mythbuntu system) all weekend for the Google Runs because it’s just so darned fast.  I can run about 70 database checks per second on it, even with the database on the other end of the network and on a VM.  I may in fact turn the TIMEOUT up to 60 seconds and start retesting some old data. 

You may well ask “How does slowing down a fast machine  get more work done?” 

It’s all in the forks, boys and girls.  The system can fork more processes even though they’re only waiting for a TIMEOUT.  The AMD64x2 has more RAM and more cycles to dedicate to that.  The VM can’t touch it in that regard. 

In fact, I’m just about Googled out.  With the faster machine doing all the Google Hacking I’m getting more and more dry runs.  Of course, this whole business is cyclical (look at Bahrain for instance), so just taking a break for a day or two is probably a good thing.

At least one of my “Proxy Judge” sites decided it was a REAL proxy judge after all and changed its format to be helpful.  In so doing it turned itself into a wothless proxy judge, at least as far as I’m concerned.  As a result, there are more “Undefined” servers than is usual.  That site is going out of the proxy judge rotation permanently.

This week I will be taking a closer look at the “Undefined” sites to see if I can get rid of them once and for all.