We hit it, but the page doesn’t show it yet.
I did a very ugly thing to compensate for GeoCity Lite’s tendency to do nothing with an address it can’t find. I ran an SQL statement on the entire database to fix the blank data. These days it’s taking a long, long time.
It was a cheap hack, what can I say? In the end it took less time to alter the GeoCity code.
So I rewrote the test-geo-city.c program that comes with the binary version of their database to spit out the values I want. One more “clean” of the database and I can stop doing it.
Great, but right now it’s almost 4PM and the 2PM run hasn’t finished yet.
Also, I got a call from GoDaddy and they’re moving me to another server. There may be some disruption in service.
The IS-1 suck is awesome. A few bugs to work out but it’s running fine. It appears they reset the file every now and then so I have to hack around that.
I plan to rewite the main page on the Web site to reflect the fact that most of the data no longer comes from proxy lists. The majority of all proxies in the database came from the Google Hack and the “Interesting Sites” found using it.
I am convinced now more than ever that all online proxy lists, with the exception of the Dinkster’s, are PURE CRAP. They have nothing on me. I am the Proxy King.
I woke up this morning to find a new file on IS-1. I downloaded it and started banging on it.
An hour later I refreshed the page and the same file’s timestamp had changed. I never noticed this before so I’m starting to wonder whether it hasn’t done this all along. If so, this site has the richest supply of proxies on the Internet.
I’m at the limit of my processing power importing three file simultaneously on the AMD64x2 box, so I may have to enlist another VM if the file updates again today. Or I can just start stockpiling data and catch-as-catch can.
-= UPDATE 12:00PM =-
I have implemented a check once evry 15 minutes on this file and it appears it is refreshed every 30 minutes, like clockwork. It’s not a new file, but an update. The file always has about 250,000 proxies so I’ll need to hack out a diff to make this manageable.
-= UPDATE 1:15PM =-
I hacked out the diff. Using – surprise – diff!
This site just may max out my processing capabilities. Right now the page says we have 995,000 proxies, but we’ve probably already gone over a million.
The page updates are taking almost an hour with the extra data. The twelve o’clock run didn’t make it to the server until 12:46. I may have to look at that code. It checks the new proxies sequentially and with a 45 second timeout that can slow things down considerably. There must be some multitasking opportunities in there somewhere.
While I was playing catch-up with IS-1 (Interesting Site #1), they sent another update. I’m cranking on it right now. The One Million Mark is in sight and approaching faster than I thought.
This project is, in fact, turning into IS-1′s proxy list. That is, if they had a proxy list (they seem to be in the SPAM business). I’m not really keeping track, but at least two thirds of the database came from that site.
Thanks guys, whoever you are. I don’t agree with what you do but your data is primo!
In other news, I got my first Uknownian site. Turns out it’s in Argentina in a /15 CIDR block.
The system is now stable. No surprises in the morning, no sudden lockups. Life is good.
I have been playing around with a new version of nmap that is very slick. You can read about it here if you are so inclined. You’ll have to compile it yourself if you want to check it out, but it’s worth the effort (you’ll need <subversion).
I woke up this morning and decided to check up on Interesting Site I (IS-1). Sure enough, there was a new file, dated today!
I downloaded it and let the AMD64x2 have at it.
It’s still running.
So far it’s added ~50,000 proxies to the database (with ~200 good ones so far). Even proxies with “weird” ports are turning out to be OK, so I may revisit my decision not to add the other 400,000 proxies in the other files from IS-1. Plus there is a lot of port 1080 systems in there, so that could be more grist for the SOCKS mill.
If I decide to do it, we should hit the million mark by the weekend or early next week.
Yesterday I was hacking away like a madman on the code all morning. I was on a serious roll, boys and girls. Then, just about 1PM, the power went out (yes, I do this all from the comfort of my own home) and stayed out until 5PM.
Extremely aggravating.
And to top it off the lease expired on the IP address I’ve had since… well… since the last power outage, whenever that was (I have my gateway box on a UPS, but it’s only good for about 90 minutes). So I lost half a day and then another hour and a half getting everything on the new IP address.
The good news is the 1G of RAM I took out seems to have stabilized the box running the system’s VM. If it lasts through the weekend I’ll probably put the 80G drive back in.
After all the fun with Interesting Site II, I went back to Interesting Site I to see if anything new (and/or “interesting”) had happened.
There was one new file, dated yesterday, with 70,000+ IP:port combinations in it.
Like the files before it, it had a lot of suspect ports, so I trimmed it down to the “usual” proxy and SOCKS ports (a little over a fifth – 16,000+ lines – of the file) and threw it at the database.
One hit.
One that I know of for sure. I got tired and went to bed.
The rest were mostly new entries, never before seen by the database.
IS-1 (Interesting Site I ), you may recall, had over 460K total “proxies” in various text files, and over 14 million email addresses tucked away in two RAR archives.
Comparing IS-1 and IS-2, we can say they have at least one thing in common, besides stockpiling proxies:
They’re both up to No Good.
Considering the sheer volume of data both sites have contributed to this project, we can safely say that the people who run Proxy Lists in general are amateurs.
The 460K Random Run has completed – faster than I anticipated – and the results are in
- CLOSED PORTS: 441,497
-
DUPE ENTRIES: 5,532
-
NEW PROXIES: 35
Is that pathetic or what? Of the new proxies most were end-user type DSL or cable systems in South America, Poland, or Spain (judging by the FQDN).
Here is the interesting part: the 431K hosts with “CLOSED” ports are live hosts. Maybe they were proxies last week. Maybe they’ll be proxies next week. Maybe they are simply IP addresses that have changed hands via DHCP.
This is also the reason it ran faster than I expected. It was programmed to bypass any testing on closed ports and just go to the next one.
I did a random sampling (nmap) of a few addresses and found – I hate to say it again – “interesting” results. One address was 100% filtered. The next had a single (non-proxy) port open. One had MySQL, VNC, NetBIOS, and HTTP ports open. That one smelled like a honeypot.
Very curious. And someone went to a lot of trouble to compile that list.
I ran across this unsort utility earlier in the week and it was perfect for the 460K Run (the proxy list from the “Interesting Site”).
Perfect because there were thousands of dupes and in order to get rid of them the file had to be sorted for unique entries. After it was sorted it was, well… sorted. Testing all those ports sequentially is simply bad form. It sends warning signs to both my ISP and the remote ISP that Something’s Up. Randomized, it’s just so much background noise to the remote ISP.
Locally it doesn’t really help with my ISP, but I’ve been doing this for three months and they don’t seem to care, although it is an exponential increase in activity.
I figure this should take about 5000-7500 seconds running on the DualCore AMD64 box. It’s been running for about ten minutes and the vast majority of the ports are closed. The ones that are “open|filtered” (per nmap) are already in the database (whether they’re active or not, regardless of how they’re listed in the database, will be determined in a future run). So far there are no open ports I don’t already have, but this is just the beginning.
I have a feeling this is a meaningless exercise in futility, but I have to get this list behind me.
It’s an obsession.
Speaking of which I have recently registered the domain names proxyobsession.com and proxyobsession.net. If you go to either one you will end up at The List for now.
***UPDATE 10:40AM***
The 460K Random Run has been running smoothly for three hours and I’m getting approximately 3-4 new live, open proxies per hour. That may not seem like a lot (and it isn’t) but it’s about what I expected.
In total, the site I mentioned yesterday had over 450,000 proxies tucked away in text files. I thought a long time about adding that stuff to the database, but on closer inspection it looked like mostly junk.
I know. I’ve said it a hundred times. The database has mostly junk in it already. I just can’t see tripling the size of it with this particular junk. There are far too many oddball ports for my taste. And there are IPs with 4 or more different ports listed. No, it just doesn’t look right.
I had an idea to just run through it and find any and all open ports in that list and to Hell with the rest, so I cooked up some quick bash kiddie scripts and ran with it… for about five minutes. It simply ran too fast. That kind of activity throws up red flags, so I shut it down and backed off. But still… it’s tempting. If the numbers I’ve run across are any indication, there could be anywhere from 300 to 600 live proxies in all that mess. I may chop it down into smaller files and give it another whack sometime. A slow, leisurely, measured whack. Or rather, whacks. Spread out over a few months. Sounds like a weekend project.
The other interesting thing about that site was 14 million email addresses stuffed into .RAR archives (I didn’t count them but the filenames themselves indicated the total numbers).
OK, so we have:
An “abandoned” Web site with…
14 million email addresses, and…
nearly half a million proxy addresses
Hmmm… ya think maybe there was some spamming going on here?
Those half a million proxies could have been a rented bot army, which would account for the oddball port numbers. The bot theory is good because I randomly tested a handfull and found live hosts with closed ports. And the ones I tested all had ISP type DNS names.
You certainly can find some peculiar things on the Intertubes!
Once upon a time, when I was doing manual, ad hoc Google grazing, I ran across a list – actually a single text file – with over 70,000 proxies in it. Well, those went into the database long ago but today’s Google Hack hit it again, so I had to kill the run and twiddle the hack so that it will ignore the site from now on.
70,000 proxies, at this point, is over a quarter of the total database and to process them would just be so much wheel-spinning, even with the recent performance enhancements (I didn’t say… ugh… tweaks).
But this time I took a closer look at the site. I can only say it’s… very, very interesting.
It looks like an abandoned Web site, but there are relatively fresh files just sitting there and most of them are text files full of proxies.
I think I will pull down the entire site and put the other VM to work on them to see what happens.
BTW, the Google Hack snarfed down about 120 Bahrainian proxies before I killed it.