I deployed an instance of this for my own use. Been using it as my daily search engine for a little over a month now. No complaints. One overlooked feature is that Whoogle supports DDG-style bangs!
If I can just instruct the browser to delete the sessions, cookies, localstorage, etc. after I close a Google search tab, then would it require us to self host Whoogle? This considers that I'll never login to Google using that browser and 3rd party cookies are disabled.
They can through browser fingerprinting, yes. Canvas/webgl/fonts/IP/accelerometer/every other web api basically
> If I can...
You can! Install Firefox, Multi account containers (add-on by Mozilla), and Temporary Containers (third party addon). You can configure Temporary Containers to spawn a new container for every tab, or every google tab, etc. Each container is like a new browser session. It can clear the data of a closed container after 15 minutes.
What kind of settings? If you want per-site settings there is (was) uMatrix which allows for extremely granular configuration. Sadly it is discontinued right now. But it still works.
Let's say for example I'd like one container to have JavaScript disabled, and only enable the cookie autodelete add-ons. In another, enable JavaScript and some specific add-ons (like a password manager). Also, would be nice to be able to control cookie settings per container.
One question: Let's assume that I've added Firefox containers and instructed it to open every tab in a new container. If I open a link from the Google search result in a new tab(that is inside a new container) then can Google still trace the flow because the opened links are from Google and not the actual search result and it may contain the tracking info?
1) Install ClearURLs. This addon strips tracking identifiers from URLs. If you hover a google search link you'll see it doesn't direct to you to website.com it directs you to google.com which then forwards you to the site you clicked without this addon
2) Configure Temporary Containers to make a new container for every different subdomain or domain. This way, if you click a link from google search, regardless of using ClearURLs, a new container spawns for any domain/subdomain that does not match (ex: click netflix.com from google and TemporaryContainers identifies this and spawns a second tab for netflix). This makes some things impossible, like SSO, so configuring it properly can be tricky. You might be able to configure it such that only links clicked from google.com spawn a new container and those that redirect to sso sites don't, but I haven't done this. You can always open a private window where the context is shared (temporary containers don't work in private) if you need SSO.
Obviously there's more you have to do to be even safer because with pings on by default and js enabled on google, they can still see you clicked a link. Also, with Google Analytics (GA), they can infer someone searching "x" and then "another user" from the same IP fetches "x" GA tracking scripts a second later is the same person. The list goes on and Google really likes tracking people, so it's very difficult to mitigate. The first and most important thing you can do is GET OFF CHROME/EDGE!
Whether it's evil or not depends on how it's used. Suppose that the top result for some common search is poor, but the second one is better, and this is visible to most users from the search result page. Everyone clicks link #2, hardly anyone clicks #1. That is valuable feedback and the search engine developer then knows that there's something wrong with the first result, and this can be determined without keeping any information on the original user. Often this happens when some clever SEO has caused the search engine to give a high rank to some stupid site.
Yes, Google can still recognize "you" for some variation of "you". Anecdote - the other day my wife searched for an address on her phone while on WiFi. I searched for that same address just one minute later on a different computer (on the same WiFi) and the address was auto-completed by Google before there was enough of the address entered to make it unambiguous.
(Consider living in a neighborhood where all the streets around you start with "Fl". And then you go to search for "Flanders Drive", which you have never searched for before, and it gets auto-completed. Even though you would have expected "Fl" to expand to "Florence Road" since that's the thing you commonly search for. That's what happened here.)
I did not understand the last sentence. If I solve the captchas, get the search results and then do the cleanup, it'll just continuously ask for the captchas and that's the only added pain, no? Will they be able to conclude if all the requests are coming from the same user?
Yes. But its going to keep asking you the captchas till the user changes behavior (sample of 1) :) (and of course the ip address too like the other poster mentioned - you can try it by searching for insurance/anything with good money on your phone and switch to desktop - assuming both are connected through the same router).
I once had to solve thousands of captchas for an archiving project, and buster helped me with a quarter of the captchas (https://github.com/dessant/buster)
Instead of instructing your browser to delete cookies after you leave, why not instruct your browser to not accept cookies from such sites in the first place? I currently have google.com and youtube.com to "Never allow cookies from this site" in my browser.
I think that may be a bit of an unfair statement. They do note just above in the public instances the following:
> Note: Use public instances at your own discretion. Maintainers of Whoogle do not personally validate the integrity of these instances, and popular public instances are more likely to be rate-limited or blocked.
Having tested with various search strings, the time to first search-result paint for the the first public instance [1] feels just as fast a Google.
No. My self-hosted instance is about as fast as using Google directly. Maybe a half a second slower, but barely noticeable if it is. I can't even tell.
This sounds too good to be true in practice. If I deploy it on a server and continuously query it from all of my devices, will Google ban that serve rIP?
I set this up for my family and I a few months ago and we set all our browsers and devices to use it as the default search engine. We've been really happy with it.
I also set up a small script on a cron job that queries random search strings every few minutes and opens the first few hits in selenium. My theory is that if I can't completely stop them from tracking us, I can at least dilute their data with bogus searches.
I'm on a shared IP with millions of people using the same public IP (T-Mobile CGNAT), there is one IP (many of them, actually) doing that right now from every T-Mobile customer. Your one server will be a blip on their radar if it even registers.
Dude, thanks for letting me know that everyone uses Google.
I'm talking about static IPs like servers in the cloud, not your home. That stuff is automated and I am sure static IPs get banned but here must be a quota of something.
Dude, CGNAT is handled at the ISP layer, I do not have an IPv4 address at all locally, it's a 464XLAT done on T-Mobile's side. All users come from a shared IPv4 on their network, not mine. Dude.
Dude - Each of those users behind the NAT will have a different set of cookies, user agents, screen sizes, among other fingerprints that qualify them as unique. ISPs also routinely place their CGNAT addresses on specific whitelists so that services don't block them for abuse (you can look through the NANOG email list to find examples of this.) IP addresses are also classified as residential, cloud/server, etc. If Google sees rapid requests from the same IP classified as a server that's sending a Python Requests user-agent, they can absolutely block it.
Obviously I can't answer how long Google should block an abusive IP address since I'm not Google.
A CGNAT IP address is not reassigned to a home, it's shared among many homes. If you meant from my example the cloud server IP, that is one issue that comes up pretty often on cloud services and there's not a clean way around it.
For example I use Linode as my VPN server, I used to have all sorts of trouble with Google making me enter captchas or blocking my search just because of the abuse coming from the same IP range. I actually can't even login to some apps while on my VPN, and I've had this same IP address on my Linode for close to 10 years, so it's not an issue with my /32 specifically.
You'll see the same thing on AWS, many of those ec2 instances can't be used for sending email or for VPN services because previous users of the IP space abused their way onto block lists.
I get "you appear to be a robot" whenever I use my DigitalOcean box as an exit node. I'd imagine you'd have to host this at home or get really lucky there
The moment I switch it on and use Google, no excessive searches etc
So, I set it up this week-end on a vps and subdomain and today it's blocked :(.
About this page
Our systems have detected unusual traffic from your computer network. This page checks to see if it's really you sending the requests, and not a robot. Why did this happen?
IP address: 137.x.x.x
Time: 2021-08-30T16:56:45Z
URL: https://www.google.com/search?gbv=1&q=broken&lr=&hl=en&safe=off
Isn't normal usage scraping in this case? Looking at whoogle-search/request.py it's "scraping" google urls via the python requests module. I'm reasonably sure google fingerprints requests and assigns different weights for "probably scraping". I wouldn't be surprised if this has a lower threshold for triggering their captchas and/or blocking.
I often trigger Google's captchas during normal usage. It seems to suspect the more "advanced" features like "intitle:" or "inurl:", or if I search too rapidly. I take being mistaken for a machine as a compliment!
That's understandable, a lot of exploit seekers use those features to find exploits e.g. "powered by [cms with known exploit]", Google (and Bing) are definitely are more prone to showing you a captcha for those searches, especially if you're looking beyond the first page.