Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Whoogle Search: A self-hosted, ad-free, privacy-respecting metasearch engine (github.com/benbusby)
201 points by wsc981 on Aug 27, 2021 | hide | past | favorite | 48 comments


I deployed an instance of this for my own use. Been using it as my daily search engine for a little over a month now. No complaints. One overlooked feature is that Whoogle supports DDG-style bangs!


Side question:

If I can just instruct the browser to delete the sessions, cookies, localstorage, etc. after I close a Google search tab, then would it require us to self host Whoogle? This considers that I'll never login to Google using that browser and 3rd party cookies are disabled.

Or, can Google still recognize me?


They can through browser fingerprinting, yes. Canvas/webgl/fonts/IP/accelerometer/every other web api basically

> If I can...

You can! Install Firefox, Multi account containers (add-on by Mozilla), and Temporary Containers (third party addon). You can configure Temporary Containers to spawn a new container for every tab, or every google tab, etc. Each container is like a new browser session. It can clear the data of a closed container after 15 minutes.


I live Firefox containers. One feature I've been waiting on for ages is per-container security settings.


What kind of settings? If you want per-site settings there is (was) uMatrix which allows for extremely granular configuration. Sadly it is discontinued right now. But it still works.


Let's say for example I'd like one container to have JavaScript disabled, and only enable the cookie autodelete add-ons. In another, enable JavaScript and some specific add-ons (like a password manager). Also, would be nice to be able to control cookie settings per container.


As a malicious JS researcher, ^THIS^!


Thanks for the info!

One question: Let's assume that I've added Firefox containers and instructed it to open every tab in a new container. If I open a link from the Google search result in a new tab(that is inside a new container) then can Google still trace the flow because the opened links are from Google and not the actual search result and it may contain the tracking info?


You should do 2 things to mitigate this:

1) Install ClearURLs. This addon strips tracking identifiers from URLs. If you hover a google search link you'll see it doesn't direct to you to website.com it directs you to google.com which then forwards you to the site you clicked without this addon

2) Configure Temporary Containers to make a new container for every different subdomain or domain. This way, if you click a link from google search, regardless of using ClearURLs, a new container spawns for any domain/subdomain that does not match (ex: click netflix.com from google and TemporaryContainers identifies this and spawns a second tab for netflix). This makes some things impossible, like SSO, so configuring it properly can be tricky. You might be able to configure it such that only links clicked from google.com spawn a new container and those that redirect to sso sites don't, but I haven't done this. You can always open a private window where the context is shared (temporary containers don't work in private) if you need SSO.

Obviously there's more you have to do to be even safer because with pings on by default and js enabled on google, they can still see you clicked a link. Also, with Google Analytics (GA), they can infer someone searching "x" and then "another user" from the same IP fetches "x" GA tracking scripts a second later is the same person. The list goes on and Google really likes tracking people, so it's very difficult to mitigate. The first and most important thing you can do is GET OFF CHROME/EDGE!


Yes. There are addons that degooglify the links though. It's such an evil practice they've introduced.


Whether it's evil or not depends on how it's used. Suppose that the top result for some common search is poor, but the second one is better, and this is visible to most users from the search result page. Everyone clicks link #2, hardly anyone clicks #1. That is valuable feedback and the search engine developer then knows that there's something wrong with the first result, and this can be determined without keeping any information on the original user. Often this happens when some clever SEO has caused the search engine to give a high rank to some stupid site.


Assuming the whoogle server has a fixed ip address doesnt that provide google enough on its own to fingerprint you?


Install Firefox and then install the Google Container extension [1]. It keeps all your Google related stuff separate from the rest of the world.

[1] https://addons.mozilla.org/en-US/firefox/addon/google-contai...


Do containers get around the fingerprinting issue though?


Yes, Google can still recognize "you" for some variation of "you". Anecdote - the other day my wife searched for an address on her phone while on WiFi. I searched for that same address just one minute later on a different computer (on the same WiFi) and the address was auto-completed by Google before there was enough of the address entered to make it unambiguous.

(Consider living in a neighborhood where all the streets around you start with "Fl". And then you go to search for "Flanders Drive", which you have never searched for before, and it gets auto-completed. Even though you would have expected "Fl" to expand to "Florence Road" since that's the thing you commonly search for. That's what happened here.)


If you consistently do that, you will start seeing captcha wall of hell. Google gets its pound of flesh one way or the other.


I did not understand the last sentence. If I solve the captchas, get the search results and then do the cleanup, it'll just continuously ask for the captchas and that's the only added pain, no? Will they be able to conclude if all the requests are coming from the same user?


Yes. But its going to keep asking you the captchas till the user changes behavior (sample of 1) :) (and of course the ip address too like the other poster mentioned - you can try it by searching for insurance/anything with good money on your phone and switch to desktop - assuming both are connected through the same router).


It might even temporarily block you and put out the message that they are seeing suspicious "bot" activity from your device.


IP and other fingerprinting techniques are enough to identify you


I once had to solve thousands of captchas for an archiving project, and buster helped me with a quarter of the captchas (https://github.com/dessant/buster)


Instead of instructing your browser to delete cookies after you leave, why not instruct your browser to not accept cookies from such sites in the first place? I currently have google.com and youtube.com to "Never allow cookies from this site" in my browser.


The available public instances [1] seem to take 5-6 seconds for search results page.

Is this to be expected with self-hosted too?

[1] https://github.com/benbusby/whoogle-search#public-instances


I think that may be a bit of an unfair statement. They do note just above in the public instances the following:

> Note: Use public instances at your own discretion. Maintainers of Whoogle do not personally validate the integrity of these instances, and popular public instances are more likely to be rate-limited or blocked.

Having tested with various search strings, the time to first search-result paint for the the first public instance [1] feels just as fast a Google.

[1] https://whoogle.sdf.org/


No. My self-hosted instance is about as fast as using Google directly. Maybe a half a second slower, but barely noticeable if it is. I can't even tell.


same, just threw it up on my home lab box and switched my default search and don't notice a difference


What's the difference with SearX[1]? SearX seems both more popular and supports other search engines besides Google.

[1]: https://github.com/searx/searx


It's mentioned in their FAQ[1], less config easier to set up

[1]:https://github.com/benbusby/whoogle-search#faq


This sounds too good to be true in practice. If I deploy it on a server and continuously query it from all of my devices, will Google ban that serve rIP?


I set this up for my family and I a few months ago and we set all our browsers and devices to use it as the default search engine. We've been really happy with it.

I also set up a small script on a cron job that queries random search strings every few minutes and opens the first few hits in selenium. My theory is that if I can't completely stop them from tracking us, I can at least dilute their data with bogus searches.

We haven't had any issues from Google.


I guess it will be just like Searx: https://searx.me/

If an IP generates too much search queries, Google will block / throttle it…


I'm on a shared IP with millions of people using the same public IP (T-Mobile CGNAT), there is one IP (many of them, actually) doing that right now from every T-Mobile customer. Your one server will be a blip on their radar if it even registers.


Dude, thanks for letting me know that everyone uses Google.

I'm talking about static IPs like servers in the cloud, not your home. That stuff is automated and I am sure static IPs get banned but here must be a quota of something.


Dude, CGNAT is handled at the ISP layer, I do not have an IPv4 address at all locally, it's a 464XLAT done on T-Mobile's side. All users come from a shared IPv4 on their network, not mine. Dude.


Dude - Each of those users behind the NAT will have a different set of cookies, user agents, screen sizes, among other fingerprints that qualify them as unique. ISPs also routinely place their CGNAT addresses on specific whitelists so that services don't block them for abuse (you can look through the NANOG email list to find examples of this.) IP addresses are also classified as residential, cloud/server, etc. If Google sees rapid requests from the same IP classified as a server that's sending a Python Requests user-agent, they can absolutely block it.


how long should google block the IP? when the ISP reassigns it to a new home they would be blocked also blocked from google


Obviously I can't answer how long Google should block an abusive IP address since I'm not Google.

A CGNAT IP address is not reassigned to a home, it's shared among many homes. If you meant from my example the cloud server IP, that is one issue that comes up pretty often on cloud services and there's not a clean way around it.

For example I use Linode as my VPN server, I used to have all sorts of trouble with Google making me enter captchas or blocking my search just because of the abuse coming from the same IP range. I actually can't even login to some apps while on my VPN, and I've had this same IP address on my Linode for close to 10 years, so it's not an issue with my /32 specifically.

You'll see the same thing on AWS, many of those ec2 instances can't be used for sending email or for VPN services because previous users of the IP space abused their way onto block lists.


I get "you appear to be a robot" whenever I use my DigitalOcean box as an exit node. I'd imagine you'd have to host this at home or get really lucky there

The moment I switch it on and use Google, no excessive searches etc


So, I set it up this week-end on a vps and subdomain and today it's blocked :(.

    About this page

    Our systems have detected unusual traffic from your computer network. This page checks to see if it's really you sending the requests, and not a robot. Why did this happen?

    IP address: 137.x.x.x
    Time: 2021-08-30T16:56:45Z
    URL: https://www.google.com/search?gbv=1&q=broken&lr=&hl=en&safe=off


scraping yes. normal usage no.


Isn't normal usage scraping in this case? Looking at whoogle-search/request.py it's "scraping" google urls via the python requests module. I'm reasonably sure google fingerprints requests and assigns different weights for "probably scraping". I wouldn't be surprised if this has a lower threshold for triggering their captchas and/or blocking.


I often trigger Google's captchas during normal usage. It seems to suspect the more "advanced" features like "intitle:" or "inurl:", or if I search too rapidly. I take being mistaken for a machine as a compliment!


That's understandable, a lot of exploit seekers use those features to find exploits e.g. "powered by [cms with known exploit]", Google (and Bing) are definitely are more prone to showing you a captcha for those searches, especially if you're looking beyond the first page.



Looks like the public instance at https://whoogle.sdf.org is hacked/corrupt since it returns different results to the other public instances.


If you don't like the "meta" in the title, have a look at https://yacy.net.


It's missing an "I am feeling lucky" button :).

And shareable links for search results or did I miss that ?


I just deployed a version for personal user, lets see how it works out!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: