My theory is that Google lost the spam war. All the leetcode in the world won't teach someone to build an ML model that can tell whether a website is spam. So Google outsourced the spam detection problem -- they heavily bias their results towards only the most popular sites, who either have human moderators or paid contributors, and those sites do the spam management that Google's automated approach is incapable of.
Google's recency bias in search results hurts them too. There are many older resources which are still valid out there, but you won't find them (easily) via Google Search. Instead, you get the SEO spam which doesn't match the search as well, but is newer.
All of the Google sections seem to be terrible for their use case. Images are lacking filters and are filled with Pinterest results which send people into a cycle when trying to get to the actual image to save. News will sometimes show old news when it's obvious there should be new information about whatever you're searching, leading you to have to specifically indicate sort by date. Videos will prioritize certain domains even if the video itself is irrelevant to your search (ie searching up an actress, it'll show an IMDB video first, even if the video is very old and is just a generic trailer for some movie they were in ages ago). If you search something relatively generic, the new search bubbles will now hide the other categories (ie searching adele will hide categories like shopping or books). The finance option literally just redirects you to Google Finance now, doesn't even retain your search.
They've really let the core search experience be deteriorated so extensively, that we can't blame all of it on SEO.
I haven’t given it much thought until just now but it really is surprising how terrible Googles image search is. There are so many images on the Internet yet the image search often fails to retrieve good examples even on straightforward searches. Never mind difficult searches, higher resolution images, or if you want to quickly download an image. It was never very good and it’s gotten perceptible worse.
Side-thought: I have at times wished Google had a time machine function, for example, "show me the results that this search would have returned in 2008".
It would open up a whole new world of chronological meta analysis; a new dimension of cross-referencing.
[I understand that Google has, or had, some limited incarnation of this, in the form of it's Time Range search, but - and without looking deeply into it - I suspect that is an algorithmically different procedure than the one I am describing. That is, I expect that applies a simple filter to a current search, rather than being The Actual Old Search Results]
there have been many times when i'm searching for things i'll go into the "tools" and limit results from 1990-2010 and find exactly it with a handful of results, vs if i don't a ton of blogspam and other crap hiding it
Is Google biased, or are it’s users biased? If someone is looking up something technical or diagnosing a problem, the most recent post on StackOverflow would be more relevant. Old stuff can be important, but if users searches mostly comprise recent stuff - what’s the point?
I have some old articles on my blog that always decrease in readership. I do a pass every year, change a paragraph here and there, change the alias/url, and then get 5x the readership for a few months. Rinse and repeat. Most of my traffic comes from Google. Also, most of my content is in such a specific niche that it doesn’t really age much in 5 years.
So recency bias kinda works against people discovering useful content. But whatever. Google is in decline anyways.
I just ran a Google search on some code I was working on 15 years ago, posted to blogspot.com. It was about card game X, using programming language Y. The relevant keywords appear in the blog post. However, it completely failed to show up in a Google search. Just... nothing, even when I scrolled through the "more results" stuff. There were relevant and more recent links to the same topic (and likely better implemented, but let's leave that aside).
I did find my page on a Bing search after getting it to not ignore an important keyword, so that's something.
There are way too many google searches where the best result was the top 10 years ago, and still should be the top. Instead it’s something from this year that has no information.
Googling is now like talking to that person who repeats your question back to you as a statement, as if they added to the conversations. The results are some social media manager who’s job it is to make “content” by writing filler.
Content is clearly being auto generated. Even the new Zelda game. Trying to search anything and it’s Tons of pages that all are extremely verbose with a consistent grammar style and not actually saying anything useful.
Most people aren't looking up techniques on how to work with cutting edge software. Sometimes, even usually, they're more likely to be looking for almost anything else.
I think it's user bias. It's shockingly hard to get Google to default to docs on Java 17 over java 8. Users want 8 so Google servers that even if it isn't as new.
Can you give an example? For many things (software, hardware, buying things,...) older resources might still be valid but dated. For more settled subjects, e.g. encyclopedic, I also have not had Google be worse than Bing.
Google results didn't start to decline to be worse than Bing, Google results declined irrespective of and without reference to Bing. Google results are far worse that Google results once were.
Doesn’t help that they just give answers to a query. Answers often being very wrong. Then instead of more results they show related questions, which are often non-sensical.
This is not even getting into censorship of any political thing that google employees dislike.
It is more than the spam war...a lot of content is just no longer produced and exposed to the open web (think about how much content goes into tiktok, discord, etc and you will never get that into your search results). Google has less useful content to index, algorithms can't fix that. There is more spam only because that's the only open content that gets added massively.
The winners of this battle will be the places where content is generated (or curated) - and reddit is perhaps the most important content hub generator (it's not just an aggregator anymore - comments about some news can be often more interesting than the news itself). Indexers (and language models) are useless without content to scrap.
I can't believe that. There's still so many personal blogs by real people. They just don't show up in the first 3 pages unless you query for them very specifically.
Seems most queries result in something like 30% SEO spam sites, 30% quora, 30% reddit, 10% other.
Edit: I don't disagree that Discord/Youtube/Other closed gardens have taken open searchable data away, but it's not like there's now no authentic searchable data at all. Perhaps Google also needs to learn to search those closed gardens better.
Google flourished because it could find forums (and blogs) and mine those, but much of that content has disappeared into Facebook and Discord (and YouTube - we must not discount how many things that would have been easily parseable blogs are now buried in livestreams and videos).
Discord is probably the worst of all. I'm not a gamer and I hate it so much that a lot of tech content is now locked behind private Discord channels. Even Facebook is more discoverable than that
Even when you are already on discord, search and trying to read old conversations is awful on discord, because that's not at all what discord was made to do.
So I've been working on a side project to make a Youtube channel I watch have its content be more discoverable through text. I've had great results by scraping the Youtube transcription, and running that for a few passes through GPT 3.5 with some prompts to essentially act as an editor. The original transcription was often terrible in some spots. Just whole phrases or multiple words mistranscribed throughout. For almost all of them, GPT 3.5 was able to clean them up and restore the original meaning through understanding the context of the monolog and fixing obviously incorrect words or phrases.
I've watched through a sample of about 20 of the 3,000 videos I'm working through, and the corrected transcription really did an amazing job at restoring the original meaning from the spoken words that was hard to understand from the original machine transcription.
That is exactly where LLMs are useful. (People thinking of them as "AI", meaning AGI is just so wrong. Writing legal briefs??) Using them to ex post facto adjust transcripts in order to make them available and searchable is great.
>we must not discount how many things that would have been easily parseable blogs are now buried in livestreams and videos
on the flip side, would those blogs have been created at all if they weren't financially motivated by streaming/video to provide the content?
there's a lot of discussion here about internet commuities, but this comment brings to question why blogs started to die down to begin with. At least with reddit you get clout if you share stuff (useless clout, but sometimes you just want a pat on the back).
Blogs are parallel to research papers in a sense. They're useless without peer review unless you're already intimately familiar with the source material and able to critically evaluate the contents.
So Blogs are more useful when they're aggregated through a site like Reddit, where users have already done the vetting on whether the linked page is valuable. Reddit comments are invaluable to pages by adding additional context. Noting when the content has become dated or inaccurate due to external changes, etc. Sites like Brian Kreb's blog are the exception as the author is well known and respected. But the general blogs? It takes time to earn that community respect.
Then beyond that, how often have you gone on the hunt for something obscure only to run across 3 or more blog pages which look entirely unique, but have the exact same article pasted to them? It isn't that the contents are bad/wrong/inaccurate, but rather who do you trust? How much effort are you going to put in to finding which blog was the original, written by the expert and which ones are bots copying the info?
>where users have already done the vetting on whether the linked page is valuable.
and ironically enough, if you post your own blog on reddit to be critiqued, there's a good chance it is removed for "self promotion". Funny how that "vetting" works, huh? So you get back to "how do I make my blog discoverable so it can be peer reviewed" and we're at square 1 again.
>How much effort are you going to put in to finding which blog was the original, written by the expert and which ones are bots copying the info?
A lot if it's important. Because as is I already have to do that muckracking on reddit to see who is trying to understand or even read the article and who just wants to soapbox their tangential pet rant. tracing a source back is child's play in comparison.
For me YouTube is always on the top, instead of the text pages where I can read the answer in a few seconds Google pushes me their video platform, probably in the hope of making money. I am logged in so I do not understand how those geniuses working at Google would think that videos in a language I do not know might be more relevant then text content.
For me YouTube is always on the top, instead of the text pages where I can read the answer in a few seconds Google pushes me their video platform, probably in the hope of making money.
To be fair, i have the same problem with Duck.
I wish i could backlist sites from my search results. YouTube and Pinterest are not helpful for the things i look for.
How great is your wish? If you host your own instance of Whoogle, which gives Google search results, you can set one of the environment variables to block particular websites from search results.
yeah, the GP really reads like it was regurgitating someone's notes that attended an internal Googs meeting on why they are ranking new higher as their mantra
a lot of content is just no longer produced and exposed to the open web (think about how much content goes into tiktok, discord, etc and you will never get that into your search results)
I see this all the time when trying to find information about old computers. So many of the good vintage computing resources are locked in social services or mailing lists that the information never shows up in search engines.
It feels a lot like the days when information was balkanized between AOL, GEnie, CompuServe, American PeopleLink, Delphi, etc.
Search engines were supposed to fix that and make all the world's information discoverable. They didn't.
There certainly is content. Often the ones I could find two years ago but now cannot.
That's because the web is full of juvenile sub-normie content such as geeksforgeeks (if you consider programming topics for example). It shadows the very specific queries with highly SEO'd Juvenile stuff.
I think this causes further problems as well. These big companies know they can easily be listed at the top of Google, and therefore they pump out low quality articles for every popular key phrase. They can write a "Top 10 X of 2023" list and bring in lots of traffic and referral sales, while a smaller site has no chance of doing the same. Then, the big site takes all that income, and pumps out more low quality articles. It's a rich get richer scenario.
If I search, "The top 10 movies on Netflix", I get sent to this page...
Is this really the best the world has to offer for my search query? I could spend a week putting together a far better page for that query, but what's the point? It's not going to rank anywhere notable in the Google results. There's no incentive for me to produce that quality content.
The real problem isn't that Google can't detect what spam is, it's that this spam drives so much of the traffic, and thus their revenue that they cannot remove it.
Google at this point is literally teaming up with spammers. Just go on a android phone and swipe left from homepage for "Google", count how many clicks it takes to get to an 100% AI generated spam page.
We have all those posts kvetching about how small email providers have become collateral damage in the arms race against email spam. I don’t think anyone is truly successful in that space.
The issue is that ML is very expensive compared to spam generation. And the moment another search engine becomes popular, then the spam cannons will be used against it instead.
for those of us without an android device to try this, care to enlighten us? the way it is described it sounds like a lot of of action required which seems like a good thing, but your point seems like it's trying to make goog's results bad. i'm really not following your point
This shows a bunch of articles based on your search history - many of which are incredibly poor quality and appear to be generated in some way.
I tried to curate mine for a bit and thumbs down useless sites or content I don't care about. I didn't really succeed in making this feed helpful so I mostly ignore it.
Google appears to think that I have one sole interest, which is The Lord of the Rings. And that I want to read endless poorly researched and sensationally headlined articles on differences between the novel and the films. It never recommends me anything else in that section.
I thought it was only me who thought that. Those articles are so much worse than, say, Google News app. Once in a while there’s something relevant, but it’s always clickbait stuff.
I wonder if Google wants to prioritize video to try to address this spam problem.
Sometimes when I search a topic all I get is video in the natural results, not a video search, and I suspect I'm part of an A versus B test.
I personally don't find it to be a compelling solution and click away because I'm not willing to sift through 10 10 minute videos to see if there is relevant content.
Google lost the spam war when Matt Cutts left. Google effectively lost its community outreach, and simultaneously pivoted to aggressive moat protection through AMP, Chrome, Jedi Blue, etc. It’s less that the spammers won and more that Google lost their aim at high-quality content.
For sure. They threw in the towel around '08 or '09. The prior years of yo-yoing between front page results containing ~60% webspam and 0% webspam settled into a permanent 60%, and they evidently just heavily downranked and/or started ignoring low-traffic sites to keep it from growing to 100%.
Their shifting from text-only "ethical" ads to being a more-ordinary web ad service, and putting ads inline with results, roughly corresponds with that, IIRC, which probably isn't a coincidence. Your ad-"results" probably get more clicks if much of the rest of first result page is crap. Many of the webspammers probably funnel money to Google one way or another, now. It screwed up their incentives to keep fighting that battle, I'd guess, which is likely part of why they stopped putting so much effort into it.
Expecting Google's code to flawlessly understand the truth and trustworthiness of all possible sources for every conceivable question is a fair/unbiased way for all topics past present and future is a very high bar. This is a really hard problem, and unlike many other hard problems (rocket science, neurosurgery, particle physics, etc.), Google has millions of financially motivated adversaries actively trying to trick them.
Google won the spam War, they were just fighting for the opposite side. I know it's trite but it bears repeating: Google only cares about user satisfaction if it positively impacts their bottom line.
As long as they can monetize remaining users more than what they lose from users abandoning search, the trend will continue
I mean I think you’re mostly right but to be fair to Google it’s an incredibly adversarial. Thousands of people have made it their full time job to try to trick Google into thinking their blogspam is good content.