Curious to know since you mentioned that it was fast for thousands of PDFs... any rough timing information on some of your queries for that kind of dataset?
I'm really reaching here to recall, but the short version is that actual searches never took more than a second. All I really cared about was how noticeable a delay to expect, and it was never more than that.
On a bulk import of 1,000+, it took a couple minutes to ingest them. This was all on a $20/month VPS.
It took me a long time to realize that articles related to a topic can be seen by clicking on the year. This instruction is hidden below the screen fold. It'd also be nice to see actual numbers (along with the %) for each year during hover.
It'd also be useful to see a curated list of topics to select from, instead of just randomly picking some related topics on page load. Hopefully they harvest some interesting suggestions they receive through @nyt. There are some fun topic-suggestions in this thread already.
Overall, a nifty tool that I wish existed for all news sources in the world and worked across all languages.
We've had professors who have over 350,000 emails and years of email history use the system, and it has handled that pretty ok. So, don't worry about the network being over-populated. We do limit the number of nodes that the network shows initially. You can adjust that later if you wish to do so.
As for your time counter of 20 mins, that is a bit odd. Has the first version of the network not loaded for you? If you are on the /viz page, just refresh it. If the server has fetched at least the first batch of metadata, then the corresponding network will load immediately, and the rest of the fetching will happen in the background. Please note that we are also having heavy traffic at the moment; so the fetching process could be a bit slow.
The 'Play' button is something we'd already implemented in a previous prototype, but it wasn't ready enough to be included in this release. It will show up in a future release though. :)
Also, by _slow_, were you referring to the actual rendering of the network when you adjusted the time-slider? Or the initial loading of the network itself? Because the former is actually pretty fast (if you are using Chrome), and the latter is something that we've tried to optimize as much as possible, but there are limitations.
The latter. Have you seen this tool? https://github.com/mbostock/d3/wiki/Force-Layout Try e.g. the 'build your own' demo, and add a few hundred nodes. There are quite a few demos out there using this layout and canvas, SVG, etc, that perform extremely well. Otherwise, further in the past I've seen some that offload calculation onto a webworker pool, some of them handling several thousand nodes (not all tightly connected, but still) without noticeably delaying the page. But I can't find those at the moment :/
Without profiling, I'd guess the biggest speedup is from the Barnes-Hut approximation [1], which is simple enough conceptually that you might even be able to implement it by hand (if you have control over your current layout algorithm).
edit: hm, and it's noticeably quicker in Chrome than Firefox. That's a little surprising, given that I'm running the nightly build, and it's usually competitive :/
edit2: ah, you are using that. strange... I wonder where the slowdown is.
That's the first we've heard of that error. When the user logs out, they are also presented with the link to revoke access via Gmail. Sorry you weren't able to get to that page. If you want to make sure that your data is deleted, we can do a manual delete of your metadata (if it exists on our server) for you. Just write to us at the address on the website.