• 0 Posts
  • 34 Comments
Joined 1 year ago
cake
Cake day: July 15th, 2023

help-circle





  • I use an app called Recipe Keeper. It’s amazing because I just share the page to the app, it extracts the recipe without any nonsense, and now I have a copy for later if I want to reuse it. I literally never bother scrolling recipe pages because of how terrible they all are, and I decide in the app if the recipe is one I want to keep.

    It also bypasses paywalls and registration requirements for many sites because the recipe data is still on the page for crawlers even if it’s not rendered for a normal visitor.





  • I recently went through these exact pains trying to contribute to a project that exclusively ran through Discord and eventually had to give up when it was clear they would never enable issues in their GitHub repos for “reasons.”

    It was impossible to discover the history behind anything. Even current information was lost within days, having to rehash aspects that were already investigated and decided upon.




  • It’s a really interesting question and I imagine scaling a distributed solution like that with commodity hardware and relatively high latency network connections would be problematic in several ways.

    There are several orders of magnitude between the population of people who would participate in providing the service and those who would consume the service.

    Those populations aren’t local to each other. In other words, your search is likely global across such a network, especially given the size of the indexed data.

    To put some rough numbers together for perspective, for search nearing Google’s scale:

    • A single copy of a 100PB index would require 10,000 network participants each contributing 10TB of reliable and fast storage.

    • 100K searches / sec if evenly distributed and resolvable by a single node would be at least 10 req/sec/node. Realistically it’s much higher than that, depending on how many copies of the index, how requests are routed, and how many nodes participate in a single query (probably on the order of hundreds). Of that 10TB of storage per node, substantial amounts of it would need to be kept in memory to sustain the likely hundreds of req/sec a node might see on average.

    • The index needs to be updated. Let’s suppose the index is 1/10th the size of the crawled data and the oldest data is 30 days (which is pretty stale for popular sites). That’s at least 33PB of data to crawl per day or roughly 3,000Gbps minimum sustained data ingestion. For those 10,000 nodes they would need 1Gbps of bandwidth to index fresh data.

    These are all rough numbers but this is not something the vast majority of people would have the hardware and connection to support.

    You’d also need many copies of this setup around the world for redundancy and lower latency. You’d also want to protect the network against DDoS, abuse and malicious network participants. You’ll need some form of organizational oversight to support removal of certain data.

    Probably the best way to support such a distributed system in an open manner would be to have universities and other public organizations run the hardware and support the network (at a non-trivial expense).





  • I disagree. You should have validation at each layer, as it’s easier to handle bad inputs and errors the earlier they are caught.

    It’s especially important in this case with email because often one or more of the following comes into play when you’re dealing with an email input:

    • You’re doing more than sending an email (for ex, creating a record for a new user).
    • The UI isn’t waiting for you to send that email (for ex, it’s handled through a queue or some other background process).
    • The API call to send an email has a cost (both time and money).
    • You have multiple email recipients (better hope that external API error tells you which one failed).

    I’m not suggesting that validation of an email should attempt to be exhaustive, but a well thought-out implementation validates all user inputs. Even the underlying API in this example is validating the email you give it before trying to send an email through its own underlying API.

    Passing obvious garbage inputs down is just bad practice.




  • You should get 33% more pay as the full work force productivity would be 4/3 of the original in your example.

    This difference might be clearer with an example where only half of the work force is required to match the original productivity. In this case, if the full work force continues to work, productivity is presumably doubled. That’s not a 50% increase. It’s 200% of the original or a 100% increase. So the trade-off should be between 50% fewer working hours and 100% more pay.

    Of course, instead you’ll work the same hours for the same pay and some shareholders pocket that 100% difference.