GigglyPanda
GigglyPanda

zepto tech is facing really bad tech debt

stumbled on zepto's blog post regarding their infra for using postgres

here's why they're facing really bad tech debt:

  • db connections are maintained on the application layer (with 3 workers per pod), this setup should ideally have a proxy service that maintains the connection pool

  • ive no experience with managed databases, but shouldn't there be pre-configured connection thresholds and alerts if you rely on such services? they didn't have the most rudimentary alerts for database?

  • "as latency increased, new relic started fetching more EXPLAIN plans" this inherently looks like an incorrect implementation from new relic's end (a service they must be paying thousands of $/month); rather, it should fire alerts when the explain queries are failing (which is still not fixed)

all of their metrics are scattered (these are mentioned in the blog post, who knows in reality how many they might be using):

  • "AWS Aurora with Performance Insights"
  • "RDS Postgres CloudWatch metrics"
  • "New Relic"

how could one even dig so many platforms during downtime or escalations? it should be a nightmare working with this system at the very least

they call this system "high-throughput", you're hardly doing a million deliveries per day, the scale is subjective, but if you genuinely feel that you need to do a reality check

Post image
1mo ago
Talking product sense with Ridhi
9 min AI interview5 questions
Round 1 by Grapevine
ZoomyNarwhal
ZoomyNarwhal

This isn’t tech debt, ye toh tech bankruptcy hai. Even New Relic’s confused who’s debugging who

GigglyPanda
GigglyPanda

true lol, im genuinely surprised

how they can publicise these facts

let alone having those issues on the first place

SparklyCupcake
SparklyCupcake

Sahi baat hai

SillyPanda
SillyPanda

Every young company is built like that. You cannot scale as quickly as they have while also ensuring good software practices. As they mature as a company, they'll fix a lot of things. Also, the things you've mentioned are subjective choices and not necessarily bad tech debt.

  • Nothing wrong with maintaining db connections on the application layer. Every application can maintain its own connection pool and scale accordingly. A central proxy service can be more efficient in managing connections but it can also become a bottleneck and a single point of failure.
  • Correct but I'm not sure if they have proper alerts or not.
  • that's a tall claim. New relic in my experience has been an excellent service and it's hard to say if their implementation is wrong.
  • that is common for many orgs. It is ideal to have your metrics in a centralised place but it's just a minor inconvenience that devs get used to fairly easily.
GigglyPanda
GigglyPanda

zepto isn't a young company at this point, it's been 4 years now and 300+ folks in engineering

scaling quickly means that you create processes, and that processes run in automated manner

unless you're following this, you're only falling into crumbling disaggregated components (which means, today one service face an issue, they'll identify and fix it for them, tomorrow another team is going to face the same issue, and cycle continues)

you're right when you say there's nothing wrong in managing db connections on application layer, but if you check the blog post, every pod has only 3 gunicorn workers (and 10 connections), and they peak over 5k connections overall - based on this, do you still think managing connections on application layer makes sense?

FluffyWaffle
FluffyWaffle

Im starting out Backend dev and would love some stack agnostic book recommendations for learning the conceptual learning’s from the past 2 decades that can help me build that mindset? Originally an iOS developer so everything from Internet to sql to deployment related recommendations are welcome! @Micheal_Scott if you could lend me one of your BE Devs for some guidance ance, it’d be quite helpful

JumpyPretzel
JumpyPretzel

As for Explain part, why to even enable that in the first place? That itself was a huge mistake, so they are right in disabling it. I wonder why is that even a feature in new relic. Long running queries should just get printed in logs and can be analysed by devs later on, DB timeouts beyond certain number should trigger email and/or messaging alerts. But I didn’t understand the point of triggering explain query.

GigglyPanda
GigglyPanda

i would doubt one might not need it

explain queries are for runtime profiling and optimisation for queries, although you need to keep check to avoid this leading to mishaps

ZoomyBagel
ZoomyBagel

Digging between platforms would be simpler if they have trace span enabled along with flame graphs.

Db conns on app layer are ok in some cases. As an org it might not be ok after a certain critical mass

Discover more
Curated from across
Indian Startups
by JumpyJellybeanFounder
Top comments
user

Must be using physical servers in that case. It will be a SPOF if anything happens to the servers like getting a 🔥🔥...

user

Many companies are successfully managing their own on-premise infra to save costs. It's not anything new or rocket sc...

user

*Gilfoyle intensifies*

Indian Startups
by SqueakyMuffinFounder

My 3 cents on the Zepto story as a founder

Tldr; Fully support it.

I have 3 points

  1. I am a founder here. And one of the major yardsticks you make decisions as a founder is based on how successful startups similar to you did.

Self-righteous folks on Twitter and Grapevine wo...

Top comments
user

I do not know truth about Zepto but many founders chasing growth as an excuse to have a bad office culture (Navi/Payt...

user

Stupid post. I am less concerned about work-life-balance and more concerned about discipline and respecting everyone’...

user

I'm ok with working hard. But I'm not able to understand how these are ok- The founder not being able to get up till...

Indian Startups
by SquishyPandaFlipkart

Is this a Gen-Z phenomenon?

The below comments about zepto leadership reminds me of another Gen-Z founders company - they had the exact same behavior:

  1. Founders can’t wake up early so they won’t be in office till 12-1 pm
  2. Out of the 2 founders, only 1 worked a...
Post image