JazzyPotato
JazzyPotato

BigData- How does data parsing happen in your company?

Hi community,

Context: In my current company, we have a data-pipeline, which in short works like this: we get raw events from Kafka dumped in S3. We run a batch job (Airflow), this job essentially picks up the raw jsons in s3, enforces data parser logic (we have a service written in python where we explicitly define what attributed we want from raw json, these attributes are accordingly parsed), the parsed data is then converted to CSV/parquet formats and dumped in s3 in another folder and later loaded into tables, which is used for analytics, etc.

Problem: Today for every new event we generate, we have to write a parser logic from scratch, if the event structure is different. In case of small changes we can update attributes we want to parse in code itself. But post that we have to deploy changes which takes time. Is there a smarter way of doing it? For example, having a UI interface, where we select the attributes from json (that could include nested attributes), and that is parsed and dumped in s3, later loading happens. And if we want to update parser, we can do so from UI itself than going into code updating things, deploying, etc.

Do we have any open source alternatives here? Or any good engineering blogs which has covered such/similar scenario?

23mo ago
Jobs
One interview, 1000+ job opportunities
Take a 10-min AI interview to qualify for numerous real jobs auto-matched to your profile 🔑
+322 new users this month
PeppyBanana
PeppyBanana
JazzyPotato
JazzyPotato
Amazon22mo

Noted. Thanks buddy.

JazzyPotato
JazzyPotato
Amazon22mo

In simple words, I want to have an abstraction over the raw data I wanna parse, and make things language agnostic.

Discover more
Curated from across