
BigData- How does data parsing happen in your company?
Hi community,
Context: In my current company, we have a data-pipeline, which in short works like this: we get raw events from Kafka dumped in S3. We run a batch job (Airflow), this job essentially picks up the raw jsons in s3, enforces data parser logic (we have a service written in python where we explicitly define what attributed we want from raw json, these attributes are accordingly parsed), the parsed data is then converted to CSV/parquet formats and dumped in s3 in another folder and later loaded into tables, which is used for analytics, etc.
Problem: Today for every new event we generate, we have to write a parser logic from scratch, if the event structure is different. In case of small changes we can update attributes we want to parse in code itself. But post that we have to deploy changes which takes time. Is there a smarter way of doing it? For example, having a UI interface, where we select the attributes from json (that could include nested attributes), and that is parsed and dumped in s3, later loading happens. And if we want to update parser, we can do so from UI itself than going into code updating things, deploying, etc.
Do we have any open source alternatives here? Or any good engineering blogs which has covered such/similar scenario?
One interview, 1000+ job opportunities
Take a 10-min AI interview to qualify for numerous real jobs auto-matched to your profile 🔑
Try asking it in subreddit here https://www.reddit.com/r/dataengineering/s/mcFo9ng1t0

Noted. Thanks buddy.

In simple words, I want to have an abstraction over the raw data I wanna parse, and make things language agnostic.