JazzyPotato

BigData- How does data parsing happen in your company?

Hi community,

Context: In my current company, we have a data-pipeline, which in short works like this: we get raw events from Kafka dumped in S3. We run a batch job (Airflow), this job essentially picks up the raw jsons in s3, enforces data parser logic (we have a service written in python where we explicitly define what attributed we want from raw json, these attributes are accordingly parsed), the parsed data is then converted to CSV/parquet formats and dumped in s3 in another folder and later loaded into tables, which is used for analytics, etc.

Problem: Today for every new event we generate, we have to write a parser logic from scratch, if the event structure is different. In case of small changes we can update attributes we want to parse in code itself. But post that we have to deploy changes which takes time. Is there a smarter way of doing it? For example, having a UI interface, where we select the attributes from json (that could include nested attributes), and that is parsed and dumped in s3, later loading happens. And if we want to update parser, we can do so from UI itself than going into code updating things, deploying, etc.

Do we have any open source alternatives here? Or any good engineering blogs which has covered such/similar scenario?

26mo ago