Understanding user behavior demos a pipeline using a docker container that controls all services required via Zookeeper. This demo was presented as a final project for the course Fundamentals of Data Engineering in the MIDS program. Authors: Carolina Arriaga, Daniel Ryu, Jonathan Moges, Ziwei Zhao.
Given the influence the business question has on the design on the pipeline, we chose to begin our presentation with the focus on our business question. Our questions focused on the trends of user behavior and internet host providers. To have a more basic and affordable pipeline, we chose to use daily batch processing of our events. The rationale behind this is we would not expect strong deviations in host providers or user events from day to day.
Before describing our pipeline, those who wish to replicate the pipeline should do so by running the steps described in the terminal and not a Jupyter notebook.
- Sourcing: Events are sourced to our system through the use of two python files “game_api_with_extended_json_events” and “event_generator” run through Flask. The game_api file produces two main logs to kafka - purchasing an item or joining a guild. These are the main two actions a user can take in the game. The event_generator takes the universe of use cases and serves as a proxy for daily user events. The file allows us the flexibility to create as many events as we desire. The events are as flat as possible to reduce the amount of data munging steps.
- Ingesting: Events are ingested through kafka where we have created a topic called “events.” Given the event structure and the fact that the “event_generator” file represents the universe of events, notifications were not needed and we only utilized one partition. The topic did not require further utilization of brokers.
- Storing: Once all events have been created, we will read the associated kafka topic (events) and use the “filtered_writes” python file to unroll the json files and write the data to a parquet file. This parquet file will be the starting point for the analysis needed to answer the business questions.
- Analysis: Analysis begins by utilizing pyspark, we also need to use NumPy, Pandas and SparkSQL to sufficiently prepare the data to answer the business questions.