An Analytics Data Pipeline To Get You Started
Basic Data pipeline to start with
When your product is launched, you are expected to collect analytics of the usage of various features. It’s important for the entire company to have access to analytics to refine the existing features and predict the behavior of the users.
In the beginning, you might ask yourself “How do others build their data pipelines?”. You will come across many solutions that are being used in organizations having tons of users and a lot of traffic. It’s hard to implement and incorporate these designs when you are just starting with few users. And it does not make sense to spend so many resources when you need something simple.
The main problems are:
- Data ingestion from multiple sources.
- Data processing.
- Which queues to use?
- Which data warehouse to use?
- Visualization of the data
Proposed Architecture
The overall pipeline has the following building blocks
Events from various data sources: Your primary goal is to gather all the events happening in the system. These events can come from
- transaction logs of the DBs — the changes in the database entries and new entries in the database.
- events from API hits.
- events coming from the phone/web apps.
Data Processing: All the events which are gathered in the previous step are processed and then added to a queue. Different types of data sources require different types of processing. Here are my preferences:
- To get changes from MySQL DB, reading binlogs is a great idea. There are many open-source programs that can read the changes happening to the data in the database.
- Queries might be required in some cases to pull the data out from the DB.
- Few of the events might need data mapping to a format in which you would want this data to be stored in the warehouse.
- Clients also have very valuable data which they send via events. These events can be gathered and transformed using an HTTP service.
Message Queues: Queues are used to solve the producer-consumer problem. There are many options, here are the popular ones:
- Kafka — Apache Kafka is an open-source distributed event streaming platform that is recommended for high workloads.
- RabbitMQ — One of the most widely used message brokers or queue managers.
- SQS — If you don’t have a large number of events AWS SQS is a good option to consider. It’s a fully managed service and light on your pockets.
- and many more…
Batch Processing of Messages: In this step, the messages from the queue are picked up in a batch, processed, and stored in the data warehouse and the backup store (like S3). Recommended options are
- Pandas: Python library that is great for data manipulation and analysis.
- Apache Spark: Spark is an analytics engine for large-scale data processing. If the number of messages is high, Spark is highly efficient.
Data Warehouse: There are a bunch of options. It entirely depends on which cloud platform you are using, on the query interface they provide, and on your preferences. A few of the top used warehouses are:
- Amazon Redshift
- Snowflake
- Google BigQuery
- IBM Db2 Warehouse
- Microsoft Azure Synapse
- Oracle Autonomous Warehouse
- SAP Data Warehouse Cloud
Charts, Visualization, and Analysis: Finally, you need a dashboard where the product and business teams get to see detailed analytics for their analysis and decision making. There are many tools that connect to your database and provide various kinds of charts, widgets, and visualization ready to use.
- Tableau
- Domo
- GoodData
- Looker
- Cluvio
- Metabase: Free software which you host yourself. Has a lot of options for charts and visualization, supports cache and hot reload. The only limitation is it does not support R.
Wrapping it Up
There are tons of options for various elements in the data pipeline, you need to choose based on your current requirements. Similarly, there are various designs catering to different workloads. This scheme can get you started. As the data grows, upgrades of components will be required.