Introduction.
Building scalable real-time data pipelines is crucial for businesses that need to process and analyze large volumes of data as it streams in. Docker, a platform that allows developers to package applications and their dependencies into containers, has become an essential tool for managing complex systems. By utilizing Docker, you can create isolated, consistent environments for each component of a real-time data pipeline, enabling scalability and flexibility.
In this blog, we will explore how Docker can streamline the development and deployment of real-time data pipelines. We’ll cover the key steps for setting up a scalable pipeline, from containerizing data processing tools to integrating them into a microservices architecture. By the end, you’ll have a clearer understanding of how Docker enables faster, more efficient data processing, without compromising on performance or reliability.
Whether you are working with real-time analytics, streaming data, or complex event processing, Docker can help you manage workloads more efficiently. This article will also discuss best practices, common pitfalls, and how to optimize Docker containers for real-time data workflows. Let’s dive into the world of scalable real-time data pipelines with Docker!
STEP 1: Create a instance.
data:image/s3,"s3://crabby-images/23297/23297112cfaf52b8e35d88e11b25df60807d2bef" alt="Building Scalable Real-Time Data Pipelines with Docker. 1 Screenshot 2025 02 14 102931"
STEP 2: Accessing SSH Connection.
- Install Docker Compose.
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
docker-compose --version
data:image/s3,"s3://crabby-images/286a4/286a4a8a04880304e45f66216ede36f4551a7662" alt="Building Scalable Real-Time Data Pipelines with Docker. 2 Screenshot 2025 02 14 103115"
STEP 3: Complete Docker Compose Setup.
mkdir real-time-data-pipeline
cd real-time-data-pipeline
data:image/s3,"s3://crabby-images/e3435/e3435b62a47595db6c7a042b972ed2e9206b8376" alt="Building Scalable Real-Time Data Pipelines with Docker. 3 Screenshot 2025 02 14 103237"
STEP 4: Create a file.
nano docker-compose.yml
version: '3.8'
services:
zookeeper:
image: wurstmeister/zookeeper:3.4.6
ports:
- "2181:2181"
kafka:
image: wurstmeister/kafka:2.13-2.7.0
ports:
- "9092:9092"
environment:
KAFKA_ADVERTISED_LISTENERS: INSIDE://kafka:9092,OUTSIDE://localhost:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
KAFKA_LISTENERS: INSIDE://0.0.0.0:9092,OUTSIDE://0.0.0.0:9092
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
depends_on:
- zookeeper
producer:
build: ./producer
depends_on:
- kafka
consumer:
build: ./consumer
depends_on:
- kafka
db:
image: postgres:13
environment:
POSTGRES_USER: user
POSTGRES_PASSWORD: password
POSTGRES_DB: my_database # This creates the database
ports:
- "5432:5432"
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana-storage:/var/lib/grafana
depends_on:
- db
volumes:
grafana-storage:
data:image/s3,"s3://crabby-images/687a0/687a0cfdf4a4b5b9e8ea3887d1b373f2bb3445c2" alt="Building Scalable Real-Time Data Pipelines with Docker. 4 Screenshot 2025 02 14 103331"
data:image/s3,"s3://crabby-images/f6e66/f6e66514f8136983813c2c30f0ddbe50c4dbbcaa" alt="Building Scalable Real-Time Data Pipelines with Docker. 5 Screenshot 2025 02 14 103321"
STEP 5: Create Producer and Consumer Services,
mkdir producer
nano producer/Dockerfile
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
COPY producer.py .
CMD ["python", "producer.py"]
data:image/s3,"s3://crabby-images/3e0cd/3e0cdebc84195aa7c5ea2ae57f9993b40b5265ad" alt="Building Scalable Real-Time Data Pipelines with Docker. 6 Screenshot 2025 02 14 103421"
data:image/s3,"s3://crabby-images/b30c1/b30c1a72f3a9422fb50b0200cc062d14bef06f5e" alt="Building Scalable Real-Time Data Pipelines with Docker. 7 Screenshot 2025 02 14 103409"
STEP 6: create producer requirements.txt file.
nano producer/requirements.txt
kafka-python
data:image/s3,"s3://crabby-images/dec25/dec25cb1b5791f04e3ae128e30555a3f2fabc661" alt="Building Scalable Real-Time Data Pipelines with Docker. 8 Screenshot 2025 02 14 103506"
data:image/s3,"s3://crabby-images/1f29b/1f29b7d286d14b39dc33278d4567a561842885ab" alt="Building Scalable Real-Time Data Pipelines with Docker. 9 Screenshot 2025 02 14 103456"
STEP 7: Create the producer.py file.
nano producer/producer.py
from kafka import KafkaProducer
import time
import random
import json
producer = KafkaProducer(
bootstrap_servers='kafka:9092',
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
while True:
data = {
'sensor_id': random.randint(1, 100),
'temperature': random.uniform(20.0, 30.0)
}
producer.send('sensor_data', data)
print(f'Sent: {data}')
time.sleep(2)
data:image/s3,"s3://crabby-images/1e585/1e58528ecac200820d20a5c0c0752f87850976d1" alt="Building Scalable Real-Time Data Pipelines with Docker. 10 Screenshot 2025 02 14 103556"
data:image/s3,"s3://crabby-images/9be3e/9be3e12b5caf9df240910e075da7abaee2d69fc4" alt="Building Scalable Real-Time Data Pipelines with Docker. 11 Screenshot 2025 02 14 103545"
STEP 8: Create a file enter the following command and save the file.
mkdir consumer
nano consumer/Dockerfile
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
COPY consumer.py .
CMD ["python", "consumer.py"]
data:image/s3,"s3://crabby-images/c8011/c8011cff610140455576a9f79cbb6e30c6d0d129" alt="Building Scalable Real-Time Data Pipelines with Docker. 12 Screenshot 2025 02 14 103651"
data:image/s3,"s3://crabby-images/32f4a/32f4ab8c205387bda97c5efd3daed32ddf8e3836" alt="Building Scalable Real-Time Data Pipelines with Docker. 13 Screenshot 2025 02 14 103632"
STEP 9: Create consumer requirements.txt file.
nano consumer/requirements.txt
kafka-python
psycopg2-binary
nano consumer/consumer.py
from kafka import KafkaConsumer
import psycopg2
import json
# Set up the Kafka consumer to listen to the 'sensor_data' topic
consumer = KafkaConsumer(
'sensor_data',
bootstrap_servers='kafka:9092',
value_deserializer=lambda v: json.loads(v)
)
# Establish a connection to the PostgreSQL database
conn = psycopg2.connect(
dbname="postgres",
user="user",
password="password",
host="db"
)
cur = conn.cursor()
# Create the table for storing sensor data if it doesn't exist
cur.execute('''CREATE TABLE IF NOT EXISTS sensor_data (
sensor_id INT,
temperature FLOAT
);''')
conn.commit()
# Process messages from the Kafka topic
for message in consumer:
data = message.value
cur.execute(
"INSERT INTO sensor_data (sensor_id, temperature) VALUES (%s, %s)",
(data['sensor_id'], data['temperature'])
)
conn.commit()
print(f"Inserted: {data}")
data:image/s3,"s3://crabby-images/37a44/37a441355e9dc0ef338b560f5e6e47dbd5c0d384" alt="Building Scalable Real-Time Data Pipelines with Docker. 14 Screenshot 2025 02 14 103851"
data:image/s3,"s3://crabby-images/2606e/2606ee9ce79041c9d2b0b3e162a4b2759ece4155" alt="Building Scalable Real-Time Data Pipelines with Docker. 15 Screenshot 2025 02 14 103722"
data:image/s3,"s3://crabby-images/0128d/0128dc82248904aff041e1baa494cefb42479c6a" alt="Building Scalable Real-Time Data Pipelines with Docker. 16 Screenshot 2025 02 14 103754"
STEP 12: Install PostgreSQL 13 on Amazon Linux 2.
sudo yum update -y
sudo amazon-linux-extras install -y postgresql13
sudo yum install -y postgresql13 postgresql13-server
psql --version
sudo usermod -aG docker $USER
newgrp docker
docker-compose up -d
data:image/s3,"s3://crabby-images/60158/601585a74ece26e55ff7a842344eddca20105a23" alt="Building Scalable Real-Time Data Pipelines with Docker. 17 Screenshot 2025 02 14 103923"
data:image/s3,"s3://crabby-images/ca123/ca123fd42153ff1d7637c42415d339c014ac8547" alt="Building Scalable Real-Time Data Pipelines with Docker. 18 Screenshot 2025 02 14 104100"
data:image/s3,"s3://crabby-images/7792c/7792c8e198e22d9d905d97417086335d791007b8" alt="Building Scalable Real-Time Data Pipelines with Docker. 19 Screenshot 2025 02 14 104253"
data:image/s3,"s3://crabby-images/b2786/b27867f79064a606e7bbac009fcb4d7e1c7b0bbd" alt="Building Scalable Real-Time Data Pipelines with Docker. 20 Screenshot 2025 02 14 104418"
data:image/s3,"s3://crabby-images/6bd17/6bd1702c9d8a54ea38f95a520d204032d6e946fd" alt="Building Scalable Real-Time Data Pipelines with Docker. 21 Screenshot 2025 02 14 104448"
STEP 13: Access PostgreSQL Container.
docker ps
docker exec -it <container_id> /bin/bash
psql -U user -d my_database
Create a Table inside the Database:
CREATE TABLE sensor_data (
id SERIAL PRIMARY KEY,
timestamp TIMESTAMPTZ NOT NULL,
temperature FLOAT,
humidity FLOAT
);
Insert Sample Data to the table:
INSERT INTO sensor_data (timestamp, temperature, humidity) VALUES
(NOW(), 22.5, 60.0),
(NOW() - INTERVAL '1 hour', 21.0, 65.0);
data:image/s3,"s3://crabby-images/266e4/266e4983627d03f85d5e372b5a09b301b901c58a" alt="Building Scalable Real-Time Data Pipelines with Docker. 22 Screenshot 2025 02 14 104543"
data:image/s3,"s3://crabby-images/ca58d/ca58de2ba3b491a89f7543c8c1a08c448bccb7ef" alt="Building Scalable Real-Time Data Pipelines with Docker. 23 Screenshot 2025 02 14 213614"
data:image/s3,"s3://crabby-images/af50e/af50e56abdb4286b02f06dad9e988926bc8697b1" alt="Building Scalable Real-Time Data Pipelines with Docker. 24 Screenshot 2025 02 14 213805"
data:image/s3,"s3://crabby-images/707fa/707fac54fb7914b78343eb85186e2813f45e9d5a" alt="Building Scalable Real-Time Data Pipelines with Docker. 25 Screenshot 2025 02 14 213906"
STEP 14: Configure Grafana to connect to this database.
- Enter your browser http://<EC2_PUBLIC_IP>:3000
data:image/s3,"s3://crabby-images/a21ae/a21aeb0ef2f3a93da2bd8de321c9e782726b8757" alt="Building Scalable Real-Time Data Pipelines with Docker. 26 Screenshot 2025 02 14 214047"
STEP 15: Set a username & password.
data:image/s3,"s3://crabby-images/93816/938164d4c045c22da995fe3a1ad5ed5cce470cc1" alt="Building Scalable Real-Time Data Pipelines with Docker. 27 Screenshot 2025 02 14 214251"
STEP 16: Select connection and click on new connection.
data:image/s3,"s3://crabby-images/6f351/6f351b07adaf727fecd442b924c8d73d9578b092" alt="Building Scalable Real-Time Data Pipelines with Docker. 28 Screenshot 2025 02 14 214329"
STEP 17: Search for postgresql and select postgreSQL.
data:image/s3,"s3://crabby-images/aec7c/aec7cf18064f4eef42ec74ca0ec1d072e6f1fdce" alt="Building Scalable Real-Time Data Pipelines with Docker. 29 Screenshot 2025 02 14 214405"
STEP 18: Click on add new data source.
data:image/s3,"s3://crabby-images/f37b7/f37b7ce44be5059a5a7bf50b3d111e86c6e73227" alt="Building Scalable Real-Time Data Pipelines with Docker. 30 Screenshot 2025 02 14 214430"
STEP 19: Enter Host URL and Database name Enter Username as “user” and Password as “password”, select TLS/SSL Mode as “Disable”.
- Save & test.
data:image/s3,"s3://crabby-images/1f5b0/1f5b0532353c6cccf5947136aaf562afdd9c1f19" alt="Building Scalable Real-Time Data Pipelines with Docker. 31 Screenshot 2025 02 14 214726"
data:image/s3,"s3://crabby-images/aa06c/aa06c9d1e426fa3b18c77dee643219f0ef224876" alt="Building Scalable Real-Time Data Pipelines with Docker. 32 Screenshot 2025 02 14 214740"
data:image/s3,"s3://crabby-images/bc20f/bc20f554291f9be7ebe961683e9a0b72febf58af" alt="Building Scalable Real-Time Data Pipelines with Docker. 33 Screenshot 2025 02 14 214753"
data:image/s3,"s3://crabby-images/bafc6/bafc67867d3d2674236726c8c18d341648cb4227" alt="Building Scalable Real-Time Data Pipelines with Docker. 34 Screenshot 2025 02 14 214818"
STEP 20: Create Grafana Dashboards.
- 1. Click on + icon on the top left corner and then select New dashboard.
- Select Add vizualization.
data:image/s3,"s3://crabby-images/15a32/15a328c02d43a9d3120471fd821c61454a19b0cb" alt="Building Scalable Real-Time Data Pipelines with Docker. 35 Screenshot 2025 02 14 214903"
data:image/s3,"s3://crabby-images/3d7ad/3d7adc321442592ef1e4b8b7a76ed27a5946dec1" alt="Building Scalable Real-Time Data Pipelines with Docker. 36 Screenshot 2025 02 14 214939"
STEP 21: Select data source as PostgreSQL.
data:image/s3,"s3://crabby-images/a6d73/a6d731320ec79ef6831d19ebe73036480e867d15" alt="Building Scalable Real-Time Data Pipelines with Docker. 37 Screenshot 2025 02 14 215017"
STEP 22: Enter the code and title.
SELECT
timestamp AS "time",
temperature,
humidity
FROM
sensor_data
ORDER BY
timestamp DESC
LIMIT 10;
data:image/s3,"s3://crabby-images/8f25f/8f25f37ed05356c59c23cf4ecb56a5be57547527" alt="Building Scalable Real-Time Data Pipelines with Docker. 38 Screenshot 2025 02 14 220938"
STEP 23: You will see the Dashboard.
data:image/s3,"s3://crabby-images/58a0e/58a0ea37e9dd929c957878968fbb00f841c88af8" alt="Building Scalable Real-Time Data Pipelines with Docker. 39 Screenshot 2025 02 14 221020"
Conclusion.
In conclusion, Docker offers a powerful and efficient way to build and scale real-time data pipelines. By leveraging its containerization capabilities, developers can create isolated, consistent environments for each component, making it easier to manage complex data workflows. Docker’s flexibility and scalability are key when handling large volumes of streaming data, ensuring high performance and reliability.
As we’ve seen, Docker simplifies the deployment of real-time data processing tools, streamlining integration and reducing potential issues related to compatibility and resource management. With the right strategies and best practices in place, Docker can significantly enhance the efficiency of your data pipeline, allowing you to process and analyze data faster than ever before.
Whether you’re building an analytics platform or working with continuous data streams, Docker can empower you to scale your operations smoothly. Embracing containerized environments is a smart step towards optimizing your real-time data pipelines, improving performance, and ensuring seamless scalability as your data needs grow.
Add a Comment