
Author: Steve Lock
Keep reading for an overview of the most common automated solutions for data pipelines.
You’ll learn about a number of options from orchestration, to no-code solutions, offerings from major cloud providers and more fully featured platforms.
NOTE: In recent years, this space has exploded with new solutions emerging almost daily, so this is surely not an exhaustive list. If you think we’ve missed a great one, please reach out and let us know.
Apache Airflow/Google Composer/Astronomer
Apache Airflow is an important open source technology that powers a number of managed services such as Google Composer and Astronomer. It’s fairly universally accepted as the industry standard for data orchestration and acts as the backbone for scheduling and executing jobs, such as extracting data.
Dagster
This is another open source orchestration option, which can be useful for observability. Depending on your needs and if your engineering team has a strong preference, Dagster may also be a good alternative to Apache Airflow.
No-Code SaaS Connectors
For many businesses, it increasingly makes sense to run data pipelines on managed services that handle data integrations on a SaaS model. This is a particularly ideal option if you are light on engineering resources or you want to outsource the maintenance of different data integrations you have.
You will need to research which sources are supported and how you would work with the data once extracted. Keep in mind that features can vary quite a bit, especially around manipulating data.
There are too many to mention, but some of the most common are below. (FYI: There’s a lot of variety on this list — from connectors to full featured platforms.)
- Fivetran
- Stitch Data
- MuleSoft
- Talend
- Hevo
- Windsor.ai
- Informatica
- Funnel.io
Major Cloud Platforms
It probably comes as no surprise that the major cloud platforms also have a series of offerings to help streamline and automate working with data.
We’ve already mentioned Google Composer. Other offerings worth exploring include:
- AWS Glue
- Azure Data Factory
- Google Cloud Dataflow
This solution category also includes a huge number of options, depending on your use case, so we’d recommend researching either GCP, AWS or Azure if your organization has already got a strong preference. We also recommend working with a developer as there are often multiple services within each cloud platform for each use case.
Qlik
This solution is strongly worth considering if you’re looking for a platform that handles data extraction, data management, transformation and visualization.
If we had to pick just one solution in this category, Qlik is hard to beat because it can act like a multi-faceted Swiss Army knife. It’s one of the only options we’re aware of that doesn’t require using multiple systems, which is especially appealing for organizations with modest requirements.
Alteryx
We’ve heard several success stories where Alteryx has been scaled, and non-technical users have been trained to complete data engineering workloads. As with all no-code solutions, there are trade-offs in terms of flexibility, but if you’re looking for a solution to roll out for a non-technical team, this one is well worth evaluating.
PS. Automating data pipelines is just one piece of the puzzle. To fully leverage the insights flowing through your systems, check out these 8 First-Party Data Collection Strategies and learn better ways to capture and use first-party data.