Develops and maintains scalable data pipelines and builds out new API integrations to support continuing increases in data volume and complexity.
Collaborates with analytics and business teams to improve data models that feed business intelligence tools, increasing data accessibility, and fostering data-driven decision making across the organization.
Implements processes and systems to monitor data quality, ensuring production data is always accurate and available for key stakeholders and business processes that depend on it.
Writes unit/integration tests, contributes to engineering wiki, and documents work. Performs data analysis required to troubleshoot data related issues and assist in the resolution of data issues.
Works closely with a team of frontend and backend engineers, product managers, and analysts.
Defines company data assets (data models), spark, spark SQL, and hive SQL jobs to populate data models.
Designs data integrations and data quality framework.
Designs and evaluates open source and vendor tools for data lineage.
Works closely with all business units and engineering teams to develop strategy for long term data platform architecture.
Area of Expertise:
Experience in Cloud platform, e.g., AWS, GCP, Azure, etc.
Experience in distributed technology tools, viz. SQL, Spark, Python, PySpark,
Performance Turing Optimize SQL, PySpark for performance.
Airflow workflow scheduling tool for creating data pipelines.
GitHub source control tool & experience with creating/ configuring Jenkins pipeline.