Responsibilities :
1. Data Pipeline Development: Design, build, and maintain robust data pipelines to collect, process, and transport data from various sources to the data warehouse or data lake. 2. Data Modeling: Develop and optimize data models, schemas, and structures to support efficient data storage and retrieval, ensuring data is organized for analytics and reporting. 3. ETL Process Management: Manage complex ETL (Extract, Transform, Load) processes, including data transformation, validation, and cleansing, while ensuring data quality and integrity. 4. Performance Optimization: Optimize data pipelines and database systems for improved performance, scalability, and responsiveness to meet growing data demands. 6. Data Integration: Implement data integration solutions for real-time and batch data streaming, ensuring that data is accessible and available to users and applications. 7. Data Security and Compliance: Implement and maintain data security measures and ensure compliance with relevant data protection and privacy regulations, such as GDPR or HIPAA. 8. Automation: Automate repetitive data engineering tasks and processes to reduce manual effort and improve efficiency. 9. Monitoring and Troubleshooting: Set up monitoring systems to track data pipeline health and troubleshoot issues to minimize downtime or data loss. 10. Collaboration: Collaborate with data scientists, analysts, and other stakeholders to understand data requirements, business processes and provide support for their analytics and reporting needs. 11. Data Warehouses and Data Lakes: Familiarity with data warehousing and data lakes used to store and manage large volumes of pharmaceutical data efficiently. 12. Understanding of Pharma related datasets: IQVIA, PLD, OneKey, Sales Data, Activity Data, Clinical trial data, Manufacturing data, transitional data etc.
Additional Responsibilities:
Advance SQL, window functions, joins and query optimization techniques. Advanced analytics and machine learning is a plus. Nice to have CI CD pipelines, Git or DevOps, hands-on Unity Catalogue/PurView capability
Multiple Locations (All Infosys Locations in India)
Technical and Professional Requirements:
a. Understanding of Apache Spark architecture and how it operates in huge data loads. b. Familiarity with BI/DW concepts, data modeling and entity-relationship diagrams. c. Demonstrated proficiency in developing KPIs and visual control. d. Optimizations methodologies: like partitioning, caching, shuffling, broadcast variables etc. e. The parameterization concept and why it is used with real world examples. f. Strong knowledge of data integration solutions, including Adeptia and SnapLogic. g. Pipeline Triggers, their types and use cases. Type of activities in ADF studio. h. Hands-on experience on Python/ PySpark/ SQL, Azure Logic App i. Azure Databricks, Python libraries and batch load and transformation using scripts.