Big Data Developer

Responsibility: Design, build, test and maintain scalable and stable off the shelf applications to support distributed processing using the Hadoop Ecosystem Implement ETL and data processes for structured and unstructured data Pipelines for optimal extraction of data from a wide variety of data sources, ingestion, transformation, conversion validation Conduct root cause analysis and advanced performance […]

Responsibility:

  • Design, build, test and maintain scalable and stable off the shelf applications to support
    distributed processing using the Hadoop Ecosystem
  • Implement ETL and data processes for structured and unstructured data
  • Pipelines for optimal extraction of data from a wide variety of data sources, ingestion,
    transformation, conversion validation
  • Conduct root cause analysis and advanced performance tuning for complex business
    processes and functionality
  • Ability to review frameworks and design principles towards suitability in the project
    context
  • Client orientation:
  • Propose the right solutions to the client by identifying & understanding critical pain
    points
  • Contribute to the entire implementation process including driving the definition of
    improvements based on business need and architectural improvements
  • Propose, pitch, sell, implement and prove success in continuous improvement initiatives
    Work and collaborate with multiple teams and stakeholders
  • Agile orientation:
  • Be a part of the Agile ceremonies to groom stories and develop defect-free code for the
    stories
  • Review code for quality and implementation best practices
  • Promote coding, testing and deployment best practices through hands-on research and
    demonstration
  • Write testable code that enables extremely high levels of code coverage
  • Mentor young engineers towards guiding them to become great engineers

Desired Skills/ Experience:

  • Preferably 4 to 7 years of experience
  • Highly skilled in:
  • PySpark and Spark
  • PySpark SQL and Dataframe APIs
  • Interpreting Spark execution DAG as displayed in ApplicationMaster
  • Writing optimal PySpark codes + deep knowledge of Spark parameter tweaking for
    execution optimization
  • Python (2 and 3), including knowledge of libraries like NumPy, Pandas, etc.
  • Writing sqoop scripts for ETL from TeraData
  • SQL and Analytical thinking
  • Strong understanding of:
  • Hadoop and Spark architectures and the MapReduce framework
  • Big data storages like HDFS, HBase, Cassandra
  • Data formats like Avro, Parquet, ORC, etc.
  • Exposure to at least one big data platform like Hortonworks, Cloudera, HDP, AWS-
    EMR, MapR, etc.
  • Prior experience with:
  • Using monitoring and administration tools like Ambari, Ganglia, etc.
  • Scheduling big data applications using Oozie (Including workflow and coordinator
    properties)
  • Good OO skills, including good design patterns knowledge
  • Good understanding of technologies like Hive, Pig, Presto, Impala, etc.
  • Prior experience in building spark infrastructure (cluster setup, administration,
    performance tuning) [on-premise (bare metal) and / or cloud-based]
  • Knowledge of software best practices, like Test-Driven Development (TDD) and
    Continuous Integration (CI)