Python Automation: Streamlining Data Workflows

Category Data Engineering

In today’s data-driven world, the ability to automate routine tasks is a game-changer. Python, with its extensive libraries and scripting capabilities, empowers professionals to form streamlined data workflows, save time, reduce errors, and improve efficiency. In this document, we’ll delve into the technical aspects of Python automation, exemplified by a use case.

The Power of Python Automation

Consider a scenario where a data analyst is tasked with collecting data, performing analysis, and sending daily reports to stakeholders. The process, if done manually, is laborious and prone to errors. Python automation can revolutionize this workflow.

Use Case: Data Analyst’s Dilemma

Meet Alex, a data analyst grappling with a substantial daily workload. Gathering data from various sources, performing analysis, and distributing reports was a time-consuming, repetitive process. The need for timely and accurate reports often led to stress and inefficiency. Determined to optimize this workflow, Alex turned to Python automation.

How Python Automation Transformed Alex’s Workflow

Alex took a systematic approach to address the data collection, analysis, and reporting aspects of the task:

Import the required Lib -

#importing required libraries
import pandas as pd
import numpy as np
import pytz
import gspread
import pygsheets
import datetime
from datetime import timedelta
from datetime import datetime
#import oauth2client
from queryrunner_client import Client|
from oauth2client.service_account import ServiceAccountCredentials
gs = pygsheets.authorize(service_file 'example. j son' )
c Client(user_email 'example@abc.com' )

STEP 1-

Explanation as to what each does and why is it necessary -

1. pandas (`import pandas as pd`):

- Purpose: pandas is a powerful data manipulation library. It provides data structures like DataFrames for efficient data manipulation and analysis.

- Why Required: Used to handle and manipulate tabular data, such as reading/writing data, merging, filtering, and performing various data operations.

2. numpy (`import numpy as np`): 

- Purpose: numpy is a numerical computing library. It provides support for large, multi-dimensional arrays and matrices, along with mathematical functions to operate on these arrays.

- Why Required: Used for numerical operations and working with arrays, especially in scenarios where efficient numerical computations are needed.

3. pytz (`import pytz`):

- Purpose: pytz is a library for working with time zones. It allows the localization of datetimes to different time zones.

- Why Required: Used to handle and convert time zones, ensuring accurate representation and manipulation of datetime values.

4. gspread (`import gspread`):

- Purpose: gspread is a Python API for Google Sheets. It allows for interaction with Google Sheets, such as reading and writing data.

- Why Required: Used to connect to and manipulate Google Sheets data, which might be the source or destination for data.

5. pygsheets (`import pygsheets`):

- Purpose: pygsheets is another library for working with Google Sheets, providing additional functionality and ease of use.

- Why Required: Used for authorization and access to Google Sheets. Offers a convenient way to work with Google Sheets using Python.

6. datetime (`import datetime`):

- Purpose: datetime is a module in the standard library for working with dates and times.

- Why Required: Used for handling datetime objects, calculating time differences, and formatting timestamps.

7. timedelta (`from datetime import timedelta`):

- Purpose: timedelta is a class in the datetime module for representing differences between two dates or times.

- Why Required: Used for adding or subtracting time intervals, such as calculating a date in the future or past.

8. Client (`from queryrunner_client import Client`):

- Purpose: queryrunner_client is a library for interacting with a query runner service, allowing the execution of queries and retrieval of results.

- Why Required: Used for querying a data source or executing specific operations, depending on the service it connects to.

9. ServiceAccountCredentials (`from oauth2client.service_account import ServiceAccountCredentials`):

- Purpose: ServiceAccountCredentials is used for OAuth 2.0 authentication with Google services.

- Why Required: Used to authenticate the application and authorize access to Google services, such as Google Sheets.

10. pygsheets authorization (`gs = pygsheets.authorize(service_file=’gcp_creds_example.json’)`):

- Purpose: Authorizes access to Google Sheets using credentials from a service account file.

- Why Required: Enables the Python script to interact with Google Sheets on behalf of the service account.

STEP 2-

Now give the Gsheet that you want to play around with or share it across to your managers for their view —
Sheet = “Link of the sheet which is shared to your managers”

Then we authorise and open the sheet -

If you are looking to use this sheet in your Python code, you might want to extract the sheet key from the URL. The sheet key is the long alphanumeric string between “/d/” and “/edit” in the URL.

For example, in the URL you provided:

`https://docs.google.com/spreadsheets/example

The sheet key is `djskhuej6778d9e3nbsk`.

If you intend to use this sheet in your Python code with the `gspread` library, you would typically use the key to open the sheet. Here’s an example:

import gspread

# Extract the sheet key from the URL

sheet_url = ‘https://docs.google.com/spreadsheets/example

sheet_key = sheet_url.split(“/d/”)[1].split(“/”)[0]

# Authorize and open the sheet

gc = gspread.authorize(credentials)

worksheet = gc.open_by_key(sheet_key).sheet1

Make sure to replace `credentials` with your actual Google Sheets credentials. If you have a service account JSON file, you can use `gspread.service_account(filename=’path/to/credentials.json’)` for authorization.

STEP 3-

Now we write a function which is designed to update a specific range of cells in a Google Sheet with the data provided (which is basically the task you’re supposed to do. It can be scraping the data, running your queries and updating the sheet with that data or anything for that matter). Let’s break down each part of the function:

def update_cells(sheet_link, tab_name, start_cell, end_cell, data):

book = pygsheets_client.open_by_url(sheet_link)

ws = book.worksheet_by_title(tab_name)

ws.clear(start=start_cell, end=end_cell)

ws.set_dataframe(df=data, start=start_cell)

1. sheet_link`:

- This parameter is the URL of the Google Sheet.

2. `tab_name`:

- This parameter is the name of the specific worksheet (tab) within the Google Sheet that you want to update.

3. `start_cell` and `end_cell`:

- These parameters specify the range of cells to clear and update in the worksheet.

4. `data`:

- This parameter is the DataFrame (`df`) containing the data you want to update in the specified range.

5. `book = pygsheets_client.open_by_url(sheet_link)`:

- Opens the Google Sheet using the provided URL.

6. `ws = book.worksheet_by_title(tab_name)`:

- Gets the specific worksheet (tab) within the Google Sheet by its name.

7. `ws.clear(start=start_cell, end=end_cell)`:

- Clears the specified range of cells in the worksheet to ensure a clean update.

8. `ws.set_dataframe(df=data, start=start_cell)`:

- Sets the provided DataFrame (`data`) starting from the specified cell. This updates the content in the specified range.

Note: Make sure that the `pygsheets_client` variable is defined and contains the necessary authorization information to access the Google Sheet.

Analyze your data, generate the necessary numbers and reports in your code. Once ready, invoke the `def_update_cells()` function, providing the required inputs. Presto! Your Google Sheet is seamlessly updated using Python.

STEP 4-

But how do I run my Python Script everyday at 5PM well there is a solution for that too —

### Method 1: Using `cron` on Unix/Linux/Mac

1. Open the crontab configuration:

```bash

crontab -e

```

2. Add a new entry to run your script daily at 5 PM:

```bash

0 17 * * * /path/to/python /path/to/your_script.py

```

Replace `/path/to/python` with the path to your Python interpreter and `/path/to/your_script.py` with the full path to your Python script.

3. Save and exit the crontab editor.

### Method 2: Using Task Scheduler on Windows

1. Open Task Scheduler:

- Press `Win + R`, type `taskschd.msc`, and press Enter.

2. Create a Basic Task:

- Click on “Create Basic Task…” on the right-hand side.

3. Follow the Wizard:

- Name your task and provide a description.

- Choose “Daily” and click Next.

4. Set the Daily Trigger:

- Specify the start date and time (5:00 PM) and set it to recur every 1 day. Click Next.

5. Choose Action:

- Select “Start a Program” and click Next.

6. Configure the Program:

- Browse and select the Python executable (`python.exe`) in the “Program/script” field.

- In the “Add arguments (optional)” field, provide the full path to your Python script (`/path/to/your_script.py`).

- Click Next.

7. Review and Finish:

- Review your settings and click Finish.

Now, your Python script will be scheduled to run daily at 5 PM.

Choose the method that corresponds to your operating system, and adjust the paths and file names accordingly.

STEP 5-

Now how do we let our end users know that sheet is updated with most recent data again Python is here to help you -

Here are two different ways to set up mail sending using Python: one using the built-in `smtplib` library for sending emails through a Gmail account and another using a third-party library called `yagmail`.

### Method 1: Using `smtplib` (with Gmail)

Step 1: Enable “Less Secure App Access” in your Gmail account

- Go to your Google Account settings: [Google Account](https://myaccount.google.com/).

- Navigate to the “Security” tab.

- Under “Less secure app access,” turn on access.

Step 2: Write Python Code using `smtplib`

import smtplib

from email.mime.text import MIMEText

from email.mime.multipart import MIMEMultipart

def send_email(subject, body, to_email):

# Your Gmail credentials

sender_email = ‘your_email@gmail.com’

sender_password = ‘your_password’

# Create the email message

msg = MIMEMultipart()

msg[‘From’] = sender_email

msg[‘To’] = to_email

msg[‘Subject’] = subject

# Attach the body of the email

msg.attach(MIMEText(body, ‘plain’))

# Connect to the Gmail server

with smtplib.SMTP(‘smtp.gmail.com’, 587) as server:

server.starttls()

server.login(sender_email, sender_password)

# Send the email

server.sendmail(sender_email, to_email, msg.as_string())

# Example usage

send_email(‘Test Subject’, ‘Hello, this is a test email.’, ‘recipient_email@example.com’)

```

Replace `’your_email@gmail.com’` and `’your_password’` with your Gmail credentials. Ensure you keep your credentials secure.

### Method 2: Using `yagmail` Library

Step 1: Install `yagmail`**

```bash

pip install yagmail

```

Step 2: Write Python Code using `yagmail`**

```python

import yagmail

def send_email(subject, body, to_email):

# Your Gmail credentials

sender_email = ‘your_email@gmail.com’

sender_password = ‘your_password’

# Connect to the Gmail server using yagmail

with yagmail.SMTP(sender_email, sender_password) as yag:

# Send the email

yag.send(to=to_email, subject=subject, contents=body)

# Example usage

send_email(‘Test Subject’, ‘Hello, this is a test email.’, ‘recipient_email@example.com’)

```

Replace `’your_email@gmail.com’` and `’your_password’` with your Gmail credentials. Ensure you keep your credentials secure.

These examples assume you’re using a Gmail account. If you’re using a different email provider, you may need to adjust the SMTP server and port accordingly. Always be cautious with your email credentials and consider using environment variables or a configuration file for a more secure setup.

Author: Abhishek Singh

Ready to embark on a transformative journey? Connect with our experts and fuel your growth today!