python

How to perform groupby using Python Itertools

Itertools is a powerful module and is part of python standard library. It provides a set of fast and memory efficient functions. You can learn more about them by referring to this article.

# Sample Data 
data = [
        {'id': 1, 'name': 'abc', 'child': {'id': 2, 'name': 'child1'}},
        {'id': 1, 'name': 'abc', 'child': {'id': 3, 'name': 'child2'}},
        {'id': 2, 'name': 'def', 'child': {'id': 4, 'name': 'child3'}},
        {'id': 2, 'name': 'def', 'child': {'id': 5, 'name': 'child4'}}
      ]

Problem statement:  For a particular id, name, all the entries should be turned into list of dictionaries.

# Expected output
 [
   {'id': 1, 'name': 'abc', 'child': [
                                      {'id': 2, 'name': 'child1'}, 
                                      {'id': 3, 'name': 'child2'}
                                     ]
   },
   {'id': 2, 'name': 'def', 'child': [
                                      {'id': 4, 'name': 'child3'},
                                      {'id': 5, 'name': 'child4'}
                                      ]
   }
 ]

I have used following code to get the expected output

import pprint
pp = pprint.PrettyPrinter(indent=4)

from itertools import groupby
from operator import itemgetter
# Define group by key
grouper = itemgetter("id", "name")
result = []

#itertools requires sorted input, so we will first sort the input data

for key, grp in groupby(sorted(data, key = grouper), grouper):
    temp_dict = dict(zip(["id", "name"], key))
    temp_dict['child'] = []
    # Use list comprehension to collect all the items in grp
    temp_dict['child'] = list(item['child'] for item in grp)
    result.append(temp_dict)

# print the result
pp.pprint(result)

AWS Glue Python shell job timeout with custom Libraries

This is short post on Timeout errors faced using custom libraries with AWS Glue Python shell job.  I referred the steps listed in AWS docs to create a custom library , and submitted the job with timeout of 5 minutes.  But the job timed out without any errors in logs. Cloudwatch log reported following messages


2020-06-13T12:02:28.821+05:30 Installed /glue/lib/installation/redshift_utils-0.1-py3.7.egg
2020-06-13T12:02:28.822+05:30 Processing dependencies for redshift-utils==0.1
2020-06-13T12:12:45.550+05:30 Searching for redshift-module==0.1
2020-06-13T12:12:45.550+05:30 Reading https://pypi.org/simple/redshift-module/

On searching for error, I came across this AWS Forum post ,where it was recommended  to use python3.6. I referred back documentation and it confirmed that AWS Glue shell jobs are compatible with python 2.7 and 3.6. I was using python3.7 virtualenv for my testing, so this had to be fixed. 

To easily manage multiple environments, I installed miniconda on my Mac which allows to create virtual environment with different python version. Post installation, I created a new python3.6 env with conda and created the egg file

conda create -n venv36 python=3.6
conda activate venv36
python setup.py bdist_egg