Archive for November, 2018
November 30, 2018
One o the the nice things about AWS EMR is that when you spin up a cluster and run pyspark, a sparkSession is setup for you. You’re also easily able to use the Glue catalog as metastore for hive tables. If you’re working on building out scripts[…]
November 26, 2018
Before starting, there are some things that you are going to already have in place. Primarily, you’ll need to have PyDev installed and a download of the version of spark that you want to use. In this case, we are doing our setup on a Mac and[…]
November 23, 2018
Have been pulling a lot of data out of hana lately. Due to network restrictions we haven’t been able to use pip, so we are uploading the pyhdb module into a directory called ‘python’ in users home directory and using it to connect to Hana.
from pyspark.sql.types import *
# module is being loaded into python directory located in home directory.
current_loc = os.path.dirname(os.path.realpath('~'))
connection = pyhdb.connect(
cursor = connection.cursor()
query = "SELECT FROM TABLE"
fields = [ StructField(field_name, StringType(), True) for field_name in schemaString.split(',')]
schema = StructType(fields)
df = spark.createDataFrame(cursor.fetchall(), schema)
November 20, 2018
There are times when you might have a lot of data files in an s3 location that you update periodically. If they are stored in a format that does not hold any schema information (such as csv with no header), then each time you run a crawler[…]
Advanced Data Engineering Platform for Cleansing, Preprocessing and Analytics